Lingjiao Zhang

Lingjiao Zhang
University of Pennsylvania

Model-based phenotyping in Electronic Health Records with data for anchor-labeled cases and unlabeled patients

Lingjiao Zhang, Xiruo Ding, Naveen Muthu, Imran Ajmal, Jason Moore, Daniel Herman, Jinbo Chen

Objective: Building a classifier for a binary phenotype from Electronic Health Records (EHRs) normally requires a curated dataset that consists of patients with (“cases”) or without (“controls”) presence of the phenotype. Manual annotation of gold standard cases and controls takes significant time and expert knowledge. For some phenotypes, it is feasible to identify a group of cases based on medical knowledge, but infeasible to identify control patients. Data is then available only for identified cases and unlabeled patients. Methods: We propose a maximum likelihood approach to analyzing such “positive-only” data to develop an EHR phenotyping model. Our framework relies on the definition of a binary “anchor variable” that summarizes domain knowledge regarding the phenotype of interest. The anchor variable being positive indicates cases, while being negative is uninformative of the true phenotype status. Conditional on the phenotype, the anchor variable is independent of other predictors. Upon specification of the anchor variable, the EHR data can be considered as a random sample consisting of a group of anchor-labeled cases and a large amount of unlabeled patients. We propose a novel maximum likelihood approach that efficiently utilizes both labeled and unlabeled data to develop a logistic regression phenotyping model. Our method also yields consistent estimates for phenotype prevalence and anchor sensitivity. Additionally, we propose novel statistical methods for assessing model calibration and predictive accuracy. Results: We evaluated the performance of our method through theoretical and simulation studies, considering a range of phenotype prevalence and varying degree of model goodness-of-fit. We applied our method to develop a preliminary model for identifying patients with primary aldosteronism (PA) in the University of Pennsylvania Health System, which achieved an area under the ROC curve (AUC) of 0.99. Conclusion: Our likelihood approach utilizes the combination of domain expertise summarized by an anchor variable and vast amount of unlabeled data to develop a model for accurate case-control identification in EHRs. It spares researchers largely from labor-intensive manual labeling, thereby leading to greatly increased efficiency for EHR phenotyping. Importantly, our method is transportable because well-defined anchor variables can be shared among institutions.