Addressing Incomplete and Missing EHR Data in Implementation Science

Project Summary: The purpose of this pilot is to develop methods to address incomplete and missing electronic health record (EHR) data to advance low-burden evaluation of implementation science outcomes in community-based learning health systems.

Data from EHR systems represent an important opportunity for cost-effect implementation science (IS) research: they provide rich data on large populations, over long-time frames, and at a relatively low cost. Notwithstanding their substantial potential, however, the use of EHR data for research purposes requires consideration of many potential threats to validity. One such threat is incomplete or missing data in the EHR which, depending on why the data are incomplete/missing and how an analysis proceeds, can result in selection bias which, in turn, can compromise the generalizability of the results.

Unfortunately, existing methods for missing data generally fail to acknowledge the inherent complexity and heterogeneity of EHR data, in particular the complex interplay of numerous decisions made by patients, health care providers and health systems that are required in order for complete data to be observed. As such, their application may fail to resolve selection bias. In order to ensure high-quality, rigorous and generalizable results from IS research activities that use EHR data, this project aims to develop a suite of novel statistical methods, including: new “blended analysis strategies” that combine and build on the strengths of multiple imputation and inverse-probability weighting; new study designs and analysis methods that acknowledge the complex nature of EHR data; new approaches for model and variable selection in risk prediction studies; and new sensitivity analysis methods for settings where the data are missing not at random.

Project Lead: Sebastien Haneuse