The Lin Lab is at the forefront of developing and applying scalable statistical and machine learning methods for analyzing large datasets encompassing the genome, exome, exposome, and phenome. Our research spans diverse areas, including big and complex genetic and genomic data studies, variant functional annotations, gene-environment interactions, multi-phenotype analysis, polygenic risk prediction, and heritability estimation. Additionally, we explore integrative analysis of various data types, Mendelian randomization, causal mediation analysis, federated and transferred learning, single-cell genomics, complex observational studies, and the analysis of COVID-19 epidemic data.


Image of STAAR flowchartThe Lin Lab recently developed methods for functionally-informed rare variant association testing in Whole Genome Sequencing (WGS) and biobank datasets (STAAR, STAARPipeline), including meta-analysis (metaSTAAR) and multi-trait analysis (multiSTAAR). Current extensions to this work include integrating single-cell sequencing data into such analyses to boost statistical power and interpretation (cellSTAAR) and extensions to analyze gene-environment interactions.  As part of these efforts, the FAVOR database ( was created to provide comprehensive genome-wide annotations.  We also helped develop improved methods for polygenic risk scores in diverse populations (CT-SLEB). The Lin Lab plays an instrumental role in advancing large-scale whole-genome and whole-exome analysis through its significant contributions to the development of innovative genetic-based ethnicity prediction methods and rigorous quality control protocols. 

The Lin Lab is also dedicated to researching the genetic, environmental, and lifestyle factors that contribute to Lung Cancer. We are actively involved in developing innovative methods to identify and interpret crucial genetic variants relevant to squamous cell lung cancer, adenocarcinoma lung cancer, and Non-Small Cell Lung Cancers (NSLCs).(multiomic annotation, smoking history).

Dr. Xihong Lin

Dr. Lin’s theoretical and computational statistical research includes statistical methods for testing a large number of complex hypotheses, causal inference, statistical and machine learning methods for large matrices, prediction models using high-dimensional data, federated and transferred learning, cloud-based statistical computing, and mixed models, nonparametric and semiparametric regression, and statistical methods for epidemiological studies.

Dr. Lin’s statistical methodological research has been supported by the MERIT Award (R37) (2007-2015), the Outstanding Investigator Award (OIA) (R35) (2015-2029) from the National Cancer Institute (NCI), the R01 grant from the National Heart, Lung, and Blood Institute. She is the multiple PI of a Predictive Modeling Center of the Impact of Genomic Variation on Function (IGVF) Program of the National Human Genome Research Institute (NHGRI), and the multiple PI of the U19 grant on Integrative Analysis of Lung Cancer Etiology and Risk from NCI. She is also the contact PI of the T32 training grant on interdisciplinary training in statistical genetics and computational biology. She is the former contact PI of the Program Project (P01) on Statistical Informatics in Cancer Research from NCI, and the former contact PI of the Harvard Analysis Center (U19) of the Genome Sequencing Program of the National Human Genome Research Institute.

Dr. Lin was active in the early phase of the COVID-19 pandemic. She is one of the corresponding authors of the JAMA and Nature papers on the analysis of the Wuhan COVID-19 data on transmission, public health intervention and epidemiological characteristics.  She is the senior author of the 2021 Journal of the American Statistical Association Discussion paper on modeling COVID transmission dynamics in US. In Spring 2020, Dr. Lin served on the State of Massachusetts COVID-19 Task Force, and testified in the UK Parliament’s Committee of Science and Technology on COVID Responses.

Dr. Lin was elected to the National Academy of Medicine in 2018 and the National Academy of Sciences in 2023.  She received the 2002 Mortimer Spiegelman Award from the American Public Health Association, the 2006 Committee of Presidents of Statistical Societies (COPSS) Presidents’ Award, the 2017 COPSS FN David Award, the 2008 Janet L. Norwood Award for Outstanding Achievement of a Woman in Statistics, the 2022 National Institute of Statistical Sciences Jerome Sacks Award for Outstanding Cross-Disciplinary Research, and the 2022 Marvin Zelen Leadership in Statistical Science Award. She is an elected fellow of American Statistical Association (ASA), Institute of Mathematical Statistics, and International Statistical Institute.

Dr. Lin is the former Chair of the Committee of Presidents of Statistical Societies (COPSS) (2010-2012) and a former member of the Committee of Applied and Theoretical Statistics (CATS) of the National Academy of Science. She is the founding chair of the US Biostatistics Department Chair Group, and the founding co-chair of the Young Researcher Workshop of East-North American Region (ENAR) of the International Biometric Society. She co-launched the Section of Statistical Genetics and Genomics of the American Statistical Association and served as a former section chair. She is the former Coordinating Editor of Biometrics and the founding co-editor of Statistics in Biosciences. She has served on a large number of committees of many statistical societies, numerous NIH and NSF review panels, and several National Academies committees.

Areas of Interest

  • WGS association studies: UKBiobank, TOPMed, GSP, AllofUs, etc.
  • Integration of single-cell & multi-omics data in WGS analysis
  • Prioritizing causal variants with functional annotations (IGVF consortium)
  • Single-cell RNA-sequencing & functional annotation tool development
  • Quality control for large-scale WGS/WES data & rare variant analysis
  • Scalable methods for polygenic risk score construction & improving risk prediction accuracy
  • Lung cancer epidemiology (ILCCO), cardiovascular diseases, & sleep apnea research
  • Statistical genetics/genomics, causal inference, and Mendelian Randomization
  • Pathway/network analysis, and integrative data analysis
  • Focus on common diseases, genes, environment, epigenetics
  • Nonparametric/semiparametric regression, mixed models, correlated data analysis
  • Measurement error in genetic epidemiology/environmental genetics/genomics research