PQG Working Group
September 28 @ 1:00 pm - 2:00 pm
Postdoctoral Research Fellow, Biostatistics
Harvard T.H. Chan School of Public Health
Scalable and accurate mixed effects model to account for relatedness and populations structure in multi-ethnic PheWAS using sparse ancestry-adjusted genetic relatedness matrix
In genetic association studies, generalized linear mixed effects models (GLMMs) are commonly used to control for the relatedness among the samples by modelling the familial and cryptic relationships using a genetic relationship matrix (GRM). Existing GLMM methods use an empirical GRM to account for the sample-relatedness, which works well in the context of association analysis in studies with mostly homogeneous populations. However, they are not suitable to analyze recent multi-ethnic whole genome sequencing (WGS) studies with heterogeneous populations. A standard approach to adjust for the population stratification in multi-ethnic studies is to use the principal components (PCs) as fixed effects. Since the empirical GRM also contains the population structure information, using both the PCs and the empirical GRM in the model can lead to “double-fitting” of the population structure. Moreover, using the empirical GRM can lead to the mis-specification of the familial relationships due to the confounding effect of the population structure, which can potentially result in the loss of power and miscalibration of the type I error rates. Existing methods that rely on the sparsity of the empirical GRM also fail to work because of the lack of sparsity due to the population structure.
Here, we propose a scalable GLMM method for multi-ethnic studies that uses a sparse ancestry-adjusted GRM to model the sample-relatedness, and accounts for the population structure using the ancestry-informative principal components as fixed effects. By separating the distant ancestry and the familial relationships, our method provides a scalable and accurate solution to analyze large multi-ethnic studies, especially some of the recent WGS studies, which leads to accurate type I error control and improved power to detect associations. To facilitate the entire pipeline for the WGS data analysis, we further propose a scalable computation method to estimate the sparse ancestry-adjusted GRM using efficient distributed computation techniques, which can compute the sparse ancestry-adjusted GRM for the entire UK Biobank dataset of more than 450000 subjects in less than nine hours using only 45 CPUs and 40 GB overall memory usage. Using numerical simulations, and an application on the entire UK Biobank dataset, we demonstrate that our method is scalable to handle association analysis with more than 450000 subjects, and control type I error and improve power compared to the existing GLMM methods.