Research

Efficient Cross-Trait Penalized Regression Increases Prediction Accuracy in Large Cohorts using Secondary Phenotypes
We introduce cross-trait penalized regression (CTPR), a powerful and practical approach for multi-trait polygenic risk prediction in large cohorts. Specifically, we propose a novel cross-trait penalty function with the Lasso and the minimax concave penalty (MCP) to incorporate the shared genetic effects across multiple traits for large-sample GWAS data. Our approach extracts information from the secondary traits that is beneficial for predicting the primary trait based on individual-level genotypes and/or summary statistics. Our novel implementation of a parallel computing algorithm makes it feasible to apply our method to biobank-scale GWAS data. We illustrate our method using large-scale GWAS data (~1M SNPs) from the UK Biobank (N=456,837). We show that our multi-trait method outperforms the recently proposed multi-trait analysis of GWAS (MTAG) for predictive performance. The prediction accuracy for height by the aid of BMI improves from R2=35.8% (MTAG) to 42.5% (MCP+CTPR) or 42.8% (Lasso+CTPR) with UK Biobank data. (Chung et al. Nature Communications (2019))

A Genome-Wide Cross-Trait Analysis from UK Biobank Highlights the Shared Genetic Architecture of Asthma and Allergic Diseases
Clinical and epidemiological data suggest that asthma and allergic diseases are associated and may share a common genetic etiology. We analyzed genome-wide SNP data for asthma and allergic diseases in 33,593 cases and 76,768 controls of European ancestry from UK Biobank. Two publicly available independent genome-wide association studies were used for replication. We have found a strong genome-wide genetic correlation between asthma and allergic diseases (rg = 0.75, P = 6.84 × 10−62). Cross-trait analysis identified 38 genome-wide significant loci, including 7 novel shared loci. Computational analysis showed that shared genetic loci are enriched in immune/inflammatory systems and tissues with epithelium cells. Our work identifies common genetic architectures shared between asthma and allergy and will help to advance understanding of the molecular mechanisms underlying co-morbid asthma and allergic diseases. (Zhu et al. Nature Genetics (2018))

Integrative Approaches for Large-scale Transcriptome-Wide Association Studies
Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance of one or multiple proteins. Here we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated with complex traits. We leverage expression imputation from genetic data to perform a transcriptome-wide association study (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ~3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 new genes significantly associated with obesity-related traits (BMI, lipids and height). Many of these genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits. (Gusev et al. Nature Genetics (2016))

Heritability and Genomics of Gene Expression in Peripheral Blood
We assessed gene expression profiles in 2,752 twins, using a classic twin design and careful control of measurement variation to quantify expression heritability and quantitative trait loci (eQTL) in peripheral blood. The most highly heritable genes (~777) grouped into distinct expression clusters, tended to reside in gene-poor regions, were associated with specific gene function/ontology classes, and were strongly associated with disease designation from OMIM and the NHGRI GWAS catalog. The design enabled a direct comparison of classic twin-based heritability to estimates based on IBD sharing in dizygotic twins and to the analysis based on genetic relatedness of unrelated individuals. A consideration of sampling variation in expression heritability estimates suggests that previous estimates have been upwardly biased, providing an imprecise view of genetic transcription control. Genotyping of 2,494 of the twins enabled highly definitive identification of eQTLs, which were further examined in a replication set of 1,895 independent subjects. Of 7,200 highly significant local eQTLs from twins, 6,988 (97.1%) replicated, while of 348 distant eQTLs meeting strict quality control standards, 165 (47.4%) replicated. Data subsampling suggests that numerous weaker eQTLs, especially distant eQTLs, remain undiscovered, but our data did not support previous findings that distant eQTLs influence a large number of transcripts. Our results and eQTLs provide an important new resource toward understanding the genetic control of transcription. (Wright et al. Nature Genetics (2014))

Mixed Effects Models for GAW18 Longitudinal Blood Pressure Data
In this paper, we propose two mixed effects models for Genetic Analysis Workshop 18 (GAW18) longitudinal blood pressure data. The first method extends EMMA, an efficient mixed-model association mapping algorithm. EMMA corrects for population structure and genetic relatedness using a kinship similarity matrix. We replace the kinship similarity matrix in EMMA with an estimated correlation matrix for modeling the dependence structure of repeated measurements. Our second approach is a Bayesian multiple association mapping algorithm based on a mixed effects model with a built-in variable selection feature. It models multiple single-nucleotide polymorphisms (SNPs) simultaneously and allows for SNP-SNP interactions and SNP-environment interactions. We applied these two methods to GAW18 longitudinal systolic blood pressure (SBP) and diastolic blood pressure (DBP) data. The xtended EMMA method identified a single SNP on Chr5:75506197 for SBP, and three SNPs on Chr3:23715851, Chr17:54834217 and Chr21:18744081 for DBP, respectively. The Bayesian method identied several additional SNPs on Chr1:17876090, Chr3:197469358, Chr15:87675666 and Chr19:41642807 for SBP. Furthermore, for SBP, we found a single SNP on Chr3:197469358 that has a strong interaction with age. We further evaluated the performances of the proposed methods by simulations. (Chung et al. BMC Proceedings (2014))

Bayesian Parametric and Nonparametric Methods for Multiple QTL Mapping and SNP-Set Analysis
Many complex traits and human diseases, such as blood pressure and body weight, are known to change over time. The genetic basis of such traits can be better understood by repeatedly collecting data over time. The resulting longitudinal data provide us useful resources for studying the joint action of multiple time-dependent genetic factors. In the first part of the dissertation, we extend two existing Bayesian multiple quantitative trait loci (QTL) mapping methods from univariate traits to longitudinal traits. Our first approach focuses on mapping genes with main effects and two-way gene-gene and gene-environment interactions. Multiple QTL are selected by a variable selection procedure based on the composite model space framework. Our second approach presents a Bayesian Gaussian process method to map multiple QTL without restricting to pairwise interactions. Rather than modeling each main and interaction term explicitly, the nonparametric Bayesian method measures the importance of each QTL, regardless whether it is mostly due to a main effect or some interaction effect(s), via an unspecified function. We assign a Gaussian process prior to this unknown function. For the unstructured covariance matrix, both approaches employ a modified Cholesky decomposition. For data where phenotype measurements are not collected at a fixed set of time points across all samples, we propose a grid-based approach which parsimoniously approximates the covariance matrix of each subject as a function of a covariance matrix defined on a set of pre-selected time points. For most genome-wide association studies (GWAS), power to detect an association between a single genetic variant, such as a single nucleotide polymorphism (SNP) and a complex trait is extremely low. Alternative strategies, such as regional SNP-set analysis have overcome some of the limitations of the standard single SNP analysis. Our third topic develops a Bayesian regional SNP-set analysis which extends the nonparametric Gaussian process model and simultaneously models multiple groups of rare and/or common SNP variants. Instead of assigning each SNP a hyperparameter, we assign a common hyperparameter to every SNP within each set to measure the cumulative effect of all SNPs in that set. (Chung et al. ProQuest (2013))