PQG Working Group Series


Each year, the PQG organizes a less formal PQG Working Group Series for all local students, postdocs, and faculty. The goal is to provide the opportunity to present and participate in the discussion of works-in-progress, and to focus on the methods and analysis of high-dimensional data in genetics and genomics.

2020/2021 Working Group Organizers: Corbin Quick & Wei Zhou

Please direct any logistical questions to Amanda King

Upcoming Working Group


All PQG working group meetings for the semester will be held by Zoom.  The link to each meeting will be posted along with the talk information.

Tuesday, May 11, 2021
1:00-2:00 PM
Join Zoom meeting:
https://harvard.zoom.us/meeting/register/tJ0pc-iopzwiGtxtoI-KPHc9PUL_c_prozVm

Brian Cleary

Broad Fellow
Broad Institute of MIT and Harvard

Using viral loads and within-host models to improve COVID-19 surveillance

Limited testing capacity has been an ongoing problem throughout the COVID-19 pandemic. Pooled testing is a faster and less expensive diagnostic approach compared to individual testing, but there are important tradeoffs in sensitivity, efficiency and logistics to consider. In this talk, we outline an approach combining within-host hierarchical models, compartmental models of pathogen spread, and viral load data to identify optimized pooling protocols that are robust to the changing state of an epidemic. This study highlights the importance of considering within-host biology and individual-level heterogeneity when evaluating epidemiological surveillance strategies. We will also introduce HYPER, a new pooled testing method based on hypergraph factorization. HYPER is designed to be easy to implement and adapt, while also producing pools that are balanced and efficient. We will discuss what hypergraph factorizations are and how they generate the pooling designs used in HYPER.

 

2020-2021 Dates


September 29, 2020 - Zilin Li, HSPH

Zilin Li

Postdoctoral Fellow
Harvard T. H. Chan School of Public Health

A Framework for Detecting Noncoding Associations in Large Whole Genome Sequencing Studies at Scale

Compared with GWAS and whole exome sequencing studies, large-scale whole genome sequencing studies have enabled the analysis of non-coding rare variants (RVs) associated with complex human traits. Common analytic strategies for RV association in non-coding region considered limited choices of gene-centric masks and sliding windows of a fixed length, and have limited scope to leverage the functions of variants.

We propose a non-coding rare variant association detection framework, including gene-centric analysis and genetic region analysis. For gene-centric analysis, we consider various strategies for grouping non-coding variants based on functional annotations, including UTR, upstream, downstream, promoter, enhancer and long non-coding RNA genes. For genetic region analysis, we group non-coding RVs residing in a contiguous window, defined either by a pre-specified (fixed) window size or a flexible data-adaptive window size using SCANG (SCAN the Genome). The STAAR (variant-Set Test for Association using Annotation infoRmation) method is also applied in the framework that increases the power of RV association tests by effectively incorporating multiple functional annotations.

We applied the proposed framework to analyze non-coding RV association with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) in 21,015 discovery samples and 9,123 replication samples from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. Several novel non-coding RV-sets associated with lipids were discovered and replicated using the TOPMed WGS data.

October 13, 2020 - Masahiro Kanai, Harvard Medical School<br /> Analytical and Translational Genetics Unit, Mass General Hospital and Broad Institute

Masahiro Kanai

PhD Candidate
Harvard Medical School, Analytical and Translational Genetics Unit, Mass General Hospital, Broad Institute

Insights into fine-mapping causal variants of complex traits from diverse populations
Identifying causal variants for complex traits is one of the major challenges in human genetics. The causal variants in most GWAS loci remain unknown due to lack of power and to high linkage disequilibrium (LD) in a locus. Moreover, little is known about how causal variants are shared across populations due to lack of large-scale GWAS from diverse populations.

Here, we present a cross-population analysis of fine-mapping results based on three large-scale biobanks. In parallel to our effort fine-mapping complex traits in UK Biobank (n = 361,194; UKB) and expression QTL (eQTL) in GTEx (n = 838) (Ulirsch & Kanai et al), we fine-mapped hundreds of complex traits and diseases from Biobank Japan (n = 180,987; BBJ) and FinnGen (n = 183,694) using FINEMAP (Benner et al, 2016) and SuSiE (Wang et al, 2018). In total, 4,151 high-confidence putative causal variants for 124 traits were identified (posterior inclusion probability [PIP] > 0.9 in any population), including 46 and 66 population-enriched variants from BBJ and FinnGen, respectively. Distinct coding variants from each population often fine-mapped together in the same exons, and we found that coding putative causal variants are more deleterious (OR = 10.4 and 5.9 for pLoF and missense vs synonymous) and more pathogenic (OR = 28.3 for ClinVar variants) than other coding variants. Furthermore, we observed that non-coding putative causal variants are strongly enriched for promoters and cis-regulatory regions (accessible chromatin and H3K27ac) (OR = 10.8 and 11.3 vs non-genic) and colocalized with fine-mapped eQTL variants in GTEx, suggesting that the majority of putative causal variants could be explained via coding or regulatory mechanisms. Altogether, we demonstrate how diverse populations gain additional insights into disease biology with an expanded atlas of candidate causal variants.

Despite high trans-ethnic genetic correlation, we found most single-population fine-mapped variants are undiscoverable across populations; only 8% of the variants with PIP > 0.9 were identified in 95% credible sets in other populations. This inconsistency is mainly due to lack of power in other populations or to LD complexity, with a minority unexplained. For example, among 2,483 fine-mapped variants with PIP > 0.9 in UKB, 53% are missing in BBJ because they are rare or monomorphic, 35% have lower power for association due to MAF and sample size, and 2% have higher LD complexity based on empirical predicted PIP analysis. Overall, our analysis gives insights into how to interpret fine-mapping results from multiple populations and emphasizes the desperate need of more diversity in human genetics.

November 10, 2020 - Carles Boix, MIT

Carles Boix

PhD Candidate
MIT

Regulatory genomic circuitry of human disease loci by integrative epigenomics
Annotating the molecular basis of human disease remains an unsolved challenge, as 93% of disease loci are non-coding, and gene-regulatory annotations highly incomplete. Here, we present EpiMap, a compendium of 10,000 epigenomic maps across 800 samples, which we use to define chromatin states, high-resolution enhancers, enhancer modules, upstream regulators, and downstream target genes. We use this resource to annotate 30,000 genetic loci associated with 540 traits, predicting trait-relevant tissues, putative causal nucleotide variants in enriched-tissue enhancers, and candidate tissue-specific target genes for each. We partition multifactorial traits into tissue-specific contributing factors with distinct functional enrichments and disease-comorbidity patterns, and reveal both single-factor monotropic and multi-factor pleiotropic loci. Top-scoring loci frequently have multiple predicted driver variants, converging through multiple common-target enhancers, multiple common-tissue genes, or multiple genes/tissues, indicating extensive pleiotropy. Our results demonstrate the importance of dense, rich, high-resolution epigenomic annotations for complex trait dissection.

December 15, 2020 - Xihao Li, HSPH

Xihao Li

PhD Candidate
Harvard T. H. Chan School of Public Health

Powerful and resource-efficient rare variant meta-analysis for large-scale whole genome sequencing studies using summary statistics and functional annotations, with application to TOPMed lipid data
Large-scale whole genome sequencing (WGS) studies have enabled the analysis of rare variants (RVs) associated with complex human traits. Existing RV meta-analysis approaches are not scalable when applied to WGS data. We propose MetaSTAAR (Meta-analysis of variant-Set Test for Association using Annotation infoRmation), a powerful and resource-efficient rare variant meta-analysis framework, for large-scale whole genome sequencing association studies. MetaSTAAR accounts for population structure and relatedness for both continuous and dichotomous traits by fitting the generalized linear mixed models using sparse genetic relatedness matrices. By storing LD information of RVs in sparse matrix format, the proposed workflow is highly storage efficient and computationally scalable for analyzing large-scale WGS data. Furthermore, the proposed meta-analysis framework builds upon the STAAR method, which dynamically incorporates multiple functional annotations to empower rare variant association analysis and allows for RV-set analysis including gene-centric analysis by grouping variants into functional categories for each gene and genetic region analysis using sliding windows. MetaSTAAR also enables conditional analyses to identify RV-set signals independent of nearby common variants. We applied MetaSTAAR to identify RV-sets associated with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) in 30,138 related samples from the NHLBI Trans-Omics for Precision Medicine program Freeze 5 data, consisting of 14 ancestrally diverse study cohorts and 255 million variants in total. MetaSTAAR requires 520 GB to store the summary statistics and LD matrices across the whole genome, which is at least 100 times smaller than the existing method RAREMETAL. In addition, the computation time is benchmarked to be 100 times faster than RAREMETAL. Compared to the joint analysis of pooled individual-level data using STAAR, the P-values from MetaSTAAR and STAAR are highly consistent, with correlation > 0.99 among significant regions in both unconditional and conditional analyses.

February 16, 2021 - Xuefang Zhao, HMS / MGH

Xuefang Zhao 

Postdoctoral Researcher
Harvard Medical School, Center for Genomic Medicine at Massachusetts General Hospital

March 9, 2021 - Kumar Veerapen, HMS, ATGU, MGH and Broad Institute

Research Fellow
Analytic and Translational Genetics Unit (ATGU) at Massachusetts General Hospital, the Broad Institute, and Harvard Medical School

Assessing the Genetic Contribution of Drug Response Variability in Selective Serotonin Reuptake Inhibitors From More than 2 million Purchases in the Finnish National Drug Registry   

Drug use and response is highly variable between individuals e.g. up to 65% of major depressive disorder (MDD) do not respond to selective serotonin reuptake inhibitors (SSRI, ATC code: N06AB). Moreover, molecular mechanisms of interindividual differences that cause SSRI response variability is poorly understood but is largely genetic. We hypothesize that large scale biobanks can be used to understand the genetic basis of SSRI response variability. We utilized the FinnGen integrated biobank data to interrogate SSRI use and response variability. The FinnGen biobank aims to recruit a total of 500,000 Finnish individuals who will be genotyped and linked with lifetime health and drug data from the national Finnish registries. Currently, 218,792 individuals have complete genetic and phenotypic data: 51,519 are individuals who have longitudinal SSRI purchase history. We aim to understand genetic differences contributing to the variability in SSRI usage and response in FinnGen. Each genetic association was assessed using mixed-models and significant loci were determined using a p<5×10-8  threshold. The use of SSRI and MDD is highly correlated in FinnGen (Rg = 0.73): pointing towards a genetic basis in response variability of SSRI. From a total of 1,711,695 antidepressant purchases, 853,286 (49.9%) were SSRI comprising 51,519 individuals. We first investigated the SSRI users (N=51,519) relative to non-users (N=159,788) and identified 3 significant loci where a signal in HLA-A (chr6:29978172:G:A) is located near to a previously reported locus in a recent MDD GWAS (r2 = 0.39). An additional signal in the PHF21A (chr11:46035522:T:C) locus could point towards functional relevance: PHF21A has been shown to impair serotonin metabolism in mouse models. After confirming that drug usage data identifies known MDD loci, we analyzed the  genetic differences in adherence of SSRI: 18,691 individuals consumed only SSRI and 13,325 individuals switched from an SSRI to any other antidepressant. We identified 1 significant locus in HLA-DQB1-AS1 (chr6:32658495:C:A) which has been associated with immune and neurological phenotypes (FinnGen phewas). Finally, we attempted to understand genetic differences contributing to longitudinal SSRI use over time between individuals who had no change vs an increase in dosage over time. We identified a significant polygenic risk load in patients with increased dosage (p<0.05); and 1 significant locus in PTDSS2 (chr11:476394:G:A) (p<5×10-8). The protein PTDSS2 has been shown to have a high affinity to docosahexaenoic acid (DHA) where DHA has been clinically used to augment SSRI efficacy.
The use of FinnGen as a large scale biobank to understand the genetic contribution of SSRI usage and response variability has shown to be promising. Future studies will include further understanding polygenic risk contributing to longitudinal drug use, dosage patterns, and potential adverse outcomes. Results of this study can be used to improve prescription accuracy and treatment of MDD.

April 6, 2021- Huwenbo Shi & Martin Zhang, HSPH

Huwenbo Shi and Martin Zhang

Postdoctoral Research Fellow, Department of Epidemiology
Harvard T.H. Chan School of Public Health

Transcriptome-wide association study and fine-mapping at cell-type resolution  

Transcriptome-wide association studies (TWAS) using cis predicted gene expression have identified thousands of genes associated with diseases and complex traits (Gamazon et al. 2015 Nature Genetics, Gusev et al. 2016 Nature Genetics, Zhu et al. 2016 Nature Genetics). However, existing TWAS are based on gene expression models for bulk tissue gene expression, and therefore cannot pinpoint the specific cell types through which a gene is associated with diseases and complex traits.

Here, we introduce an approach to perform TWAS at tissue-cell-type resolution. We apply computational methods to infer cell type proportions and cell-type specific gene expression for each GTEx sample, using the mouse reference single-cell gene expression from the Tabula Muris Consortium in matched tissues, and obtain gene expression models for every tissue-cell-types based on the inferred gene expression.

We performed tissue-cell-type TWAS and fine-mapping for 52 diseases and complex traits. Compared to TWAS at tissue resolution, our tissue-cell-type TWAS discovered on average 84% more unique gene-trait associations at FDR < 0.05. Our approach also pinpoints specific cell types for each gene-trait association. For example, we found that the SORT1-LDL association is highly significant (P<10-30) in liver hepatocyte with fine-mapped PIP=0.68, and the CACNA1C-schizophrenia association in brain cerebellum neuronal cells (P=2.35×10-10, PIP=1.0) . In addition, in a systematic comparison with known disease-related genes, the fine-mapped tissue-cell-type TWAS results showed increased enrichment relative to tissue TWAS in 16/19 sets of comparisons (P=0.003).

May 11, 2021- Brian Cleary, Broad Institute of MIT and Harvard

Brian Cleary

Broad Fellow
Broad Institute of MIT and Harvard

Using viral loads and within-host models to improve COVID-19 surveillance

Limited testing capacity has been an ongoing problem throughout the COVID-19 pandemic. Pooled testing is a faster and less expensive diagnostic approach compared to individual testing, but there are important tradeoffs in sensitivity, efficiency and logistics to consider. In this talk, we outline an approach combining within-host hierarchical models, compartmental models of pathogen spread, and viral load data to identify optimized pooling protocols that are robust to the changing state of an epidemic. This study highlights the importance of considering within-host biology and individual-level heterogeneity when evaluating epidemiological surveillance strategies. We will also introduce HYPER, a new pooled testing method based on hypergraph factorization. HYPER is designed to be easy to implement and adapt, while also producing pools that are balanced and efficient. We will discuss what hypergraph factorizations are and how they generate the pooling designs used in HYPER.

PQG Working Group Archive