PQG Working Group Series

Each year, the PQG organizes a less formal PQG Working Group Series for all local students, postdocs, and faculty. The goal is to provide the opportunity to present and participate in the discussion of works-in-progress, and to focus on the methods and analysis of high-dimensional data in genetics and genomics.

2021/2022 Working Group Organizers: Karthik Jagadeesh & Eric Van Buren

Please direct any logistical questions to Amanda King

Upcoming Working Group


All PQG working group meetings for the semester will be held by Zoom.  The link to each meeting will be posted along with the talk information.

Tuesday, May 10, 2022
1:00-2:00 PM
Join Zoom meeting:
https://harvard.zoom.us/j/96354044407?pwd=aFNMcWVDUUtmUlBRRlJsZHNmWmhaQT09

Gang Li

Data Science Postdoctoral Fellow
University of Washington

Integration of Imaging and Sequencing Data in the Context of Visual Cell Sorting
Abstract: Visual cell sorting (VCS) is a single-cell co-assay that combines microscopy and high-throughput sequencing.  The microscopy measures cell morphology ​​and marks cells with phenotypes of interest, which enables sorting of cells based on visual phenotype. The subsequent sequencing step can be used to measure any one of a variety of cell characteristics, such as gene expression, chromatin accessibility, or chromatin 3D architecture. In the current VCS analysis pipeline, the imaging data is used primarily to generate discrete morphology labels. This approach does not fully exploit the rich information from images. VCS can associate single-cell profiles with their associated morphological phenotypes, but the images and the single-cell profiles do not have a direct correspondence. To attempt to recover this correspondence information, we developed a supervised version of the MMD-MA manifold alignment algorithm, with the goal of embedding the single-cell sequencing measurements and microscopy images into a shared manifold in such a way that two observations derived from the same cell are nearby in the embedded space. Clearly, successfully creating such an embedding would be valuable because it would allow us to explicitly describe how changes in gene expression relate to specific changes in cell morphology. We utilized partial phenotype labels of cells to guide the embedding, and we used the labels of the remaining cells for evaluation. Our approach sheds light on how gene expression profiles interact with cell morphology.

 

2020-2021 Dates


September 28, 2021 - Rounak Dey, HSPH

Rounak Dey

Postdoctoral Research Fellow, Biostatistics
Harvard T.H. Chan School of Public Health

Scalable and accurate mixed effects model to account for relatedness and populations structure in multi-ethnic PheWAS using sparse ancestry-adjusted genetic relatedness matrix

In genetic association studies, generalized linear mixed effects models (GLMMs) are commonly used to control for the relatedness among the samples by modelling the familial and cryptic relationships using a genetic relationship matrix (GRM). Existing GLMM methods use an empirical GRM to account for the sample-relatedness, which works well in the context of association analysis in studies with mostly homogeneous populations. However, they are not suitable to analyze recent multi-ethnic whole genome sequencing (WGS) studies with heterogeneous populations. A standard approach to adjust for the population stratification in multi-ethnic studies is to use the principal components (PCs) as fixed effects. Since the empirical GRM also contains the population structure information, using both the PCs and the empirical GRM in the model can lead to “double-fitting” of the population structure. Moreover, using the empirical GRM can lead to the mis-specification of the familial relationships due to the confounding effect of the population structure, which can potentially result in the loss of power and miscalibration of the type I error rates. Existing methods that rely on the sparsity of the empirical GRM also fail to work because of the lack of sparsity due to the population structure.
Here, we propose a scalable GLMM method for multi-ethnic studies that uses a sparse ancestry-adjusted GRM to model the sample-relatedness, and accounts for the population structure using the ancestry-informative principal components as fixed effects. By separating the distant ancestry and the familial relationships, our method provides a scalable and accurate solution to analyze large multi-ethnic studies, especially some of the recent WGS studies, which leads to accurate type I error control and improved power to detect associations. To facilitate the entire pipeline for the WGS data analysis, we further propose a scalable computation method to estimate the sparse ancestry-adjusted GRM using efficient distributed computation techniques, which can compute the sparse ancestry-adjusted GRM for the entire UK Biobank dataset of more than 450000 subjects in less than nine hours using only 45 CPUs and 40 GB overall memory usage. Using numerical simulations, and an application on the entire UK Biobank dataset, we demonstrate that our method is scalable to handle association analysis with more than 450000 subjects, and control type I error and improve power compared to the existing GLMM methods.

October 12, 2021 - Tiffany Amariuta, HSPH

Tiffany Amariuta

Postdoctoral Research Fellow, Genetic Epidemiology and Statistical Genetics
Harvard T.H. Chan School of Public Health

Modeling tissue co-regulation to infer tissue-specific contributions to disease

Despite abundant evidence of disease etiologies that span multiple tissues, quantifying tissue- specific contributions to disease heritability remains challenging. Previous work emphasized the potential of accounting for tissue co-regulation (Ongen et al. 2017 Nat Genet), but tissue-specific disease effects have not been formally modeled.

We introduce a new method, tissue co-regulation score regression (TCSC), that quantifies tissue-specific contributions to disease heritability by regressing transcriptome-wide association study (TWAS) gene-disease chi-square statistics on tissue co-regulation scores, across genes and tissues. TWAS statistics include direct effects of predicted cis-genetic components of gene expression on disease and tagging effects of co-regulated tissues (Wainberg et al. 2019 Nat Genet). TCSC distinguishes best proxy causal versus tagging gene-disease effects across tissues by modeling pairwise correlations of predicted gene expression between tissues (tissue co-regulation scores). In simulations, TCSC detects causal tissues with well-calibrated false positive rate across a broad range of parameter settings. At default settings, TCSC attained substantially higher power to detect causal tissues than the Ongen et al. method. TCSC also estimates the proportion of SNP-heritability explained by each tissue; estimates are conservative, as they exclude effects of genes with non-significant gene expression heritability at finite sample size.

We applied TCSC to 82 heritable complex traits and diseases from UK Biobank (average N = 299K), using gene expression prediction models constructed from GTEx data across 49 tissues to compute TWAS statistics and co-regulation scores. Below, we discuss tissues with non-zero heritability at 10% FDR for three representative traits: waist-hip-ratio (WHR), anorexia, and height. For WHR, TCSC correctly implicates adipose (29.9% of SNP-heritability explained), as WHR reflects the corporeal distribution of fat. For anorexia, TCSC specifically implicates the brain cortex (32.9% of SNP-heritability explained) out of five central nervous system regions, consistent with extensive previous work implicating this brain region (Haye et al. 2009 Nat Rev Neuro). For height, TCSC implicates five best proxy causal tissues, where the lead tissue, fibroblasts, explains 41.8% of SNP-heritability. This is consistent with the known role of connective tissue in growth regulation. Our method also reduced the number of trait-associated tissues within a tissue category by 67% compared to LDSC-SEG (Finucane et al. 2018 Nat Genet). In conclusion, TCSC is a powerful method for quantifying tissue-specific contributions to disease heritability.

November 9, 2021 - Tushar Kamath, Broad Institute

Tushar Kamath

PhD Candidate, Biophysics, Broad Institute
MD Student, Harvard Medical School

Vulnerabilities of midbrain dopaminergic neurons to Parkinson’s disease revealed by single-cell genomics

The loss of some dopamine (DA) neurons within the substantia nigra pars compacta (SNpc) is a defining pathological hallmark of Parkinson’s Disease (PD). Yet, the molecular features associated with DA neuron vulnerability have not yet been fully identified. To comprehensively characterize DA neuron types in the SNpc and their relative vulnerabilities to PD, we developed a protocol to enrich and transcriptionally profile thousands of midbrain DA neurons from PD patients and matched controls. We identified 10 populations and spatially localized each within the SNpc using Slide-seq, a high-resolution spatial transcriptomics technology. A single subtype, marked by the expression of the gene AGTR1 and spatially confined to the ventral tier of SNpc, was highly susceptible to loss and showed the strongest upregulation, relative to other DA types, of targets of TP53 and NR2F2, nominating molecular processes associated with degeneration in vivo. This same vulnerable population was specifically enriched for the heritable risk associated with sporadic PD. These analyses highlight the importance of cell-intrinsic pathways in determining the differential vulnerability of DA neurons to degeneration in PD.

December 14, 2021 - cancelled

February 15, 2022 - Martin Jinye Zhang, HSPH

Martin Jinye Zhang

Postdoctoral Researcher, Department of Epidemiology
Harvard T.H. Chan School of Public Health

Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data

Gene expression at the individual cell-level resolution, as quantified by single-cell RNA-sequencing (scRNA-seq), can provide unique insights into the pathology and cellular origin of diseases and complex traits. Here, we introduce single-cell Disease Relevance Score (scDRS), an approach that links scRNA-seq with polygenic risk of disease at individual cell resolution; scDRS identifies individual cells that show excess expression levels for genes in a disease-specific gene set constructed from GWAS data. We determined via simulations that scDRS is well-calibrated and powerful in identifying individual cells associated to disease. We applied scDRS to GWAS data from 74 diseases and complex traits (average N =341K) in conjunction with 16 scRNA-seq data sets spanning 1.3 million cells from 31 tissues and organs. At the cell type level, scDRS broadly recapitulated known links between classical cell types and disease, and also produced novel biologically plausible findings. At the individual cell level, scDRS identified subpopulations of disease-associated cells that are not captured by existing cell type labels, including subpopulations of CD4+ T cells associated with inflammatory bowel disease, partially characterized by their effector-like states; subpopulations of hippocampal CA1 pyramidal neurons associated with schizophrenia, partially characterized by their spatial location at the proximal part of the hippocampal CA1 region; and subpopulations of hepatocytes associated with triglyceride levels, partially characterized by their higher ploidy levels. At the gene level, we determined that genes whose expression across individual cells was correlated with the scDRS score (thus reflecting co-expression with GWAS disease genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.

March 8, 2022 - Haoyu Zhang, HSPH

Haoyu Zhang
Postdoctoral Fellow
Department of Biostatistics, Harvard T.H. Chan School of Public Health

Methods for risk prediction using integrative multi-ethnic genetic and genomic datasets

Polygenic risk scores (PRS) are becoming increasingly predictive of complex traits, but poorer performance in non-European populations raises concerns for clinical applications. We develop a powerful and scalable method for developing PRS using GWAS across diverse populations by combining multiple techniques, including LD-clumping, empirical-Bayes and machine learning. We evaluate the performance of the proposed method relative to a variety of alternatives using extensive simulation studies and 23andMe Inc. datasets for seven complex traits, including up to 800K individuals from non-European populations. Results show that the proposed method can substantially improve the performance of PRS in non-European populations relative to simple alternatives and can perform comparably or superior to more advanced methods that require a different order of computational time. Further, our simulation studies provide novel insight to sample size requirement and effect of SNP density on multi-ethnic polygenic prediction.

April 5, 2022 - Ayshwarya Subramanian, Broad Institute

Ayshwarya Subramanian

Computational Biologist
Broad Institute

Comparative transcriptomics of mouse and human kidneys to study disease altered cell states

Mouse models are a tool for studying the mechanisms underlying complex diseases; however, differences between species pose a significant challenge for translating findings to patients. Here, we used single-cell transcriptomics and orthogonal validation approaches to provide cross-species taxonomies, identifying shared broad cell classes and unique granular cellular states, between mouse and human kidney. We generated cell atlases of the diabetic and obese kidney using two different mouse models, a high-fat diet (HFD) model and a genetic model (BTBR ob/ob), at multiple time points along disease progression. Importantly, we identified a previously unrecognized, expanding Trem2high macrophage population in kidneys of HFD mice that matched human TREM2high macrophages in obese patients. Taken together, our cross-species comparison highlights shared immune and metabolic cell-state changes.

 

May 10, 2022 - Gang Li, University of Washington

Gang Li

Data Science Postdoctoral Fellow
University of Washington

Integration of Imaging and Sequencing Data in the Context of Visual Cell Sorting
Abstract: Visual cell sorting (VCS) is a single-cell co-assay that combines microscopy and high-throughput sequencing.  The microscopy measures cell morphology ​​and marks cells with phenotypes of interest, which enables sorting of cells based on visual phenotype. The subsequent sequencing step can be used to measure any one of a variety of cell characteristics, such as gene expression, chromatin accessibility, or chromatin 3D architecture. In the current VCS analysis pipeline, the imaging data is used primarily to generate discrete morphology labels. This approach does not fully exploit the rich information from images. VCS can associate single-cell profiles with their associated morphological phenotypes, but the images and the single-cell profiles do not have a direct correspondence. To attempt to recover this correspondence information, we developed a supervised version of the MMD-MA manifold alignment algorithm, with the goal of embedding the single-cell sequencing measurements and microscopy images into a shared manifold in such a way that two observations derived from the same cell are nearby in the embedded space. Clearly, successfully creating such an embedding would be valuable because it would allow us to explicitly describe how changes in gene expression relate to specific changes in cell morphology. We utilized partial phenotype labels of cells to guide the embedding, and we used the labels of the remaining cells for evaluation. Our approach sheds light on how gene expression profiles interact with cell morphology.

PQG Working Group Archive