Working Group Archive

Each year, the PQG organizes a less formal PQG Working Group Series for all local students, postdocs, and faculty. The goal is to provide the opportunity to present and participate in the discussion of works-in-progress, and to focus on the methods and analysis of high-dimensional data in genetics and genomics.


May 9, 2023

Wenhan Lu
Analytic and Translational Genetics Unit
HMS, MGH, Broad Institute
Quantifying the extent of pleiotropy using rare variant association data in 394,841 human exomes
Genome-wide association studies (GWAS) and rare-variant association studies (RVAS) have identified thousands of genes and variants that affect multiple phenotypes, implicating potential pleiotropic effects. However, it still needs to be determined through what biological avenue such genetic components engender these cross-phenotype associations. Characterizing the underlying genetic architecture of complex phenotypes sharing associations with the same genetic component is crucial in interpreting the functions of genes and promoting our understanding of human biology. Previous studies on existing GWAS results have provided evidence of the pervasive existence of pleiotropy across the human genome, which is well-characterized for common variants but not systematically explored for rare variants yet.We use the rare variant association data across 4,529 phenotypes from the exome sequence data of 394,841 individuals in the UK Biobank to assess the extent of pleiotropy among rare variants. Utilizing both the variant-level and the aggregated gene-based test results, we observe a widespread pleiotropic structure across human exomes through a comprehensive survey of cross-phenotype associations at the individual phenotype level as well as phenotypic domain level. We also investigate pleiotropy as an allelic series, where variants from the same gene with different functional impacts may exhibit diverse phenotypic effects; more specifically, variants of the pLoF group and the missense group are observed to be associated with completely different phenotypic consequences. To better interpret the underlying genetic mechanisms of pleiotropy, we develop new statistical methods and frameworks to detect the allelic heterogeneity underneath the pleiotropic effects and distinguish between pleiotropy resulting from mediating factors and those implicating genuine diverse genetic pathways. Finally, we apply our novel methods to genes harboring a high potential of being pleiotropic and highlight the heterogeneous effects of variants within the genes impacting diseases from distinct domains. In this way, we demonstrate potential shared and diverse biological pathways among disease groups, providing insight into the genetic basis of pleiotropy in complex human traits.

April 4, 2023

Karl Tayeb

PhD Candidate
University of Chicago

Integrating functional annotations to estimate variant effect sizes from GWAS


The distribution of causal variants– both their distribution throughout the genome, as well as the magnitude of their effects– is a fundamental quantity that underpins many key questions in genetics. Knowledge of this distribution can be used to regularize estimated effects– providing a principled trade off between noisy experimental observations and an informative prior. While the distribution of causal effects is not known a-priori, there is an opportunity to learn this distribution by aggregating information in a genome-wide analysis.  Adaptive shrinkage (ASH) is a flexible Empirical Bayes approach to estimate the distribution of genetic effects. However, it assumes that variants are exchangeable. It is well understood that different regions of the genome have varying regulatory potential, that is, some genomic regions are more likely to contain causal varaints of variable effect size. We propose to extend the adaptive shrinkage framework to leverage functional annotations capturing this regulatory potential. Specifically, we propose to model the distribution of effects as a discrete scale-mixture of normals where the mixture weights vary as a function of annotations. Because the relevant annotations may not be known a-priori, we focus on an approach that can select relevant annotations from a large number of correlated annotations. This framework simultaneously provides an interpretable summary of genome wide association signal by identifying relevant annotations, and produces refined effect estimates.

March 28, 2023

Yu Chen
Analytic and Translational Genetics Unit
HMS, MGH, Broad Institute
Enhanced Interpretation of Schizophrenia GWAS of Diverse Ancestry with Brain Regulatory Architecture of Matched Population
Previous genetic studies of schizophrenia focused on disproportionate majority of European Population. GWAS of non-European populations highlighted the importance of generalizability of genetic studies. Yet, the scarcity of expression quantitative trait loci (eQTL) data of non-European population brains restricted our understanding of how population genetic diversity is involved in the risk of schizophrenia. In this study, we analyzed genotype and RNA-seq of African Americans (AA, n= 158) and Europeans (EUR, n= 408) from the PsychENCODE consortium and East Asians (EAS, n= 217) from the Chinese Human Brain Bank. In our study, 198,769 cis-eQTLs involving 1,276 genes (~10% of all eGenes) and 288,143 SNPs are uniquely captured in the non-EUR populations. We identified 534 eQTLs involving ten genes that have opposite eQTL effects across populations. TAS2R31 is a bitter taste receptor gene, and its opposite eQTLs in EAS and EUR were replicated using independent datasets. We next used the brain eQTLs to explain SCZ GWAS from diverse ancestries and observed that eQTLs better explained SCZ GWAS in the matched population. After prioritizing risk genes via transcriptome-wide association analysis and colocalization in the matched populations. Fourteen novel SCZ candidate genes were identified based on EAS population eQTL with EAS SCZ GWAS. Lastly, using deconvoluted eQTLs of the three populations, we fine-mapped more potential causal variants and risk genes at the major subtypes of brain cells. This characterization of brain regulatory architecture across diverse populations highlighted the importance of studying brain eQTL from non-EUR populations and provides a comprehensive resource for novel insights into schizophrenia.

February 14, 2023

Saori Sakaue

Postdoctoral Research Fellow

A scalable approach to use single-cell multimodal data to fine-map disease causal variants and links them to target genes

After 20 years of genome-wide association study (GWAS), rarely have we identified causal variants or genes. Translating GWAS loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. We developed a new non-parametric statistical method, SCENT (Single-Cell ENhancer Target gene mapping) which models association between enhancer chromatin accessibility and gene expression in single-cell multimodal RNA-seq and ATAC-seq data. We applied SCENT to 9 multimodal datasets including > 120,000 single cells and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in eQTLs and GWAS for 1,143 diseases and traits, which outperformed previous bulk-tissue based enhancer map (e.g., EpiMap) and single-cell based method (e.g., Signac). We identified novel likely causal genes for both common and rare diseases. In addition, we were able to link somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining variant function.

December 13, 2022

Joshua Popp
PhD Candidate
Johns Hopkins University


Investigating dynamic genetic regulation in diverse cell types and differentiation trajectories

Large-scale efforts to identify genetic variants associated with nearby gene expression levels (cis expression quantitative trait loci, or cis-eQTLs) have focused primarily on healthy adult tissues, potentially overlooking effects specific to transient cell states such as those arising during development or environmental exposures. The current limitations in identifying target genes and mechanisms underlying disease-associated genetic loci may arise partially due to the scarcity of data from developmental and dynamic contexts in disease-relevant cell types. By integrating established stem cell model systems with recent advances in single-cell RNA-sequencing technologies, we are now able to study these transient contexts at high resolution, bringing a clearer picture of gene regulatory dynamics into view. We first studied directed differentiation of induced pluripotent stem cells (iPSCs) into cardiomyocytes, where single-cell analysis enabled us to resolve bifurcating trajectories displaying distinct regulatory dynamics. Now, we are branching out to a wider array of trajectories in an efficient, unified experimental framework based on embryoid bodies. Embryoid bodies are three-dimensional aggregates of iPSCs that spontaneously differentiate into a wide variety of cell types, including derivatives of all three germ layers. We generated embryoid bodies from iPSCs from 53 donors, and collected single cell RNA-sequencing data for over 900,000 cells in total. We applied trajectory inference to characterize the complex, multifurcating differentiation landscape and annotated a wide array of cell types, many of which resemble those arising during human embryonic development. We found that different cellular lineages display dynamic expression of numerous genes, including gene modules relevant to complex traits such as schizophrenia, hypertension, and BMI. We identified hundreds of genetic regulatory effects including cell-type specific and dynamic eQTLs in various trajectories. Next, we used unsupervised machine learning to infer patterns of shared regulatory activity within and between differentiation trajectories to characterize how regulation varies between cellular contexts. This work will expand identification of genetic effects to transient cell states arising during differentiation in diverse lineages and improve attribution of regulatory mechanisms to disease risk loci.

November 8, 2022

Martin Zhang

Postdoctoral Research Fellow
Harvard T.H. Chan School of Public Health

 Quantifying and partitioning SNP effect correlation across UK Biobank traits

While traditional heritability estimation methods such as LD-score regression (Bulik-Sullivan et al. 2015 Nat Genet) assume independent SNP effects on the trait, recent works (Schoech et al. 2020 bioRxiv) determined that nearby SNPs (e.g., <100bp) have negatively correlated effects. Yet, it is unclear if the SNP effect correlation exists for other types of SNP pairs. We developed a new method, gene-level directional effect regression (GDREG), that analyzes summary association statistics and in-sample LD information to jointly estimate average SNP effect correlation and per-SNP heritability, stratified by MAF and functional category; we define SNP effect correlation based on standardized minor allele causal effect sizes of SNPs. We performed extensive simulations to confirm that GDREG obtains unbiased estimates of average SNP effect correlation under a range of different genetic architectures.

We applied GDREG to the UK Biobank imputed data across 32 independent diseases and complex traits (N=334K unrelated British-ancestry samples). We determined that SNP pairs within 100bp are significantly negatively correlated (non-significant for common pairs, -0.27±0.04 for low-freq SNP pairs), recapitulating results in previous works. Furthermore, we detected a significant negative correlation for gene-level SNP pairs, including SNP pairs on the same exon (non-significant for common SNP pairs, -0.04±0.01 for low-freq SNP pairs) and SNP pairs on the promoter of the same gene (-0.04±0.01 for common pairs, -0.21±0.03 for low-freq pairs). The negative correlation of gene-level SNP pairs cannot be fully explained by their proximity, providing additional information to the proximity annotation. We hypothesize that the negative correlation is due to linkage masking, in which negatively correlated effects of positively correlated minor alleles enable the alleles to cancel each other’s effects and thus survive the action of negative selection. In addition, we determined that low-freq SNP pairs are more negatively correlated than common SNP pairs across the annotations.


October 11, 2022
Ying Wang
Postdoctoral Research Fellow, Analytic and Translational Genetics Unit
HMS, MGH, Broad Institute


Challenges and opportunities for developing more generalizable polygenic risk scores

Polygenic risk scores (PRS) estimate an individual’s genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. We will discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.

September 13, 2022
Marouen Ben Guebila
Research Fellow
Department of Biostatistics, HSPH


Multiclass network inference reveals distinct parameter regions in gene regulatory space

Gene expression is controlled by complex regulatory processes that allow the same genome to create and regulate the vast diversity of cellular phenotypes that exist in each individual. Although we cannot directly measure gene regulatory networks (GRNs), inferring these from available data is increasingly acknowledged as essential to understanding a wide range of biological processes. PANDA (Passing Attributes between Networks for Data Assimilation) is a method for inferring GRNs that has been used in a large number of studies to identify changes in patterns of gene regulation that correspond to changes in phenotypic state. BOA (Bayesian Optimization of Associations) extends PANDA by introducing a number of weight parameters and using a Bayesian optimization protocol to adjust the relative importance of various PANDA inputs by tuning against relevant experimental data sets. This allows BOA to infer not only GRNs (based on benchmarking against ChIP-seq transcription factor (TF) binding data), but also Transcriptional Control Networks (TCNs) that model effects of TF knock-down to provide insight into cascading regulatory responses, and Functional Binding Networks (FBNs; a combination of GRNs and TCNs) that are optimized against a combination of ChIP-seq and TF knock-down data. We demonstrate the utility of BOA by constructing GRNs, TCNs, and FBNs for cell lines treated with 259 drugs and analyze network structure to identify altered regulatory patterns and the effects of perturbative compounds. BOA is available at; the gene regulatory network database (GRAND; includes the cell line drug benchmark data and the networks generated by BOA.


May 10, 2022

Gang Li

Data Science Postdoctoral Fellow
University of Washington

Integration of Imaging and Sequencing Data in the Context of Visual Cell Sorting
Abstract: Visual cell sorting (VCS) is a single-cell co-assay that combines microscopy and high-throughput sequencing.  The microscopy measures cell morphology ​​and marks cells with phenotypes of interest, which enables sorting of cells based on visual phenotype. The subsequent sequencing step can be used to measure any one of a variety of cell characteristics, such as gene expression, chromatin accessibility, or chromatin 3D architecture. In the current VCS analysis pipeline, the imaging data is used primarily to generate discrete morphology labels. This approach does not fully exploit the rich information from images. VCS can associate single-cell profiles with their associated morphological phenotypes, but the images and the single-cell profiles do not have a direct correspondence. To attempt to recover this correspondence information, we developed a supervised version of the MMD-MA manifold alignment algorithm, with the goal of embedding the single-cell sequencing measurements and microscopy images into a shared manifold in such a way that two observations derived from the same cell are nearby in the embedded space. Clearly, successfully creating such an embedding would be valuable because it would allow us to explicitly describe how changes in gene expression relate to specific changes in cell morphology. We utilized partial phenotype labels of cells to guide the embedding, and we used the labels of the remaining cells for evaluation. Our approach sheds light on how gene expression profiles interact with cell morphology.

April 5, 2022

Ayshwarya Subramanian

Computational Biologist
Broad Institute

Comparative transcriptomics of mouse and human kidneys to study disease altered cell states

Mouse models are a tool for studying the mechanisms underlying complex diseases; however, differences between species pose a significant challenge for translating findings to patients. Here, we used single-cell transcriptomics and orthogonal validation approaches to provide cross-species taxonomies, identifying shared broad cell classes and unique granular cellular states, between mouse and human kidney. We generated cell atlases of the diabetic and obese kidney using two different mouse models, a high-fat diet (HFD) model and a genetic model (BTBR ob/ob), at multiple time points along disease progression. Importantly, we identified a previously unrecognized, expanding Trem2high macrophage population in kidneys of HFD mice that matched human TREM2high macrophages in obese patients. Taken together, our cross-species comparison highlights shared immune and metabolic cell-state changes.

March 8, 2022

Haoyu Zhang
Postdoctoral Fellow
Department of Biostatistics, Harvard T.H. Chan School of Public Health

Methods for risk prediction using integrative multi-ethnic genetic and genomic datasets

Polygenic risk scores (PRS) are becoming increasingly predictive of complex traits, but poorer performance in non-European populations raises concerns for clinical applications. We develop a powerful and scalable method for developing PRS using GWAS across diverse populations by combining multiple techniques, including LD-clumping, empirical-Bayes and machine learning. We evaluate the performance of the proposed method relative to a variety of alternatives using extensive simulation studies and 23andMe Inc. datasets for seven complex traits, including up to 800K individuals from non-European populations. Results show that the proposed method can substantially improve the performance of PRS in non-European populations relative to simple alternatives and can perform comparably or superior to more advanced methods that require a different order of computational time. Further, our simulation studies provide novel insight to sample size requirement and effect of SNP density on multi-ethnic polygenic prediction.

February 15, 2022

Martin Jinye Zhang

Postdoctoral Researcher, Department of Epidemiology
Harvard T.H. Chan School of Public Health

Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data

Gene expression at the individual cell-level resolution, as quantified by single-cell RNA-sequencing (scRNA-seq), can provide unique insights into the pathology and cellular origin of diseases and complex traits. Here, we introduce single-cell Disease Relevance Score (scDRS), an approach that links scRNA-seq with polygenic risk of disease at individual cell resolution; scDRS identifies individual cells that show excess expression levels for genes in a disease-specific gene set constructed from GWAS data. We determined via simulations that scDRS is well-calibrated and powerful in identifying individual cells associated to disease. We applied scDRS to GWAS data from 74 diseases and complex traits (average N =341K) in conjunction with 16 scRNA-seq data sets spanning 1.3 million cells from 31 tissues and organs. At the cell type level, scDRS broadly recapitulated known links between classical cell types and disease, and also produced novel biologically plausible findings. At the individual cell level, scDRS identified subpopulations of disease-associated cells that are not captured by existing cell type labels, including subpopulations of CD4+ T cells associated with inflammatory bowel disease, partially characterized by their effector-like states; subpopulations of hippocampal CA1 pyramidal neurons associated with schizophrenia, partially characterized by their spatial location at the proximal part of the hippocampal CA1 region; and subpopulations of hepatocytes associated with triglyceride levels, partially characterized by their higher ploidy levels. At the gene level, we determined that genes whose expression across individual cells was correlated with the scDRS score (thus reflecting co-expression with GWAS disease genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.

December 14, 2021 – cancelled

November 9, 2021

Tushar Kamath

PhD Candidate, Biophysics, Broad Institute
MD Student, Harvard Medical School

Vulnerabilities of midbrain dopaminergic neurons to Parkinson’s disease revealed by single-cell genomics

The loss of some dopamine (DA) neurons within the substantia nigra pars compacta (SNpc) is a defining pathological hallmark of Parkinson’s Disease (PD). Yet, the molecular features associated with DA neuron vulnerability have not yet been fully identified. To comprehensively characterize DA neuron types in the SNpc and their relative vulnerabilities to PD, we developed a protocol to enrich and transcriptionally profile thousands of midbrain DA neurons from PD patients and matched controls. We identified 10 populations and spatially localized each within the SNpc using Slide-seq, a high-resolution spatial transcriptomics technology. A single subtype, marked by the expression of the gene AGTR1 and spatially confined to the ventral tier of SNpc, was highly susceptible to loss and showed the strongest upregulation, relative to other DA types, of targets of TP53 and NR2F2, nominating molecular processes associated with degeneration in vivo. This same vulnerable population was specifically enriched for the heritable risk associated with sporadic PD. These analyses highlight the importance of cell-intrinsic pathways in determining the differential vulnerability of DA neurons to degeneration in PD.

October 12, 2021

Tiffany Amariuta

Postdoctoral Research Fellow, Genetic Epidemiology and Statistical Genetics
Harvard T.H. Chan School of Public Health

Modeling tissue co-regulation to infer tissue-specific contributions to disease

Despite abundant evidence of disease etiologies that span multiple tissues, quantifying tissue- specific contributions to disease heritability remains challenging. Previous work emphasized the potential of accounting for tissue co-regulation (Ongen et al. 2017 Nat Genet), but tissue-specific disease effects have not been formally modeled.

We introduce a new method, tissue co-regulation score regression (TCSC), that quantifies tissue-specific contributions to disease heritability by regressing transcriptome-wide association study (TWAS) gene-disease chi-square statistics on tissue co-regulation scores, across genes and tissues. TWAS statistics include direct effects of predicted cis-genetic components of gene expression on disease and tagging effects of co-regulated tissues (Wainberg et al. 2019 Nat Genet). TCSC distinguishes best proxy causal versus tagging gene-disease effects across tissues by modeling pairwise correlations of predicted gene expression between tissues (tissue co-regulation scores). In simulations, TCSC detects causal tissues with well-calibrated false positive rate across a broad range of parameter settings. At default settings, TCSC attained substantially higher power to detect causal tissues than the Ongen et al. method. TCSC also estimates the proportion of SNP-heritability explained by each tissue; estimates are conservative, as they exclude effects of genes with non-significant gene expression heritability at finite sample size.

We applied TCSC to 82 heritable complex traits and diseases from UK Biobank (average N = 299K), using gene expression prediction models constructed from GTEx data across 49 tissues to compute TWAS statistics and co-regulation scores. Below, we discuss tissues with non-zero heritability at 10% FDR for three representative traits: waist-hip-ratio (WHR), anorexia, and height. For WHR, TCSC correctly implicates adipose (29.9% of SNP-heritability explained), as WHR reflects the corporeal distribution of fat. For anorexia, TCSC specifically implicates the brain cortex (32.9% of SNP-heritability explained) out of five central nervous system regions, consistent with extensive previous work implicating this brain region (Haye et al. 2009 Nat Rev Neuro). For height, TCSC implicates five best proxy causal tissues, where the lead tissue, fibroblasts, explains 41.8% of SNP-heritability. This is consistent with the known role of connective tissue in growth regulation. Our method also reduced the number of trait-associated tissues within a tissue category by 67% compared to LDSC-SEG (Finucane et al. 2018 Nat Genet). In conclusion, TCSC is a powerful method for quantifying tissue-specific contributions to disease heritability.

September 28, 2021

Rounak Dey

Postdoctoral Research Fellow, Biostatistics
Harvard T.H. Chan School of Public Health

Scalable and accurate mixed effects model to account for relatedness and populations structure in multi-ethnic PheWAS using sparse ancestry-adjusted genetic relatedness matrix

In genetic association studies, generalized linear mixed effects models (GLMMs) are commonly used to control for the relatedness among the samples by modelling the familial and cryptic relationships using a genetic relationship matrix (GRM). Existing GLMM methods use an empirical GRM to account for the sample-relatedness, which works well in the context of association analysis in studies with mostly homogeneous populations. However, they are not suitable to analyze recent multi-ethnic whole genome sequencing (WGS) studies with heterogeneous populations. A standard approach to adjust for the population stratification in multi-ethnic studies is to use the principal components (PCs) as fixed effects. Since the empirical GRM also contains the population structure information, using both the PCs and the empirical GRM in the model can lead to “double-fitting” of the population structure. Moreover, using the empirical GRM can lead to the mis-specification of the familial relationships due to the confounding effect of the population structure, which can potentially result in the loss of power and miscalibration of the type I error rates. Existing methods that rely on the sparsity of the empirical GRM also fail to work because of the lack of sparsity due to the population structure.
Here, we propose a scalable GLMM method for multi-ethnic studies that uses a sparse ancestry-adjusted GRM to model the sample-relatedness, and accounts for the population structure using the ancestry-informative principal components as fixed effects. By separating the distant ancestry and the familial relationships, our method provides a scalable and accurate solution to analyze large multi-ethnic studies, especially some of the recent WGS studies, which leads to accurate type I error control and improved power to detect associations. To facilitate the entire pipeline for the WGS data analysis, we further propose a scalable computation method to estimate the sparse ancestry-adjusted GRM using efficient distributed computation techniques, which can compute the sparse ancestry-adjusted GRM for the entire UK Biobank dataset of more than 450000 subjects in less than nine hours using only 45 CPUs and 40 GB overall memory usage. Using numerical simulations, and an application on the entire UK Biobank dataset, we demonstrate that our method is scalable to handle association analysis with more than 450000 subjects, and control type I error and improve power compared to the existing GLMM methods.


May 11, 2021

Brian Cleary

Broad Fellow
Broad Institute of MIT and Harvard

Using viral loads and within-host models to improve COVID-19 surveillance

Limited testing capacity has been an ongoing problem throughout the COVID-19 pandemic. Pooled testing is a faster and less expensive diagnostic approach compared to individual testing, but there are important tradeoffs in sensitivity, efficiency and logistics to consider. In this talk, we outline an approach combining within-host hierarchical models, compartmental models of pathogen spread, and viral load data to identify optimized pooling protocols that are robust to the changing state of an epidemic. This study highlights the importance of considering within-host biology and individual-level heterogeneity when evaluating epidemiological surveillance strategies. We will also introduce HYPER, a new pooled testing method based on hypergraph factorization. HYPER is designed to be easy to implement and adapt, while also producing pools that are balanced and efficient. We will discuss what hypergraph factorizations are and how they generate the pooling designs used in HYPER.

April 6, 2021

Huwenbo Shi and Martin Zhang

Postdoctoral Research Fellow, Department of Epidemiology
Harvard T.H. Chan School of Public Health

Transcriptome-wide association study and fine-mapping at cell-type resolution  

Transcriptome-wide association studies (TWAS) using cis predicted gene expression have identified thousands of genes associated with diseases and complex traits (Gamazon et al. 2015 Nature Genetics, Gusev et al. 2016 Nature Genetics, Zhu et al. 2016 Nature Genetics). However, existing TWAS are based on gene expression models for bulk tissue gene expression, and therefore cannot pinpoint the specific cell types through which a gene is associated with diseases and complex traits.

Here, we introduce an approach to perform TWAS at tissue-cell-type resolution. We apply computational methods to infer cell type proportions and cell-type specific gene expression for each GTEx sample, using the mouse reference single-cell gene expression from the Tabula Muris Consortium in matched tissues, and obtain gene expression models for every tissue-cell-types based on the inferred gene expression.

We performed tissue-cell-type TWAS and fine-mapping for 52 diseases and complex traits. Compared to TWAS at tissue resolution, our tissue-cell-type TWAS discovered on average 84% more unique gene-trait associations at FDR < 0.05. Our approach also pinpoints specific cell types for each gene-trait association. For example, we found that the SORT1-LDL association is highly significant (P<10-30) in liver hepatocyte with fine-mapped PIP=0.68, and the CACNA1C-schizophrenia association in brain cerebellum neuronal cells (P=2.35×10-10, PIP=1.0) . In addition, in a systematic comparison with known disease-related genes, the fine-mapped tissue-cell-type TWAS results showed increased enrichment relative to tissue TWAS in 16/19 sets of comparisons (P=0.003).

March 9, 2021

Kumar Veerapen

Research Fellow
Analytic and Translational Genetics Unit (ATGU) at Massachusetts General Hospital, the Broad Institute, and Harvard Medical School

Assessing the Genetic Contribution of Drug Response Variability in Selective Serotonin Reuptake Inhibitors From More than 2 million Purchases in the Finnish National Drug Registry   

Drug use and response is highly variable between individuals e.g. up to 65% of major depressive disorder (MDD) do not respond to selective serotonin reuptake inhibitors (SSRI, ATC code: N06AB). Moreover, molecular mechanisms of interindividual differences that cause SSRI response variability is poorly understood but is largely genetic. We hypothesize that large scale biobanks can be used to understand the genetic basis of SSRI response variability. We utilized the FinnGen integrated biobank data to interrogate SSRI use and response variability. The FinnGen biobank aims to recruit a total of 500,000 Finnish individuals who will be genotyped and linked with lifetime health and drug data from the national Finnish registries. Currently, 218,792 individuals have complete genetic and phenotypic data: 51,519 are individuals who have longitudinal SSRI purchase history. We aim to understand genetic differences contributing to the variability in SSRI usage and response in FinnGen. Each genetic association was assessed using mixed-models and significant loci were determined using a p<5×10-8  threshold. The use of SSRI and MDD is highly correlated in FinnGen (Rg = 0.73): pointing towards a genetic basis in response variability of SSRI. From a total of 1,711,695 antidepressant purchases, 853,286 (49.9%) were SSRI comprising 51,519 individuals. We first investigated the SSRI users (N=51,519) relative to non-users (N=159,788) and identified 3 significant loci where a signal in HLA-A (chr6:29978172:G:A) is located near to a previously reported locus in a recent MDD GWAS (r2 = 0.39). An additional signal in the PHF21A (chr11:46035522:T:C) locus could point towards functional relevance: PHF21A has been shown to impair serotonin metabolism in mouse models. After confirming that drug usage data identifies known MDD loci, we analyzed the  genetic differences in adherence of SSRI: 18,691 individuals consumed only SSRI and 13,325 individuals switched from an SSRI to any other antidepressant. We identified 1 significant locus in HLA-DQB1-AS1 (chr6:32658495:C:A) which has been associated with immune and neurological phenotypes (FinnGen phewas). Finally, we attempted to understand genetic differences contributing to longitudinal SSRI use over time between individuals who had no change vs an increase in dosage over time. We identified a significant polygenic risk load in patients with increased dosage (p<0.05); and 1 significant locus in PTDSS2 (chr11:476394:G:A) (p<5×10-8). The protein PTDSS2 has been shown to have a high affinity to docosahexaenoic acid (DHA) where DHA has been clinically used to augment SSRI efficacy.
The use of FinnGen as a large scale biobank to understand the genetic contribution of SSRI usage and response variability has shown to be promising. Future studies will include further understanding polygenic risk contributing to longitudinal drug use, dosage patterns, and potential adverse outcomes. Results of this study can be used to improve prescription accuracy and treatment of MDD.

February 16, 2021

Xuefang Zhao 

Postdoctoral Researcher
Harvard Medical School, Center for Genomic Medicine at Massachusetts General Hospital

Expectations and blind spots for structural variation detection from short-read alignment and long-read assembly

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and emerging clinical diagnostic approaches utilize short-reads (srWGS), which present constraints for genome-wide discovery of structural variants (SVs). Alternative long-read single molecule technologies (lrWGS) offer significant advantages for genome assembly and SV detection, while these technologies are currently cost prohibitive for large-scale disease studies and clinical diagnostics (~5-12X higher cost than comparable coverage srWGS). Moreover, only dozens of such genomes are currently publicly accessible by comparison to millions of srWGS genomes that have been commissioned for international initiatives. Given this ubiquitous reliance on srWGS in human genetics and genomics, we sought to characterize and quantify the properties of SVs accessible to both srWGS and lrWGS to establish benchmarks and expectations in ongoing medical and population genetic studies, and to project the added value of SVs uniquely accessible to each technology. In analyses of three trios with matched srWGS and lrWGS from the Human Genome Structural Variation Consortium (HGSVC), srWGS captured ~11,000 SVs per genome using reference-based algorithms, while haplotype-resolved assembly from lrWGS identified ~25,000 SVs per genome. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplications (SD) and simple repeats (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of the human reference, we observed extremely high concordance (93.8%) for deletions discovered by srWGS and lrWGS after error correction using the raw lrWGS reads. Conversely, lrWGS was superior for detection of insertions across all genomic contexts. Given that the non-SD/SR sequences span 90.3% of the GRCh38 reference, and encompass 95.9% of coding exons in currently annotated disease associated genes, improved sensitivity from lrWGS to discover novel and interpretable pathogenic deletions not already accessible to srWGS is likely to be incremental. However, these analyses highlight the added value of assembly-based lrWGS to create new catalogues of functional insertions and transposable elements, as well as disease associated repeat expansions in genomic regions previously recalcitrant to routine assessment.

December 15, 2020

Xihao Li

PhD Candidate
Harvard T. H. Chan School of Public Health

Powerful and resource-efficient rare variant meta-analysis for large-scale whole genome sequencing studies using summary statistics and functional annotations, with application to TOPMed lipid data
Large-scale whole genome sequencing (WGS) studies have enabled the analysis of rare variants (RVs) associated with complex human traits. Existing RV meta-analysis approaches are not scalable when applied to WGS data. We propose MetaSTAAR (Meta-analysis of variant-Set Test for Association using Annotation infoRmation), a powerful and resource-efficient rare variant meta-analysis framework, for large-scale whole genome sequencing association studies. MetaSTAAR accounts for population structure and relatedness for both continuous and dichotomous traits by fitting the generalized linear mixed models using sparse genetic relatedness matrices. By storing LD information of RVs in sparse matrix format, the proposed workflow is highly storage efficient and computationally scalable for analyzing large-scale WGS data. Furthermore, the proposed meta-analysis framework builds upon the STAAR method, which dynamically incorporates multiple functional annotations to empower rare variant association analysis and allows for RV-set analysis including gene-centric analysis by grouping variants into functional categories for each gene and genetic region analysis using sliding windows. MetaSTAAR also enables conditional analyses to identify RV-set signals independent of nearby common variants. We applied MetaSTAAR to identify RV-sets associated with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) in 30,138 related samples from the NHLBI Trans-Omics for Precision Medicine program Freeze 5 data, consisting of 14 ancestrally diverse study cohorts and 255 million variants in total. MetaSTAAR requires 520 GB to store the summary statistics and LD matrices across the whole genome, which is at least 100 times smaller than the existing method RAREMETAL. In addition, the computation time is benchmarked to be 100 times faster than RAREMETAL. Compared to the joint analysis of pooled individual-level data using STAAR, the P-values from MetaSTAAR and STAAR are highly consistent, with correlation > 0.99 among significant regions in both unconditional and conditional analyses.

November 10, 2020

Carles Boix

PhD Candidate

Regulatory genomic circuitry of human disease loci by integrative epigenomics
Annotating the molecular basis of human disease remains an unsolved challenge, as 93% of disease loci are non-coding, and gene-regulatory annotations highly incomplete. Here, we present EpiMap, a compendium of 10,000 epigenomic maps across 800 samples, which we use to define chromatin states, high-resolution enhancers, enhancer modules, upstream regulators, and downstream target genes. We use this resource to annotate 30,000 genetic loci associated with 540 traits, predicting trait-relevant tissues, putative causal nucleotide variants in enriched-tissue enhancers, and candidate tissue-specific target genes for each. We partition multifactorial traits into tissue-specific contributing factors with distinct functional enrichments and disease-comorbidity patterns, and reveal both single-factor monotropic and multi-factor pleiotropic loci. Top-scoring loci frequently have multiple predicted driver variants, converging through multiple common-target enhancers, multiple common-tissue genes, or multiple genes/tissues, indicating extensive pleiotropy. Our results demonstrate the importance of dense, rich, high-resolution epigenomic annotations for complex trait dissection.

October 13, 2020

Masahiro Kanai

PhD Candidate
Harvard Medical School, Analytical and Translational Genetics Unit, Mass General Hospital, Broad Institute

Insights into fine-mapping causal variants of complex traits from diverse populations
Identifying causal variants for complex traits is one of the major challenges in human genetics. The causal variants in most GWAS loci remain unknown due to lack of power and to high linkage disequilibrium (LD) in a locus. Moreover, little is known about how causal variants are shared across populations due to lack of large-scale GWAS from diverse populations.

Here, we present a cross-population analysis of fine-mapping results based on three large-scale biobanks. In parallel to our effort fine-mapping complex traits in UK Biobank (n = 361,194; UKB) and expression QTL (eQTL) in GTEx (n = 838) (Ulirsch & Kanai et al), we fine-mapped hundreds of complex traits and diseases from Biobank Japan (n = 180,987; BBJ) and FinnGen (n = 183,694) using FINEMAP (Benner et al, 2016) and SuSiE (Wang et al, 2018). In total, 4,151 high-confidence putative causal variants for 124 traits were identified (posterior inclusion probability [PIP] > 0.9 in any population), including 46 and 66 population-enriched variants from BBJ and FinnGen, respectively. Distinct coding variants from each population often fine-mapped together in the same exons, and we found that coding putative causal variants are more deleterious (OR = 10.4 and 5.9 for pLoF and missense vs synonymous) and more pathogenic (OR = 28.3 for ClinVar variants) than other coding variants. Furthermore, we observed that non-coding putative causal variants are strongly enriched for promoters and cis-regulatory regions (accessible chromatin and H3K27ac) (OR = 10.8 and 11.3 vs non-genic) and colocalized with fine-mapped eQTL variants in GTEx, suggesting that the majority of putative causal variants could be explained via coding or regulatory mechanisms. Altogether, we demonstrate how diverse populations gain additional insights into disease biology with an expanded atlas of candidate causal variants.

Despite high trans-ethnic genetic correlation, we found most single-population fine-mapped variants are undiscoverable across populations; only 8% of the variants with PIP > 0.9 were identified in 95% credible sets in other populations. This inconsistency is mainly due to lack of power in other populations or to LD complexity, with a minority unexplained. For example, among 2,483 fine-mapped variants with PIP > 0.9 in UKB, 53% are missing in BBJ because they are rare or monomorphic, 35% have lower power for association due to MAF and sample size, and 2% have higher LD complexity based on empirical predicted PIP analysis. Overall, our analysis gives insights into how to interpret fine-mapping results from multiple populations and emphasizes the desperate need of more diversity in human genetics.

September 29, 2020

Zilin Li

Postdoctoral Fellow
Harvard T. H. Chan School of Public Health

A Framework for Detecting Noncoding Associations in Large Whole Genome Sequencing Studies at Scale

Compared with GWAS and whole exome sequencing studies, large-scale whole genome sequencing studies have enabled the analysis of non-coding rare variants (RVs) associated with complex human traits. Common analytic strategies for RV association in non-coding region considered limited choices of gene-centric masks and sliding windows of a fixed length, and have limited scope to leverage the functions of variants.

We propose a non-coding rare variant association detection framework, including gene-centric analysis and genetic region analysis. For gene-centric analysis, we consider various strategies for grouping non-coding variants based on functional annotations, including UTR, upstream, downstream, promoter, enhancer and long non-coding RNA genes. For genetic region analysis, we group non-coding RVs residing in a contiguous window, defined either by a pre-specified (fixed) window size or a flexible data-adaptive window size using SCANG (SCAN the Genome). The STAAR (variant-Set Test for Association using Annotation infoRmation) method is also applied in the framework that increases the power of RV association tests by effectively incorporating multiple functional annotations.

We applied the proposed framework to analyze non-coding RV association with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) in 21,015 discovery samples and 9,123 replication samples from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. Several novel non-coding RV-sets associated with lipids were discovered and replicated using the TOPMed WGS data.


May 12, 2020

Tiffany Amariuta

PhD Candidate
Harvard Medical School

Leveraging functional annotations to improve the trans-ethnic portability of polygenic risk scores

Polygenic risk scores (PRS) quantify the cumulative trait-increasing allelic effect of an individual’s genome and have the potential to transform clinical practice by identifying patients with higher genetic burdens. Poor trans-ethnic portability of PRS is a critical issue that may be partially due to limited knowledge of causal variants shared among populations. Hence, leveraging noncoding regulatory annotations that capture genetic variation across populations has the potential to enhance the trans-ethnic portability of PRS. Using our cell-type-specific regulatory element annotation strategy called IMPACT, we partitioned the common SNP heritability of diverse polygenic traits and diseases from 111 GWAS summary statistics of European (EUR, average N=180K) and East Asian (EAS, average N=157K) origin. Strikingly, we observed highly concordant polygenic trait regulation between populations: the same regulatory annotations captured statistically indistinguishable SNP heritability. Therefore, prioritizing variants in IMPACT regulatory elements may improve the trans-ethnic portability of PRS by selecting variants with population nonspecific effects. Indeed, we observed that EUR PRS models more accurately predicted 21 tested phenotypes of EAS individuals when variants were prioritized by key IMPACT tracks (49.9% mean relative increase in R ). Notably, the improvement afforded by IMPACT 2 was greater in the trans-ethnic EUR-to-EAS PRS application than in the EAS-to-EAS application (47.3% vs 20.9%, P < 1.7e-4). Overall, our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ethnic portability of genetic data, and this has important implications for future risk prediction models that work across populations.

April 21, 2020

Wei Zhou

Postdoctoral Research Fellow
Broad Institute

An efficient and accurate frailty model approach for genome-wide survival association analysis controlling for population structure and relatedness in large-scale biobanks

With decades of electronic health records linked to genomic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan.

We developed an efficient and accurate frailty (random effects) model approach for genome-wide survival association analysis of censored time-to-event phenotypes in large datasets (>400,000 individuals), which accounts for population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computation cost. The saddlepoint approximation is used to allow for analysis of heavily censored endpoints and low frequency variants. We have demonstrated analysis of ~400,000 samples and 77 million genetic markers in 15 hours using 32 threads and < 15 GByte of memory – with computing time scaling nearly linearly with the numbers of samples and markers. Using simulation studies, we have showed that type I error rates are well controlled, even for less frequent variants with frequency 0.5% and censoring rate 99%. We analyzed ~180,000 samples in FinnGen and ~400,000 UK Biobank participants with white British ancestry and identified the well-known locus APOE-e4 associated with lifespan. We also identified loci for secondary conditions following primary diseases, such as death following lung cancer and nephropathy following diabetes. Currently, we are applying our method to more event-to-time endpoints in the UK Biobank.

Genome-wide survival analysis of large biobanks will help understand the natural history and progression of diseases and pave a road for constructing polygenic risk scores to predict age-at-onset, disease progression and lifespan.

April 7, 2020

Hufeng Zhou

Instructor in Medicine
Harvard Medical School

Functional Annotation of Variants – Online Resource

March 10, 2020 – cancelled

Giulio Genovese
Senior Computational Biologist
Harvard Medical School
Chromosomal phasing improves aneuploidy determination in non-invasive prenatal testing at low fetal fractions
Non-invasive prenatal testing (NIPT) to detect fetal aneuploidy by sequencing cell-free DNA (cfDNA) in maternal plasma has been broadly adopted. To detect fetal aneuploidies from maternal plasma – where fetal DNA is mixed with far-larger amounts of maternal DNA – NIPT requires a minimum fraction of the circulating cfDNA to be of placental origin, a level which is usually attained beginning at 10 weeks gestational age. We present a framework to leverage chromosomal phase – the arrangement of alleles along homologous chromosomes – to make NIPT analyses more conclusive. We re-analyze data from a singleton pregnant mother who received an inconclusive aneuploidy determination through NIPT due to a fetal DNA fraction of 3.4%. We show that the same laboratory data can be used to conclusively infer the presence of a trisomy 18 fetus when chromosomal phase is taken into account. Key to the effectiveness of the approach we describe is the robustness of allelic fraction estimates to biological and laboratory-process driven noise and the ability of chromosomal phase to integrate quantitative signals across very many polymorphic markers. These results show that chromosomal phase increases the sensitivity of a common laboratory test, an idea that could also have broad application in cfDNA analyses for cancer detection.

February 25, 2020

Caitlin Carey

Post-Doctoral Fellow
Analytical and Translational Genetics Unit, Mass General Hospital and Broad Institute

Genetic architecture of phenome-wide latent factors in the UK Biobank

December 10, 2019 – cancelled

Duncan Palmer

Post-Doctoral Fellow
Analytical and Translational Genetics Unit
Mass General Hospital and Broad Institute

Dominance genetic effects across multiple traits

November 19, 2019

Tarjinder Singh

Postdoctoral fellow
Massachusetts General Hospital, Harvard Medical School

Exome sequencing identifies genes for schizophrenia

Ultra-rare variants from 25,000 schizophrenia cases implicates ten genes and provides mechanistic hypotheses related to the biology of disease

The Schizophrenia Exome Sequencing Meta­-Analysis (SCHEMA) Consortium is one of the largest efforts to analyze sequencing data to advance gene discovery. We have completed the analysis of 24,248 sequenced cases and 97,322 controls comprising of individuals from five continental populations. The scale of SCHEMA enables us, for the first time, to implicate URVs in ten genes as conferring substantial risk for SCZ at genome-­wide significance (odds ratios 4 ­ 50, P < 2e­6), and 34 genes at a FDR < 5%. Two of these, the NMDA receptor subunit GRIN2A and transcription factor SP4, reside in two loci implicated by SCZ GWAS. A second glutamate receptor subunit, GRIA3, is also implicated, providing support for the hypofunction of the glutamatergic system in the pathogenesis of schizophrenia.

Exploring the published results from severe neurodevelopmental delay (NDD) and autism (ASD) consortia, we find that top ten SCZ genes have no protein­-truncating variant signal in either ascertainment, though a significant overlap between ASD (n = 102, FDR < 10%) and SCZ risk genes (n = 34, FDR < 5%) was observed. Partitioning the 102 ASD genes into those disrupted more frequently 1) in ASD and 2) in intellectual disability (ID), we show that this signal was driven by the ASD­preferential, and not ID­ preferential genes. Thus, SCZ genes from exome sequencing have relevance for later-­onset psychiatric disorders rather than more severe NDDs.

Between genes identified by GWAS and exome sequencing, we find convergence in biological processes and tissue types, specifically in synaptic transmission and components of the post­synaptic density. After excluding associated genes, SCZ cases still carry a substantial excess of rare URVs, suggesting that many more remain to be discovered.

October 1, 2019

Julian Hecker

Research Associate
Brigham and Women’s Hospital

PolyGEE: A generalized estimating equation approach to the efficient estimation of polygenic effects in large-scale association studies

To quantify polygenic effects in large-scale association studies, we propose a generalized estimating equation (GEE) based estimation framework. We develop a marginal model for single-variant association test statistics of complex diseases that is applicable to population-based designs, to family-based designs or arbitrary combinations of both. We extend the standard GEE approach so that the parameters of the proposed marginal model can be estimated based on working-correlation/linkage-disequilibrium (LD) matrices from external reference panels. Our method achieves substantial efficiency gains over standard approaches, while it is robust against misspecification of the LD structure, i.e. the LD structure of the reference panel can differ substantially from the true LD structure in the study population. In simulation studies and applications to real data, we illustrate the features of the proposed GEE framework.

September 24, 2019

Raymond Walters

Post-Doctoral Research Fellow
Analytical and Translational Genetics Unit
Mass General Hospital and Broad Institute

Exploring sex differences in the genetics of UK Biobank phenotype

Sex differences are commonly observed for numerous human traits (e.g. sexual dimorphism of height and weight, sex bias in psychiatric diagnoses), likely reflecting both biological (e.g. sex hormones) and environmental (e.g. cultural norms) effects. Twin studies and genome-wide association studies (GWAS), have suggested that genetic effects are predominantly shared across sexes. Nevertheless, there is evidence that for certain phenotypes, such as disorders involving alcohol use and waist-hip ratio, genetic effects may differ by sex. Detecting these differences remains challenging, however, since the sex differences in genetic effects are likely smaller than the genetic main effects. We use the large-scale genotyping data (194,174 female and 167,020 male unrelated individuals of European ancestry) of UK Biobank (UKB) to evaluate whether sex differences in genetic effects can be detected across the breadth of UKB phenotypes. In preliminary results, we find that on average most autosomal genetic effects are shared between sexes (mean rg=.92), but significant differences are observed for reported substance use, body mass index, and insomnia, among other phenotypes. We compare these estimates to analyses of siblings in UKB, and identify individual loci showing significant differences in effects between the GWAS in each sex.


April 9, 2019

Keegan Korthauer

Postdoctoral Research Fellow

Accurate inference of DNA methylation data: statistical challenges lead to biological insights

DNA methylation is an epigenetic modification widely believed to act as a repressive signal of gene expression. Whether or not this signal is causal, however, is currently under debate. Recently, a groundbreaking experiment probed the influence of genome-wide promoter DNA methylation on transcription and concluded that it is generally insufficient to induce repression. However, the previous study did not make full use of statistical inference in identifying differentially methylated promoters. In this talk, I’ll detail the pressing statistical challenges in the area of DNA methylation sequencing analysis, as well as introduce a statistical method that overcomes these challenges to perform accurate inference. Using both Monte Carlo simulation and complementary experimental data, I’ll demonstrate that the inferential approach has improved sensitivity to detect regions enriched for downstream changes in gene expression while accurately controlling the False Discovery Rate. I will also highlight the utility of the method through a reanalysis of the landmark study of the causal role of DNA methylation. In contrast to the previous study, our results show that DNA methylation of thousands of promoters overwhelmingly represses gene expression.

March 12, 2019

Elizabeth Atkinson

Postdoctoral Research Fellow
MGH / Broad Institute

Association studies for all: A novel framework to allow for the well-calibrated genomic analysis of underrepresented admixed individuals
Currently, many genetics studies exclude individuals whose ancestry is not homogeneous but rather comprises components from several ancestral groups – such individuals are “admixed”.  Admixed individuals are removed due to the challenges of accurately accounting for their ancestry such that population substructure can infiltrate analyses and bias results. Here, we present a framework to account for this issue and allow admixed individuals to be studied alongside homogenous ones by inferring fine-scale population structure informed by local ancestry estimates. Incorporating local ancestry information in addition to principal components takes into account individuals’ subtle differences in admixture patterns that may differ among case and control cohorts even if their genome-wide ancestry fractions are
the same. We apply our framework to several admixed cohorts with high global diversity from the Psychiatric Genomics Consortium PTSD group, with a focus on African American and Latino individuals. The current lack of methods for analysis of admixed individuals is problematic for many traits, but especially for certain psychiatric disorders including PTSD, which both has a very large number of patients of admixed descent and is extremely polygenic, meaning that accounting for the local ancestral dosage is key to precisely understand the genetic contribution of the many component sites of small effect that
combine to result in the disorder. We show the extent to which including admixed individuals within ancestry-specific meta-analyses of the PGC-PTSD cohorts boosts signal to discover GWAS loci. We further demonstrate that this framework gives increased precision when looking at variants across ancestry groups and improves the resolution of association signals by leveraging differences in linkage disequilibrium patterns between populations to more narrowly map where signal is coming from. This framework could be applied to solve the statistical issues related to admixture across many medical and
population genetics activities, including association testing efforts and evolutionary studies such as genome-wide selection scans. In sum, this framework dramatically advances the existing methodologies for studying admixed individuals and allows for significantly better calibrated study of the genetics of complex disorders in underrepresented populations.

February 26, 2019

Dandi Qiao

Instructor in Medicine
Brigham and Women’s Hospital

Gene-based segregation method and its application in family-based exome sequencing studies of severe COPD

For Mendelian diseases or complex diseases with Mendelian subtypes, filtering-based approach has been used to identify rare coding variants using whole exome sequencing data of families. However, formal statistical approaches are needed to quantify the uncertainty of such filtering approach. We described a Gene-based Segregation test (GESE) that was constructed using the probability of segregation events under the null hypothesis of Mendelian transmission, which takes into account the degree of relatedness in families, and the number and minor allele frequencies of the functional rare variants in the genes. We showed via simulation that GESE performs better than region-based methods for such diseases. Chronic obstructive pulmonary disease (COPD) is strongly influenced by genetic factors including rare variants of large effects, as exemplified by alpha-1 antitrypsin deficiency. We applied our method to a large whole-exome sequencing data of 1554 subjects in families with severe COPD patients. We identified two genes significantly segregating in multiple families. The expression of 5 out of the top 10 genes was associated with forced expiratory volume in 1-second (FEV1) % predicted, with enrichment p-value of 0.024. Our proposed method shows great potential for identifying rare coding variants of large effect and high penetrance for family-based sequencing data. It is implemented in an R package that is publicly available on CRAN.

December 11, 2018

Andrea Byrnes

Postdoctoral Fellow
MGH / Broad Institute

Normalization of RNA-seq data from multiple tissue or cell types
As RNA sequencing (RNA-seq) technology matures for bulk and single
cell samples, experiments comparing gene expression across many tissue
and cell types will become possible. Such experiments promise to
illuminate cell types of interest for human disease phenotypes for
which there already exist genetic associations, and identify genes
with cell type-specific patterns of expression. However, human cell
types have great diversity in overall gene expression, both in
quantity and diversity. These differences are not adequately accounted
for in current data-processing pipelines for RNA-seq data. In order to
rigorously analyze data from the experiments described above, we must
adjust for the vast differences in transcriptional landscape present
in human cell types. We develop a normalization regime to take these
differences into account when quantifying expression from a single
gene using estimated library size as a metric for quality control and
proxy for the level of overall levels of gene expression. We
demonstrate a consistent estimator of library size and apply it to the
data generated by the GTEx Consortium. We show the impact the choice
of such a normalization has on common expression analyses such as
differential expression and tissue-specificity.

November 6, 2018

Peter Kharchenko

Associate Professor of Biomedical Informatics
Harvard Medical School

Analysis of transcriptional dynamics with single-cell transcriptomics

Single-cell RNA-seq measurements can reveal detailed composition of complex biological tissues, however generally capture a snapshot of the cellular states at a given point in time. However, in many contexts, such as development or stimulus response, it is the dynamics of the cellular state is of primary interest.  We find that residual signatures of mRNA processing, such as trace amounts of unspliced mRNA, can be used to estimate the first time derivative of the transcriptional state. These RNA velocity can be used to extrapolate the state of the cells into the future on the timescale of hours, facilitating interpretation of the cell dynamics.


October 2, 2018

Robert Maier

Analytic and Translational Genetics Unit
MGH, Broad Institute, HMS

Polygenic adaptation signals for height are confounded by population structure

with Mashaal Sohail, Andrea Ganna, Alex Bloemendal, Alicia Martin, Mark Daly, Nick Patterson, Benjamin Neale, Iain Mathieson, David Reich, Shamil Sunyaev

In the past six years, numerous publications have reported directional selection signals of polygenic selection on height. Here, we show that many of these results are severely confounded by uncontrolled population stratification. In particular, signals of polygenic adaptation based on summary statistics from the from the GIANT consortium meta-analysis are dramatically reduced in magnitude and, in many cases no longer statistically significant when using summary statistics derived from the UK Biobank. This polygenic adaptation signal was apparently independently confirmed by a tests based on the singleton density score statistic (SDS). However, we show that this signal too is only present when using GIANT but not UK Biobank summary statistics. Specifically, the Spearman correlation between p-value and SDS statistic is 2e-65 using GIANT statistics but only 0.077 using UK Biobank.

We further show that correlations between effect size estimates and allele frequency differences between North- and South- European populations underlie most of these discrepancies. The confounding with population stratification appears to be most severe for less significant SNPs; restricting the analyses to genome-wide significant SNPs results in a higher concordance between GIANT and UK Biobank data. While stratification-free within-family estimates suggest that the phenotypic north-south height gradient in Europe is indeed paralleled by genetically predicted height as reported before, the magnitude of this effect had been greatly overestimated.

Height is widely used as a model for polygenic traits, which raises the question whether other methods using similar summary statistics might suffer from the same kind of confounding. Existing methods that aim to identify the presence of population stratification in GWAS summary statistics are not always applicable, and we provide simple suggestions that can supplement these methods in order to detect uncontrolled stratification and avoid resulting biases.

September 25, 2018

Doga Gulhan

Research Fellow in Biomedical Informatics
Harvard Medical School


May 1, 2018

John Platig

Postdotoral Research Fellow, Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute

Exploring regulation in tissues with eQTL networks

Characterizing the collective regulatory impact of genetic variants on complex phenotypes is a major challenge in developing a genotype to phenotype map. Using expression quantitative trait locus (eQTL) analyses, we constructed bipartite networks in which edges represent significant associations between genetic variants and gene expression levels, and found that the network structure informs regulatory function. We show, in thirteen tissues, that these eQTL networks are organized into dense, highly modular communities grouping genes often involved in coherent biological processes. We find communities representing shared processes across tissues, as well as communities associated with tissue-specific processes that coalesce around variants in tissue-specific active chromatin regions. Node centrality is also highly informative, with the global and community hubs differing in regulatory potential and likelihood of being disease-associated.

April 10, 2018

Victoria Dean

Research Intern
Deep Genomics

Deep Learning for Branch Point Selection in RNA Splicing

Branch point selection is a key step in RNA splicing, yet many many popular splicing analysis tools do not model this mechanism. There were relatively few confirmed branch points until 2015, when a genome-wide map of experimental human branch points was released. This data facilitates, for the first time, modeling branch sites with more sophisticated methods. We used deep learning to model branch site selection, which improved significantly over position-weight matrix models. We show that our branch point model can be used to classify potential disease-causing variants, and can help to improve existing splicing models.

March 20, 2018

Marieke Kuijjer

Postdotoral Research Fellow,Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute

Understanding Tissue-specific Gene Regulation

Although all human tissues carry out common processes, tissues are distinguished by gene expression patterns, implying that distinct regulatory programs control tissue-specificity. In this study, we investigate gene expression and regulation across 38 tissues profiled in the Genotype-Tissue Expression project. We find that network edges (transcription factor to target gene connections) have higher tissue-specificity than network nodes (genes) and that regulating nodes (transcription factors) are less likely to be expressed in a tissue-specific manner as compared to their targets (genes). Gene set enrichment analysis of network targeting also indicates that regulation of tissue-specific function is largely independent of transcription factor expression. In addition, tissue-specific genes are not highly targeted in their corresponding tissue-network. However, they do assume bottleneck positions due to variability in transcription factor targeting and the influence of non-canonical regulatory interactions. These results suggest that tissue-specificity is driven by context-dependent regulatory paths, providing transcriptional control of tissue-specific processes.

February 27, 2018

Andrea Ganna

Research fellow, Analytical and Translational Genetic Unit
Massachusetts General Hospital


December 12, 2017

Camila Lopes-Ramos

Postdoc Fellow
Dana-Farber Cancer Institute

Using gene regulatory networks to understand sex differences

Sex differences plays an important role in development and progression of many diseases, and understanding how the disease differs between the sexes is essential for the advancement of prevention, diagnosis, and treatment. We conducted a comprehensive assessment of gene expression and gene regulatory network modeling in 29 healthy tissues from the GTEx data set. We observed sex-biased patterns of gene expression involving as many as 60% of the human genome, depending on the tissue. Interestingly, we could capture the sex-biased gene expression through differential network targeting. Next, we used network analysis to study colon cancer. Males have both a higher risk of developing colon cancer and a lower survival rate than women, yet the molecular features that drive these sex differences are poorly understood. We compared gene expression between tumors in men and women and found no significant sex differences except for X and Y chromosome genes. However, when we used PANDA and LIONESS to infer patient-specific gene regulatory networks, we identified genes and pathways that were targeted by transcription factors in a sex-specific manner. Our network analysis uncovered patterns of transcriptional regulation that differentiate male and female in healthy tissues, and also in colon cancer. Changes in gene regulatory networks may explain not only differences in tissue expression, but may also help us to understand the mechanism of sex differences in diseases and other complex traits.

November 7, 2017

Joshua Campbell

Assistant Professor, Division of Computational Biomedicine
Department of Medicine, Boston University

Characterizing transcriptional states and cellular populations in single-cell RNA-seq data using discrete Bayesian hierarchical modeling

We can try to understand complex biological systems by dividing them into hierarchies where each level of the hierarchy is composed of different parts that perform distinct biological functions. For example, organisms can be subdivided into a collection of complex tissues; each complex tissue is composed of different cell types; each cell type is comprised of a unique combination of transcriptional pathways (i.e. transcriptional states); and each transcriptional state is composed of groups of genes that are coordinately expressed to perform specific molecular processes. Single-cell RNA-seq can be used to explore these hierarchies by unbiasedly identifying cellular populations within a sample and to determining the unique combination of transcriptional states that define each subpopulation. However, this type of data is noisy and sparse due to the challenges inherent in amplifying small amounts of RNA from single cells.

With these goals and challenges in mind, we implemented a novel discrete Bayesian hierarchical model that reflects the hierarchy observed in biological systems. This model, which we call Cellular Latent Dirichlet Allocation (Celda), can be used to simultaneously cluster co-expressed genes into transcriptional states, cells into subpopulations, and quantify the proportion of each cellular subpopulation within individual samples. We used this model to identify transcriptional states and novel cellular subpopulations enriched in the airways of smokers versus nonsmokers. Overall, these models represent novel approaches to characterizing cellular and transcriptional heterogeneity in biological samples that have been profiled with single-cell RNA-seq.

October 3, 2017

Yang-Yu Liu
Assistant Professor of Medicine
Brigham and Women’s Hospital and Harvard Medical School

Controlling Human Microbiota

We coexist with a vast number of microbes—our microbiota—that live in and on our bodies, and play an important role in human physiology and diseases. Propelled by metagenomics and next-generation DNA sequencing technologies, many scientific advances have been made through the work of large-scale, consortium-driven metagenomic projects. Despite these advances, there are still many fundamental questions regarding the dynamics and control of microbiota to be addressed. Indeed, it is well established that human-associated microbes form a very complex and dynamic ecosystem, which can be altered by drastic diet change, medical interventions, and many other factors. The alterability of our microbiome offers opportunities for practical microbiome-based therapies, e.g., fecal microbiota transplantation and probiotic administration, to restore or maintain our healthy microbiota. Yet, the complex structure and dynamics of the underlying ecosystem render the quantitative study of microbiome-based therapies extremely difficult. In this talk, I will discuss our recent theoretical progress on controlling human microbiota [1-6].

September 26, 2017

Rui-Sheng Wang
Instructor of Medicine
Brigham and Women’s Hospital and Harvard Medical School

Network analysis of the genomic basis of the placebo effect

The placebo effect is a phenomenon in which patients who are given an inactive treatment (e.g., inert pill) show a perceived or actual improvement in a medical condition. Placebo effects in clinical trials have been investigated for many years especially because placebo treatments often serve as the control arm of randomized clinical designs. Recent observations suggest that placebo effects may be modified by genetics. This observation has given rise to the term “placebome,” which refers to a group of genome-related mediators that affect an individual’s response to placebo treatments. In this talk, I will present a network analysis of the placebome and identify a placebome module in the comprehensive human interactome using a seed-connector algorithm. I will also talk about the validation of the placebome module using a large cohort of the Women’s Genome Health Study (WGHS) and discuss the closeness of the placebome module to different disease modules and drug target modules.


Tuesday, May 9, 2017
1:00-2:00 PM
Biostats Conference Room, Building 2, Rm 426

Guo-Cheng Yuan
Associate Professor of Computational Biology and Bioinformatics
Harvard T.H. Chan School of Public Health

Systematic Mapping of Cellular States and Regulatory Circuits

Human and other multi-cellular organisms are composed of diverse cell types each has its own morphology and biological functions. However, efforts for comprehensive characterization of cell types have just begun recently, thanks to the rapid development of single-cell technologies and data analysis methods.  In this talk, I will discuss a few methods our lab developed to characterize cell states and lineages. I will also discuss our work in elucidating the gene regulatory networks underlying cell fate decision and maintenance.

Tuesday, April 11, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426

Luca Pinello
Assistant Professor
Massachusetts General Hospital and Harvard Medical School

Chromatin state variability & single-cell state and developmental trajectory modeling

In this talk I will present, first, an approach that exploits chromatin state variability across cell types to uncover non-coding functional genomic regions and potential regulated genes. This method recovers regions enriched for genetic variants, and provides additional power in explaining genetic heritability of traits or diseases, compared to well-characterized functional elements such as enhancers, promoters or DHS sites.

Then I will present a novel computational method called ARIADNE for single-cell analysis of transcriptomic data. This method can be used to disentangling complex cellular types and states in development, differentiation or in perturbation studies. It can accurately detect cellular hierarchies and recover complex developmental trajectories. In addition, it provides informative and intuitive visualization techniques to highlight important genes that can be used as markers to define sub-populations and rare cell types.

Tuesday, March 28, 2017
12:30-1:30 PM
*Kresge G2*

Samira Asgari
Brigham and Women’s Hospital

Genomics of severe paediatric infections

Infectious diseases are among the leading causes of human mortality and the burden of infectious disorders is the greatest in paediatric population. For any infectious disorder only a fraction of the exposed individuals develop the clinical symptoms and only a small fraction of the individuals with clinical symptoms develop life-threatening   conditions (e.g. extreme phenotypes). These individuals with extreme phenotypes are more likely to carry variants of profound effects. In this presentation I will talk about the role of rare human genetic loss-of-function (LoF) variants in life-threatening susceptibility to Pseudomonas aeruginosa (P. aeruginosa) and respiratory syncytial virus (RSV) infections.

Tuesday, March 7, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426

Wei Li

Postdoctoral Fellow
Harvard School of Public Health

CRISPR Screens: Algorithms and Applications to Functional Genomics   

High-throughput CRISPR/Cas9 knockout screens have shown great promise in the functional studies of coding genes and non-coding elements. In this talk, I will discuss my recent work to develop algorithms for modeling CRISPR screening data, and identify functional coding and non-coding elements in cancer cells. I present MAGeCK and MAGeCK-VISPR, two novel algorithms to analyze CRISPR screens. MAGeCK uses a ranked-based method to identify top hits in the screen, while MAGeCK-VISPR builds a maximum-likelihood framework to analyze screens involving multiple conditions. Based on these algorithms, we studied the mechanism of endocrine resistance, where cancer cells are resistant to the standard endocrine therapy in many breast cancer patients. We identified and validated genes whose loss confers endocrine resistance, as well as synthetic lethal vulnerabilities that are potential drug targets. These findings demonstrate the potential of CRISPR screens to tackle many emerging problems, including cancer and drug resistance.

Tuesday, February 14, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426

Jialiang Huang

Postdoctoral Fellow
Harvard School of Public Health

Dissection of super-enhancer hierarchy

Super-enhancers (SEs), groups of putative enhancers with unusually high enhancer activity, play prominent roles in determining cell identity in development and disease. However, it remains unclear to what extent SEs are structurally and functionally distinct from conventional enhancers. We hypothesize that many super-enhancers are composed of a hierarchy of constituent elements whose functional difference is associated with long-range chromatin interaction patterns. To test this hypothesis, we developed a systematic approach to classify SEs into hierarchical and non-hierarchical subtypes by integrating Hi-C and ChIPseq data. We find that, comparing with non-hierarchical SEs, hierarchical SEs enriched with cell-type specific regulators. Furthermore, within each hierarchical SE, we found that the hub enhancers are highly enriched for disease-associated variants, which might suggest their functional potency. In a few cases, the hierarchy within a SE has been previously mapped out by using CRISPR/Cas9-mediated. Comparison between this functional hierarchy and that predicted from chromatin interactions shows strong agreement. Taken together, these results strongly support our hypothesis that many SEs are composed of a hierarchical structure that is associated with chromatin interaction patterns.


Tuesday, December 13, 2016
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426

Bo Li
Harvard T.H. Chan School of Public Health

“Computational approaches for understanding tumor-immune interaction with clinical implications”

Characterization of the interaction between cancer and immune system is critical to developing novel immunotherapies. Here in this talk, I will present two computational methods we have developed. The first one is Tumor IMmune Estimation Resource, or TIMER, which is a statistical tool for deconvolving different immune cell components in the tumor microenvironment using gene expression data. The second one is called T-cell receptor Repertoire Utilities for Solid Tumor, or TRUST, which is a de novo assembler for analyzing the TCR hypervariable sequences using unselected RNA-seq data. Application of both methods to large cancer cohort lead to biological findings with potential clinical applications.

Tuesday, Nov 8, 2016
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426

Yue Li

Computer Science and Artificial Intelligence Laboratory (CSAIL)

“Joint Bayesian inference of risk variants using summary statistics and epigenomic annotations across multiple traits”

Fine-mapping causal variants is challenging due to linkage disequilibrium and the lack of interpretation of noncoding mutations, which contribute to ~70-90% of GWAS SNPs. Most of the existing fine-mapping methods do not scale well on inferring multiple causal variants per locus and causal variants across multiple related diseases. Moreover, Many complex traits are genetically related and potentially share causal mechanisms. Thus, we hypothesize that exploiting the correlation between traits at the epigenomic annotation level may prove useful in fine-mapping for shared causal mechanisms that go beyond the level of individual variants. We develop a new Bayesian fine-mapping model named RiVIERA. The key features of RiVIERA include 1) require only summary statistics; 2) ability to model epigenomic covariance of multiple related traits; 3) efficient posterior inference of causal configuration; 4) efficient Bayesian inference of causal annotations.

We conducted a comprehensive simulation studies using 1000 Genome and Roadmap epigenomic data to demonstrate that RiVIERA compares quite favorably with existing methods yet having unique advantages of jointly inferring and leveraging the underlying tissue-specific functional co-enrichment among traits that are not necessarily associated with the same set of risk loci. In particular, the efficient inference of multiple causal variants per locus from our RiVIERA model led to significantly improved estimation of the causal posterior and functional enrichments compared to the state-of-the-art fine-mapping methods especially for the cases where more than 3 causal variants co-harboured within the same loci. Furthermore, joint modelling multiple traits confers further improvement over the single-trait mode of the same model, which attributes to the more robust estimation of the enrichment parameters. Finally, we demonstrated the application of RiVIERA on jointly modelling several well-powered GWAS datasets of genetically related traits. Our analyses revealed several putative pleiotropic pathways that are disrupted by separate mutations in those traits, which do not share the same risk loci. RiVIERA is freely available from Github.

Tuesday, October 4, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Curtis Huttenhower

Associate Professor of Computational Biology and Bioinformatics
Harvard T.H. Chan School of Public Health

Methods for population studies of the human microbiome

Epidemiology-scale studies of the human microbiome pose unique methodological challenges: assay technologies are still new and rapidly changing, inter-individual biological and environmental variation is high, and relevant molecular, translational, and public health endpoints are not yet fully defined. I will discuss some of the lab’s efforts to close these gaps, beginning with an overview of three phases of the NIH Human Microbiome Project. These include, first, previously published work to define a “healthy” microbiome. The project has continued with ongoing work to study its strain-level composition and molecular functionality in over 2,500 shotgun metagenomes, the largest body-wide survey to date. Finally, new work in the “HMP2” characterizes microbiome dynamics and metabolism during inflammatory bowel disease. I will conclude with results from the Microbiome Quality Control Project (MBQC), the first large-scale effort to identify and mitigate sources of variation and non-reproducibility in human microbiome studies.

Tuesday, September 27, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Peng Jiang
Research Fellow
Harvard T.H. Chan School of Public Health

“Inference of genes associated with resistance to targeted cancer therapy”

Drug resistance is a major challenge of targeted therapy in cancer. We present CARE (Comprehensive Analysis of REsistance), a computational method for identification of genes associated with targeted therapy resistance from pharmacological screen data. The methodology of CARE is based on test of interaction effect in multivariate regression. When applied to CCLE, CGP and CTRP datasets, CARE significantly outperformed other computational methods, such as Elastic Net and ANOVA, in predicting drug resistance-associated genes. Moreover, the gene signature inferred by CARE can better predict therapy clinical outcome than the signatures from other experimental technologies such as CRISPR and shRNA screens.  When finding genes associated with Lapatinib resistance, CARE identified PRKD3 as the top candidate. Experimental validation by both siRNA pool and small molecule compounds confirmed that inhibition of PRKD3 significantly sensitizes HER2+ breast cancer cell to Lapatinib.


Working Group Organizers: Zhonhua Liu & Pier Palamara

Tuesday, April 5, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Kaitlin Samocha

Kaitlin Samocha is a graduate student in the Biological and Biomedical Sciences (BBS) program at Harvard Medical School. She currently works with Dr. Mark Daly on a number of projects revolving around the application of statistical rigor to the results of sequencing studies. Her academic interests, more broadly, focus on the genetic basis underlying complex traits.

“Using large reference-population call-sets to prioritize observed variation in patients with neurodevelopmental disorders

It is a primary challenge of human genetics to distinguish disease-causing rare variation from the multitude of more benign, low-frequency variants found in any genome. As a complement to methods that predict the deleteriousness of individual variants, we target variants found in genes that show unusually low levels of variation in healthy populations. Leveraging the large number of reference individuals in the call-set released by the Exome Aggregation Consortium (ExAC; n=60,706), our model of mutation predicts the number of expected rare (minor allele frequency <0.001) variants for each gene and identifies genes that are significantly depleted of loss-of-function and/or missense variation. Further, the vast number of observable rare variants in ExAC allows us the statistical power to use our method to search for missense-constrained regions within genes.

We applied both gene-level and region-level metrics of mutation intolerance (constraint) to de novo variants found in patients with neurodevelopmental disorders, such as autism spectrum disorders and intellectual disability, as well as to the variants found in controls. The previously established excess de novo loss-of-function variation found in these cases manifests primarily in genes we identify as highly loss-of-function constrained. We also find that the regions with the most severe missense constraint have a greater enrichment of established pathogenic variants, in addition to a higher rate of de novo missense variants found in cases when compared to the rate seen in controls. By contrast, those regions under no constraint show no difference between case and control de novo rates. These analyses highlight the use of evaluating genic intolerance to mutations when interpreting the variation found in patients.

Tuesday, March 1, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Wouter Meuleman
Postdoctoral Associate
MIT CSAIL & Broad Institute

Interpreting the non-coding genome using computational epigenomics

Chromatin state models and annotations are valuable tools for interpreting epigenomic data across potentially large numbers of different epigenomic marks. Summarizing the combinatorial occurrence of marks into a discrete number of states, they provide insight into the functional characteristics of genomic regions. Although useful for the interpretation of particular genomic regions of interest, it remains a challenge to extract meaningful patterns at the genome-wide level. The growing amount of epigenomic data across multiple tissues, cell types, conditions, and individuals, calls for scalable quantitative methods allowing for interpretation and mining of chromatin state models.

To address this challenge, we developed epilogos, a novel approach for analyzing chromatin state models across an arbitrarily large number of epigenomes. At the core of the approach, chromatin states are modelled and visualized in a manner analogous to “motif logos”, as traditionally learned from amino acid or nucleotide biological sequences. This allows for an immediate and intuitive readout and interpretation of results.
The epilogos approach greatly improves the interpretability of epigenomic reference maps such as those generated by the Roadmap Epigenomics Consortium, leading to a better interpretation of genetic maps as well. It has already been shown to easily scale up to hundreds of epigenomes, representing thousands of underlying genome-wide signal tracks. It allows for the quantification of epigenomic variation between cellular conditions, and opens up possibilities for the characterization of spatial as well as temporal epigenomic patterns.

Tuesday, February 2, 2015

12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Giulio Genovese
Stanley Center for Psychiatric Research
Broad Institute

Detection of copy-number-neutral-loss-of-heterozygosity (CNNLOH) events in whole genome sequencing (WGS) experiments

Measurements of allelic fractions at heterozygous sites for the purpose of detection of copy number alterations have initially focused on detection of two separate Gaussian mixtures around 0.5 related to, respectively, alleles that were gained and alleles that were lost in the event. However, when mosaic alterations are present at low allelic fractions, it becomes more challenging to discern signals from two separate Gaussian mixtures rather than a single Gaussian distribution generated by the allelic ratios at heterozygous sites. Phasing information, allowing to not have to guess which Gaussian mixture to observe in the case of a mosaic copy number alteration, has been successfully used to aid in the detection of subtle allelic imbalance in data from genotyping arrays.

In WGS experiments at 30x the standard deviation for the measured allelic fraction is ~9% while the standard deviation for the measured allelic fraction in a genotyping experiment is ~4%. While the lower information at a single heterozygous site can be made up for by the much larger number of heterozygous sites assayed across the whole genome, due to this disparity phasing information is even more important in WGS experiments than it is in genotyping array experiments.

We expanded on existing methodologies and developed a new algorithm to incorporate phasing information into detection of CNNLOH events from WGS data. Such a method incorporates state-of-the-art phasing from population genetics models such as Beagle and a Markov model to allow for errors in the phasing. Combination of phasing and WGS allows detection of large CNNLOH mosaic events present in as little as 1-2% of cells as opposed to published methods requiring relatively high frequency of cells with the same abnormal karyotype (>5-10%).

Tuesday, December 1, 2015

12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Tune Pers
Children’s Hospital Boston and Broad Institute
Research fellow in Joel Hirschhorn’s Lab

Towards schizophrenia genes and pathways based on analysis of GWAS data

Over 100 associated genetic loci have been robustly associated with schizophrenia. Gene prioritization and pathway analysis have focused on a priori hypotheses and thus may have been unduly influenced by prior assumptions and missed important causal genes and pathways. Using a data-driven approach, we show that genes in associated loci: (1) are highly expressed in cortical brain areas; (2) are enriched for 104 gene sets including dendritic spine development (P=3.15×10-6), processes related to the postsynaptic density (P<1.29×10-6) and behavioral mouse phenotypes (P<5×10-4); and (3) contain 62 genes that are functionally related to each other and hence represent promising candidates for experimental follow up. We validate the relevance of the prioritized genes by showing that they are enriched for rare disruptive variants from schizophrenia sequencing studies (odds ratio 1.74, P=0.04), and are enriched for genes encoding members of mouse and human postsynaptic density proteomes (odds ratio 4.56, P=5.00×10-4; odds ratio 2.60, P=0.049).

Tuesday, November 3, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Rodolphe Thiébaut
Professor at the Bordeaux University, Bordeaux, France

Statistical analysis of high-dimensional data in clinical trial: the example of an HIV vaccine phase I/II trial

The size and the complexity of data collected for the evaluation of interventions in clinical trials has considerably changed. This is due to the availability and the improvement of new technologies including gene sequencing. Today, vaccine immunogenicity is assessed by various measurements including cell populations characterization (up to 216 with flow cytometry to 240 with CYTOF), proteins productions (cytokines at a single cell level with intracellular staining, in supernatants with multiplex assays), gene expression and viral adaptation using microarray and sequencing technologies. This situation gives the opportunity to study the mechanistic effect of the vaccine and also to try to predict the future response with baseline or early responses thank to the broad spectrum of the measurements performed.

In this talk, I will use the example of the DALIA 1 vaccine trial evaluating the response to an ex vivo generated DC loaded with HIV-lipopeptides in HIV patients treated by antiretroviral therapy (ART). Gene expression in whole blood was measured by microarrays (Illumina HumanHT-12) at 14 time points. Post vaccination immune responses were evaluated using various assays. In step 1, a Time-course Gene Sets Analysis (TcGSA) was performed using hierarchical models allowing heterogeneity in predefined gene sets and implemented in an R package (Plos Comp Biol, 2015,11:e1004310). Statistical properties of this approach have been studied through simulations. Association between abundance of genes selected in step 1, immune responses at w16 and viral replication after ART interruption was analysed in step 2 using an extension of the sparse-Partial Least Square approach to take into account the structural or group effects due to the relationship between markers among biological pathways.

This example illustrates new opportunities as well as the complexity of the analysis of high dimensional datasets and the need of adapted statistical approach.

Tuesday, October 6, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room

Peng Jiang
Postdoc at Dana-Farber Cancer Institute and Harvard School of Public Health

Inference of regulatory abnormality in cancers from heterogeneous public data

Recent years have seen the rapid growth of large-scale tumor-profiling data, which provides numerous opportunities for novel cancer biology insights. However, the experimental condition of most public data may not match the physiological condition of a cancer type of interest. Meanwhile, the cancer genome is highly unstable. Many alterations could arise from somatic events unrelated with biological regulation.
In this study, we focused on finding tumorigenesis regulators, including transcription factors (TF) and RNA binding proteins (RBP), through integrating a large number and variety of datasets. For efficient and accurate inference on gene regulatory rules, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests the regulatory impact of TF (or RBP), controlling for background effect from copy number alteration and DNA methylation. In each cancer type, RABIT further tests whether public repository data have reasonable correlation with tumor type specific data. Our predicted regulator impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many novel aspects of regulation in tumor progression.

Brief Research Profile
Peng Jiang is a postdoc at Dana Farber Cancer Institute and Harvard School of Public Health. His research focuses on building up computational methods to integrate big biological data from public domain and discover driver events in diverse cancer types. Peng received his Ph.D. in Computer Science from Princeton University in 2013.


Working Group Organizer: Po-Ru Loh

Tuesday, May 12, 2015
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Jae-Hoon Sul
Sunyaev Lab at Brigham & Women’s Hospital

Large-scale genetic studies of human complex traits

In the past decade, genotyping and next-generation sequencing (NGS) technologies have generated an enormous amount of data to discover genetic variants present in human genomes and to find the genetic basis of diseases. The technologies have shifted the paradigm of genetic studies from studies that analyzed fewer than a hundred individuals at hundreds of markers to studies that analyze more than tens of thousand individuals at millions of genetic variants. With rapid decrease in sequencing costs and emphasis on genomic medicine, studies will sequence hundreds of thousands of individuals in the near future. A major challenge in these large-scale genetic studies is developing computational methods that can utilize this big data efficiently.

In this talk, I will describe my work to address this challenge. As genome-wide association studies have discovered numerous non-coding genetic variants associated with traits, there has been increasing focus on interpreting these variants using functional genomics. Expression quantitative trait loci (eQTL) studies that attempt to detect genetic variants associated with gene expression may provide clues as to which variants are functional. I will discuss a method to perform multiple testing correction accurately and rapidly in eQTL studies for identification of genes whose expression is influenced by genetic variants. As eQTL studies have grown larger in sample size, multiple testing correction using the permutation test has become a major computational bottleneck. I developed a multivariate normal sampling approach (MVN), and MVN is more than 100 times faster than the permutation test for the sample size of 2,000 while generating almost the same results. Next, I will present a novel approach to detect rare variants associated with a disease in large families. NGS enables studies to evaluate effects of rare variants on complex traits, and family-based studies have attracted great attention recently because of their higher power for rare variant testing than case-control studies. I developed a method called RareIBD that can be applied to large pedigrees, both binary and quantitative traits, and affected-only pedigrees. Using simulations, I will show my method achieves higher power than previous approaches

Tuesday, April 21, 2015
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Rany Salem
Hirschhorn Lab at Boston Children’s Hospital

Complex Trait Genetics: GWAS, Phenotypic Heterogeneity and Future Directions

The emergence of Genome-wide association studies (GWAS) has given researchers a powerful tool to dissect the genetic architecture of many complex traits. A major challenge of performing a GWAS is phenotype selection, and dealing with phenotypic heterogeneity. The first part of the talk will focus on results of the first large scale GWAS of diabetic kidney disease and issues related to identifying and dealing with phenotypic heterogeneity. For the second part of the talk, I will present a framework to deal with the limitations of the GWAS consortium model and briefly explore statistical and methodological issues to take advantage of this opportunity. Finally, I will highlight results of a novel association between a structural variant and lipid levels.

Tuesday, January 20, 2015

12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
a pizza lunch will be provided

Manuel A. Rivas
Wellcome Trust Centre for Human Genetics, University of Oxford and the Broad Institute

Exploiting Correlation of Genetic Effects in Rare Variant Association Studies

I consider the problem of assessing association between rare variants and their effects on multiple, possibly correlated phenotypes. This problem arises in population biobanks, where a large number of health related measurements are made, and sequencing association study designs of related disorders. In this talk I will present a statistical framework that employs estimates of the correlation of genetic effects to improve power to discover associations.  The approach exploits knowledge about the correlation in both genetic effects across a group of genetic variants and across a group of phenotypes. Properties of the approach as implemented include that it is computationally efficient, making the analysis of large study designs practical; it is flexible and extensible, making the analysis of gene-sets, pathways, and networks feasible; and it includes standard univariate and multivariate gene-based tests as a special case, thus it can be employed in settings where estimates of the expected correlation of genetic effects are not available.  I will present results from a simulation study and from the analysis of two independent rare variant association studies: 1) blood metabolite measurements (multiple quantitative traits) from subjects in the Oxford Biobank, a collection of 30-50 year old healthy men and women living in Oxford recruited from primary care who underwent a detailed examination at a screening visit; and 2) a targeted sequencing data set of six autoimmune diseases. The statistical and computational framework is implemented in software package MAMBA.

Tuesday, December 2, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Sriram Sankararaman
Reich Lab at Harvard Medical School

Characterizing the structure and impact of archaic admixture in humans

Large-scale studies of human genetic variation, in conjunction with advances in ancient DNA technology, have revealed a number of mixture events between highly-diverged human populations that occurred tens of thousands of years ago (archaic admixture), e.g., between the ancestors of Neandertals and modern non-Africans. While several subsequent studies have found examples of phenotypically-relevant genetic variants introduced by archaic admixture, the biological impact of archaic admixture on human populations is not systematically understood. To understand the biological impact, we need maps of archaic local ancestry i.e., a labeling of the archaic ancestry along an individual genome.

We developed a statistical method based on the framework of Conditional Random Fields (CRF) and applied it to infer Neandertal local ancestry. The CRF provides inferences that are robust and sensitive both on simulated and real data. We applied the CRF to data from the 1000 Genomes project combined with a high-coverage Neandertal genome. The resulting map of Neandertal local ancestry reveals likely positive selection of Neandertal ancestry in specific loci and sets of genes (e.g. genes involved in keratin filament formation), associations between Neandertal-derived alleles and medical phenotypes (e.g. Crohn’s disease), as well as strong purifying selection against Neandertal alleles in part due to the effects of hybrid male sterility.

Finally, I will present ongoing work to study the impact of another archaic admixture between a population known as Denisovans and ancestors of modern Melanesian populations. I will discuss generalizations of the CRF to multi-way and de-novo admixtures as well as unphased data, efforts to study phenotypic associations of Neandertal variants and some open statistical questions ( how do we learn and apply discriminative models to population genetic problems?).

Tuesday, October 28, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Yun Li
Assistant Professor of Genetics & Biostatistics
University North Carolina at Chapel Hill

A Bayesian method for the detection of long-range chromosomal interactions in Hi-C Data

Advances in chromosome conformation capture and next-generation sequencing technologies are enabling genome-wide investigation of dynamic chromatin interactions. For example, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. One essential task in such studies is peak calling, that is, the identification of non-random interactions between loci from the two-dimensional contact frequency matrix. Successful fulfillment of this task has many important implications including identifying long-range interactions that assist in interpreting a sizable fraction of the results from genome-wide association studies (GWAS). The task – distinguishing biologically meaningful chromatin interactions from massive numbers of random interactions – poses great challenges both statistically and computationally. Model based methods to address this challenge are still lacking. In particular, no statistical model exists that takes the underlying dependency structure into consideration. We propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulations and real data.

Tuesday, October 7, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Elinor Karlsson
Postdoctoral Fellow and soon-to-be Assistant Professor
Broad Institute
Harvard University FAS Center for Systems Biology
University of Massachusetts Medical School
Natural Selection and Cholera Resistance in Bangladesh

Infectious pathogens are among the strongest selective forces driving recent human evolution, as migration and cultural changes over the last 100,000 years exposed populations to dangerous new diseases. One such pathogen is the causal agent of cholera, the bacterium Vibrio cholerae, which is endemic in the Ganges River Delta. By combining signals of association and natural selection, we have identified genes and pathways implicated in cholera susceptibility in Bangladesh, and developed a model of the innate immune signaling pathways that respond to V. cholerae infection. In this model, inflammasome activation and NF-kB signaling play an integrated role in TLR4-mediated sensing of V. cholerae – which is consistent with our in vitro data. Our approach is broadly applicable to other historically prevalent infectious diseases, such as Lassa fever, tuberculosis, leishmaniasis, dengue fever and malaria, and to complex, common diseases, such as inflammatory bowel disease, for which the associated genes may have been historically selected. Thus, ancient natural selection can give new insights into the function – and dysfunction – of human biology, with important implications for medical genomics.


Working Group Co-organizers: Bjarni Vilhjalmsson & Sasha Gusev


Tuesday, April 22, 2014
12:30-2:30, Building 2-Room 426

Kaitlin Samocha
Analytic and Translational Genetics Unit | MGH

Identification of a Set of Highly Constrained Genes from Exome Sequencing Data

A major challenge of medical genetics is to determine which variants, if any, contribute to disease in a patient. While variants can be prioritized based on their predicted deleteriousness, information about the disrupted gene can also be used to highlight those variants that are more likely to contribute to disease. For example, damaging variants in genes expressed in the relevant tissue might be prioritized over variants in genes that are not expressed in the tissue. Another potential way to prioritize variants is by the evolutionary constraint of the gene.

We developed a sequence context based model of de novo variation to create per-gene probabilities of synonymous, missense, and loss-of-function mutations. We noticed a high correlation (0.94) between the probability of a synonymous mutation in a gene and the number of rare, synonymous variants identified in that same gene using the NHLBI’s Exome Sequencing Project data ( We predicted the number of variants that we would expect to see in the dataset and – in order to quantify deviations from those expected values – created a Z score of the chi-squared difference between the observed and expected variation. While the distribution of these Z scores for the synonymous variants was normal, there is a marked shift in the missense distribution towards having fewer variants than predicted.

We identified a list of excessively constrained genes representing roughly 5% of all genes. This set of genes identified as excessively constrained showed enrichment for entries in the Online Mendelian Inheritance in Man (OMIM) database and, in particular, for those with a dominant inheritance pattern. Using published data, we found that de novo loss-of-function variants identified in patients with autism and intellectual disability were in a constrained gene more often than expected (p < 0.0001 for both). This trend did not hold for those genes with a de novo loss-of-function variant in a control (p = 0.66), indicating that this approach can effectively prioritize genes in which mutations can strongly predispose to disease.

Tuesday, February 25, 2014
12:30-2:30, Building 2-Room 426

Stephan Ripke
Analytic and Translational Genetics Unit | MGH

Reality at Last: Psychiatric Genomics Consortium Quadruples Schizophrenia GWAS Sample Size

Genome wide association studies have been a principal study design in human genetics for almost ten years now. They have been performed on almost all known heritable human traits and diseases with the hope of shedding light on the biology of so called complex genetic diseases like diabetes mellitus, hypertension and hyperlipidemia. The psychiatric field, which has traditionally lacked biologic instruments for medical research was especially enthusiastic about this new technology. Still, early analyses did not bear fruit causing many to question or abandon this approach.

Genome wide association studies are now clearly one of the most successful genetic analysis methods in medical research. Results from these studies have led to many known and possibly novel drug targets and have deciphered many new biological pathways. Interestingly, psychiatric phenotypes are in no way inferior to somatic phenotypes. Yet the path to this success was more arduous than expected.

In this talk I will present lessons learned during the last ten years of GWAS studies. I will give an overview over this journey, highlighting early successes from somatic diseases (e.g. Crohn’s disease) and later but no less striking success from psychiatric diseases, schizophrenia in particular.

I will then attempt to translate these experiences into expectations and recommendations for future work in the field of genetics and in the field of psychiatric genetics specifically.

Tuesday, November 19, 2013
12:30-2:30, Building 2-Room 426

Gosia Trynka
The Raychaudhuri Lab | BWH | HMS

Analyses assessing enrichment of GWAS variants for non-coding annotations in the genome are upwardly biased
Interpreting the function of variants associated to complex traits is challenging, particularly because only a small proportion of the variants maps to the gene coding regions. The activity of the genomic regions can be inferred from whole genome assays for histone modifications, chromatin accessibility or specific transcription factors. Now such maps are becoming available for hundreds of cell-types and tissues.
A test for enrichment within chromatin annotations has become a widely applied approach to infer functions of trait-associated variants. However, currently there are no standards to correctly carry out such analysis. Typically, enrichment analysis takes associated variants and quantifies the overlap with chromatin annotations. The observed enrichment is then compared to the null distribution of SNPs sampled from the whole genome based on different parameters. Most frequently the null is defined by random sets of SNPs or matched for proximity to TSS and minor allele frequency. Different definitions of null distribution highly influence the significance of observed results.
We tested different sets of matching parameters and defined those that are essential to sufficiently control for type 1 error. With simulations we show that the currently employed matching methods are strongly biased towards false positive results. Additionally, we develop an independent method that does not rely on matching and instead takes into account local complex correlation of genomic annotations present at the associated loci.

Tuesday, November 5, 2013
12:30-2:30, Building 2-Room 426

David Golan
Ph.D candidate Rosset lab Tel-Aviv University

Accurate Estimation of Heritability in Case-Control GWAS
Linear mixed effects models have recently gained popularity as the method of choice for estimating heritability from GWAS data, i.e. quantifying how much of the variability of a phenotype can be explained by the genotyped SNPs.
However, most of the interesting diseases and disorders studied are rare (typically affecting <1% of the population), and so the proportion of cases in a study is usually considerably higher than the proportion of cases in the population. This over-representation of cases invalidates several key assumptions of linear mixed models, e.g. the normality and independence of the random effects. Ignoring these problems results in shrunken estimates of heritability. We propose an alternative approach for estimating heritability, related to the well known method of Haseman and Elston. We derive the relationship between the genetic similarity and the phenotypic similarity of any two individuals as a function of the heritability, while explicitly conditioning on the fact that both individuals were selected for the study. Our method then entials regressing the pairwise phenotypic similarities on the pairwise genetic similarities and using the slope to obtain an estimate of the heritability. We show, using simulations, that our method yields unbiased estimates which are considerably more accurate than the current state-of-the-art methodology. Applying our method to several well-studied GWAS yields heritability estimates which are considerably higher than previously published results.

We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.

Tuesday, October 22, 2013
12:30-2:30, Building 2-Room 426

Saurabh Ghosh
Human Genetics Unit
Indian Statistical Institute, Kolkata

Integrating Multiple Phenotypes For Association Mapping
Most clinical end-point traits are governed by a set of quantitative and qualitative precursors and a single precursor is unlikely to explain the variation in the end-point trait completely. Thus, it may be a prudent strategy to analyze a multivariate phenotype vector possibly comprising both quantitative as well as qualitative precursors for association mapping of a clinical end-point trait. The major statistical challenge in the analyses of multivariate phenotypes lies in the modelling of the vector of phenotypes, particularly in the presence of both quantitative and binary traits in the multivariate phenotype vector.

For population-based data, we propose a novel Binomial regression approach that models the likelihood of the number of minor alleles at a SNP conditional on the vector of multivariate phenotype using a logistic link function. For family-based data comprising informative trios, we propose a logistic regression method that models the transmission probability of a marker allele from a heterozygous parent conditioned on the multivariate phenotype vector and the allele transmitted by the other parent. In both the approaches, the test for association is based on all the regression coefficients.

We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.

Tuesday, September 24, 2013
12:30-2:30, Building 2-Room 426

Ron Do
Research Fellow in Genetics
Harvard Medical School

Human Genetics Approaches to Understand if Triglyceride-Rich Lipoproteins Cause Coronary Artery Disease
Plasma triglycerides are transported in specific lipoproteins; in observational epidemiologic studies, increased triglyceride levels correlate with higher risk for coronary artery disease (CAD). However, it is unclear whether this association reflects causal processes. Genetic information can help assess causality, and can be useful in dissecting the influences of correlated measures such as triglycerides, low-density lipoprotein cholesterol (LDL-C), and high-density lipoprotein cholesterol (HDL-C). Here, we present results from two studies that help us understand if triglyceride-rich lipoproteins cause coronary artery disease. In the first study, we used 185 common polymorphisms recently mapped for plasma lipid traits (P < 5 x 10-8 for each) to examine the role of triglycerides on risk for CAD. First, we highlight loci associated with both LDL-C and triglycerides, and show that the direction and magnitude of both are factors in determining risk for CAD. Second, we consider loci with a strong magnitude of association with triglycerides but a minimal one with LDL-C, and show that these loci are also associated with CAD. Finally, in a model accounting for effects on LDL-C and/or HDL-C, a polymorphism’s strength of effect on triglycerides is correlated with the magnitude of its effect on CAD risk. In the second study, we sequenced the exomes of 1,027 individuals with myocardial infarction (MI) at an early age, compared their sequences with 946 older individuals without MI, and validated initial associations using statistical imputation, genotyping, and re-sequencing. In follow-up sequencing of > 10,000 individuals, we observed rare non-synonymous mutations in the APOA5 gene were more frequent in early-onset MI patients versus MI-free controls at an exome-wide significance threshold (APOA5, 1.4% versus 0.6%, odds ratio = 2.2, P =5 x 10-7). Carriers had higher plasma triglyceride concentrations compared to non-carriers (median in carriers was 167 mg/dl versus 104 mg/dl for non-carriers, P=0.007). These results suggest that triglyceride-rich lipoproteins may causally influence risk for CAD and that novel therapeutic approaches targeted to TG-rich lipoproteins might be expected to reduce risk of CAD.


Working Group Organizer: Hugues Aschard
Postdoc Committee: Bjarni Vilhjalmsson, Jin Zhou, Megha Padi

Tuesday, April 2, 2013
12:30-2:30, Building 2, Room 426

Nikolaos A. Patsopoulos, MD, PhD
Instructor in Neurology
Brigham & Women’s Hospital
Harvard Medical School

The genetics of common variation in Multiple Sclerosis

Multiple Sclerosis is a neurodegenerative disease with strong evidence of genetic predisposition. In this PQG I’ll present current and on-going efforts of the International Multiple Sclerosis Genetics Consortium (IMSGC) to identify and fine-map genetic effects in large-scale data sets. The focus will be in design, addressing technical issues, e.g. complicated sample structure, and application of methods, e.g. control for population stratification.

Tuesday, March 5, 2013
12:30-2:30, Building 2, Room 426

Tune H. Pers
Research Fellow in Pediatrics
Children’s Hospital Boston
Harvard Medical School

Predicting pathways and selecting likely causal genes from a height genome-wide association study by integrative network-based analysis

In genome-wide association (GWA) studies many loci have no single obviously causal gene, therefore the challenge for moving from association to novel biological insight is to identify which gene at each locus most likely explains the association. Another challenge is to identify whether associated loci coalesce onto biological pathways. A key first step is to use computational approaches to prioritize genes within loci as most likely to be biologically relevant, and to assess whether genes in associated loci enrich for particular gene sets. Our approach is based on expression patterns derived from a heterogenous panel of 80,000 expression arrays that predict functions for genes. The predicted functions include predicted protein-protein interactions, predicted phenotypic consequences, predicted Reactome pathway members, predicted KEGG pathway members, and predicted Gene Ontology term membership. We subsequently used these predicted functions to systematically identify the most likely causal gene(s) at a given locus. We evaluate our method using the GIANT consortium’s GWA meta-analyses for human height. For human height GWA loci, our integrative approach performs consistently better in predicting likely causal genes than approaches based on one or two data types only. As unbiased benchmarks, we tested for enrichment of genes that recently have been shown to be differentially expressed in rodent growth plates, and for genes that have associated missense variants. Based on these and other benchmarks, this approach outperforms methods using fewer data types. We have developed an unbiased computational approach that integrates a variety of data types and performs well in prioritizing potentially causal genes from GWA data.

Tuesday, December 4, 2012
12:30-1:30, Building 2, Room 426

Paz Polak
The Broad Institute
BWH/HMS, Sunyaev Lab

Revealing the factors that shape the regional mutation rates in melanoma

Recent advances in high-throughput sequencing technology enable us to identify cancer-specific mutations at any position in a patient’s genome, allowing dozens of studies to reveal a rich world of mutation patterns. We have been studying how the density of mutations changes along the genome in different cancer types. We used epigenetic data from several normal cell types, recently released by the ENCODE project consortium, to predict the variation in cancer-specific mutation densities along human chromosomes. For prediction, we used three different methods: Random Forest; Poisson regression with elastic net regularization, which is a compromise between lasso and ridge regression; and Generalized Additive Models (GAM), which smooth the predictors and deal with the need to linearize their relations. Overall, we were able to explain up to 80% of the variation in melanoma genomes, showing that chromatin structure is involved in establishing mutation patterns in cancer.

Tuesday, November 27, 2012
12:30-1:30, Building 2, Room 426

Sara Lindstrom
Research Scientist
Department of Epidemiology

Deep targeted sequencing of 12 breast cancer loci in 4,700 women across four different ethnicities

Genome-wide association studies (GWAS) have identified multiple genetic loci associated with breast cancer risk. However, the underlying genetic structure in these regions is not fully understood and it is likely that the index GWAS signal originates from one or more yet unidentified causal variants within the region. We used next-generation sequencing to characterize 12 GWAS-discovered breast cancer loci in a total of 2,300 breast cancer cases and 2,300 controls across four ethnic populations. Region intervals spanned between 46kb and 973kb. In total we hybrid-captured and sequenced 5.5Mb. On average, we were able to capture 82% of the non-repetitive sequence in the targeted regions, and the average fraction of captured bases sequenced with a depth >20x was more than 95%. After single nucleotide variant (SNV) calling and quality control, a total of 138,898 SNVs remained for analysis. 87.4% of those had a minor allele frequency less than 0.5% and 50.2% were private mutations. Across all regions, a total of 81 SNVs showed evidence for association with breast cancer (P<0.001). However, we did not find unequivocal association evidence supporting a causal role for any individual SNV. These results illustrate the challenges facing post-GWAS fine-mapping studies.

Tuesday, October 2, 2012
12:30-2:00, Building 2, Room 426

Hugues Aschard
Postdoc of Peter Kraft

Exploring the effect of gene-gene and gene-environment interactions in breast-cancer risk prediction

Genome-wide association scans have identified scores of common genetic variants associated with the risk of complex diseases in the last years. However their aggregate effects on risk beyond traditional factors remain uncertain. Recent papers reported that the addition of information on these genetic variants to existing risk models for common diseases can moderately improve discrimination and prediction in estimating that risk. Here we explore the extent to which the consideration of interaction between genetic variants (GxG) and interaction between genetic variants and established clinical factors (GxE) can add to these risk-assessment models. Using data from the Nurses’ Health Study, we derive the predictive ability of 15 independent published risk alleles in 1145 breast cancer cases and 1142 matched controls. We evaluate the extent to which single nucleotide polymorphisms (SNPs) can discriminate breast cancer cases from controls in the whole sample and in strata defined by non-genetic risk factors. We conduct then a simulation study to explore the potential improvement of discrimination if complex GxG and GxE interactions exist and we know them.

2011-2012 Working Groups

Working Group Organizers: Elizabeth Schifano & Monica Ter-Minassian

Tuesday, March 27, 2012
12:30-2:00, Building 2-Room 426

Aedin Culhane, Ph.D.
Research Scientist
Department of Biostatistic
Dana-Farber Cancer Institute

New biclustering approaches for pan-cancer analysis of a large compendium of genomics data

Though it is widely recognized that molecular traits such as p53 mutation and deregulation of the cell cycle span many types of cancer. Conventionally classification of cancer is by anatomical site and studies of cancer generally analyze the disease within these boundaries. There is a growing recognition that histological and molecular heterogeneity cross anatomical boundaries, for example clear cell ovarian cancer is closer to clear cell renal cancer than it is to other ovarian cancers. New pan-cancer analysis approaches are required to discover the molecular taxonomy of cancer upon which personalized cancer medicine can be built. I will describe a new biclustering approach that we have developed to enable analysis of large compendium of cancer genomics data and the pan-cancer molecular traits we have discovered.

Tuesday, February 28, 2012
12:30-2:00, Building 2-Room 426

Peggy Lai, M.D.
Post-doctoral research fellow in Environmental Health
(Mentors: David Christiani and Winston Hide)

Using a gene expression signature to understand endotoxin related chronic obstructive lung disease

A quarter of patients with chronic obstructive pulmonary disease (COPD) in the United States have never smoked. Endotoxin is a common and poorly recognized environmental exposure that may cause non-tobacco related COPD. In recent studies, chronic inhalational endotoxin has been associated with the development of obstructive lung disease in both murine models and human epidemiologic studies of occupationally exposed populations. Despite this, endotoxin related lung disease has not been well studied. Three distinct murine models of chronic endotoxin exposure have been developed in different laboratories with use of microarrays to characterize global gene expression. We identified a common 101 gene signature for recurrent endotoxin exposure that is both biologically interesting at the gene level, and at the pathway level demonstrated increasing numbers of inflammation related genes with longer periods of endotoxin exposure; this is surprising in light of the well-known phenomenon of endotoxin tolerance, suggesting that dysregulated inflammation may play a role in endotoxin-related lung disease. The focus of this talk is on the application of biostatical and bioinformatics methods to a clinically relevant but poorly understood disease.

Tuesday, January 31, 2012
12:30-2:00, Building 2-Room 426

Lin Li, Ph.D.
Postdoctoral Research Fellow
Department of Biostatistics
Harvard School of Public Health

Incorporating gene expression information in SNP set association testing

Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are likely to be expression quantitative trait loci (eQTLs). It is of interest to utilize eQTL information for understanding the genetic basis underlying complex traits. On the other hand, SNP set association testing, especially region-based, can harness information of SNPs in linkage disequilibrium and lead to increased power. We are interested in incorporating gene expression information in SNP set association testing. Specifically, we employ kernel machine tests using weighted kernels, where the predefined weights are based on eQTL signals. We also consider an adaptive test that is robustly powerful regardless of whether the expression signals are relevant or not to the trait of interest. As an application, we analyze an asthma study data using our proposed methods.

(Joint work with Prof. Xihong Lin and Prof. Liming Liang)

Tuesday, December 13, 2011
12:30-2:00, Building 2-Room 426

Chen-yu Liu
Research Fellow, Department of Environmental Health
Harvard School of Public Health

A pilot study of residential petrochemical exposures and DNA methylation changes for leukemia risk

We conducted a pilot study to assess the associations of genome-wide methylation pattern changes in peripheral blood and residential petrochemical exposure on 30 childhood acute lymphoblastic leukemia (ALL) cases and controls, as an exploratory extension of a population-based case-control study [1-3]. Higher concentrations of selected PAHs and VOCs in the vicinity of petrochemical industries in the study area have been reported, compared to those in industrialized communities of the U.S. [4, 5]. Hypermethylation /hypomethylation of specific genes involved in carcinogens metabolisms and exposure to several carcinogens of PAH and VOC may increase the childhood ALL risk. We used geographic information system tools to estimate individual-level exposure by accounting for subjects’ mobility, length of stay at each residence, distance to petrochemical plant(s), and monthly prevailing wind direction. DNA methylation profiles were obtained by using the Illumina Infinium HumanMethylation27 BeadArray. Differential methylation levels between ALL cases and controls were found in genes involved in inflammatory response, lymphocyte differentiation/ activation, and apoptosis. Similar methylation pattern changes were observed when comparing petrochemical exposure correlated effects. Seventy prevent genes were overlapping compared the petrochemical exposure correlated changes and leukemia associated changes. The results do not remain significant after adjusting for multiple comparisons.


  1. Liu, C.Y., et al., Maternal and offspring genetic variants of AKR1C3 and the risk of childhood leukemia. Carcinogenesis, 2008. 29(5): p. 984-90.
  2. Liu, C.Y., et al., Cured meat, vegetables, and bean-curd foods in relation to childhood acute leukemia risk: A population based case-control study. BMC Cancer, 2009. 9(1): p. 15.
  3. Yu, C.L., et al., Residential exposure to petrochemicals and the risk of leukemia: using geographic information system tools to estimate individual-level residential exposure. Am J Epidemiol, 2006. 164(3): p. 200-7.
  4. Lee, C., The characteristics of ambient volatile organic compounds and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese). Taipei, Taiwan, Republic of China: National Science Council. 1995.
  5. Lee, C. and G. Tsai, The characteristics of ambient polycyclic aromatic hydrocarbons and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese).Taipei, Taiwan, Republic of China: National Science Council,. 1994.

Tuesday, November 22, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Cristian Tomasetti
Postdoc with Giovanni Parmigiani
Dana-Farber Cancer Institute and Harvard School of Public Health

Mathematical Modeling of Random Genetic Mutations for a General Class of Tumor Growth Curves with Applications

Various attempts to study random genetic mutations have been made
in the mathematical modeling literature. An important goal has been the
estimation of the probability for a specic mutation to be present in a tumor
of a given size and the related estimate for the expected number of mutants,
if mutants are indeed present. Previous mathematical models addressing
these questions have shown however two potential limits: the tumor was
assumed to be homogeneous and, on average, growing exponentially. In this
talk a few recent results will be presented where the heterogeneity of the
tumor cell population is taken into account, and the exponential growth
of cancer has been replaced by other, arguably more realistic types of tu-
mor growth. Various application of this mathematical framework to chronic
myeloid leukemia and gastrointestinal stromal tumor will be introduced.

Tuesday, October 25, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Noah Zaitlen
Postdoc with Alkes Price

Components of Heritability in An Icelandic Cohort

The combined genotypes and genealogy of 38,167 Icelanders provides the
opportunity to examine the contribution of different components of
genetic variation to the heritability of complex phenotypes. In this
work we focus on both parent-of-origin effects, and the contribution
of typed versus untyped variants. Genetic variation can alter
phenotype in a parent-of-origin specific manner, for example, via
imprinting. Kong et al [1] identified three Type 2 Diabetes (T2D)
variants including rs2334499, which is protective for T2D when
inherited maternally, confers risk when inherited paternally, and lies
in an imprinted region of the genome. The genealogy provides a means
of resolving not only highly accurate phasing, but also the
parent-of-origin of each typed variant. Using this information we
develop methods to examine the total contribution of parent-of origin
effects, as well as the differences between paternal and maternal
contributions to the heritability of complex phenotypes. We find that
variation of height shows little evidence of contribution from
parent-of-origin effects, while T2D shows highly significant evidence
(p-value < 1.9×10<sup>-5</sup>), suggesting that many more variants
such as rs2334499 remain to be found. In addition to parent-of-origin
effects, the unique long range phasing the deCode data provide a means
of efficiently estimating the fraction of the genome shared identical
by descent (IBD) as well as identical by state (IBS). As has been
recently demonstrated [2], IBS sharing can be used to estimate the
contribution to phenotype of genotyped SNPs and SNPs in linkage
disequilibrium (LD) to genotyped SNPs. IBD sharing estimates the total
narrow sense heritability of the phenotype, that is, the additive
contribution of all SNPs. Since both are readily available in these
data we are able to compare the IBD based estimates to the IBS based
estimates and show that for all phenotypes examined the majority of
heritability is well captured by the genotyped SNPs. We conclude with
a discussion of confounding effects in mixed model estimates of
heritability such as those provided by GCTA, showing that population
structure can lead to biased estimates of heritability even when
corrected for by principal component adjustments. 1 Kong et al.
Parental origin of sequence variants associated with complex diseases.
Nature 2009. 2 Yang el al. Genome partitioning of genetic variation
for complex traits using common SNPs. Nat Genet 2011.
Tuesday, September 27, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Kimberly Glass
Postdoc with GC Yuan & John Quackenbush

Passing Messages Between Data Types to Refine Predicted Network Interactions

Gene regulatory network reconstruction is a fundamental question in computational biology. Although it is recognized that there are significant limitations when using individual datasets to reconstruct a regulatory network, it remains a major challenge to integrate multiple data sources effectively. We propose a message-passing model that utilizes multiple sources of regulatory information to predict regulatory relationships. A major advantage of our method is that the local connections are iteratively refined by exchanging information within each gene’s functional neighborhood, thus increasing both numerical efficiency and biological accuracy. Using the yeast as a model system, we are able to demonstrate that this integrative method is able to more accurately reconstruct the regulatory network than other widely-accepted network reconstruction algorithms. In addition, our method is amenable to the future inclusion of data reflecting many different types of regulatory mechanisms, including protein complexes and protein-protein interactions, the binding of transcription factors to promoters, and epigenetic marks residing in the control regions of genes.

2010-2011 Working Groups

Tuesday, May 3, 2011
12:30-2:00, Building 2-Room 426

Sarah Fortune, MD
Assistant Professor of Immunology and Infectious Diseases
Department of Immunology and Infectious Diseases

Time for change: Using whole genome sequencing data to define the mutation rate of Mycobacterium tuberculosis during the course of infection

Mycobacterium tuberculosis (Mtb) poses a global health catastrophe that has been compounded by the emergence of highly drug resistant Mtb strains. In Mtb, all drug resistances are the result of chromosomal mutations and depend on the bacterium’s capacity for mutation during the course of infection. We have used whole genome sequencing of bacterial isolates to derive estimates of the mutation rate of Mtb in the infected host in order to better understand the occurrence of drug resistance. Our data suggest that Mtb acquires mutations at roughly the same rate over time, irrespective of the organism’s replicative state. However, the mutation spectrum in Mtb strains from animals with active and latent disease reflects different mutational pressures in different disease states. From these data, we are developing a mutational clock for Mtb in the infected host, which we are validating through WGS of human isolates from defined transmission chains.

Tuesday, April 5, 2011
12:30-2:00, Building 2-Room 426

Melissa Merritt, Ph.D.
Postdoctoral Research Fellow
Department of Epidemiology, Harvard School of Public Health
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute

Cell-of-origin effect on tubo-ovarian tumor phenotype

The examination of gynecologic tissues with putative precursor lesions for ovarian tumors in clinical samples has identified a variety of suggested precursors for ovarian cancer. These observational studies have guided us to design experiments to address whether cells-of-origin influence the phenotype of tubo-ovarian cancer. To do this, we established paired normal human ovarian and fallopian tube epithelial cells in culture from two donors without cancer, and induced carcinogenic transformation of cells by the sequential introduction of hTERT, SV40 and H-Ras. The gene expression profiles of the normal and transformed cells were examined using HG-U133 Plus 2 microarrays (Affymetrix), which revealed a ‘cell-of-origin’ gene signature that distinguished ovarian (OV) vs fallopian tube (FT) epithelial cells from the same patient. Among the most significant genes we selected a subset that were over-expressed in either OV or FT origin cultured cells and used these to identify two subpopulations (OV-like versus FT-like) among human ovarian tumors in two publically available gene expression datasets. In this analysis FT-like tumors were predominantly composed of serous high grade cancers and were associated with significantly shorter disease-free survival. In contrast, OV-like tumors were associated with a better prognosis. We have previously demonstrated that cell-of-origin plays a role in determining the phenotype of breast tumors. These results provide further support for the hypothesis that cell-of-origin strongly influences tumor phenotype. We suggest that the most aggressive subtype of tubo-ovarian cancers may originate in the fallopian tube. The results from these studies may allow a cell-of-origin based improved molecular classification of ovarian cancers and a better understanding of the pathogenesis of mullerian tumors.

Tuesday, March 1, 2011
12:30-2:00, Building 2-Room 426

Monica Ter-Minassian, Sc.D.
Post doctoral Fellow at HSPH (Environmental Health Dept.) and DFCI (Medical Oncology Dept.)

Genetic variability in tobacco-specific nitrosamine NNK to NNAL metabolism

The nitrosamine NNK, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone, is known to be one of the most potent tobacco carcinogens, particularly for lung adenocarcinoma. Recently, NNK urinary metabolites, total NNAL, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol and its glucuronides, have been shown to be good predictors of lung cancer risk, years prior to diagnosis. We sought to determine if several genetic polymorphisms significantly contributed to the wide range of inter-individual variability observed in total NNAL output. The study subjects were derived from the Harvard/ Massachusetts General Hospital Lung cancer case-control study. We analyzed 87 self-described smokers (35 lung cancer cases and 52 controls), with urine samples collected at time of diagnosis (1992-1996). We tested 77 tagging SNPs in 16 genes related to the metabolism of NNK to total NNAL derived from prior GWAS genotyping on these subjects. Using a weighted case status least squares regression, we tested for the association of each SNP with square-root (sqrt) transformed total NNAL (pmol per mg creatinine), controlling for age, sex, sqrt packyears and sqrt nicotine (ng per mg creatinine). Three HSD11B1 SNPs and AKR1C4 rs7083869 were significantly associated with decreasing total NNAL levels. HSD11B1 and AKR1C4 enzymes are carbonyl reductases directly involved in the single step reduction of NNK to NNAL. The HSD11B1 SNPs may be correlated with the functional variant rs13306401 and the AKR1C4 SNP is correlated with the enzyme activity reducing variant rs17134592, L311V.

Nicola Segata, Ph.D.
Post Doctoral Fellow
Department of Biostatistics , HSPH

Metagenomic Biomarker Discovery and the Human Microbiome

Metagenomics has provided a new avenue for biomarker discovery, since changes in the composition and functional activity of microbial communities can provide insight into the ecological differences among communities or provide diagnostic or prognostic power when applied to the human microbiome. We propose the LDA Effect Size (LEfSe) algorithm to discover and explain microbial and functional biomarkers in the human microbiota and other microbiomes. LEfSe determines the features (organisms, clades, OTUs, genes, or functions) most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and biological relevance. We demonstrate this method to be effective in mining human microbiomes for metagenomic biomarkers associated with mucosal tissues and with different levels of aerobic metabolism. Similarly, when applied to 16S rRNA gene data describing a murine ulcerative colitis gut community, LEfSe confirms the key role played by Bifidobacterium and suggests the involvement of additional clades including Metascardovia. Finally, we provide characterizations of microbial functional activity from metagenomic community sequencing, comparing environmental bacterial and viral microbiomes. A comparison of LEfSe with existing microbial biomarker discovery methods and with standard statistical approaches (including an evaluation incorporating synthetic data) highlights a lower false discovery rate, consistent ranking of biomarkers relevance, and concise representations of taxonomic and functional shifts in microbial communities associated with environmental conditions or disease phenotypes.

Tuesday, February 1, 2011
12:30-2:00, Building 2-Room 426

Gabriel Altshuler
Post Doctoral Associate, HSPH

Pathway Fingerprinting; A Functional Framework for Multi-platform Integration

No objective, broadly applicable approach exists to compare molecular profiles, yet public repositories of gene expression arrays, RNASeq, GWAS, CAGE, epigenetic marks and ChIPseq data represent major resources for discovery. There is a pressing need for an objective, biologically-interpretable, functional abstraction of the comprehensive sample space of cellular genome, transcriptome and regulome data to compare these experiments within a consistent global framework. Pathway profiling significantly out-performs gene-based comparisons for cross-platform analysis by exploiting the fact that molecular profiles reflect activity of genes acting in concert across pathways. We have developed the ‘pathway fingerprint’, a mapping of any molecular profile to a fixed set of known pathways. Each fingerprint generated is standardized relative to the full public data corpus so they are directly comparable to any other, ensuring broad applicability. The method has been successfully piloted for the analysis of expression data. We performed a pathway fingerprint meta-analysis to establish and verify the stem cell pluripotency pathway signature. This was used to classify cell types and successfully identify additional pluripotent arrays from the GEO expression database. Moving away from small-scale comparisons and towards a literature-wide study of pluripotency has offered a means to resolve an ongoing debate over the pluripotent potential of testis-derived stem cells. We are now in the process of expanding the utility of the pathway fingerprint to GWAS, RNASeq, and eipgenetic data to provide a unified method for functional integration within and between these experimental platforms.

Levi Waldron
Research Fellow, HSPH & DFCI

Whole-genome expression profiling of degraded tumor RNA from the NHS/HPFS colorectal cancer cohorts

Over 20 million formalin-fixed, paraffin-embedded (FFPE) tissue blocks per year are routinely stored for cancer patients in the United States, and in particular, are available for cancer patients from long-term epidemiological studies including the Nurses Health Study and Health Professionals Follow-up Study (NHS/HPFS). Traditional microarray technology requires high-quality RNA from fresh-frozen tissues, and are not appropriate for gene expression profiling of these tissues. The recently introduced cDNA-mediated Annealing, Selection, extension, and Ligation (DASL) microarray assay by Illumina enables whole-genome expression profiling from degraded, clinical FFPE tissues. We present a case study involving more than 1,000 patients from the NHS/HPFS colorectal cancer cohort, demonstrating an end-to-end pipeline for quality control, normalization, and application of the data to tumor subtyping and differential gene expression. We conclude that this technology differs in important ways from the traditional microarray, discuss some of the challenges and opportunities associated with its adoption, and present initial results for the subtyping of colorectal cancer.

Tuesday, December 7, 2010
12:30-2:00, Building 2-Room 426

Elaine Hoffman, Ph.D.
Research Scientist
Department of Biostatistics
Harvard School of Public Health

Applications of Path Analysis & Structural Equation Models in Environmental Health Studies

Path analysis and structural equation models will be presented in the context of environmental health. The Bangladesh-Arsenic project, Italian-Manganese project, and some of my other projects will be used as examples. The statistical application of path analysis and structural equation models and some of the obstacles I have encountered also will be discussed. Path analysis is a statistical model that can account for complex relationships among variables and correlated variables. It is often used when there are suspected causal relationships. Structural equation models are a class of covariance structure models that simultaneously model multiple surrogates of both exposure and outcome. Both path and structural equation models are often shown as path diagrams. These models have been extensively used in the social sciences, and more recently are beginning to be used in environmental epidemiology. This presentation is a Work-in-Progress, with many of the statistical ideas discussed not having been published or completed yet. Software packages will be discussed.

Tuesday, November 2, 2010
12:30-2:00, Building 2-Room 426

Benjamin Haibe-Kains, Ph.D.
Research Fellow
Department of Biostatistics & Computational Biology
Dana-Farber Cancer Institute

Breast Cancer Molecular Subtypes: A Three-gene Model for a Translation Into Clinic

Background: Gene expression studies have well established that breast cancer (BC), in addition to being clinically diverse, is also a molecular heterogeneous disease. The early studies classified BC into at least three clinically relevant molecular subtypes: basal-like, HER2-enriched, and luminal tumors, with each subtype exhibiting different prognosis and response to therapies. Demonstration of the molecular heterogeneity within BC has changed the way clinicians perceive the disease and has a dramatic impact on the design of new clinical trials.

During the last decade, several methods have been proposed to classify breast cancer into their corresponding molecular subtype using gene expression. Three versions of a “Single Sample Predicton” (SSP), based on a hierarchical clustering and a nearest centroid classifier using different sets of “intrinsic” genes, have been proposed; however this method of classification has been shown to be highly unreliable by many investigators. We have previously reported a “Subtype Clustering Model” (SCM), which uses a mixture of three Gaussians with ER, HER2 and proliferation-related gene modules, to estimate probabilities of belonging to each BC molecular subtype. This model has improved classification stability; however such a method is not so easily clinically implemented.

In this work we developed a novel SCM-based 3-gene classifier for molecular subtyping and we evaluated its concordance, robustness and prognostic value compared with five existing classifiers.

Materials and Methods: We refined the SCM to its simplest form, a classification model that uses only the 3 genes reported to be the main discriminators of the molecular subtypes: ER, HER2 and AURKA (proliferation). We evaluated the concordance and robustness of five previously described classifiers- three SSPs and two SCMs- and the new 3-gene SCM, using gene expression and clinical data from a large compendium of publicly available BC datasets comprising 5,113 primary breast tumors. Clinical relevance was determined from survival analysis of a subset consisting of 1,318 untreated node-negative patients.

Results: SCM-based classifiers, including our 3-gene model, were significantly more robust than all the SSPs (prediction strength > 0.8, p < 0.001); notably, the 3-gene SCM was the most robust classifier for molecular subtypes. Although all models were concordant (Cramer’s V = 0.54-0.81, p < 0.001), with basal-like subtype being particularly well defined (median Cramer’s V of 0.8), SCMs yielded stronger concordance than SSP models. Overall, SCMs were also more consistent with traditional clinical variables (ER, HER2 status by IHC/FISH and histological grade). All classifications yielded significant and independent prognostic value.

Conclusions: We found significant disparities in robustness of BC molecular subtype classification models. SCMs outperformed SSPs and consistently identified molecular subtypes in numerous datasets derived using various microarray technologies and conducted by different laboratories. Compared with existing models, we propose that a 3-gene SCM-based model is the most reliable and its simplicity could be a significant step towards translation of BC molecular subtyping into the clinic.

Keywords: molecular subtypes, gene expression, clustering, classification, robustness, concordance, prognosis

Tuesday, October 5, 2010
12:30-2:00, Building 2-Room 426

Pinaki Sarder, Ph.D
Research Fellow
Department of Biostatistics
Harvard School of Public Health

Functional understanding of microbial communities using experimental data integration

To understand the functional and metabolic activities of microbes and microbial communities, it is critical to link genes and proteins to their biological roles. This encompasses both their biochemical activities and the processes and pathways in which they are used by the cell. This problem is typically approached by transferring knowledge to newly sequenced genomes by relying on sequence similarity. This can be a difficult process involving sparse knowledge and the propagation of error, and even the best-studied organisms’ genomes are only partially characterized. To mitigate these issues, we have developed a data integration method TafTan leveraging all experimental results available from multiple model systems to identify potential functional roles for genes in a new organism. The performance of TaFTan’s genome-wide functional network prediction was evaluated using ~300 experimental datasets from 20 model organisms. This evaluation study demonstrated that TaFTan is able to significantly improve individual organisms’ inferred functional networks by transferring knowledge from other experimentally characterized systems.

September 21, 2010
12:30-2:00, Building 2-Room 426 (Biostats Conference Room)

X. Shirley Liu, Ph.D.
Associate Professor of Biostatistics
Departments of Biostatistics and Computational Biology
Harvard School of Public Health & Dana-Farber Cancer Institute

Computational Genomics of Gene Regulation

High throughput genomics technologies brought a paradigm shift to gene regulation studies, but they also created challenges on data analysis. In this talk, I will highlight two studies conducted in my lab to show how computational and statistical algorithms could help remove the noise in the data, provide informative results, and help design efficient experiments. One study is a model-based analysis of tiling arrays for ChIP-chip peak calling, and the other is using dynamics of H3K4me2 nucleosomes to infer the in vivo transcription factors and their binding sites driving a biological process.

2009-2010 Working Groups

Tuesday, April 27, 2010
12:30-2:00 PM

Robert Wright, M.D., M.P.H.
Associate Professor
Department of Environmental Health, HSPH
Department of Pediatrics, HMS

A Framework for Measuring Gene-Environment Interactions in Children

Will discuss issues of development unique to children that will improve estimates of gene environment interaction including child-specific pitfalls to case control and family-based association designs.
Tuesday, March 23, 2010
12:30-2:00 PM
Kresge 201

Marianne Wessling-Resnick, Ph.D.
Professor of Nutritional Biochemistry
Dept of Molecular Metabolism
Director of the PhD Program in Biological Sciences in Public Health
Harvard School of Public Health

Chemical Genetics of Iron Transport

Chemical genetics is an emerging field that takes advantage of combinatorial chemical and small molecule libraries to dissect complex biological processes. Small molecules can act very fast, can be very specific, and can help to distinguish the temporal order of molecular steps and the hierarchical regulation of biological processes. Because small molecules can alter the function of a specific gene product, they can be used in a manner analogous to the use of inducible dominant or homozygous recessive genetic mutations. A large body of biochemical literature is based on the past use of small molecule antagonists that were employed in “reverse chemical genetics” approaches to conditionally eliminate protein function, and on that basis to subsequently identify the target, its mechanism of action, and its regulation. Thus, ouabain helped to define the catalytic cycle of the NaK-ATPase, cytochalasin B was instrumental in defining the molecular basis for insulin’s action to stimulate glucose uptake, and analogs of amiloride were used to purify and define the epithelial Na channel. There is a need to develop “forward chemical genetics” in order to discover small molecules that partner with key elements in a pathway of interest. Our goals are to discover small molecule inhibitors of iron transport using chemical genetics and to use these reagents to advance our understanding of the factors, mechanisms, and regulation of different pathways of iron metabolism.

Patrick Loerch
Research Fellow, Computational Biology

Using Networks to Integrate Diverse -omics Datasets and Identify Disease Pathways

The dramatic increase in the application of omics-based platforms (microarrays, GWAS, proteomics, deep sequencing, metabolomics, etc) in biological research has resulted in the generation of vast public and
private data repositories spanning a wide array of diseases, tissues and species. The challenge facing researchers today, commonly referred to as integrative genomics, is figuring out how to integrate, analyze and interpret all of this information within the context of a well-defined biological question. One biological question of particular importance is the identification of genes/proteins that contribute to, and/or are altered as a result of, the onset of a specific disease. As opposed to the gene-centric approach to integrative genomics, which looks for genes that are associated with a disease across multiple omics platforms, we have developed a pathway-centric approach. We hypothesize that the onset of disease involves the altered regulation of specific pathways, which can be triggered by any number of genes or proteins, so long as the end result is the same. Working under this hypothesis, we have developed a network-based approach to integrating various omics datasets with the aim of identifying disease pathways. This approach also allows us to take into account a number of biological realities, such as the regulation of pathway members at various states (from transcription to translation) and the fact that the regulation of some pathway members will simply not be observable through omics technologies. Here we will present the development of this methodology within the context of identifying disease pathways, discuss a specific application/validation of the method, and describe ongoing efforts to further refine this approach.

Tuesday, February 23, 2010
12:30-2:00 PM
Kresge 201

Curtis Huttenhower, Ph.D
Assistant Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Harvard School of Public Health

Scalable Data Mining for Functional Genomics and Metagenomics

The average human body contains over ten times as many microbial cells as “human” cells. These microbial communities are usually beneficial, but their dysfunction has been linked to conditions ranging from obesity to antibiotic resistant infections. The recent dramatic reduction in the cost of DNA sequencing has opened up several exciting new ways in which we can explore how the human microflora vary across populations and how they can be manipulated to improve human health.

Biological network integration and mining algorithms provide a means of assembling the entire body of cultured microbial genomic data, understanding it from a systems level, and applying it to the study of uncharacterized species and communities. We compare unsupervised and supervised Bayesian approaches to biological network integration; this process provides maps of functional activity and genomewide interactomes in over 100 areas of cellular biology, using information from ~5,000 genome-scale experiments pertaining to 13 microbial species. In combination with graph alignment, these network manipulation tools provide a means for analyzing the functional activity unique to particular pathogens, transferring putative functional annotations to uncharacterized organisms, and potentially inferring interactomes using weighted network integration for metagenomic communities.

Ed Silverman, M.D., Ph.D.

Associate Professor of Medicine
Harvard Medical School
Associate Physician, Brigham and Women’s Hospital

Genetic Epidemiology of Chronic Obstructive Pulmonary Disease

Genetic factors are likely important determinants of chronic obstructive pulmonary disease (COPD) susceptibility. Genome-wide association analysis has recently been performed in COPD, and several regions of highly significant association on chromosomes 15 and 4 have been identified. Because COPD is a heterogeneous disease, genetic studies have focused on the identification of distinct subgroups of COPD subjects (COPD subtypes) as well as disease-related conditions (COPD-related phenotypes). Genetic association studies combined with chest CT scan analysis have the potential to lead to substantial new insights into COPD pathogenesis, which could provide important pathways to develop new treatments for COPD.

Tuesday, January 26, 2010
12:30-2:00 PM
Kresge 201

John Quackenbush
Professor of Computational Biology and Bioinformatics

Network and State Space Models: Science and Science Fiction Approaches to Cell Fate Predictions

Two trends are driving innovation and discovery in biological sciences: technologies that allow holistic surveys of genes, proteins, and metabolites and a realization that biological processes are driven by complex networks of interacting biological molecules. However, there is a gap between the gene lists emerging from genome sequencing projects and the network diagrams that are essential if we are to understand the link between genotype and phenotype. ‘Omic technologies such as DNA microarrays were once heralded as providing a window into those networks, but so far their success has been limited, in large part because the high-dimensional they produce cannot be fully constrained by the limited number of measurements and in part because the data themselves represent only a small part of the complete story. To circumvent these limitations, we have developed methods that combine ‘omic data with other sources of information in an effort to leverage, more completely, the compendium of information that we have been able to amass. Here we will present a number of approaches we have developed, including an integrated database that collects clinical, research, and public domain data and synthesizes it to drive discovery and an application of seeded Bayesian Network analysis applied to gene expression data that deduces predictive models of network response. Looking forward, we will examine more abstract state-space models that may have potential to lead us to a more general predictive, theoretical biology.

Miguel Camargo
Merck Reserach Laboratories

“Pathway based analysis of whole genome siRNA screens and de novo pathway identification”

A prevailing model of Alzheimer’s Disease etiology is the progressive aggregation of toxic Ab (Abeta) peptides in the brain. Ab is produced as a result of proteolytic processing of the amyloid precursor protein (APP). Cleavage of APP by beta- and gamma-secretases result in the production of Ab and hence several drug discovery efforts are aimed at
finding either beta- or gamma-secretase inhibitors. However, development of small molecules to either of these enzymes has proven to be challenging. In order to overcome limitations of developing b or g-secretase inhibitors, we performed large scale siRNA screens in order to identify novel regulators of APP processing that could represent more tractable drug targets. The screen measures the production of four APP proteolytic products: the non-amyloidgenic peptide (sAPPa) or the amyloidgenic pathway peptides (Ab40, Ab42, and sAPPa). We introduce a
novel analysis method that scores the overall effect of individual pathways on the processing of the APP protein. This method takes into account all genes in the pathway, thus allowing for small effects to be considered, and introduces the concept of scoring ‘pathways’ as opposed to individual genes as a way of mitigating against false positive hits.
Using this method, we identified novel and distinct pathways that regulate processing of APP into either amyloidgenic peptides or non-amyloidgeneic peptides respectively. We will also highlight how to leverage biological network data in combination with siRNA screens to identify novel pathways.

Tuesday, December 15, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)

Oliver Hofmann, Ph.D.
Research Associate in Biostatistics

Combining curated and in silico interaction data in network analysis

Integrating biological data obtained from multiple high throughput platforms is an area of active research. While it is possible to normalize for technical differences, laboratory effects and other artifacts the problem of merging data from different biological samples is still mostly unsolved. Standard methods include rank-based analysis of biological features (mRNA abundance, peptide counts etc.) rather than absolute measurements as well as higher levels of abstraction such as Gene Set Enrichment Analysis or the identification of Metagenes. By comparing biological systems at the network level an additional layer of data abstraction is added, allowing for the identificaton of connected network areas contributing to a phenotype of interest in heterogenous samples or between studies. Three different examples highlight current limitations of data availability and quality, identifying possible future methods for even higher levels of data abstraction to analyze complex systems.

Stalo Karageorgi, M.S.
Doctoral Student in Environmental Health

Polymorphisms in HSD17b2 and HSD17b4 and endometrial cancer risk

Hydroxysteroid dehydogenase 17b (HSD17b) genes encode for enzymes that control the last step in estrogen biosynthesis. The isoenzymes HSD17b2 and HSD17b4 in the uterus preferentially catalyze the conversion of estradiol, the most potent and active form of estrogen, to estrone, the inactive form of estrogen. Endometrial carcinoma is a disease strongly linked to the imbalance between the hormones estrogen and progesterone. We hypothesized that variation in single nucleotide polymorphisms (SNPs) in genes HSD17b2 and HSD17b4 may alter the enzyme activity, estradiol levels and risk of disease. Pairwise tagging SNPs were selected from the HapMap CEU database to capture all known common (MAF >0.05) genetic variation in the gene region with an r2 of at least 0.8. SNPs were genotyped in participants in the nested case-control studies in the Nurses’ Health Study (NHS) (cases=544, controls=1296) and the Womens’ Health Study (WHS) (cases=130, controls=389), who provided a blood or cheek cell sample. The association between SNPs and endometrial cancer was examined using conditional logistic regression to estimate odds ratio and 95% confidence intervals adjusted for known risk factors. We additionally investigated whether SNPs are predictive of plasma estradiol and estrone levels in the NHS using linear regression. This is the first study to report on genetic variation in HSD17b2 and HSD17b4 in relation to endometrial cancer.

Tuesday, November 17, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)

Zhaoxi (Michael) Wang, M.D., Ph.D.
Research Scientist, Environmental & Occupational Medicine and Epidemiology Program

“Mitochondrial Variations in NSCLC by Microarray-based Resequencing”

Mutations in human mitochondrial genome (mtDNA) genome have long been suspected to play an important role in the development of cancer. Although most cancer cells harbor mtDNA mutations, the question of whether such mutations are associated with clinical prognosis of cancer remains unclear. In this study, we resequenced the entire mitochondrial genomes of tumor tissue from a population of 249 Korean non-small cell lung cancer (NSCLC) patients using the Affymetrix GeneChips Human Mitochondrial Resequencing Array 2.0 (Santa Clara, CA). In early stage (stage I/ II) NSCLC, patients with the haplogroup D4 had the worst clinical prognosis. Interestingly, haplogroup D4 was previous reported as a marker for extreme longevity in Japanese.

Alkes Price
Assistant Professor of Statistical Genetics, HSPH

“Effects of Cis and Trans Family Heritability on Single-tissue and Cross-tissue Gene Expression Regulation”

Family heritability is a useful approach for understanding the genetic basis of gene expression. Heritability analyses can evaluate the contribution of both cis and trans regulation by considering genetic relatedness either genome-wide (trans), or at the genomic location proximal to the expressed gene (cis). We used gene expression data from blood and adipose tissue cohorts to estimate the contribution of cis and trans regulation to heritable variation in gene expression. We estimate that cis regulation contributes 45±7% of heritability in blood expression and 30±3% of heritability in adipose tissue expression, with the difference entirely attributable to greater trans effects in adipose tissue. We also conducted a cross-tissue analysis to investigate regulation that is shared across tissues. Strikingly, we observed that cross-tissue regulation is dominated by cis effects. These analyses point to a greater contribution for cis regulation than previous admixture-based analyses. This divergence would be consistent with a substantial role for epigenetic regulation, whose effects are included in heritability analyses but excluded in admixture analyses. Our results have implications for understanding the causes of “missing heritability” in genetic association studies.

Tuesday, September 22, 2009 1 2:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room

Meet & Greet, Monica Ter-Minassian practice talk on Genetic Risk Factors of Neuroendocrine Tumor