Each year, the PQG organizes a less formal PQG Working Group Series for all local students, postdocs, and faculty. The goal is to provide the opportunity to present and participate in the discussion of works-in-progress, and to focus on the methods and analysis of high-dimensional data in genetics and genomics.
2023-2024
Dana-Farber Cancer Institute
Feb 13, 2024
Jennifer Chen
Research Associate, Department of Molecular
and Cellular Biology
Harvard University
The cellular evolution underlying variation in innate social behavior across Peromyscus deer mice
Understanding how genes can ultimately code for social behavior is an open question in biology. Variation in innate social behaviors across species offers an opportunity to use a comparative approach to identify the genetic building blocks of behavioral traits. Here, I investigate the variation of a sexually dimorphic social behavior – parental care – across Peromyscus wild mice. Parental care is carried out by only female animals in promiscuous species, but is shared across the sexes in monogamous species of Peromyscus mice. Using single-cell gene expression sequencing, I identify how differences in the cellular and molecular composition of the hypothalamus contribute to vast differences in social behavior of two Peromyscus species. I identify multiple cell types with abundance differences across species, including two cell types previously implicated in parental care. Furthermore, I find that the monogamous species has significantly fewer sexually dimorphic genes compared to promiscuous species, suggesting that changes in sex-biased expression may underlie evolution of sexually dimorphic behaviors. Finally, I find the changes across species are highly enriched to involve neuropeptidergic genes. Together, my data suggest the evolution of social behaviors involve changes in both cell abundance and sex-biased expression, with neuropeptide regulation likely playing a crucial role.
December 12, 2023
Kaia Mattioli
Postdoctoral Research Fellow
Brigham and Women’s Hospital
Widespread variation in molecular interactions and regulatory properties among transcription factor isoforms
Transcription factors (TFs) control gene expression by interacting with DNA and cofactors to regulate transcription. Human TF genes produce multiple protein isoforms with altered DNA binding domains, effector domains, and other protein regions. The global extent to which this results in functional differences between TF isoforms of the same gene remains unknown. Here, we systematically tested 756 isoforms of 309 TF genes, assessing DNA binding, protein binding, transcriptional activation, subcellular localization, and condensate formation. Compared to the reference isoform, two-thirds of alternative TF isoforms exhibit differences in their molecular activities, which often cannot be predicted from sequence alone. We observed two primary categories of alternative TF isoforms: “rewirers” and “negative regulators”, both of which are associated with differentiation and cancer. Our results support a model wherein the relative expression levels of and interactions between TF isoforms add an understudied layer of complexity to gene regulatory networks, demonstrating the importance of isoform-aware characterization of TF functions and providing a rich resource for further studies.
November 14, 2023
Eric Van Buren
Postdoctoral Fellow in Biostatistics & Statistical Genetics
Harvard T.H. Chan School of Public Health
cellSTAAR: Incorporating single-cell-sequencing-based functional data to boost power in rare variant association testing of non-coding regions
Whole genome sequencing studies have identified hundreds of millions of rare variants, the majority of which are in non-coding regions and of unknown function. Because the regulatory landscape of many candidate Cis-Regulatory Elements (cCREs) varies across cell types, it is of substantial interest to incorporate single-cell sequencing data into RVATs of cCREs to capture the functional variability that exists across cell types in the non-coding genome and boost statistical power in the process. We propose cellSTAAR to address two opportunities to improve existing gene-centric RVAT methods as applied to genetic variants in cCREs. First, cellSTAAR integrates single-cell ATAC-seq data to capture variability in chromatin accessibility across cell types via the construction of cell-type-specific variant sets and cell-type-specific functional annotations. Second, cellSTAAR links cCREs to their target genes using an omnibus framework that aggregates results from a variety of linking approaches to reflect the uncertainty in element-gene linking. We applied cellSTAAR on Freeze 8 (N = 60,000) of the NHLBI Trans-Omics for Precision Medicine (TOPMed) consortium whole genome sequencing data to three quantitative lipids traits: LDL, HDL, and TG. In at least one cell type, genome-wide significant promoter and enhancer associations were found in several known lipids loci, including APOE, APOA1, and CETP. Unlike existing methods, cellSTAAR reveals variability in the significance at these loci across a variety of cell types from diverse tissues and uncertainty in the target gene for significant enhancers. Using a weakened genome-wide significance threshold, the most discoveries using cellSTAAR are found in cell types that are the most relevant to lipids such hepatocytes. We replicate our results using UK Biobank whole genome sequcing data (N = 190,000).
October 10, 2023
Rongbin Zheng
Research Fellow in Pediatrics
Boston Children’s Hospital
Computational Models on Metabolite-mediated Intercellular Communication
2022-2023
May 9, 2023
Analytic and Translational Genetics Unit
HMS, MGH, Broad Institute
April 4, 2023
Karl Tayeb
PhD Candidate
University of Chicago
March 28, 2023
Analytic and Translational Genetics Unit
HMS, MGH, Broad Institute
February 14, 2023
Saori Sakaue
Postdoctoral Research Fellow
HMS
After 20 years of genome-wide association study (GWAS), rarely have we identified causal variants or genes. Translating GWAS loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. We developed a new non-parametric statistical method, SCENT (Single-Cell ENhancer Target gene mapping) which models association between enhancer chromatin accessibility and gene expression in single-cell multimodal RNA-seq and ATAC-seq data. We applied SCENT to 9 multimodal datasets including > 120,000 single cells and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in eQTLs and GWAS for 1,143 diseases and traits, which outperformed previous bulk-tissue based enhancer map (e.g., EpiMap) and single-cell based method (e.g., Signac). We identified novel likely causal genes for both common and rare diseases. In addition, we were able to link somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining variant function.
December 13, 2022
Johns Hopkins University
Investigating dynamic genetic regulation in diverse cell types and differentiation trajectories
Large-scale efforts to identify genetic variants associated with nearby gene expression levels (cis expression quantitative trait loci, or cis-eQTLs) have focused primarily on healthy adult tissues, potentially overlooking effects specific to transient cell states such as those arising during development or environmental exposures. The current limitations in identifying target genes and mechanisms underlying disease-associated genetic loci may arise partially due to the scarcity of data from developmental and dynamic contexts in disease-relevant cell types. By integrating established stem cell model systems with recent advances in single-cell RNA-sequencing technologies, we are now able to study these transient contexts at high resolution, bringing a clearer picture of gene regulatory dynamics into view. We first studied directed differentiation of induced pluripotent stem cells (iPSCs) into cardiomyocytes, where single-cell analysis enabled us to resolve bifurcating trajectories displaying distinct regulatory dynamics. Now, we are branching out to a wider array of trajectories in an efficient, unified experimental framework based on embryoid bodies. Embryoid bodies are three-dimensional aggregates of iPSCs that spontaneously differentiate into a wide variety of cell types, including derivatives of all three germ layers. We generated embryoid bodies from iPSCs from 53 donors, and collected single cell RNA-sequencing data for over 900,000 cells in total. We applied trajectory inference to characterize the complex, multifurcating differentiation landscape and annotated a wide array of cell types, many of which resemble those arising during human embryonic development. We found that different cellular lineages display dynamic expression of numerous genes, including gene modules relevant to complex traits such as schizophrenia, hypertension, and BMI. We identified hundreds of genetic regulatory effects including cell-type specific and dynamic eQTLs in various trajectories. Next, we used unsupervised machine learning to infer patterns of shared regulatory activity within and between differentiation trajectories to characterize how regulation varies between cellular contexts. This work will expand identification of genetic effects to transient cell states arising during differentiation in diverse lineages and improve attribution of regulatory mechanisms to disease risk loci.
November 8, 2022
Martin Zhang
Postdoctoral Research Fellow
Harvard T.H. Chan School of Public Health
Quantifying and partitioning SNP effect correlation across UK Biobank traits
While traditional heritability estimation methods such as LD-score regression (Bulik-Sullivan et al. 2015 Nat Genet) assume independent SNP effects on the trait, recent works (Schoech et al. 2020 bioRxiv) determined that nearby SNPs (e.g., <100bp) have negatively correlated effects. Yet, it is unclear if the SNP effect correlation exists for other types of SNP pairs. We developed a new method, gene-level directional effect regression (GDREG), that analyzes summary association statistics and in-sample LD information to jointly estimate average SNP effect correlation and per-SNP heritability, stratified by MAF and functional category; we define SNP effect correlation based on standardized minor allele causal effect sizes of SNPs. We performed extensive simulations to confirm that GDREG obtains unbiased estimates of average SNP effect correlation under a range of different genetic architectures.
We applied GDREG to the UK Biobank imputed data across 32 independent diseases and complex traits (N=334K unrelated British-ancestry samples). We determined that SNP pairs within 100bp are significantly negatively correlated (non-significant for common pairs, -0.27±0.04 for low-freq SNP pairs), recapitulating results in previous works. Furthermore, we detected a significant negative correlation for gene-level SNP pairs, including SNP pairs on the same exon (non-significant for common SNP pairs, -0.04±0.01 for low-freq SNP pairs) and SNP pairs on the promoter of the same gene (-0.04±0.01 for common pairs, -0.21±0.03 for low-freq pairs). The negative correlation of gene-level SNP pairs cannot be fully explained by their proximity, providing additional information to the proximity annotation. We hypothesize that the negative correlation is due to linkage masking, in which negatively correlated effects of positively correlated minor alleles enable the alleles to cancel each other’s effects and thus survive the action of negative selection. In addition, we determined that low-freq SNP pairs are more negatively correlated than common SNP pairs across the annotations.
HMS, MGH, Broad Institute
Challenges and opportunities for developing more generalizable polygenic risk scores
Polygenic risk scores (PRS) estimate an individual’s genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. We will discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.
Multiclass network inference reveals distinct parameter regions in gene regulatory space
Gene expression is controlled by complex regulatory processes that allow the same genome to create and regulate the vast diversity of cellular phenotypes that exist in each individual. Although we cannot directly measure gene regulatory networks (GRNs), inferring these from available data is increasingly acknowledged as essential to understanding a wide range of biological processes. PANDA (Passing Attributes between Networks for Data Assimilation) is a method for inferring GRNs that has been used in a large number of studies to identify changes in patterns of gene regulation that correspond to changes in phenotypic state. BOA (Bayesian Optimization of Associations) extends PANDA by introducing a number of weight parameters and using a Bayesian optimization protocol to adjust the relative importance of various PANDA inputs by tuning against relevant experimental data sets. This allows BOA to infer not only GRNs (based on benchmarking against ChIP-seq transcription factor (TF) binding data), but also Transcriptional Control Networks (TCNs) that model effects of TF knock-down to provide insight into cascading regulatory responses, and Functional Binding Networks (FBNs; a combination of GRNs and TCNs) that are optimized against a combination of ChIP-seq and TF knock-down data. We demonstrate the utility of BOA by constructing GRNs, TCNs, and FBNs for cell lines treated with 259 drugs and analyze network structure to identify altered regulatory patterns and the effects of perturbative compounds. BOA is available at https://github.com/netZoo/netZooM; the gene regulatory network database (GRAND; https://grand.networkmedicine.org) includes the cell line drug benchmark data and the networks generated by BOA.
2021-2022
May 10, 2022
Gang Li
Data Science Postdoctoral Fellow
University of Washington
Ayshwarya Subramanian
Computational Biologist
Broad Institute
Comparative transcriptomics of mouse and human kidneys to study disease altered cell states
Mouse models are a tool for studying the mechanisms underlying complex diseases; however, differences between species pose a significant challenge for translating findings to patients. Here, we used single-cell transcriptomics and orthogonal validation approaches to provide cross-species taxonomies, identifying shared broad cell classes and unique granular cellular states, between mouse and human kidney. We generated cell atlases of the diabetic and obese kidney using two different mouse models, a high-fat diet (HFD) model and a genetic model (BTBR ob/ob), at multiple time points along disease progression. Importantly, we identified a previously unrecognized, expanding Trem2high macrophage population in kidneys of HFD mice that matched human TREM2high macrophages in obese patients. Taken together, our cross-species comparison highlights shared immune and metabolic cell-state changes.
March 8, 2022
Department of Biostatistics, Harvard T.H. Chan School of Public Health
Methods for risk prediction using integrative multi-ethnic genetic and genomic datasets
Polygenic risk scores (PRS) are becoming increasingly predictive of complex traits, but poorer performance in non-European populations raises concerns for clinical applications. We develop a powerful and scalable method for developing PRS using GWAS across diverse populations by combining multiple techniques, including LD-clumping, empirical-Bayes and machine learning. We evaluate the performance of the proposed method relative to a variety of alternatives using extensive simulation studies and 23andMe Inc. datasets for seven complex traits, including up to 800K individuals from non-European populations. Results show that the proposed method can substantially improve the performance of PRS in non-European populations relative to simple alternatives and can perform comparably or superior to more advanced methods that require a different order of computational time. Further, our simulation studies provide novel insight to sample size requirement and effect of SNP density on multi-ethnic polygenic prediction.
February 15, 2022
Martin Jinye Zhang
Postdoctoral Researcher, Department of Epidemiology
Harvard T.H. Chan School of Public Health
Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data
Gene expression at the individual cell-level resolution, as quantified by single-cell RNA-sequencing (scRNA-seq), can provide unique insights into the pathology and cellular origin of diseases and complex traits. Here, we introduce single-cell Disease Relevance Score (scDRS), an approach that links scRNA-seq with polygenic risk of disease at individual cell resolution; scDRS identifies individual cells that show excess expression levels for genes in a disease-specific gene set constructed from GWAS data. We determined via simulations that scDRS is well-calibrated and powerful in identifying individual cells associated to disease. We applied scDRS to GWAS data from 74 diseases and complex traits (average N =341K) in conjunction with 16 scRNA-seq data sets spanning 1.3 million cells from 31 tissues and organs. At the cell type level, scDRS broadly recapitulated known links between classical cell types and disease, and also produced novel biologically plausible findings. At the individual cell level, scDRS identified subpopulations of disease-associated cells that are not captured by existing cell type labels, including subpopulations of CD4+ T cells associated with inflammatory bowel disease, partially characterized by their effector-like states; subpopulations of hippocampal CA1 pyramidal neurons associated with schizophrenia, partially characterized by their spatial location at the proximal part of the hippocampal CA1 region; and subpopulations of hepatocytes associated with triglyceride levels, partially characterized by their higher ploidy levels. At the gene level, we determined that genes whose expression across individual cells was correlated with the scDRS score (thus reflecting co-expression with GWAS disease genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.
December 14, 2021 – cancelled
November 9, 2021
Tushar Kamath
PhD Candidate, Biophysics, Broad Institute
MD Student, Harvard Medical School
Vulnerabilities of midbrain dopaminergic neurons to Parkinson’s disease revealed by single-cell genomics
The loss of some dopamine (DA) neurons within the substantia nigra pars compacta (SNpc) is a defining pathological hallmark of Parkinson’s Disease (PD). Yet, the molecular features associated with DA neuron vulnerability have not yet been fully identified. To comprehensively characterize DA neuron types in the SNpc and their relative vulnerabilities to PD, we developed a protocol to enrich and transcriptionally profile thousands of midbrain DA neurons from PD patients and matched controls. We identified 10 populations and spatially localized each within the SNpc using Slide-seq, a high-resolution spatial transcriptomics technology. A single subtype, marked by the expression of the gene AGTR1 and spatially confined to the ventral tier of SNpc, was highly susceptible to loss and showed the strongest upregulation, relative to other DA types, of targets of TP53 and NR2F2, nominating molecular processes associated with degeneration in vivo. This same vulnerable population was specifically enriched for the heritable risk associated with sporadic PD. These analyses highlight the importance of cell-intrinsic pathways in determining the differential vulnerability of DA neurons to degeneration in PD.
October 12, 2021
Tiffany Amariuta
Postdoctoral Research Fellow, Genetic Epidemiology and Statistical Genetics
Harvard T.H. Chan School of Public Health
Modeling tissue co-regulation to infer tissue-specific contributions to disease
Despite abundant evidence of disease etiologies that span multiple tissues, quantifying tissue- specific contributions to disease heritability remains challenging. Previous work emphasized the potential of accounting for tissue co-regulation (Ongen et al. 2017 Nat Genet), but tissue-specific disease effects have not been formally modeled.
We introduce a new method, tissue co-regulation score regression (TCSC), that quantifies tissue-specific contributions to disease heritability by regressing transcriptome-wide association study (TWAS) gene-disease chi-square statistics on tissue co-regulation scores, across genes and tissues. TWAS statistics include direct effects of predicted cis-genetic components of gene expression on disease and tagging effects of co-regulated tissues (Wainberg et al. 2019 Nat Genet). TCSC distinguishes best proxy causal versus tagging gene-disease effects across tissues by modeling pairwise correlations of predicted gene expression between tissues (tissue co-regulation scores). In simulations, TCSC detects causal tissues with well-calibrated false positive rate across a broad range of parameter settings. At default settings, TCSC attained substantially higher power to detect causal tissues than the Ongen et al. method. TCSC also estimates the proportion of SNP-heritability explained by each tissue; estimates are conservative, as they exclude effects of genes with non-significant gene expression heritability at finite sample size.
We applied TCSC to 82 heritable complex traits and diseases from UK Biobank (average N = 299K), using gene expression prediction models constructed from GTEx data across 49 tissues to compute TWAS statistics and co-regulation scores. Below, we discuss tissues with non-zero heritability at 10% FDR for three representative traits: waist-hip-ratio (WHR), anorexia, and height. For WHR, TCSC correctly implicates adipose (29.9% of SNP-heritability explained), as WHR reflects the corporeal distribution of fat. For anorexia, TCSC specifically implicates the brain cortex (32.9% of SNP-heritability explained) out of five central nervous system regions, consistent with extensive previous work implicating this brain region (Haye et al. 2009 Nat Rev Neuro). For height, TCSC implicates five best proxy causal tissues, where the lead tissue, fibroblasts, explains 41.8% of SNP-heritability. This is consistent with the known role of connective tissue in growth regulation. Our method also reduced the number of trait-associated tissues within a tissue category by 67% compared to LDSC-SEG (Finucane et al. 2018 Nat Genet). In conclusion, TCSC is a powerful method for quantifying tissue-specific contributions to disease heritability.
September 28, 2021
Rounak Dey
Postdoctoral Research Fellow, Biostatistics
Harvard T.H. Chan School of Public Health
Scalable and accurate mixed effects model to account for relatedness and populations structure in multi-ethnic PheWAS using sparse ancestry-adjusted genetic relatedness matrix
In genetic association studies, generalized linear mixed effects models (GLMMs) are commonly used to control for the relatedness among the samples by modelling the familial and cryptic relationships using a genetic relationship matrix (GRM). Existing GLMM methods use an empirical GRM to account for the sample-relatedness, which works well in the context of association analysis in studies with mostly homogeneous populations. However, they are not suitable to analyze recent multi-ethnic whole genome sequencing (WGS) studies with heterogeneous populations. A standard approach to adjust for the population stratification in multi-ethnic studies is to use the principal components (PCs) as fixed effects. Since the empirical GRM also contains the population structure information, using both the PCs and the empirical GRM in the model can lead to “double-fitting” of the population structure. Moreover, using the empirical GRM can lead to the mis-specification of the familial relationships due to the confounding effect of the population structure, which can potentially result in the loss of power and miscalibration of the type I error rates. Existing methods that rely on the sparsity of the empirical GRM also fail to work because of the lack of sparsity due to the population structure.
Here, we propose a scalable GLMM method for multi-ethnic studies that uses a sparse ancestry-adjusted GRM to model the sample-relatedness, and accounts for the population structure using the ancestry-informative principal components as fixed effects. By separating the distant ancestry and the familial relationships, our method provides a scalable and accurate solution to analyze large multi-ethnic studies, especially some of the recent WGS studies, which leads to accurate type I error control and improved power to detect associations. To facilitate the entire pipeline for the WGS data analysis, we further propose a scalable computation method to estimate the sparse ancestry-adjusted GRM using efficient distributed computation techniques, which can compute the sparse ancestry-adjusted GRM for the entire UK Biobank dataset of more than 450000 subjects in less than nine hours using only 45 CPUs and 40 GB overall memory usage. Using numerical simulations, and an application on the entire UK Biobank dataset, we demonstrate that our method is scalable to handle association analysis with more than 450000 subjects, and control type I error and improve power compared to the existing GLMM methods.
2020-2021
May 11, 2021
Brian Cleary
Broad Fellow
Broad Institute of MIT and Harvard
Using viral loads and within-host models to improve COVID-19 surveillance
Limited testing capacity has been an ongoing problem throughout the COVID-19 pandemic. Pooled testing is a faster and less expensive diagnostic approach compared to individual testing, but there are important tradeoffs in sensitivity, efficiency and logistics to consider. In this talk, we outline an approach combining within-host hierarchical models, compartmental models of pathogen spread, and viral load data to identify optimized pooling protocols that are robust to the changing state of an epidemic. This study highlights the importance of considering within-host biology and individual-level heterogeneity when evaluating epidemiological surveillance strategies. We will also introduce HYPER, a new pooled testing method based on hypergraph factorization. HYPER is designed to be easy to implement and adapt, while also producing pools that are balanced and efficient. We will discuss what hypergraph factorizations are and how they generate the pooling designs used in HYPER.
April 6, 2021
Huwenbo Shi and Martin Zhang
Postdoctoral Research Fellow, Department of Epidemiology
Harvard T.H. Chan School of Public Health
Transcriptome-wide association study and fine-mapping at cell-type resolution
Transcriptome-wide association studies (TWAS) using cis predicted gene expression have identified thousands of genes associated with diseases and complex traits (Gamazon et al. 2015 Nature Genetics, Gusev et al. 2016 Nature Genetics, Zhu et al. 2016 Nature Genetics). However, existing TWAS are based on gene expression models for bulk tissue gene expression, and therefore cannot pinpoint the specific cell types through which a gene is associated with diseases and complex traits.
Here, we introduce an approach to perform TWAS at tissue-cell-type resolution. We apply computational methods to infer cell type proportions and cell-type specific gene expression for each GTEx sample, using the mouse reference single-cell gene expression from the Tabula Muris Consortium in matched tissues, and obtain gene expression models for every tissue-cell-types based on the inferred gene expression.
We performed tissue-cell-type TWAS and fine-mapping for 52 diseases and complex traits. Compared to TWAS at tissue resolution, our tissue-cell-type TWAS discovered on average 84% more unique gene-trait associations at FDR < 0.05. Our approach also pinpoints specific cell types for each gene-trait association. For example, we found that the SORT1-LDL association is highly significant (P<10-30) in liver hepatocyte with fine-mapped PIP=0.68, and the CACNA1C-schizophrenia association in brain cerebellum neuronal cells (P=2.35×10-10, PIP=1.0) . In addition, in a systematic comparison with known disease-related genes, the fine-mapped tissue-cell-type TWAS results showed increased enrichment relative to tissue TWAS in 16/19 sets of comparisons (P=0.003).
March 9, 2021
Research Fellow
Analytic and Translational Genetics Unit (ATGU) at Massachusetts General Hospital, the Broad Institute, and Harvard Medical School
Assessing the Genetic Contribution of Drug Response Variability in Selective Serotonin Reuptake Inhibitors From More than 2 million Purchases in the Finnish National Drug Registry
February 16, 2021
Xuefang Zhao
Postdoctoral Researcher
Harvard Medical School, Center for Genomic Medicine at Massachusetts General Hospital
Expectations and blind spots for structural variation detection from short-read alignment and long-read assembly
Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and emerging clinical diagnostic approaches utilize short-reads (srWGS), which present constraints for genome-wide discovery of structural variants (SVs). Alternative long-read single molecule technologies (lrWGS) offer significant advantages for genome assembly and SV detection, while these technologies are currently cost prohibitive for large-scale disease studies and clinical diagnostics (~5-12X higher cost than comparable coverage srWGS). Moreover, only dozens of such genomes are currently publicly accessible by comparison to millions of srWGS genomes that have been commissioned for international initiatives. Given this ubiquitous reliance on srWGS in human genetics and genomics, we sought to characterize and quantify the properties of SVs accessible to both srWGS and lrWGS to establish benchmarks and expectations in ongoing medical and population genetic studies, and to project the added value of SVs uniquely accessible to each technology. In analyses of three trios with matched srWGS and lrWGS from the Human Genome Structural Variation Consortium (HGSVC), srWGS captured ~11,000 SVs per genome using reference-based algorithms, while haplotype-resolved assembly from lrWGS identified ~25,000 SVs per genome. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplications (SD) and simple repeats (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of the human reference, we observed extremely high concordance (93.8%) for deletions discovered by srWGS and lrWGS after error correction using the raw lrWGS reads. Conversely, lrWGS was superior for detection of insertions across all genomic contexts. Given that the non-SD/SR sequences span 90.3% of the GRCh38 reference, and encompass 95.9% of coding exons in currently annotated disease associated genes, improved sensitivity from lrWGS to discover novel and interpretable pathogenic deletions not already accessible to srWGS is likely to be incremental. However, these analyses highlight the added value of assembly-based lrWGS to create new catalogues of functional insertions and transposable elements, as well as disease associated repeat expansions in genomic regions previously recalcitrant to routine assessment.
December 15, 2020
Xihao Li
PhD Candidate
Harvard T. H. Chan School of Public Health
November 10, 2020
Carles Boix
PhD Candidate
MIT
October 13, 2020
Masahiro Kanai
PhD Candidate
Harvard Medical School, Analytical and Translational Genetics Unit, Mass General Hospital, Broad Institute
Here, we present a cross-population analysis of fine-mapping results based on three large-scale biobanks. In parallel to our effort fine-mapping complex traits in UK Biobank (n = 361,194; UKB) and expression QTL (eQTL) in GTEx (n = 838) (Ulirsch & Kanai et al), we fine-mapped hundreds of complex traits and diseases from Biobank Japan (n = 180,987; BBJ) and FinnGen (n = 183,694) using FINEMAP (Benner et al, 2016) and SuSiE (Wang et al, 2018). In total, 4,151 high-confidence putative causal variants for 124 traits were identified (posterior inclusion probability [PIP] > 0.9 in any population), including 46 and 66 population-enriched variants from BBJ and FinnGen, respectively. Distinct coding variants from each population often fine-mapped together in the same exons, and we found that coding putative causal variants are more deleterious (OR = 10.4 and 5.9 for pLoF and missense vs synonymous) and more pathogenic (OR = 28.3 for ClinVar variants) than other coding variants. Furthermore, we observed that non-coding putative causal variants are strongly enriched for promoters and cis-regulatory regions (accessible chromatin and H3K27ac) (OR = 10.8 and 11.3 vs non-genic) and colocalized with fine-mapped eQTL variants in GTEx, suggesting that the majority of putative causal variants could be explained via coding or regulatory mechanisms. Altogether, we demonstrate how diverse populations gain additional insights into disease biology with an expanded atlas of candidate causal variants.
September 29, 2020
Zilin Li
Postdoctoral Fellow
Harvard T. H. Chan School of Public Health
A Framework for Detecting Noncoding Associations in Large Whole Genome Sequencing Studies at Scale
Compared with GWAS and whole exome sequencing studies, large-scale whole genome sequencing studies have enabled the analysis of non-coding rare variants (RVs) associated with complex human traits. Common analytic strategies for RV association in non-coding region considered limited choices of gene-centric masks and sliding windows of a fixed length, and have limited scope to leverage the functions of variants.
We propose a non-coding rare variant association detection framework, including gene-centric analysis and genetic region analysis. For gene-centric analysis, we consider various strategies for grouping non-coding variants based on functional annotations, including UTR, upstream, downstream, promoter, enhancer and long non-coding RNA genes. For genetic region analysis, we group non-coding RVs residing in a contiguous window, defined either by a pre-specified (fixed) window size or a flexible data-adaptive window size using SCANG (SCAN the Genome). The STAAR (variant-Set Test for Association using Annotation infoRmation) method is also applied in the framework that increases the power of RV association tests by effectively incorporating multiple functional annotations.
We applied the proposed framework to analyze non-coding RV association with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) in 21,015 discovery samples and 9,123 replication samples from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. Several novel non-coding RV-sets associated with lipids were discovered and replicated using the TOPMed WGS data.
2019-2020
May 12, 2020
Tiffany Amariuta
PhD Candidate
Harvard Medical School
Leveraging functional annotations to improve the trans-ethnic portability of polygenic risk scores
Polygenic risk scores (PRS) quantify the cumulative trait-increasing allelic effect of an individual’s genome and have the potential to transform clinical practice by identifying patients with higher genetic burdens. Poor trans-ethnic portability of PRS is a critical issue that may be partially due to limited knowledge of causal variants shared among populations. Hence, leveraging noncoding regulatory annotations that capture genetic variation across populations has the potential to enhance the trans-ethnic portability of PRS. Using our cell-type-specific regulatory element annotation strategy called IMPACT, we partitioned the common SNP heritability of diverse polygenic traits and diseases from 111 GWAS summary statistics of European (EUR, average N=180K) and East Asian (EAS, average N=157K) origin. Strikingly, we observed highly concordant polygenic trait regulation between populations: the same regulatory annotations captured statistically indistinguishable SNP heritability. Therefore, prioritizing variants in IMPACT regulatory elements may improve the trans-ethnic portability of PRS by selecting variants with population nonspecific effects. Indeed, we observed that EUR PRS models more accurately predicted 21 tested phenotypes of EAS individuals when variants were prioritized by key IMPACT tracks (49.9% mean relative increase in R ). Notably, the improvement afforded by IMPACT 2 was greater in the trans-ethnic EUR-to-EAS PRS application than in the EAS-to-EAS application (47.3% vs 20.9%, P < 1.7e-4). Overall, our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ethnic portability of genetic data, and this has important implications for future risk prediction models that work across populations.
April 21, 2020
Wei Zhou
Postdoctoral Research Fellow
Broad Institute
An efficient and accurate frailty model approach for genome-wide survival association analysis controlling for population structure and relatedness in large-scale biobanks
With decades of electronic health records linked to genomic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan.
We developed an efficient and accurate frailty (random effects) model approach for genome-wide survival association analysis of censored time-to-event phenotypes in large datasets (>400,000 individuals), which accounts for population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computation cost. The saddlepoint approximation is used to allow for analysis of heavily censored endpoints and low frequency variants. We have demonstrated analysis of ~400,000 samples and 77 million genetic markers in 15 hours using 32 threads and < 15 GByte of memory – with computing time scaling nearly linearly with the numbers of samples and markers. Using simulation studies, we have showed that type I error rates are well controlled, even for less frequent variants with frequency 0.5% and censoring rate 99%. We analyzed ~180,000 samples in FinnGen and ~400,000 UK Biobank participants with white British ancestry and identified the well-known locus APOE-e4 associated with lifespan. We also identified loci for secondary conditions following primary diseases, such as death following lung cancer and nephropathy following diabetes. Currently, we are applying our method to more event-to-time endpoints in the UK Biobank.
Genome-wide survival analysis of large biobanks will help understand the natural history and progression of diseases and pave a road for constructing polygenic risk scores to predict age-at-onset, disease progression and lifespan.
April 7, 2020
Hufeng Zhou
Instructor in Medicine
Harvard Medical School
Functional Annotation of Variants – Online Resource
March 10, 2020 – cancelled
Harvard Medical School
February 25, 2020
Caitlin Carey
Post-Doctoral Fellow
Analytical and Translational Genetics Unit, Mass General Hospital and Broad Institute
Genetic architecture of phenome-wide latent factors in the UK Biobank
December 10, 2019 – cancelled
Duncan Palmer
Post-Doctoral Fellow
Analytical and Translational Genetics Unit
Mass General Hospital and Broad Institute
November 19, 2019
Tarjinder Singh
Postdoctoral fellow
Massachusetts General Hospital, Harvard Medical School
Exome sequencing identifies genes for schizophrenia
Ultra-rare variants from 25,000 schizophrenia cases implicates ten genes and provides mechanistic hypotheses related to the biology of disease
The Schizophrenia Exome Sequencing Meta-Analysis (SCHEMA) Consortium is one of the largest efforts to analyze sequencing data to advance gene discovery. We have completed the analysis of 24,248 sequenced cases and 97,322 controls comprising of individuals from five continental populations. The scale of SCHEMA enables us, for the first time, to implicate URVs in ten genes as conferring substantial risk for SCZ at genome-wide significance (odds ratios 4 50, P < 2e6), and 34 genes at a FDR < 5%. Two of these, the NMDA receptor subunit GRIN2A and transcription factor SP4, reside in two loci implicated by SCZ GWAS. A second glutamate receptor subunit, GRIA3, is also implicated, providing support for the hypofunction of the glutamatergic system in the pathogenesis of schizophrenia.
Exploring the published results from severe neurodevelopmental delay (NDD) and autism (ASD) consortia, we find that top ten SCZ genes have no protein-truncating variant signal in either ascertainment, though a significant overlap between ASD (n = 102, FDR < 10%) and SCZ risk genes (n = 34, FDR < 5%) was observed. Partitioning the 102 ASD genes into those disrupted more frequently 1) in ASD and 2) in intellectual disability (ID), we show that this signal was driven by the ASDpreferential, and not ID preferential genes. Thus, SCZ genes from exome sequencing have relevance for later-onset psychiatric disorders rather than more severe NDDs.
Between genes identified by GWAS and exome sequencing, we find convergence in biological processes and tissue types, specifically in synaptic transmission and components of the postsynaptic density. After excluding associated genes, SCZ cases still carry a substantial excess of rare URVs, suggesting that many more remain to be discovered.
October 1, 2019
Julian Hecker
Research Associate
Brigham and Women’s Hospital
PolyGEE: A generalized estimating equation approach to the efficient estimation of polygenic effects in large-scale association studies
To quantify polygenic effects in large-scale association studies, we propose a generalized estimating equation (GEE) based estimation framework. We develop a marginal model for single-variant association test statistics of complex diseases that is applicable to population-based designs, to family-based designs or arbitrary combinations of both. We extend the standard GEE approach so that the parameters of the proposed marginal model can be estimated based on working-correlation/linkage-disequilibrium (LD) matrices from external reference panels. Our method achieves substantial efficiency gains over standard approaches, while it is robust against misspecification of the LD structure, i.e. the LD structure of the reference panel can differ substantially from the true LD structure in the study population. In simulation studies and applications to real data, we illustrate the features of the proposed GEE framework.
September 24, 2019
Raymond Walters
Post-Doctoral Research Fellow
Analytical and Translational Genetics Unit
Mass General Hospital and Broad Institute
Exploring sex differences in the genetics of UK Biobank phenotype
Sex differences are commonly observed for numerous human traits (e.g. sexual dimorphism of height and weight, sex bias in psychiatric diagnoses), likely reflecting both biological (e.g. sex hormones) and environmental (e.g. cultural norms) effects. Twin studies and genome-wide association studies (GWAS), have suggested that genetic effects are predominantly shared across sexes. Nevertheless, there is evidence that for certain phenotypes, such as disorders involving alcohol use and waist-hip ratio, genetic effects may differ by sex. Detecting these differences remains challenging, however, since the sex differences in genetic effects are likely smaller than the genetic main effects. We use the large-scale genotyping data (194,174 female and 167,020 male unrelated individuals of European ancestry) of UK Biobank (UKB) to evaluate whether sex differences in genetic effects can be detected across the breadth of UKB phenotypes. In preliminary results, we find that on average most autosomal genetic effects are shared between sexes (mean rg=.92), but significant differences are observed for reported substance use, body mass index, and insomnia, among other phenotypes. We compare these estimates to analyses of siblings in UKB, and identify individual loci showing significant differences in effects between the GWAS in each sex.
2018-2019
April 9, 2019
Keegan Korthauer
Postdoctoral Research Fellow
DFCI and HSPH
Accurate inference of DNA methylation data: statistical challenges lead to biological insights
DNA methylation is an epigenetic modification widely believed to act as a repressive signal of gene expression. Whether or not this signal is causal, however, is currently under debate. Recently, a groundbreaking experiment probed the influence of genome-wide promoter DNA methylation on transcription and concluded that it is generally insufficient to induce repression. However, the previous study did not make full use of statistical inference in identifying differentially methylated promoters. In this talk, I’ll detail the pressing statistical challenges in the area of DNA methylation sequencing analysis, as well as introduce a statistical method that overcomes these challenges to perform accurate inference. Using both Monte Carlo simulation and complementary experimental data, I’ll demonstrate that the inferential approach has improved sensitivity to detect regions enriched for downstream changes in gene expression while accurately controlling the False Discovery Rate. I will also highlight the utility of the method through a reanalysis of the landmark study of the causal role of DNA methylation. In contrast to the previous study, our results show that DNA methylation of thousands of promoters overwhelmingly represses gene expression.
March 12, 2019
Elizabeth Atkinson
Postdoctoral Research Fellow
MGH / Broad Institute
February 26, 2019
Dandi Qiao
Instructor in Medicine
Brigham and Women’s Hospital
Gene-based segregation method and its application in family-based exome sequencing studies of severe COPD
For Mendelian diseases or complex diseases with Mendelian subtypes, filtering-based approach has been used to identify rare coding variants using whole exome sequencing data of families. However, formal statistical approaches are needed to quantify the uncertainty of such filtering approach. We described a Gene-based Segregation test (GESE) that was constructed using the probability of segregation events under the null hypothesis of Mendelian transmission, which takes into account the degree of relatedness in families, and the number and minor allele frequencies of the functional rare variants in the genes. We showed via simulation that GESE performs better than region-based methods for such diseases. Chronic obstructive pulmonary disease (COPD) is strongly influenced by genetic factors including rare variants of large effects, as exemplified by alpha-1 antitrypsin deficiency. We applied our method to a large whole-exome sequencing data of 1554 subjects in families with severe COPD patients. We identified two genes significantly segregating in multiple families. The expression of 5 out of the top 10 genes was associated with forced expiratory volume in 1-second (FEV1) % predicted, with enrichment p-value of 0.024. Our proposed method shows great potential for identifying rare coding variants of large effect and high penetrance for family-based sequencing data. It is implemented in an R package that is publicly available on CRAN.
December 11, 2018
Andrea Byrnes
Postdoctoral Fellow
MGH / Broad Institute
November 6, 2018
Peter Kharchenko
Associate Professor of Biomedical Informatics
Harvard Medical School
Analysis of transcriptional dynamics with single-cell transcriptomics
Single-cell RNA-seq measurements can reveal detailed composition of complex biological tissues, however generally capture a snapshot of the cellular states at a given point in time. However, in many contexts, such as development or stimulus response, it is the dynamics of the cellular state is of primary interest. We find that residual signatures of mRNA processing, such as trace amounts of unspliced mRNA, can be used to estimate the first time derivative of the transcriptional state. These RNA velocity can be used to extrapolate the state of the cells into the future on the timescale of hours, facilitating interpretation of the cell dynamics.
October 2, 2018
Robert Maier
Analytic and Translational Genetics Unit
MGH, Broad Institute, HMS
Polygenic adaptation signals for height are confounded by population structure
with Mashaal Sohail, Andrea Ganna, Alex Bloemendal, Alicia Martin, Mark Daly, Nick Patterson, Benjamin Neale, Iain Mathieson, David Reich, Shamil Sunyaev
In the past six years, numerous publications have reported directional selection signals of polygenic selection on height. Here, we show that many of these results are severely confounded by uncontrolled population stratification. In particular, signals of polygenic adaptation based on summary statistics from the from the GIANT consortium meta-analysis are dramatically reduced in magnitude and, in many cases no longer statistically significant when using summary statistics derived from the UK Biobank. This polygenic adaptation signal was apparently independently confirmed by a tests based on the singleton density score statistic (SDS). However, we show that this signal too is only present when using GIANT but not UK Biobank summary statistics. Specifically, the Spearman correlation between p-value and SDS statistic is 2e-65 using GIANT statistics but only 0.077 using UK Biobank.
We further show that correlations between effect size estimates and allele frequency differences between North- and South- European populations underlie most of these discrepancies. The confounding with population stratification appears to be most severe for less significant SNPs; restricting the analyses to genome-wide significant SNPs results in a higher concordance between GIANT and UK Biobank data. While stratification-free within-family estimates suggest that the phenotypic north-south height gradient in Europe is indeed paralleled by genetically predicted height as reported before, the magnitude of this effect had been greatly overestimated.
Height is widely used as a model for polygenic traits, which raises the question whether other methods using similar summary statistics might suffer from the same kind of confounding. Existing methods that aim to identify the presence of population stratification in GWAS summary statistics are not always applicable, and we provide simple suggestions that can supplement these methods in order to detect uncontrolled stratification and avoid resulting biases.
September 25, 2018
Doga Gulhan
Research Fellow in Biomedical Informatics
Harvard Medical School
2017-2018
May 1, 2018
John Platig
Postdotoral Research Fellow, Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute
Exploring regulation in tissues with eQTL networks
Characterizing the collective regulatory impact of genetic variants on complex phenotypes is a major challenge in developing a genotype to phenotype map. Using expression quantitative trait locus (eQTL) analyses, we constructed bipartite networks in which edges represent significant associations between genetic variants and gene expression levels, and found that the network structure informs regulatory function. We show, in thirteen tissues, that these eQTL networks are organized into dense, highly modular communities grouping genes often involved in coherent biological processes. We find communities representing shared processes across tissues, as well as communities associated with tissue-specific processes that coalesce around variants in tissue-specific active chromatin regions. Node centrality is also highly informative, with the global and community hubs differing in regulatory potential and likelihood of being disease-associated.
April 10, 2018
Victoria Dean
Research Intern
Deep Genomics
Deep Learning for Branch Point Selection in RNA Splicing
Branch point selection is a key step in RNA splicing, yet many many popular splicing analysis tools do not model this mechanism. There were relatively few confirmed branch points until 2015, when a genome-wide map of experimental human branch points was released. This data facilitates, for the first time, modeling branch sites with more sophisticated methods. We used deep learning to model branch site selection, which improved significantly over position-weight matrix models. We show that our branch point model can be used to classify potential disease-causing variants, and can help to improve existing splicing models.
March 20, 2018
Marieke Kuijjer
Postdotoral Research Fellow,Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute
Understanding Tissue-specific Gene Regulation
Although all human tissues carry out common processes, tissues are distinguished by gene expression patterns, implying that distinct regulatory programs control tissue-specificity. In this study, we investigate gene expression and regulation across 38 tissues profiled in the Genotype-Tissue Expression project. We find that network edges (transcription factor to target gene connections) have higher tissue-specificity than network nodes (genes) and that regulating nodes (transcription factors) are less likely to be expressed in a tissue-specific manner as compared to their targets (genes). Gene set enrichment analysis of network targeting also indicates that regulation of tissue-specific function is largely independent of transcription factor expression. In addition, tissue-specific genes are not highly targeted in their corresponding tissue-network. However, they do assume bottleneck positions due to variability in transcription factor targeting and the influence of non-canonical regulatory interactions. These results suggest that tissue-specificity is driven by context-dependent regulatory paths, providing transcriptional control of tissue-specific processes.
February 27, 2018
Andrea Ganna
Research fellow, Analytical and Translational Genetic Unit
Massachusetts General Hospital
December 12, 2017
Camila Lopes-Ramos
Postdoc Fellow
Dana-Farber Cancer Institute
Using gene regulatory networks to understand sex differences
Sex differences plays an important role in development and progression of many diseases, and understanding how the disease differs between the sexes is essential for the advancement of prevention, diagnosis, and treatment. We conducted a comprehensive assessment of gene expression and gene regulatory network modeling in 29 healthy tissues from the GTEx data set. We observed sex-biased patterns of gene expression involving as many as 60% of the human genome, depending on the tissue. Interestingly, we could capture the sex-biased gene expression through differential network targeting. Next, we used network analysis to study colon cancer. Males have both a higher risk of developing colon cancer and a lower survival rate than women, yet the molecular features that drive these sex differences are poorly understood. We compared gene expression between tumors in men and women and found no significant sex differences except for X and Y chromosome genes. However, when we used PANDA and LIONESS to infer patient-specific gene regulatory networks, we identified genes and pathways that were targeted by transcription factors in a sex-specific manner. Our network analysis uncovered patterns of transcriptional regulation that differentiate male and female in healthy tissues, and also in colon cancer. Changes in gene regulatory networks may explain not only differences in tissue expression, but may also help us to understand the mechanism of sex differences in diseases and other complex traits.
November 7, 2017
Joshua Campbell
Assistant Professor, Division of Computational Biomedicine
Department of Medicine, Boston University
Characterizing transcriptional states and cellular populations in single-cell RNA-seq data using discrete Bayesian hierarchical modeling
We can try to understand complex biological systems by dividing them into hierarchies where each level of the hierarchy is composed of different parts that perform distinct biological functions. For example, organisms can be subdivided into a collection of complex tissues; each complex tissue is composed of different cell types; each cell type is comprised of a unique combination of transcriptional pathways (i.e. transcriptional states); and each transcriptional state is composed of groups of genes that are coordinately expressed to perform specific molecular processes. Single-cell RNA-seq can be used to explore these hierarchies by unbiasedly identifying cellular populations within a sample and to determining the unique combination of transcriptional states that define each subpopulation. However, this type of data is noisy and sparse due to the challenges inherent in amplifying small amounts of RNA from single cells.
With these goals and challenges in mind, we implemented a novel discrete Bayesian hierarchical model that reflects the hierarchy observed in biological systems. This model, which we call Cellular Latent Dirichlet Allocation (Celda), can be used to simultaneously cluster co-expressed genes into transcriptional states, cells into subpopulations, and quantify the proportion of each cellular subpopulation within individual samples. We used this model to identify transcriptional states and novel cellular subpopulations enriched in the airways of smokers versus nonsmokers. Overall, these models represent novel approaches to characterizing cellular and transcriptional heterogeneity in biological samples that have been profiled with single-cell RNA-seq.
October 3, 2017
Yang-Yu Liu
Assistant Professor of Medicine
Brigham and Women’s Hospital and Harvard Medical School
Controlling Human Microbiota
We coexist with a vast number of microbes—our microbiota—that live in and on our bodies, and play an important role in human physiology and diseases. Propelled by metagenomics and next-generation DNA sequencing technologies, many scientific advances have been made through the work of large-scale, consortium-driven metagenomic projects. Despite these advances, there are still many fundamental questions regarding the dynamics and control of microbiota to be addressed. Indeed, it is well established that human-associated microbes form a very complex and dynamic ecosystem, which can be altered by drastic diet change, medical interventions, and many other factors. The alterability of our microbiome offers opportunities for practical microbiome-based therapies, e.g., fecal microbiota transplantation and probiotic administration, to restore or maintain our healthy microbiota. Yet, the complex structure and dynamics of the underlying ecosystem render the quantitative study of microbiome-based therapies extremely difficult. In this talk, I will discuss our recent theoretical progress on controlling human microbiota [1-6].
September 26, 2017
Rui-Sheng Wang
Instructor of Medicine
Brigham and Women’s Hospital and Harvard Medical School
Network analysis of the genomic basis of the placebo effect
The placebo effect is a phenomenon in which patients who are given an inactive treatment (e.g., inert pill) show a perceived or actual improvement in a medical condition. Placebo effects in clinical trials have been investigated for many years especially because placebo treatments often serve as the control arm of randomized clinical designs. Recent observations suggest that placebo effects may be modified by genetics. This observation has given rise to the term “placebome,” which refers to a group of genome-related mediators that affect an individual’s response to placebo treatments. In this talk, I will present a network analysis of the placebome and identify a placebome module in the comprehensive human interactome using a seed-connector algorithm. I will also talk about the validation of the placebome module using a large cohort of the Women’s Genome Health Study (WGHS) and discuss the closeness of the placebome module to different disease modules and drug target modules.
2016-2017
Tuesday, May 9, 2017
1:00-2:00 PM
Biostats Conference Room, Building 2, Rm 426
Guo-Cheng Yuan
Associate Professor of Computational Biology and Bioinformatics
Harvard T.H. Chan School of Public Health
Systematic Mapping of Cellular States and Regulatory Circuits
Human and other multi-cellular organisms are composed of diverse cell types each has its own morphology and biological functions. However, efforts for comprehensive characterization of cell types have just begun recently, thanks to the rapid development of single-cell technologies and data analysis methods. In this talk, I will discuss a few methods our lab developed to characterize cell states and lineages. I will also discuss our work in elucidating the gene regulatory networks underlying cell fate decision and maintenance.
Tuesday, April 11, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426
Luca Pinello
Assistant Professor
Massachusetts General Hospital and Harvard Medical School
Chromatin state variability & single-cell state and developmental trajectory modeling
In this talk I will present, first, an approach that exploits chromatin state variability across cell types to uncover non-coding functional genomic regions and potential regulated genes. This method recovers regions enriched for genetic variants, and provides additional power in explaining genetic heritability of traits or diseases, compared to well-characterized functional elements such as enhancers, promoters or DHS sites.
Then I will present a novel computational method called ARIADNE for single-cell analysis of transcriptomic data. This method can be used to disentangling complex cellular types and states in development, differentiation or in perturbation studies. It can accurately detect cellular hierarchies and recover complex developmental trajectories. In addition, it provides informative and intuitive visualization techniques to highlight important genes that can be used as markers to define sub-populations and rare cell types.
Tuesday, March 28, 2017
12:30-1:30 PM
*Kresge G2*
Samira Asgari
Brigham and Women’s Hospital
Genomics of severe paediatric infections
Infectious diseases are among the leading causes of human mortality and the burden of infectious disorders is the greatest in paediatric population. For any infectious disorder only a fraction of the exposed individuals develop the clinical symptoms and only a small fraction of the individuals with clinical symptoms develop life-threatening conditions (e.g. extreme phenotypes). These individuals with extreme phenotypes are more likely to carry variants of profound effects. In this presentation I will talk about the role of rare human genetic loss-of-function (LoF) variants in life-threatening susceptibility to Pseudomonas aeruginosa (P. aeruginosa) and respiratory syncytial virus (RSV) infections.
Tuesday, March 7, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426
Wei Li
Postdoctoral Fellow
Harvard School of Public Health
CRISPR Screens: Algorithms and Applications to Functional Genomics
High-throughput CRISPR/Cas9 knockout screens have shown great promise in the functional studies of coding genes and non-coding elements. In this talk, I will discuss my recent work to develop algorithms for modeling CRISPR screening data, and identify functional coding and non-coding elements in cancer cells. I present MAGeCK and MAGeCK-VISPR, two novel algorithms to analyze CRISPR screens. MAGeCK uses a ranked-based method to identify top hits in the screen, while MAGeCK-VISPR builds a maximum-likelihood framework to analyze screens involving multiple conditions. Based on these algorithms, we studied the mechanism of endocrine resistance, where cancer cells are resistant to the standard endocrine therapy in many breast cancer patients. We identified and validated genes whose loss confers endocrine resistance, as well as synthetic lethal vulnerabilities that are potential drug targets. These findings demonstrate the potential of CRISPR screens to tackle many emerging problems, including cancer and drug resistance.
Tuesday, February 14, 2017
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426
Jialiang Huang
Postdoctoral Fellow
Harvard School of Public Health
Dissection of super-enhancer hierarchy
Super-enhancers (SEs), groups of putative enhancers with unusually high enhancer activity, play prominent roles in determining cell identity in development and disease. However, it remains unclear to what extent SEs are structurally and functionally distinct from conventional enhancers. We hypothesize that many super-enhancers are composed of a hierarchy of constituent elements whose functional difference is associated with long-range chromatin interaction patterns. To test this hypothesis, we developed a systematic approach to classify SEs into hierarchical and non-hierarchical subtypes by integrating Hi-C and ChIPseq data. We find that, comparing with non-hierarchical SEs, hierarchical SEs enriched with cell-type specific regulators. Furthermore, within each hierarchical SE, we found that the hub enhancers are highly enriched for disease-associated variants, which might suggest their functional potency. In a few cases, the hierarchy within a SE has been previously mapped out by using CRISPR/Cas9-mediated. Comparison between this functional hierarchy and that predicted from chromatin interactions shows strong agreement. Taken together, these results strongly support our hypothesis that many SEs are composed of a hierarchical structure that is associated with chromatin interaction patterns.
Tuesday, December 13, 2016
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426
Bo Li
Harvard T.H. Chan School of Public Health
“Computational approaches for understanding tumor-immune interaction with clinical implications”
Characterization of the interaction between cancer and immune system is critical to developing novel immunotherapies. Here in this talk, I will present two computational methods we have developed. The first one is Tumor IMmune Estimation Resource, or TIMER, which is a statistical tool for deconvolving different immune cell components in the tumor microenvironment using gene expression data. The second one is called T-cell receptor Repertoire Utilities for Solid Tumor, or TRUST, which is a de novo assembler for analyzing the TCR hypervariable sequences using unselected RNA-seq data. Application of both methods to large cancer cohort lead to biological findings with potential clinical applications.
Tuesday, Nov 8, 2016
12:30-1:30 PM
Biostats Conference Room, Building 2, Rm 426
Yue Li
Computer Science and Artificial Intelligence Laboratory (CSAIL)
MIT
“Joint Bayesian inference of risk variants using summary statistics and epigenomic annotations across multiple traits”
Fine-mapping causal variants is challenging due to linkage disequilibrium and the lack of interpretation of noncoding mutations, which contribute to ~70-90% of GWAS SNPs. Most of the existing fine-mapping methods do not scale well on inferring multiple causal variants per locus and causal variants across multiple related diseases. Moreover, Many complex traits are genetically related and potentially share causal mechanisms. Thus, we hypothesize that exploiting the correlation between traits at the epigenomic annotation level may prove useful in fine-mapping for shared causal mechanisms that go beyond the level of individual variants. We develop a new Bayesian fine-mapping model named RiVIERA. The key features of RiVIERA include 1) require only summary statistics; 2) ability to model epigenomic covariance of multiple related traits; 3) efficient posterior inference of causal configuration; 4) efficient Bayesian inference of causal annotations.
We conducted a comprehensive simulation studies using 1000 Genome and Roadmap epigenomic data to demonstrate that RiVIERA compares quite favorably with existing methods yet having unique advantages of jointly inferring and leveraging the underlying tissue-specific functional co-enrichment among traits that are not necessarily associated with the same set of risk loci. In particular, the efficient inference of multiple causal variants per locus from our RiVIERA model led to significantly improved estimation of the causal posterior and functional enrichments compared to the state-of-the-art fine-mapping methods especially for the cases where more than 3 causal variants co-harboured within the same loci. Furthermore, joint modelling multiple traits confers further improvement over the single-trait mode of the same model, which attributes to the more robust estimation of the enrichment parameters. Finally, we demonstrated the application of RiVIERA on jointly modelling several well-powered GWAS datasets of genetically related traits. Our analyses revealed several putative pleiotropic pathways that are disrupted by separate mutations in those traits, which do not share the same risk loci. RiVIERA is freely available from Github.
Tuesday, October 4, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Curtis Huttenhower
Associate Professor of Computational Biology and Bioinformatics
Harvard T.H. Chan School of Public Health
“Methods for population studies of the human microbiome”
Epidemiology-scale studies of the human microbiome pose unique methodological challenges: assay technologies are still new and rapidly changing, inter-individual biological and environmental variation is high, and relevant molecular, translational, and public health endpoints are not yet fully defined. I will discuss some of the lab’s efforts to close these gaps, beginning with an overview of three phases of the NIH Human Microbiome Project. These include, first, previously published work to define a “healthy” microbiome. The project has continued with ongoing work to study its strain-level composition and molecular functionality in over 2,500 shotgun metagenomes, the largest body-wide survey to date. Finally, new work in the “HMP2” characterizes microbiome dynamics and metabolism during inflammatory bowel disease. I will conclude with results from the Microbiome Quality Control Project (MBQC), the first large-scale effort to identify and mitigate sources of variation and non-reproducibility in human microbiome studies.
Tuesday, September 27, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Peng Jiang
Research Fellow
Harvard T.H. Chan School of Public Health
“Inference of genes associated with resistance to targeted cancer therapy”
Drug resistance is a major challenge of targeted therapy in cancer. We present CARE (Comprehensive Analysis of REsistance), a computational method for identification of genes associated with targeted therapy resistance from pharmacological screen data. The methodology of CARE is based on test of interaction effect in multivariate regression. When applied to CCLE, CGP and CTRP datasets, CARE significantly outperformed other computational methods, such as Elastic Net and ANOVA, in predicting drug resistance-associated genes. Moreover, the gene signature inferred by CARE can better predict therapy clinical outcome than the signatures from other experimental technologies such as CRISPR and shRNA screens. When finding genes associated with Lapatinib resistance, CARE identified PRKD3 as the top candidate. Experimental validation by both siRNA pool and small molecule compounds confirmed that inhibition of PRKD3 significantly sensitizes HER2+ breast cancer cell to Lapatinib.
2015-2016
Working Group Organizers: Zhonhua Liu & Pier Palamara
Tuesday, April 5, 2016
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Kaitlin Samocha
Kaitlin Samocha is a graduate student in the Biological and Biomedical Sciences (BBS) program at Harvard Medical School. She currently works with Dr. Mark Daly on a number of projects revolving around the application of statistical rigor to the results of sequencing studies. Her academic interests, more broadly, focus on the genetic basis underlying complex traits.
“Using large reference-population call-sets to prioritize observed variation in patients with neurodevelopmental disorders”
It is a primary challenge of human genetics to distinguish disease-causing rare variation from the multitude of more benign, low-frequency variants found in any genome. As a complement to methods that predict the deleteriousness of individual variants, we target variants found in genes that show unusually low levels of variation in healthy populations. Leveraging the large number of reference individuals in the call-set released by the Exome Aggregation Consortium (ExAC; n=60,706), our model of mutation predicts the number of expected rare (minor allele frequency <0.001) variants for each gene and identifies genes that are significantly depleted of loss-of-function and/or missense variation. Further, the vast number of observable rare variants in ExAC allows us the statistical power to use our method to search for missense-constrained regions within genes.
We applied both gene-level and region-level metrics of mutation intolerance (constraint) to de novo variants found in patients with neurodevelopmental disorders, such as autism spectrum disorders and intellectual disability, as well as to the variants found in controls. The previously established excess de novo loss-of-function variation found in these cases manifests primarily in genes we identify as highly loss-of-function constrained. We also find that the regions with the most severe missense constraint have a greater enrichment of established pathogenic variants, in addition to a higher rate of de novo missense variants found in cases when compared to the rate seen in controls. By contrast, those regions under no constraint show no difference between case and control de novo rates. These analyses highlight the use of evaluating genic intolerance to mutations when interpreting the variation found in patients.
Tuesday, March 1, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Wouter Meuleman
Postdoctoral Associate
MIT CSAIL & Broad Institute
Interpreting the non-coding genome using computational epigenomics
Chromatin state models and annotations are valuable tools for interpreting epigenomic data across potentially large numbers of different epigenomic marks. Summarizing the combinatorial occurrence of marks into a discrete number of states, they provide insight into the functional characteristics of genomic regions. Although useful for the interpretation of particular genomic regions of interest, it remains a challenge to extract meaningful patterns at the genome-wide level. The growing amount of epigenomic data across multiple tissues, cell types, conditions, and individuals, calls for scalable quantitative methods allowing for interpretation and mining of chromatin state models.
To address this challenge, we developed epilogos, a novel approach for analyzing chromatin state models across an arbitrarily large number of epigenomes. At the core of the approach, chromatin states are modelled and visualized in a manner analogous to “motif logos”, as traditionally learned from amino acid or nucleotide biological sequences. This allows for an immediate and intuitive readout and interpretation of results.
The epilogos approach greatly improves the interpretability of epigenomic reference maps such as those generated by the Roadmap Epigenomics Consortium, leading to a better interpretation of genetic maps as well. It has already been shown to easily scale up to hundreds of epigenomes, representing thousands of underlying genome-wide signal tracks. It allows for the quantification of epigenomic variation between cellular conditions, and opens up possibilities for the characterization of spatial as well as temporal epigenomic patterns.
Tuesday, February 2, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Giulio Genovese
Stanley Center for Psychiatric Research
Broad Institute
Detection of copy-number-neutral-loss-of-heterozygosity (CNNLOH) events in whole genome sequencing (WGS) experiments
Measurements of allelic fractions at heterozygous sites for the purpose of detection of copy number alterations have initially focused on detection of two separate Gaussian mixtures around 0.5 related to, respectively, alleles that were gained and alleles that were lost in the event. However, when mosaic alterations are present at low allelic fractions, it becomes more challenging to discern signals from two separate Gaussian mixtures rather than a single Gaussian distribution generated by the allelic ratios at heterozygous sites. Phasing information, allowing to not have to guess which Gaussian mixture to observe in the case of a mosaic copy number alteration, has been successfully used to aid in the detection of subtle allelic imbalance in data from genotyping arrays.
In WGS experiments at 30x the standard deviation for the measured allelic fraction is ~9% while the standard deviation for the measured allelic fraction in a genotyping experiment is ~4%. While the lower information at a single heterozygous site can be made up for by the much larger number of heterozygous sites assayed across the whole genome, due to this disparity phasing information is even more important in WGS experiments than it is in genotyping array experiments.
We expanded on existing methodologies and developed a new algorithm to incorporate phasing information into detection of CNNLOH events from WGS data. Such a method incorporates state-of-the-art phasing from population genetics models such as Beagle and a Markov model to allow for errors in the phasing. Combination of phasing and WGS allows detection of large CNNLOH mosaic events present in as little as 1-2% of cells as opposed to published methods requiring relatively high frequency of cells with the same abnormal karyotype (>5-10%).
Tuesday, December 1, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Tune Pers
Children’s Hospital Boston and Broad Institute
Research fellow in Joel Hirschhorn’s Lab
Towards schizophrenia genes and pathways based on analysis of GWAS data
Over 100 associated genetic loci have been robustly associated with schizophrenia. Gene prioritization and pathway analysis have focused on a priori hypotheses and thus may have been unduly influenced by prior assumptions and missed important causal genes and pathways. Using a data-driven approach, we show that genes in associated loci: (1) are highly expressed in cortical brain areas; (2) are enriched for 104 gene sets including dendritic spine development (P=3.15×10-6), processes related to the postsynaptic density (P<1.29×10-6) and behavioral mouse phenotypes (P<5×10-4); and (3) contain 62 genes that are functionally related to each other and hence represent promising candidates for experimental follow up. We validate the relevance of the prioritized genes by showing that they are enriched for rare disruptive variants from schizophrenia sequencing studies (odds ratio 1.74, P=0.04), and are enriched for genes encoding members of mouse and human postsynaptic density proteomes (odds ratio 4.56, P=5.00×10-4; odds ratio 2.60, P=0.049).
Tuesday, November 3, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Rodolphe Thiébaut
Professor at the Bordeaux University, Bordeaux, France
Team leader INSERM/INRIA SISTM
Statistical analysis of high-dimensional data in clinical trial: the example of an HIV vaccine phase I/II trial
The size and the complexity of data collected for the evaluation of interventions in clinical trials has considerably changed. This is due to the availability and the improvement of new technologies including gene sequencing. Today, vaccine immunogenicity is assessed by various measurements including cell populations characterization (up to 216 with flow cytometry to 240 with CYTOF), proteins productions (cytokines at a single cell level with intracellular staining, in supernatants with multiplex assays), gene expression and viral adaptation using microarray and sequencing technologies. This situation gives the opportunity to study the mechanistic effect of the vaccine and also to try to predict the future response with baseline or early responses thank to the broad spectrum of the measurements performed.
In this talk, I will use the example of the DALIA 1 vaccine trial evaluating the response to an ex vivo generated DC loaded with HIV-lipopeptides in HIV patients treated by antiretroviral therapy (ART). Gene expression in whole blood was measured by microarrays (Illumina HumanHT-12) at 14 time points. Post vaccination immune responses were evaluated using various assays. In step 1, a Time-course Gene Sets Analysis (TcGSA) was performed using hierarchical models allowing heterogeneity in predefined gene sets and implemented in an R package (Plos Comp Biol, 2015,11:e1004310). Statistical properties of this approach have been studied through simulations. Association between abundance of genes selected in step 1, immune responses at w16 and viral replication after ART interruption was analysed in step 2 using an extension of the sparse-Partial Least Square approach to take into account the structural or group effects due to the relationship between markers among biological pathways.
This example illustrates new opportunities as well as the complexity of the analysis of high dimensional datasets and the need of adapted statistical approach.
Tuesday, October 6, 2015
12:30-1:30 PM
Building 2, Room 426 – Biostatistics Conference Room
Peng Jiang
Postdoc at Dana-Farber Cancer Institute and Harvard School of Public Health
Inference of regulatory abnormality in cancers from heterogeneous public data
Recent years have seen the rapid growth of large-scale tumor-profiling data, which provides numerous opportunities for novel cancer biology insights. However, the experimental condition of most public data may not match the physiological condition of a cancer type of interest. Meanwhile, the cancer genome is highly unstable. Many alterations could arise from somatic events unrelated with biological regulation.
In this study, we focused on finding tumorigenesis regulators, including transcription factors (TF) and RNA binding proteins (RBP), through integrating a large number and variety of datasets. For efficient and accurate inference on gene regulatory rules, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests the regulatory impact of TF (or RBP), controlling for background effect from copy number alteration and DNA methylation. In each cancer type, RABIT further tests whether public repository data have reasonable correlation with tumor type specific data. Our predicted regulator impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many novel aspects of regulation in tumor progression.
Brief Research Profile
Peng Jiang is a postdoc at Dana Farber Cancer Institute and Harvard School of Public Health. His research focuses on building up computational methods to integrate big biological data from public domain and discover driver events in diverse cancer types. Peng received his Ph.D. in Computer Science from Princeton University in 2013.
2014-2015
Working Group Organizer: Po-Ru Loh
Tuesday, May 12, 2015
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Jae-Hoon Sul
Sunyaev Lab at Brigham & Women’s Hospital
Large-scale genetic studies of human complex traits
In the past decade, genotyping and next-generation sequencing (NGS) technologies have generated an enormous amount of data to discover genetic variants present in human genomes and to find the genetic basis of diseases. The technologies have shifted the paradigm of genetic studies from studies that analyzed fewer than a hundred individuals at hundreds of markers to studies that analyze more than tens of thousand individuals at millions of genetic variants. With rapid decrease in sequencing costs and emphasis on genomic medicine, studies will sequence hundreds of thousands of individuals in the near future. A major challenge in these large-scale genetic studies is developing computational methods that can utilize this big data efficiently.
In this talk, I will describe my work to address this challenge. As genome-wide association studies have discovered numerous non-coding genetic variants associated with traits, there has been increasing focus on interpreting these variants using functional genomics. Expression quantitative trait loci (eQTL) studies that attempt to detect genetic variants associated with gene expression may provide clues as to which variants are functional. I will discuss a method to perform multiple testing correction accurately and rapidly in eQTL studies for identification of genes whose expression is influenced by genetic variants. As eQTL studies have grown larger in sample size, multiple testing correction using the permutation test has become a major computational bottleneck. I developed a multivariate normal sampling approach (MVN), and MVN is more than 100 times faster than the permutation test for the sample size of 2,000 while generating almost the same results. Next, I will present a novel approach to detect rare variants associated with a disease in large families. NGS enables studies to evaluate effects of rare variants on complex traits, and family-based studies have attracted great attention recently because of their higher power for rare variant testing than case-control studies. I developed a method called RareIBD that can be applied to large pedigrees, both binary and quantitative traits, and affected-only pedigrees. Using simulations, I will show my method achieves higher power than previous approaches
Tuesday, April 21, 2015
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Rany Salem
Hirschhorn Lab at Boston Children’s Hospital
Complex Trait Genetics: GWAS, Phenotypic Heterogeneity and Future Directions
The emergence of Genome-wide association studies (GWAS) has given researchers a powerful tool to dissect the genetic architecture of many complex traits. A major challenge of performing a GWAS is phenotype selection, and dealing with phenotypic heterogeneity. The first part of the talk will focus on results of the first large scale GWAS of diabetic kidney disease and issues related to identifying and dealing with phenotypic heterogeneity. For the second part of the talk, I will present a framework to deal with the limitations of the GWAS consortium model and briefly explore statistical and methodological issues to take advantage of this opportunity. Finally, I will highlight results of a novel association between a structural variant and lipid levels.
Tuesday, January 20, 2015
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
a pizza lunch will be provided
Manuel A. Rivas
Wellcome Trust Centre for Human Genetics, University of Oxford and the Broad Institute
Exploiting Correlation of Genetic Effects in Rare Variant Association Studies
I consider the problem of assessing association between rare variants and their effects on multiple, possibly correlated phenotypes. This problem arises in population biobanks, where a large number of health related measurements are made, and sequencing association study designs of related disorders. In this talk I will present a statistical framework that employs estimates of the correlation of genetic effects to improve power to discover associations. The approach exploits knowledge about the correlation in both genetic effects across a group of genetic variants and across a group of phenotypes. Properties of the approach as implemented include that it is computationally efficient, making the analysis of large study designs practical; it is flexible and extensible, making the analysis of gene-sets, pathways, and networks feasible; and it includes standard univariate and multivariate gene-based tests as a special case, thus it can be employed in settings where estimates of the expected correlation of genetic effects are not available. I will present results from a simulation study and from the analysis of two independent rare variant association studies: 1) blood metabolite measurements (multiple quantitative traits) from subjects in the Oxford Biobank, a collection of 30-50 year old healthy men and women living in Oxford recruited from primary care who underwent a detailed examination at a screening visit; and 2) a targeted sequencing data set of six autoimmune diseases. The statistical and computational framework is implemented in software package MAMBA.
Tuesday, December 2, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Sriram Sankararaman
Postdoc
Reich Lab at Harvard Medical School
Characterizing the structure and impact of archaic admixture in humans
Large-scale studies of human genetic variation, in conjunction with advances in ancient DNA technology, have revealed a number of mixture events between highly-diverged human populations that occurred tens of thousands of years ago (archaic admixture), e.g., between the ancestors of Neandertals and modern non-Africans. While several subsequent studies have found examples of phenotypically-relevant genetic variants introduced by archaic admixture, the biological impact of archaic admixture on human populations is not systematically understood. To understand the biological impact, we need maps of archaic local ancestry i.e., a labeling of the archaic ancestry along an individual genome.
We developed a statistical method based on the framework of Conditional Random Fields (CRF) and applied it to infer Neandertal local ancestry. The CRF provides inferences that are robust and sensitive both on simulated and real data. We applied the CRF to data from the 1000 Genomes project combined with a high-coverage Neandertal genome. The resulting map of Neandertal local ancestry reveals likely positive selection of Neandertal ancestry in specific loci and sets of genes (e.g. genes involved in keratin filament formation), associations between Neandertal-derived alleles and medical phenotypes (e.g. Crohn’s disease), as well as strong purifying selection against Neandertal alleles in part due to the effects of hybrid male sterility.
Finally, I will present ongoing work to study the impact of another archaic admixture between a population known as Denisovans and ancestors of modern Melanesian populations. I will discuss generalizations of the CRF to multi-way and de-novo admixtures as well as unphased data, efforts to study phenotypic associations of Neandertal variants and some open statistical questions ( how do we learn and apply discriminative models to population genetic problems?).
Tuesday, October 28, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Yun Li
Assistant Professor of Genetics & Biostatistics
University North Carolina at Chapel Hill
A Bayesian method for the detection of long-range chromosomal interactions in Hi-C Data
Advances in chromosome conformation capture and next-generation sequencing technologies are enabling genome-wide investigation of dynamic chromatin interactions. For example, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. One essential task in such studies is peak calling, that is, the identification of non-random interactions between loci from the two-dimensional contact frequency matrix. Successful fulfillment of this task has many important implications including identifying long-range interactions that assist in interpreting a sizable fraction of the results from genome-wide association studies (GWAS). The task – distinguishing biologically meaningful chromatin interactions from massive numbers of random interactions – poses great challenges both statistically and computationally. Model based methods to address this challenge are still lacking. In particular, no statistical model exists that takes the underlying dependency structure into consideration. We propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulations and real data.
Tuesday, October 7, 2014
12:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Elinor Karlsson
Postdoctoral Fellow and soon-to-be Assistant Professor
Broad Institute
Harvard University FAS Center for Systems Biology
University of Massachusetts Medical School
Natural Selection and Cholera Resistance in Bangladesh
Infectious pathogens are among the strongest selective forces driving recent human evolution, as migration and cultural changes over the last 100,000 years exposed populations to dangerous new diseases. One such pathogen is the causal agent of cholera, the bacterium Vibrio cholerae, which is endemic in the Ganges River Delta. By combining signals of association and natural selection, we have identified genes and pathways implicated in cholera susceptibility in Bangladesh, and developed a model of the innate immune signaling pathways that respond to V. cholerae infection. In this model, inflammasome activation and NF-kB signaling play an integrated role in TLR4-mediated sensing of V. cholerae – which is consistent with our in vitro data. Our approach is broadly applicable to other historically prevalent infectious diseases, such as Lassa fever, tuberculosis, leishmaniasis, dengue fever and malaria, and to complex, common diseases, such as inflammatory bowel disease, for which the associated genes may have been historically selected. Thus, ancient natural selection can give new insights into the function – and dysfunction – of human biology, with important implications for medical genomics.
2013-2014
Working Group Co-organizers: Bjarni Vilhjalmsson & Sasha Gusev
Tuesday, April 22, 2014
12:30-2:30, Building 2-Room 426
Kaitlin Samocha
Analytic and Translational Genetics Unit | MGH
Identification of a Set of Highly Constrained Genes from Exome Sequencing Data
A major challenge of medical genetics is to determine which variants, if any, contribute to disease in a patient. While variants can be prioritized based on their predicted deleteriousness, information about the disrupted gene can also be used to highlight those variants that are more likely to contribute to disease. For example, damaging variants in genes expressed in the relevant tissue might be prioritized over variants in genes that are not expressed in the tissue. Another potential way to prioritize variants is by the evolutionary constraint of the gene.
We developed a sequence context based model of de novo variation to create per-gene probabilities of synonymous, missense, and loss-of-function mutations. We noticed a high correlation (0.94) between the probability of a synonymous mutation in a gene and the number of rare, synonymous variants identified in that same gene using the NHLBI’s Exome Sequencing Project data (evs.gs.washington.edu). We predicted the number of variants that we would expect to see in the dataset and – in order to quantify deviations from those expected values – created a Z score of the chi-squared difference between the observed and expected variation. While the distribution of these Z scores for the synonymous variants was normal, there is a marked shift in the missense distribution towards having fewer variants than predicted.
We identified a list of excessively constrained genes representing roughly 5% of all genes. This set of genes identified as excessively constrained showed enrichment for entries in the Online Mendelian Inheritance in Man (OMIM) database and, in particular, for those with a dominant inheritance pattern. Using published data, we found that de novo loss-of-function variants identified in patients with autism and intellectual disability were in a constrained gene more often than expected (p < 0.0001 for both). This trend did not hold for those genes with a de novo loss-of-function variant in a control (p = 0.66), indicating that this approach can effectively prioritize genes in which mutations can strongly predispose to disease.
Tuesday, February 25, 2014
12:30-2:30, Building 2-Room 426
Stephan Ripke
Analytic and Translational Genetics Unit | MGH
Reality at Last: Psychiatric Genomics Consortium Quadruples Schizophrenia GWAS Sample Size
Genome wide association studies have been a principal study design in human genetics for almost ten years now. They have been performed on almost all known heritable human traits and diseases with the hope of shedding light on the biology of so called complex genetic diseases like diabetes mellitus, hypertension and hyperlipidemia. The psychiatric field, which has traditionally lacked biologic instruments for medical research was especially enthusiastic about this new technology. Still, early analyses did not bear fruit causing many to question or abandon this approach.
Genome wide association studies are now clearly one of the most successful genetic analysis methods in medical research. Results from these studies have led to many known and possibly novel drug targets and have deciphered many new biological pathways. Interestingly, psychiatric phenotypes are in no way inferior to somatic phenotypes. Yet the path to this success was more arduous than expected.
In this talk I will present lessons learned during the last ten years of GWAS studies. I will give an overview over this journey, highlighting early successes from somatic diseases (e.g. Crohn’s disease) and later but no less striking success from psychiatric diseases, schizophrenia in particular.
I will then attempt to translate these experiences into expectations and recommendations for future work in the field of genetics and in the field of psychiatric genetics specifically.
Tuesday, November 19, 2013
12:30-2:30, Building 2-Room 426
Gosia Trynka
The Raychaudhuri Lab | BWH | HMS
Analyses assessing enrichment of GWAS variants for non-coding annotations in the genome are upwardly biased
Interpreting the function of variants associated to complex traits is challenging, particularly because only a small proportion of the variants maps to the gene coding regions. The activity of the genomic regions can be inferred from whole genome assays for histone modifications, chromatin accessibility or specific transcription factors. Now such maps are becoming available for hundreds of cell-types and tissues.
A test for enrichment within chromatin annotations has become a widely applied approach to infer functions of trait-associated variants. However, currently there are no standards to correctly carry out such analysis. Typically, enrichment analysis takes associated variants and quantifies the overlap with chromatin annotations. The observed enrichment is then compared to the null distribution of SNPs sampled from the whole genome based on different parameters. Most frequently the null is defined by random sets of SNPs or matched for proximity to TSS and minor allele frequency. Different definitions of null distribution highly influence the significance of observed results.
We tested different sets of matching parameters and defined those that are essential to sufficiently control for type 1 error. With simulations we show that the currently employed matching methods are strongly biased towards false positive results. Additionally, we develop an independent method that does not rely on matching and instead takes into account local complex correlation of genomic annotations present at the associated loci.
Tuesday, November 5, 2013
12:30-2:30, Building 2-Room 426
David Golan
Ph.D candidate Rosset lab Tel-Aviv University
Accurate Estimation of Heritability in Case-Control GWAS
Linear mixed effects models have recently gained popularity as the method of choice for estimating heritability from GWAS data, i.e. quantifying how much of the variability of a phenotype can be explained by the genotyped SNPs.
However, most of the interesting diseases and disorders studied are rare (typically affecting <1% of the population), and so the proportion of cases in a study is usually considerably higher than the proportion of cases in the population. This over-representation of cases invalidates several key assumptions of linear mixed models, e.g. the normality and independence of the random effects. Ignoring these problems results in shrunken estimates of heritability. We propose an alternative approach for estimating heritability, related to the well known method of Haseman and Elston. We derive the relationship between the genetic similarity and the phenotypic similarity of any two individuals as a function of the heritability, while explicitly conditioning on the fact that both individuals were selected for the study. Our method then entials regressing the pairwise phenotypic similarities on the pairwise genetic similarities and using the slope to obtain an estimate of the heritability. We show, using simulations, that our method yields unbiased estimates which are considerably more accurate than the current state-of-the-art methodology. Applying our method to several well-studied GWAS yields heritability estimates which are considerably higher than previously published results.
We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.
Tuesday, October 22, 2013
12:30-2:30, Building 2-Room 426
Saurabh Ghosh
Professor
Human Genetics Unit
Indian Statistical Institute, Kolkata
Integrating Multiple Phenotypes For Association Mapping
Most clinical end-point traits are governed by a set of quantitative and qualitative precursors and a single precursor is unlikely to explain the variation in the end-point trait completely. Thus, it may be a prudent strategy to analyze a multivariate phenotype vector possibly comprising both quantitative as well as qualitative precursors for association mapping of a clinical end-point trait. The major statistical challenge in the analyses of multivariate phenotypes lies in the modelling of the vector of phenotypes, particularly in the presence of both quantitative and binary traits in the multivariate phenotype vector.
For population-based data, we propose a novel Binomial regression approach that models the likelihood of the number of minor alleles at a SNP conditional on the vector of multivariate phenotype using a logistic link function. For family-based data comprising informative trios, we propose a logistic regression method that models the transmission probability of a marker allele from a heterozygous parent conditioned on the multivariate phenotype vector and the allele transmitted by the other parent. In both the approaches, the test for association is based on all the regression coefficients.
We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.
Tuesday, September 24, 2013
12:30-2:30, Building 2-Room 426
Ron Do
Research Fellow in Genetics
Harvard Medical School
Human Genetics Approaches to Understand if Triglyceride-Rich Lipoproteins Cause Coronary Artery Disease
Plasma triglycerides are transported in specific lipoproteins; in observational epidemiologic studies, increased triglyceride levels correlate with higher risk for coronary artery disease (CAD). However, it is unclear whether this association reflects causal processes. Genetic information can help assess causality, and can be useful in dissecting the influences of correlated measures such as triglycerides, low-density lipoprotein cholesterol (LDL-C), and high-density lipoprotein cholesterol (HDL-C). Here, we present results from two studies that help us understand if triglyceride-rich lipoproteins cause coronary artery disease. In the first study, we used 185 common polymorphisms recently mapped for plasma lipid traits (P < 5 x 10-8 for each) to examine the role of triglycerides on risk for CAD. First, we highlight loci associated with both LDL-C and triglycerides, and show that the direction and magnitude of both are factors in determining risk for CAD. Second, we consider loci with a strong magnitude of association with triglycerides but a minimal one with LDL-C, and show that these loci are also associated with CAD. Finally, in a model accounting for effects on LDL-C and/or HDL-C, a polymorphism’s strength of effect on triglycerides is correlated with the magnitude of its effect on CAD risk. In the second study, we sequenced the exomes of 1,027 individuals with myocardial infarction (MI) at an early age, compared their sequences with 946 older individuals without MI, and validated initial associations using statistical imputation, genotyping, and re-sequencing. In follow-up sequencing of > 10,000 individuals, we observed rare non-synonymous mutations in the APOA5 gene were more frequent in early-onset MI patients versus MI-free controls at an exome-wide significance threshold (APOA5, 1.4% versus 0.6%, odds ratio = 2.2, P =5 x 10-7). Carriers had higher plasma triglyceride concentrations compared to non-carriers (median in carriers was 167 mg/dl versus 104 mg/dl for non-carriers, P=0.007). These results suggest that triglyceride-rich lipoproteins may causally influence risk for CAD and that novel therapeutic approaches targeted to TG-rich lipoproteins might be expected to reduce risk of CAD.
2012-2013
Working Group Organizer: Hugues Aschard
Postdoc Committee: Bjarni Vilhjalmsson, Jin Zhou, Megha Padi
Tuesday, April 2, 2013
12:30-2:30, Building 2, Room 426
Nikolaos A. Patsopoulos, MD, PhD
Instructor in Neurology
Brigham & Women’s Hospital
Harvard Medical School
The genetics of common variation in Multiple Sclerosis
Multiple Sclerosis is a neurodegenerative disease with strong evidence of genetic predisposition. In this PQG I’ll present current and on-going efforts of the International Multiple Sclerosis Genetics Consortium (IMSGC) to identify and fine-map genetic effects in large-scale data sets. The focus will be in design, addressing technical issues, e.g. complicated sample structure, and application of methods, e.g. control for population stratification.
Tuesday, March 5, 2013
12:30-2:30, Building 2, Room 426
Tune H. Pers
Research Fellow in Pediatrics
Children’s Hospital Boston
Harvard Medical School
Predicting pathways and selecting likely causal genes from a height genome-wide association study by integrative network-based analysis
In genome-wide association (GWA) studies many loci have no single obviously causal gene, therefore the challenge for moving from association to novel biological insight is to identify which gene at each locus most likely explains the association. Another challenge is to identify whether associated loci coalesce onto biological pathways. A key first step is to use computational approaches to prioritize genes within loci as most likely to be biologically relevant, and to assess whether genes in associated loci enrich for particular gene sets. Our approach is based on expression patterns derived from a heterogenous panel of 80,000 expression arrays that predict functions for genes. The predicted functions include predicted protein-protein interactions, predicted phenotypic consequences, predicted Reactome pathway members, predicted KEGG pathway members, and predicted Gene Ontology term membership. We subsequently used these predicted functions to systematically identify the most likely causal gene(s) at a given locus. We evaluate our method using the GIANT consortium’s GWA meta-analyses for human height. For human height GWA loci, our integrative approach performs consistently better in predicting likely causal genes than approaches based on one or two data types only. As unbiased benchmarks, we tested for enrichment of genes that recently have been shown to be differentially expressed in rodent growth plates, and for genes that have associated missense variants. Based on these and other benchmarks, this approach outperforms methods using fewer data types. We have developed an unbiased computational approach that integrates a variety of data types and performs well in prioritizing potentially causal genes from GWA data.
Tuesday, December 4, 2012
12:30-1:30, Building 2, Room 426
Paz Polak
The Broad Institute
BWH/HMS, Sunyaev Lab
Revealing the factors that shape the regional mutation rates in melanoma
Recent advances in high-throughput sequencing technology enable us to identify cancer-specific mutations at any position in a patient’s genome, allowing dozens of studies to reveal a rich world of mutation patterns. We have been studying how the density of mutations changes along the genome in different cancer types. We used epigenetic data from several normal cell types, recently released by the ENCODE project consortium, to predict the variation in cancer-specific mutation densities along human chromosomes. For prediction, we used three different methods: Random Forest; Poisson regression with elastic net regularization, which is a compromise between lasso and ridge regression; and Generalized Additive Models (GAM), which smooth the predictors and deal with the need to linearize their relations. Overall, we were able to explain up to 80% of the variation in melanoma genomes, showing that chromatin structure is involved in establishing mutation patterns in cancer.
Tuesday, November 27, 2012
12:30-1:30, Building 2, Room 426
Sara Lindstrom
Research Scientist
Department of Epidemiology
Deep targeted sequencing of 12 breast cancer loci in 4,700 women across four different ethnicities
Genome-wide association studies (GWAS) have identified multiple genetic loci associated with breast cancer risk. However, the underlying genetic structure in these regions is not fully understood and it is likely that the index GWAS signal originates from one or more yet unidentified causal variants within the region. We used next-generation sequencing to characterize 12 GWAS-discovered breast cancer loci in a total of 2,300 breast cancer cases and 2,300 controls across four ethnic populations. Region intervals spanned between 46kb and 973kb. In total we hybrid-captured and sequenced 5.5Mb. On average, we were able to capture 82% of the non-repetitive sequence in the targeted regions, and the average fraction of captured bases sequenced with a depth >20x was more than 95%. After single nucleotide variant (SNV) calling and quality control, a total of 138,898 SNVs remained for analysis. 87.4% of those had a minor allele frequency less than 0.5% and 50.2% were private mutations. Across all regions, a total of 81 SNVs showed evidence for association with breast cancer (P<0.001). However, we did not find unequivocal association evidence supporting a causal role for any individual SNV. These results illustrate the challenges facing post-GWAS fine-mapping studies.
Tuesday, October 2, 2012
12:30-2:00, Building 2, Room 426
Hugues Aschard
Postdoc of Peter Kraft
Exploring the effect of gene-gene and gene-environment interactions in breast-cancer risk prediction
Genome-wide association scans have identified scores of common genetic variants associated with the risk of complex diseases in the last years. However their aggregate effects on risk beyond traditional factors remain uncertain. Recent papers reported that the addition of information on these genetic variants to existing risk models for common diseases can moderately improve discrimination and prediction in estimating that risk. Here we explore the extent to which the consideration of interaction between genetic variants (GxG) and interaction between genetic variants and established clinical factors (GxE) can add to these risk-assessment models. Using data from the Nurses’ Health Study, we derive the predictive ability of 15 independent published risk alleles in 1145 breast cancer cases and 1142 matched controls. We evaluate the extent to which single nucleotide polymorphisms (SNPs) can discriminate breast cancer cases from controls in the whole sample and in strata defined by non-genetic risk factors. We conduct then a simulation study to explore the potential improvement of discrimination if complex GxG and GxE interactions exist and we know them.
2011-2012 Working Groups
Working Group Organizers: Elizabeth Schifano & Monica Ter-Minassian
Tuesday, March 27, 2012
12:30-2:00, Building 2-Room 426
Aedin Culhane, Ph.D.
Research Scientist
Department of Biostatistic
Dana-Farber Cancer Institute
New biclustering approaches for pan-cancer analysis of a large compendium of genomics data
Though it is widely recognized that molecular traits such as p53 mutation and deregulation of the cell cycle span many types of cancer. Conventionally classification of cancer is by anatomical site and studies of cancer generally analyze the disease within these boundaries. There is a growing recognition that histological and molecular heterogeneity cross anatomical boundaries, for example clear cell ovarian cancer is closer to clear cell renal cancer than it is to other ovarian cancers. New pan-cancer analysis approaches are required to discover the molecular taxonomy of cancer upon which personalized cancer medicine can be built. I will describe a new biclustering approach that we have developed to enable analysis of large compendium of cancer genomics data and the pan-cancer molecular traits we have discovered.
Tuesday, February 28, 2012
12:30-2:00, Building 2-Room 426
Peggy Lai, M.D.
Post-doctoral research fellow in Environmental Health
(Mentors: David Christiani and Winston Hide)
Using a gene expression signature to understand endotoxin related chronic obstructive lung disease
A quarter of patients with chronic obstructive pulmonary disease (COPD) in the United States have never smoked. Endotoxin is a common and poorly recognized environmental exposure that may cause non-tobacco related COPD. In recent studies, chronic inhalational endotoxin has been associated with the development of obstructive lung disease in both murine models and human epidemiologic studies of occupationally exposed populations. Despite this, endotoxin related lung disease has not been well studied. Three distinct murine models of chronic endotoxin exposure have been developed in different laboratories with use of microarrays to characterize global gene expression. We identified a common 101 gene signature for recurrent endotoxin exposure that is both biologically interesting at the gene level, and at the pathway level demonstrated increasing numbers of inflammation related genes with longer periods of endotoxin exposure; this is surprising in light of the well-known phenomenon of endotoxin tolerance, suggesting that dysregulated inflammation may play a role in endotoxin-related lung disease. The focus of this talk is on the application of biostatical and bioinformatics methods to a clinically relevant but poorly understood disease.
Tuesday, January 31, 2012
12:30-2:00, Building 2-Room 426
Lin Li, Ph.D.
Postdoctoral Research Fellow
Department of Biostatistics
Harvard School of Public Health
Incorporating gene expression information in SNP set association testing
Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are likely to be expression quantitative trait loci (eQTLs). It is of interest to utilize eQTL information for understanding the genetic basis underlying complex traits. On the other hand, SNP set association testing, especially region-based, can harness information of SNPs in linkage disequilibrium and lead to increased power. We are interested in incorporating gene expression information in SNP set association testing. Specifically, we employ kernel machine tests using weighted kernels, where the predefined weights are based on eQTL signals. We also consider an adaptive test that is robustly powerful regardless of whether the expression signals are relevant or not to the trait of interest. As an application, we analyze an asthma study data using our proposed methods.
(Joint work with Prof. Xihong Lin and Prof. Liming Liang)
Tuesday, December 13, 2011
12:30-2:00, Building 2-Room 426
Chen-yu Liu
Research Fellow, Department of Environmental Health
Harvard School of Public Health
A pilot study of residential petrochemical exposures and DNA methylation changes for leukemia risk
We conducted a pilot study to assess the associations of genome-wide methylation pattern changes in peripheral blood and residential petrochemical exposure on 30 childhood acute lymphoblastic leukemia (ALL) cases and controls, as an exploratory extension of a population-based case-control study [1-3]. Higher concentrations of selected PAHs and VOCs in the vicinity of petrochemical industries in the study area have been reported, compared to those in industrialized communities of the U.S. [4, 5]. Hypermethylation /hypomethylation of specific genes involved in carcinogens metabolisms and exposure to several carcinogens of PAH and VOC may increase the childhood ALL risk. We used geographic information system tools to estimate individual-level exposure by accounting for subjects’ mobility, length of stay at each residence, distance to petrochemical plant(s), and monthly prevailing wind direction. DNA methylation profiles were obtained by using the Illumina Infinium HumanMethylation27 BeadArray. Differential methylation levels between ALL cases and controls were found in genes involved in inflammatory response, lymphocyte differentiation/ activation, and apoptosis. Similar methylation pattern changes were observed when comparing petrochemical exposure correlated effects. Seventy prevent genes were overlapping compared the petrochemical exposure correlated changes and leukemia associated changes. The results do not remain significant after adjusting for multiple comparisons.
Reference:
- Liu, C.Y., et al., Maternal and offspring genetic variants of AKR1C3 and the risk of childhood leukemia. Carcinogenesis, 2008. 29(5): p. 984-90.
- Liu, C.Y., et al., Cured meat, vegetables, and bean-curd foods in relation to childhood acute leukemia risk: A population based case-control study. BMC Cancer, 2009. 9(1): p. 15.
- Yu, C.L., et al., Residential exposure to petrochemicals and the risk of leukemia: using geographic information system tools to estimate individual-level residential exposure. Am J Epidemiol, 2006. 164(3): p. 200-7.
- Lee, C., The characteristics of ambient volatile organic compounds and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese). Taipei, Taiwan, Republic of China: National Science Council. 1995.
- Lee, C. and G. Tsai, The characteristics of ambient polycyclic aromatic hydrocarbons and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese).Taipei, Taiwan, Republic of China: National Science Council,. 1994.
Tuesday, November 22, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided
Cristian Tomasetti
Postdoc with Giovanni Parmigiani
Dana-Farber Cancer Institute and Harvard School of Public Health
Mathematical Modeling of Random Genetic Mutations for a General Class of Tumor Growth Curves with Applications
Various attempts to study random genetic mutations have been made
in the mathematical modeling literature. An important goal has been the
estimation of the probability for a specic mutation to be present in a tumor
of a given size and the related estimate for the expected number of mutants,
if mutants are indeed present. Previous mathematical models addressing
these questions have shown however two potential limits: the tumor was
assumed to be homogeneous and, on average, growing exponentially. In this
talk a few recent results will be presented where the heterogeneity of the
tumor cell population is taken into account, and the exponential growth
of cancer has been replaced by other, arguably more realistic types of tu-
mor growth. Various application of this mathematical framework to chronic
myeloid leukemia and gastrointestinal stromal tumor will be introduced.
Tuesday, October 25, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided
Noah Zaitlen
Postdoc with Alkes Price
Components of Heritability in An Icelandic Cohort
The combined genotypes and genealogy of 38,167 Icelanders provides the
opportunity to examine the contribution of different components of
genetic variation to the heritability of complex phenotypes. In this
work we focus on both parent-of-origin effects, and the contribution
of typed versus untyped variants. Genetic variation can alter
phenotype in a parent-of-origin specific manner, for example, via
imprinting. Kong et al [1] identified three Type 2 Diabetes (T2D)
variants including rs2334499, which is protective for T2D when
inherited maternally, confers risk when inherited paternally, and lies
in an imprinted region of the genome. The genealogy provides a means
of resolving not only highly accurate phasing, but also the
parent-of-origin of each typed variant. Using this information we
develop methods to examine the total contribution of parent-of origin
effects, as well as the differences between paternal and maternal
contributions to the heritability of complex phenotypes. We find that
variation of height shows little evidence of contribution from
parent-of-origin effects, while T2D shows highly significant evidence
(p-value < 1.9×10<sup>-5</sup>), suggesting that many more variants
such as rs2334499 remain to be found. In addition to parent-of-origin
effects, the unique long range phasing the deCode data provide a means
of efficiently estimating the fraction of the genome shared identical
by descent (IBD) as well as identical by state (IBS). As has been
recently demonstrated [2], IBS sharing can be used to estimate the
contribution to phenotype of genotyped SNPs and SNPs in linkage
disequilibrium (LD) to genotyped SNPs. IBD sharing estimates the total
narrow sense heritability of the phenotype, that is, the additive
contribution of all SNPs. Since both are readily available in these
data we are able to compare the IBD based estimates to the IBS based
estimates and show that for all phenotypes examined the majority of
heritability is well captured by the genotyped SNPs. We conclude with
a discussion of confounding effects in mixed model estimates of
heritability such as those provided by GCTA, showing that population
structure can lead to biased estimates of heritability even when
corrected for by principal component adjustments. 1 Kong et al.
Parental origin of sequence variants associated with complex diseases.
Nature 2009. 2 Yang el al. Genome partitioning of genetic variation
for complex traits using common SNPs. Nat Genet 2011.
Tuesday, September 27, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided
Kimberly Glass
Postdoc with GC Yuan & John Quackenbush
Passing Messages Between Data Types to Refine Predicted Network Interactions
Gene regulatory network reconstruction is a fundamental question in computational biology. Although it is recognized that there are significant limitations when using individual datasets to reconstruct a regulatory network, it remains a major challenge to integrate multiple data sources effectively. We propose a message-passing model that utilizes multiple sources of regulatory information to predict regulatory relationships. A major advantage of our method is that the local connections are iteratively refined by exchanging information within each gene’s functional neighborhood, thus increasing both numerical efficiency and biological accuracy. Using the yeast as a model system, we are able to demonstrate that this integrative method is able to more accurately reconstruct the regulatory network than other widely-accepted network reconstruction algorithms. In addition, our method is amenable to the future inclusion of data reflecting many different types of regulatory mechanisms, including protein complexes and protein-protein interactions, the binding of transcription factors to promoters, and epigenetic marks residing in the control regions of genes.
2010-2011 Working Groups
Tuesday, May 3, 2011
12:30-2:00, Building 2-Room 426
Sarah Fortune, MD
Assistant Professor of Immunology and Infectious Diseases
Department of Immunology and Infectious Diseases
Time for change: Using whole genome sequencing data to define the mutation rate of Mycobacterium tuberculosis during the course of infection
Mycobacterium tuberculosis (Mtb) poses a global health catastrophe that has been compounded by the emergence of highly drug resistant Mtb strains. In Mtb, all drug resistances are the result of chromosomal mutations and depend on the bacterium’s capacity for mutation during the course of infection. We have used whole genome sequencing of bacterial isolates to derive estimates of the mutation rate of Mtb in the infected host in order to better understand the occurrence of drug resistance. Our data suggest that Mtb acquires mutations at roughly the same rate over time, irrespective of the organism’s replicative state. However, the mutation spectrum in Mtb strains from animals with active and latent disease reflects different mutational pressures in different disease states. From these data, we are developing a mutational clock for Mtb in the infected host, which we are validating through WGS of human isolates from defined transmission chains.
Tuesday, April 5, 2011
12:30-2:00, Building 2-Room 426
Melissa Merritt, Ph.D.
Postdoctoral Research Fellow
Department of Epidemiology, Harvard School of Public Health
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
Cell-of-origin effect on tubo-ovarian tumor phenotype
The examination of gynecologic tissues with putative precursor lesions for ovarian tumors in clinical samples has identified a variety of suggested precursors for ovarian cancer. These observational studies have guided us to design experiments to address whether cells-of-origin influence the phenotype of tubo-ovarian cancer. To do this, we established paired normal human ovarian and fallopian tube epithelial cells in culture from two donors without cancer, and induced carcinogenic transformation of cells by the sequential introduction of hTERT, SV40 and H-Ras. The gene expression profiles of the normal and transformed cells were examined using HG-U133 Plus 2 microarrays (Affymetrix), which revealed a ‘cell-of-origin’ gene signature that distinguished ovarian (OV) vs fallopian tube (FT) epithelial cells from the same patient. Among the most significant genes we selected a subset that were over-expressed in either OV or FT origin cultured cells and used these to identify two subpopulations (OV-like versus FT-like) among human ovarian tumors in two publically available gene expression datasets. In this analysis FT-like tumors were predominantly composed of serous high grade cancers and were associated with significantly shorter disease-free survival. In contrast, OV-like tumors were associated with a better prognosis. We have previously demonstrated that cell-of-origin plays a role in determining the phenotype of breast tumors. These results provide further support for the hypothesis that cell-of-origin strongly influences tumor phenotype. We suggest that the most aggressive subtype of tubo-ovarian cancers may originate in the fallopian tube. The results from these studies may allow a cell-of-origin based improved molecular classification of ovarian cancers and a better understanding of the pathogenesis of mullerian tumors.
Tuesday, March 1, 2011
12:30-2:00, Building 2-Room 426
Monica Ter-Minassian, Sc.D.
Post doctoral Fellow at HSPH (Environmental Health Dept.) and DFCI (Medical Oncology Dept.)
Genetic variability in tobacco-specific nitrosamine NNK to NNAL metabolism
The nitrosamine NNK, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone, is known to be one of the most potent tobacco carcinogens, particularly for lung adenocarcinoma. Recently, NNK urinary metabolites, total NNAL, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol and its glucuronides, have been shown to be good predictors of lung cancer risk, years prior to diagnosis. We sought to determine if several genetic polymorphisms significantly contributed to the wide range of inter-individual variability observed in total NNAL output. The study subjects were derived from the Harvard/ Massachusetts General Hospital Lung cancer case-control study. We analyzed 87 self-described smokers (35 lung cancer cases and 52 controls), with urine samples collected at time of diagnosis (1992-1996). We tested 77 tagging SNPs in 16 genes related to the metabolism of NNK to total NNAL derived from prior GWAS genotyping on these subjects. Using a weighted case status least squares regression, we tested for the association of each SNP with square-root (sqrt) transformed total NNAL (pmol per mg creatinine), controlling for age, sex, sqrt packyears and sqrt nicotine (ng per mg creatinine). Three HSD11B1 SNPs and AKR1C4 rs7083869 were significantly associated with decreasing total NNAL levels. HSD11B1 and AKR1C4 enzymes are carbonyl reductases directly involved in the single step reduction of NNK to NNAL. The HSD11B1 SNPs may be correlated with the functional variant rs13306401 and the AKR1C4 SNP is correlated with the enzyme activity reducing variant rs17134592, L311V.
Nicola Segata, Ph.D.
Post Doctoral Fellow
Department of Biostatistics , HSPH
Metagenomic Biomarker Discovery and the Human Microbiome
Metagenomics has provided a new avenue for biomarker discovery, since changes in the composition and functional activity of microbial communities can provide insight into the ecological differences among communities or provide diagnostic or prognostic power when applied to the human microbiome. We propose the LDA Effect Size (LEfSe) algorithm to discover and explain microbial and functional biomarkers in the human microbiota and other microbiomes. LEfSe determines the features (organisms, clades, OTUs, genes, or functions) most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and biological relevance. We demonstrate this method to be effective in mining human microbiomes for metagenomic biomarkers associated with mucosal tissues and with different levels of aerobic metabolism. Similarly, when applied to 16S rRNA gene data describing a murine ulcerative colitis gut community, LEfSe confirms the key role played by Bifidobacterium and suggests the involvement of additional clades including Metascardovia. Finally, we provide characterizations of microbial functional activity from metagenomic community sequencing, comparing environmental bacterial and viral microbiomes. A comparison of LEfSe with existing microbial biomarker discovery methods and with standard statistical approaches (including an evaluation incorporating synthetic data) highlights a lower false discovery rate, consistent ranking of biomarkers relevance, and concise representations of taxonomic and functional shifts in microbial communities associated with environmental conditions or disease phenotypes.
Tuesday, February 1, 2011
12:30-2:00, Building 2-Room 426
Gabriel Altshuler
Post Doctoral Associate, HSPH
Pathway Fingerprinting; A Functional Framework for Multi-platform Integration
No objective, broadly applicable approach exists to compare molecular profiles, yet public repositories of gene expression arrays, RNASeq, GWAS, CAGE, epigenetic marks and ChIPseq data represent major resources for discovery. There is a pressing need for an objective, biologically-interpretable, functional abstraction of the comprehensive sample space of cellular genome, transcriptome and regulome data to compare these experiments within a consistent global framework. Pathway profiling significantly out-performs gene-based comparisons for cross-platform analysis by exploiting the fact that molecular profiles reflect activity of genes acting in concert across pathways. We have developed the ‘pathway fingerprint’, a mapping of any molecular profile to a fixed set of known pathways. Each fingerprint generated is standardized relative to the full public data corpus so they are directly comparable to any other, ensuring broad applicability. The method has been successfully piloted for the analysis of expression data. We performed a pathway fingerprint meta-analysis to establish and verify the stem cell pluripotency pathway signature. This was used to classify cell types and successfully identify additional pluripotent arrays from the GEO expression database. Moving away from small-scale comparisons and towards a literature-wide study of pluripotency has offered a means to resolve an ongoing debate over the pluripotent potential of testis-derived stem cells. We are now in the process of expanding the utility of the pathway fingerprint to GWAS, RNASeq, and eipgenetic data to provide a unified method for functional integration within and between these experimental platforms.
Levi Waldron
Research Fellow, HSPH & DFCI
Whole-genome expression profiling of degraded tumor RNA from the NHS/HPFS colorectal cancer cohorts
Over 20 million formalin-fixed, paraffin-embedded (FFPE) tissue blocks per year are routinely stored for cancer patients in the United States, and in particular, are available for cancer patients from long-term epidemiological studies including the Nurses Health Study and Health Professionals Follow-up Study (NHS/HPFS). Traditional microarray technology requires high-quality RNA from fresh-frozen tissues, and are not appropriate for gene expression profiling of these tissues. The recently introduced cDNA-mediated Annealing, Selection, extension, and Ligation (DASL) microarray assay by Illumina enables whole-genome expression profiling from degraded, clinical FFPE tissues. We present a case study involving more than 1,000 patients from the NHS/HPFS colorectal cancer cohort, demonstrating an end-to-end pipeline for quality control, normalization, and application of the data to tumor subtyping and differential gene expression. We conclude that this technology differs in important ways from the traditional microarray, discuss some of the challenges and opportunities associated with its adoption, and present initial results for the subtyping of colorectal cancer.
Tuesday, December 7, 2010
12:30-2:00, Building 2-Room 426
Elaine Hoffman, Ph.D.
Research Scientist
Department of Biostatistics
Harvard School of Public Health
Applications of Path Analysis & Structural Equation Models in Environmental Health Studies
Path analysis and structural equation models will be presented in the context of environmental health. The Bangladesh-Arsenic project, Italian-Manganese project, and some of my other projects will be used as examples. The statistical application of path analysis and structural equation models and some of the obstacles I have encountered also will be discussed. Path analysis is a statistical model that can account for complex relationships among variables and correlated variables. It is often used when there are suspected causal relationships. Structural equation models are a class of covariance structure models that simultaneously model multiple surrogates of both exposure and outcome. Both path and structural equation models are often shown as path diagrams. These models have been extensively used in the social sciences, and more recently are beginning to be used in environmental epidemiology. This presentation is a Work-in-Progress, with many of the statistical ideas discussed not having been published or completed yet. Software packages will be discussed.
Tuesday, November 2, 2010
12:30-2:00, Building 2-Room 426
Benjamin Haibe-Kains, Ph.D.
Research Fellow
Department of Biostatistics & Computational Biology
Dana-Farber Cancer Institute
Breast Cancer Molecular Subtypes: A Three-gene Model for a Translation Into Clinic
Background: Gene expression studies have well established that breast cancer (BC), in addition to being clinically diverse, is also a molecular heterogeneous disease. The early studies classified BC into at least three clinically relevant molecular subtypes: basal-like, HER2-enriched, and luminal tumors, with each subtype exhibiting different prognosis and response to therapies. Demonstration of the molecular heterogeneity within BC has changed the way clinicians perceive the disease and has a dramatic impact on the design of new clinical trials.
During the last decade, several methods have been proposed to classify breast cancer into their corresponding molecular subtype using gene expression. Three versions of a “Single Sample Predicton” (SSP), based on a hierarchical clustering and a nearest centroid classifier using different sets of “intrinsic” genes, have been proposed; however this method of classification has been shown to be highly unreliable by many investigators. We have previously reported a “Subtype Clustering Model” (SCM), which uses a mixture of three Gaussians with ER, HER2 and proliferation-related gene modules, to estimate probabilities of belonging to each BC molecular subtype. This model has improved classification stability; however such a method is not so easily clinically implemented.
In this work we developed a novel SCM-based 3-gene classifier for molecular subtyping and we evaluated its concordance, robustness and prognostic value compared with five existing classifiers.
Materials and Methods: We refined the SCM to its simplest form, a classification model that uses only the 3 genes reported to be the main discriminators of the molecular subtypes: ER, HER2 and AURKA (proliferation). We evaluated the concordance and robustness of five previously described classifiers- three SSPs and two SCMs- and the new 3-gene SCM, using gene expression and clinical data from a large compendium of publicly available BC datasets comprising 5,113 primary breast tumors. Clinical relevance was determined from survival analysis of a subset consisting of 1,318 untreated node-negative patients.
Results: SCM-based classifiers, including our 3-gene model, were significantly more robust than all the SSPs (prediction strength > 0.8, p < 0.001); notably, the 3-gene SCM was the most robust classifier for molecular subtypes. Although all models were concordant (Cramer’s V = 0.54-0.81, p < 0.001), with basal-like subtype being particularly well defined (median Cramer’s V of 0.8), SCMs yielded stronger concordance than SSP models. Overall, SCMs were also more consistent with traditional clinical variables (ER, HER2 status by IHC/FISH and histological grade). All classifications yielded significant and independent prognostic value.
Conclusions: We found significant disparities in robustness of BC molecular subtype classification models. SCMs outperformed SSPs and consistently identified molecular subtypes in numerous datasets derived using various microarray technologies and conducted by different laboratories. Compared with existing models, we propose that a 3-gene SCM-based model is the most reliable and its simplicity could be a significant step towards translation of BC molecular subtyping into the clinic.
Keywords: molecular subtypes, gene expression, clustering, classification, robustness, concordance, prognosis
Tuesday, October 5, 2010
12:30-2:00, Building 2-Room 426
Pinaki Sarder, Ph.D
Research Fellow
Department of Biostatistics
Harvard School of Public Health
Functional understanding of microbial communities using experimental data integration
To understand the functional and metabolic activities of microbes and microbial communities, it is critical to link genes and proteins to their biological roles. This encompasses both their biochemical activities and the processes and pathways in which they are used by the cell. This problem is typically approached by transferring knowledge to newly sequenced genomes by relying on sequence similarity. This can be a difficult process involving sparse knowledge and the propagation of error, and even the best-studied organisms’ genomes are only partially characterized. To mitigate these issues, we have developed a data integration method TafTan leveraging all experimental results available from multiple model systems to identify potential functional roles for genes in a new organism. The performance of TaFTan’s genome-wide functional network prediction was evaluated using ~300 experimental datasets from 20 model organisms. This evaluation study demonstrated that TaFTan is able to significantly improve individual organisms’ inferred functional networks by transferring knowledge from other experimentally characterized systems.
September 21, 2010
12:30-2:00, Building 2-Room 426 (Biostats Conference Room)
X. Shirley Liu, Ph.D.
Associate Professor of Biostatistics
Departments of Biostatistics and Computational Biology
Harvard School of Public Health & Dana-Farber Cancer Institute
Computational Genomics of Gene Regulation
High throughput genomics technologies brought a paradigm shift to gene regulation studies, but they also created challenges on data analysis. In this talk, I will highlight two studies conducted in my lab to show how computational and statistical algorithms could help remove the noise in the data, provide informative results, and help design efficient experiments. One study is a model-based analysis of tiling arrays for ChIP-chip peak calling, and the other is using dynamics of H3K4me2 nucleosomes to infer the in vivo transcription factors and their binding sites driving a biological process.
2009-2010 Working Groups
Tuesday, April 27, 2010
12:30-2:00 PM
KRESGE G2
Robert Wright, M.D., M.P.H.
Associate Professor
Department of Environmental Health, HSPH
Department of Pediatrics, HMS
A Framework for Measuring Gene-Environment Interactions in Children
Will discuss issues of development unique to children that will improve estimates of gene environment interaction including child-specific pitfalls to case control and family-based association designs.
Tuesday, March 23, 2010
12:30-2:00 PM
Kresge 201
Marianne Wessling-Resnick, Ph.D.
Professor of Nutritional Biochemistry
Dept of Molecular Metabolism
Director of the PhD Program in Biological Sciences in Public Health
Harvard School of Public Health
Chemical Genetics of Iron Transport
Chemical genetics is an emerging field that takes advantage of combinatorial chemical and small molecule libraries to dissect complex biological processes. Small molecules can act very fast, can be very specific, and can help to distinguish the temporal order of molecular steps and the hierarchical regulation of biological processes. Because small molecules can alter the function of a specific gene product, they can be used in a manner analogous to the use of inducible dominant or homozygous recessive genetic mutations. A large body of biochemical literature is based on the past use of small molecule antagonists that were employed in “reverse chemical genetics” approaches to conditionally eliminate protein function, and on that basis to subsequently identify the target, its mechanism of action, and its regulation. Thus, ouabain helped to define the catalytic cycle of the NaK-ATPase, cytochalasin B was instrumental in defining the molecular basis for insulin’s action to stimulate glucose uptake, and analogs of amiloride were used to purify and define the epithelial Na channel. There is a need to develop “forward chemical genetics” in order to discover small molecules that partner with key elements in a pathway of interest. Our goals are to discover small molecule inhibitors of iron transport using chemical genetics and to use these reagents to advance our understanding of the factors, mechanisms, and regulation of different pathways of iron metabolism.
Patrick Loerch
Research Fellow, Computational Biology
Merck
Using Networks to Integrate Diverse -omics Datasets and Identify Disease Pathways
The dramatic increase in the application of omics-based platforms (microarrays, GWAS, proteomics, deep sequencing, metabolomics, etc) in biological research has resulted in the generation of vast public and
private data repositories spanning a wide array of diseases, tissues and species. The challenge facing researchers today, commonly referred to as integrative genomics, is figuring out how to integrate, analyze and interpret all of this information within the context of a well-defined biological question. One biological question of particular importance is the identification of genes/proteins that contribute to, and/or are altered as a result of, the onset of a specific disease. As opposed to the gene-centric approach to integrative genomics, which looks for genes that are associated with a disease across multiple omics platforms, we have developed a pathway-centric approach. We hypothesize that the onset of disease involves the altered regulation of specific pathways, which can be triggered by any number of genes or proteins, so long as the end result is the same. Working under this hypothesis, we have developed a network-based approach to integrating various omics datasets with the aim of identifying disease pathways. This approach also allows us to take into account a number of biological realities, such as the regulation of pathway members at various states (from transcription to translation) and the fact that the regulation of some pathway members will simply not be observable through omics technologies. Here we will present the development of this methodology within the context of identifying disease pathways, discuss a specific application/validation of the method, and describe ongoing efforts to further refine this approach.
Tuesday, February 23, 2010
12:30-2:00 PM
Kresge 201
Curtis Huttenhower, Ph.D
Assistant Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Harvard School of Public Health
Scalable Data Mining for Functional Genomics and Metagenomics
The average human body contains over ten times as many microbial cells as “human” cells. These microbial communities are usually beneficial, but their dysfunction has been linked to conditions ranging from obesity to antibiotic resistant infections. The recent dramatic reduction in the cost of DNA sequencing has opened up several exciting new ways in which we can explore how the human microflora vary across populations and how they can be manipulated to improve human health.
Biological network integration and mining algorithms provide a means of assembling the entire body of cultured microbial genomic data, understanding it from a systems level, and applying it to the study of uncharacterized species and communities. We compare unsupervised and supervised Bayesian approaches to biological network integration; this process provides maps of functional activity and genomewide interactomes in over 100 areas of cellular biology, using information from ~5,000 genome-scale experiments pertaining to 13 microbial species. In combination with graph alignment, these network manipulation tools provide a means for analyzing the functional activity unique to particular pathogens, transferring putative functional annotations to uncharacterized organisms, and potentially inferring interactomes using weighted network integration for metagenomic communities.
Ed Silverman, M.D., Ph.D.
Associate Professor of Medicine
Harvard Medical School
Associate Physician, Brigham and Women’s Hospital
Genetic Epidemiology of Chronic Obstructive Pulmonary Disease
Genetic factors are likely important determinants of chronic obstructive pulmonary disease (COPD) susceptibility. Genome-wide association analysis has recently been performed in COPD, and several regions of highly significant association on chromosomes 15 and 4 have been identified. Because COPD is a heterogeneous disease, genetic studies have focused on the identification of distinct subgroups of COPD subjects (COPD subtypes) as well as disease-related conditions (COPD-related phenotypes). Genetic association studies combined with chest CT scan analysis have the potential to lead to substantial new insights into COPD pathogenesis, which could provide important pathways to develop new treatments for COPD.
Tuesday, January 26, 2010
12:30-2:00 PM
Kresge 201
John Quackenbush
Professor of Computational Biology and Bioinformatics
HSPH & DFCI
Network and State Space Models: Science and Science Fiction Approaches to Cell Fate Predictions
Two trends are driving innovation and discovery in biological sciences: technologies that allow holistic surveys of genes, proteins, and metabolites and a realization that biological processes are driven by complex networks of interacting biological molecules. However, there is a gap between the gene lists emerging from genome sequencing projects and the network diagrams that are essential if we are to understand the link between genotype and phenotype. ‘Omic technologies such as DNA microarrays were once heralded as providing a window into those networks, but so far their success has been limited, in large part because the high-dimensional they produce cannot be fully constrained by the limited number of measurements and in part because the data themselves represent only a small part of the complete story. To circumvent these limitations, we have developed methods that combine ‘omic data with other sources of information in an effort to leverage, more completely, the compendium of information that we have been able to amass. Here we will present a number of approaches we have developed, including an integrated database that collects clinical, research, and public domain data and synthesizes it to drive discovery and an application of seeded Bayesian Network analysis applied to gene expression data that deduces predictive models of network response. Looking forward, we will examine more abstract state-space models that may have potential to lead us to a more general predictive, theoretical biology.
Miguel Camargo
Merck Reserach Laboratories
“Pathway based analysis of whole genome siRNA screens and de novo pathway identification”
A prevailing model of Alzheimer’s Disease etiology is the progressive aggregation of toxic Ab (Abeta) peptides in the brain. Ab is produced as a result of proteolytic processing of the amyloid precursor protein (APP). Cleavage of APP by beta- and gamma-secretases result in the production of Ab and hence several drug discovery efforts are aimed at
finding either beta- or gamma-secretase inhibitors. However, development of small molecules to either of these enzymes has proven to be challenging. In order to overcome limitations of developing b or g-secretase inhibitors, we performed large scale siRNA screens in order to identify novel regulators of APP processing that could represent more tractable drug targets. The screen measures the production of four APP proteolytic products: the non-amyloidgenic peptide (sAPPa) or the amyloidgenic pathway peptides (Ab40, Ab42, and sAPPa). We introduce a
novel analysis method that scores the overall effect of individual pathways on the processing of the APP protein. This method takes into account all genes in the pathway, thus allowing for small effects to be considered, and introduces the concept of scoring ‘pathways’ as opposed to individual genes as a way of mitigating against false positive hits.
Using this method, we identified novel and distinct pathways that regulate processing of APP into either amyloidgenic peptides or non-amyloidgeneic peptides respectively. We will also highlight how to leverage biological network data in combination with siRNA screens to identify novel pathways.
Tuesday, December 15, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)
Oliver Hofmann, Ph.D.
Research Associate in Biostatistics
Combining curated and in silico interaction data in network analysis
Integrating biological data obtained from multiple high throughput platforms is an area of active research. While it is possible to normalize for technical differences, laboratory effects and other artifacts the problem of merging data from different biological samples is still mostly unsolved. Standard methods include rank-based analysis of biological features (mRNA abundance, peptide counts etc.) rather than absolute measurements as well as higher levels of abstraction such as Gene Set Enrichment Analysis or the identification of Metagenes. By comparing biological systems at the network level an additional layer of data abstraction is added, allowing for the identificaton of connected network areas contributing to a phenotype of interest in heterogenous samples or between studies. Three different examples highlight current limitations of data availability and quality, identifying possible future methods for even higher levels of data abstraction to analyze complex systems.
Stalo Karageorgi, M.S.
Doctoral Student in Environmental Health
Polymorphisms in HSD17b2 and HSD17b4 and endometrial cancer risk
Hydroxysteroid dehydogenase 17b (HSD17b) genes encode for enzymes that control the last step in estrogen biosynthesis. The isoenzymes HSD17b2 and HSD17b4 in the uterus preferentially catalyze the conversion of estradiol, the most potent and active form of estrogen, to estrone, the inactive form of estrogen. Endometrial carcinoma is a disease strongly linked to the imbalance between the hormones estrogen and progesterone. We hypothesized that variation in single nucleotide polymorphisms (SNPs) in genes HSD17b2 and HSD17b4 may alter the enzyme activity, estradiol levels and risk of disease. Pairwise tagging SNPs were selected from the HapMap CEU database to capture all known common (MAF >0.05) genetic variation in the gene region with an r2 of at least 0.8. SNPs were genotyped in participants in the nested case-control studies in the Nurses’ Health Study (NHS) (cases=544, controls=1296) and the Womens’ Health Study (WHS) (cases=130, controls=389), who provided a blood or cheek cell sample. The association between SNPs and endometrial cancer was examined using conditional logistic regression to estimate odds ratio and 95% confidence intervals adjusted for known risk factors. We additionally investigated whether SNPs are predictive of plasma estradiol and estrone levels in the NHS using linear regression. This is the first study to report on genetic variation in HSD17b2 and HSD17b4 in relation to endometrial cancer.
Tuesday, November 17, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)
Zhaoxi (Michael) Wang, M.D., Ph.D.
Research Scientist, Environmental & Occupational Medicine and Epidemiology Program
“Mitochondrial Variations in NSCLC by Microarray-based Resequencing”
Mutations in human mitochondrial genome (mtDNA) genome have long been suspected to play an important role in the development of cancer. Although most cancer cells harbor mtDNA mutations, the question of whether such mutations are associated with clinical prognosis of cancer remains unclear. In this study, we resequenced the entire mitochondrial genomes of tumor tissue from a population of 249 Korean non-small cell lung cancer (NSCLC) patients using the Affymetrix GeneChips Human Mitochondrial Resequencing Array 2.0 (Santa Clara, CA). In early stage (stage I/ II) NSCLC, patients with the haplogroup D4 had the worst clinical prognosis. Interestingly, haplogroup D4 was previous reported as a marker for extreme longevity in Japanese.
Alkes Price
Assistant Professor of Statistical Genetics, HSPH
“Effects of Cis and Trans Family Heritability on Single-tissue and Cross-tissue Gene Expression Regulation”
Family heritability is a useful approach for understanding the genetic basis of gene expression. Heritability analyses can evaluate the contribution of both cis and trans regulation by considering genetic relatedness either genome-wide (trans), or at the genomic location proximal to the expressed gene (cis). We used gene expression data from blood and adipose tissue cohorts to estimate the contribution of cis and trans regulation to heritable variation in gene expression. We estimate that cis regulation contributes 45±7% of heritability in blood expression and 30±3% of heritability in adipose tissue expression, with the difference entirely attributable to greater trans effects in adipose tissue. We also conducted a cross-tissue analysis to investigate regulation that is shared across tissues. Strikingly, we observed that cross-tissue regulation is dominated by cis effects. These analyses point to a greater contribution for cis regulation than previous admixture-based analyses. This divergence would be consistent with a substantial role for epigenetic regulation, whose effects are included in heritability analyses but excluded in admixture analyses. Our results have implications for understanding the causes of “missing heritability” in genetic association studies.
Tuesday, September 22, 2009 1 2:30-2:00 PM
Building 2, Room 426 – Biostatistics Conference Room
Meet & Greet, Monica Ter-Minassian practice talk on Genetic Risk Factors of Neuroendocrine Tumor