Past Working Group Series



Working Group Organizers: Zhonhua Liu & Pier Palamara

For all local students, postdocs, and faculty - The PQG is continuing a less formal working group seminar to provide the opportunity to present and participate in the discussion of works in progress focused on the methods and analysis of high-dimensional data in genetics and genomics. This year we have a renewed focus on the work of young investigators. With the current enthusiastic speakers who have already volunteered, we look forward to an exciting seminar series for 2012-2013.


Tuesday, November 3, 2015
12:30-1:30 PM
Building 2, Room 426 - Biostatistics Conference Room
a pizza lunch will be provided

Rodolphe Thiébaut
Professor at the Bordeaux University, Bordeaux, France

Statistical analysis of high-dimensional data in clinical trial: the example of an HIV vaccine phase I/II trial

The size and the complexity of data collected for the evaluation of interventions in clinical trials has considerably changed. This is due to the availability and the improvement of new technologies including gene sequencing. Today, vaccine immunogenicity is assessed by various measurements including cell populations characterization (up to 216 with flow cytometry to 240 with CYTOF), proteins productions (cytokines at a single cell level with intracellular staining, in supernatants with multiplex assays), gene expression and viral adaptation using microarray and sequencing technologies. This situation gives the opportunity to study the mechanistic effect of the vaccine and also to try to predict the future response with baseline or early responses thank to the broad spectrum of the measurements performed.

In this talk, I will use the example of the DALIA 1 vaccine trial evaluating the response to an ex vivo generated DC loaded with HIV-lipopeptides in HIV patients treated by antiretroviral therapy (ART). Gene expression in whole blood was measured by microarrays (Illumina HumanHT-12) at 14 time points. Post vaccination immune responses were evaluated using various assays. In step 1, a Time-course Gene Sets Analysis (TcGSA) was performed using hierarchical models allowing heterogeneity in predefined gene sets and implemented in an R package (Plos Comp Biol, 2015,11:e1004310). Statistical properties of this approach have been studied through simulations. Association between abundance of genes selected in step 1, immune responses at w16 and viral replication after ART interruption was analysed in step 2 using an extension of the sparse-Partial Least Square approach to take into account the structural or group effects due to the relationship between markers among biological pathways.

This example illustrates new opportunities as well as the complexity of the analysis of high dimensional datasets and the need of adapted statistical approach.



Tuesday, October 6, 2015
12:30-1:30 PM
Building 2, Room 426 - Biostatistics Conference Room
a pizza lunch will be provided

Peng Jiang
Postdoc at Dana-Farber Cancer Institute and Harvard School of Public Health

Inference of regulatory abnormality in cancers from heterogeneous public data

Recent years have seen the rapid growth of large-scale tumor-profiling data, which provides numerous opportunities for novel cancer biology insights. However, the experimental condition of most public data may not match the physiological condition of a cancer type of interest. Meanwhile, the cancer genome is highly unstable. Many alterations could arise from somatic events unrelated with biological regulation.
In this study, we focused on finding tumorigenesis regulators, including transcription factors (TF) and RNA binding proteins (RBP), through integrating a large number and variety of datasets. For efficient and accurate inference on gene regulatory rules, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests the regulatory impact of TF (or RBP), controlling for background effect from copy number alteration and DNA methylation. In each cancer type, RABIT further tests whether public repository data have reasonable correlation with tumor type specific data. Our predicted regulator impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many novel aspects of regulation in tumor progression.


Brief Research Profile
Peng Jiang is a postdoc at Dana Farber Cancer Institute and Harvard School of Public Health. His research focuses on building up computational methods to integrate big biological data from public domain and discover driver events in diverse cancer types. Peng received his Ph.D. in Computer Science from Princeton University in 2013.



Working Group Organizer: Po-Ru Loh


Tuesday, May 12, 2015
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Jae-Hoon Sul
Sunyaev Lab at Brigham & Women's Hospital

Large-scale genetic studies of human complex traits

In the past decade, genotyping and next-generation sequencing (NGS) technologies have generated an enormous amount of data to discover genetic variants present in human genomes and to find the genetic basis of diseases. The technologies have shifted the paradigm of genetic studies from studies that analyzed fewer than a hundred individuals at hundreds of markers to studies that analyze more than tens of thousand individuals at millions of genetic variants. With rapid decrease in sequencing costs and emphasis on genomic medicine, studies will sequence hundreds of thousands of individuals in the near future. A major challenge in these large-scale genetic studies is developing computational methods that can utilize this big data efficiently.

In this talk, I will describe my work to address this challenge. As genome-wide association studies have discovered numerous non-coding genetic variants associated with traits, there has been increasing focus on interpreting these variants using functional genomics. Expression quantitative trait loci (eQTL) studies that attempt to detect genetic variants associated with gene expression may provide clues as to which variants are functional. I will discuss a method to perform multiple testing correction accurately and rapidly in eQTL studies for identification of genes whose expression is influenced by genetic variants. As eQTL studies have grown larger in sample size, multiple testing correction using the permutation test has become a major computational bottleneck. I developed a multivariate normal sampling approach (MVN), and MVN is more than 100 times faster than the permutation test for the sample size of 2,000 while generating almost the same results. Next, I will present a novel approach to detect rare variants associated with a disease in large families. NGS enables studies to evaluate effects of rare variants on complex traits, and family-based studies have attracted great attention recently because of their higher power for rare variant testing than case-control studies. I developed a method called RareIBD that can be applied to large pedigrees, both binary and quantitative traits, and affected-only pedigrees. Using simulations, I will show my method achieves higher power than previous approaches

Tuesday, April 21, 2015
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Rany Salem
Hirschhorn Lab at Boston Children's Hospital

Complex Trait Genetics: GWAS, Phenotypic Heterogeneity and Future Directions

The emergence of Genome-wide association studies (GWAS) has given researchers a powerful tool to dissect the genetic architecture of many complex traits. A major challenge of performing a GWAS is phenotype selection, and dealing with phenotypic heterogeneity. The first part of the talk will focus on results of the first large scale GWAS of diabetic kidney disease and issues related to identifying and dealing with phenotypic heterogeneity. For the second part of the talk, I will present a framework to deal with the limitations of the GWAS consortium model and briefly explore statistical and methodological issues to take advantage of this opportunity. Finally, I will highlight results of a novel association between a structural variant and lipid levels.


Tuesday, January 20, 2015
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room
a pizza lunch will be provided

Manuel A. Rivas
Wellcome Trust Centre for Human Genetics, University of Oxford and the Broad Institute

Exploiting Correlation of Genetic Effects in Rare Variant Association Studies

I consider the problem of assessing association between rare variants and their effects on multiple, possibly correlated phenotypes. This problem arises in population biobanks, where a large number of health related measurements are made, and sequencing association study designs of related disorders. In this talk I will present a statistical framework that employs estimates of the correlation of genetic effects to improve power to discover associations.  The approach exploits knowledge about the correlation in both genetic effects across a group of genetic variants and across a group of phenotypes. Properties of the approach as implemented include that it is computationally efficient, making the analysis of large study designs practical; it is flexible and extensible, making the analysis of gene-sets, pathways, and networks feasible; and it includes standard univariate and multivariate gene-based tests as a special case, thus it can be employed in settings where estimates of the expected correlation of genetic effects are not available.  I will present results from a simulation study and from the analysis of two independent rare variant association studies: 1) blood metabolite measurements (multiple quantitative traits) from subjects in the Oxford Biobank, a collection of 30-50 year old healthy men and women living in Oxford recruited from primary care who underwent a detailed examination at a screening visit; and 2) a targeted sequencing data set of six autoimmune diseases. The statistical and computational framework is implemented in software package MAMBA.

Tuesday, December 2, 2014
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Sriram Sankararaman
Reich Lab at Harvard Medical School

Characterizing the structure and impact of archaic admixture in humans

Large-scale studies of human genetic variation, in conjunction with advances in ancient DNA technology, have revealed a number of mixture events between highly-diverged human populations that occurred tens of thousands of years ago (archaic admixture), e.g., between the ancestors of Neandertals and modern non-Africans. While several subsequent studies have found examples of phenotypically-relevant genetic variants introduced by archaic admixture, the biological impact of archaic admixture on human populations is not systematically understood. To understand the biological impact, we need maps of archaic local ancestry i.e., a labeling of the archaic ancestry along an individual genome.

We developed a statistical method based on the framework of Conditional Random Fields (CRF) and applied it to infer Neandertal local ancestry. The CRF provides inferences that are robust and sensitive both on simulated and real data. We applied the CRF to data from the 1000 Genomes project combined with a high-coverage Neandertal genome. The resulting map of Neandertal local ancestry reveals likely positive selection of Neandertal ancestry in specific loci and sets of genes (e.g. genes involved in keratin filament formation), associations between Neandertal-derived alleles and medical phenotypes (e.g. Crohn's disease), as well as strong purifying selection against Neandertal alleles in part due to the effects of hybrid male sterility.

Finally, I will present ongoing work to study the impact of another archaic admixture between a population known as Denisovans and ancestors of modern Melanesian populations. I will discuss generalizations of the CRF to multi-way and de-novo admixtures as well as unphased data, efforts to study phenotypic associations of Neandertal variants and some open statistical questions ( how do we learn and apply discriminative models to population genetic problems?).

Tuesday, October 28, 2014
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Yun Li
Assistant Professor of Genetics & Biostatistics
University North Carolina at Chapel Hill

A Bayesian method for the detection of long-range chromosomal interactions in Hi-C Data

Advances in chromosome conformation capture and next-generation sequencing technologies are enabling genome-wide investigation of dynamic chromatin interactions. For example, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. One essential task in such studies is peak calling, that is, the identification of non-random interactions between loci from the two-dimensional contact frequency matrix. Successful fulfillment of this task has many important implications including identifying long-range interactions that assist in interpreting a sizable fraction of the results from genome-wide association studies (GWAS). The task - distinguishing biologically meaningful chromatin interactions from massive numbers of random interactions - poses great challenges both statistically and computationally. Model based methods to address this challenge are still lacking. In particular, no statistical model exists that takes the underlying dependency structure into consideration. We propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulations and real data.


Tuesday, October 7, 2014
12:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Elinor Karlsson
Postdoctoral Fellow and soon-to-be Assistant Professor
Broad Institute
Harvard University FAS Center for Systems Biology
University of Massachusetts Medical School

Natural Selection and Cholera Resistance in Bangladesh

Infectious pathogens are among the strongest selective forces driving recent human evolution, as migration and cultural changes over the last 100,000 years exposed populations to dangerous new diseases. One such pathogen is the causal agent of cholera, the bacterium Vibrio cholerae, which is endemic in the Ganges River Delta. By combining signals of association and natural selection, we have identified genes and pathways implicated in cholera susceptibility in Bangladesh, and developed a model of the innate immune signaling pathways that respond to V. cholerae infection. In this model, inflammasome activation and NF-kB signaling play an integrated role in TLR4-mediated sensing of V. cholerae – which is consistent with our in vitro data. Our approach is broadly applicable to other historically prevalent infectious diseases, such as Lassa fever, tuberculosis, leishmaniasis, dengue fever and malaria, and to complex, common diseases, such as inflammatory bowel disease, for which the associated genes may have been historically selected. Thus, ancient natural selection can give new insights into the function – and dysfunction – of human biology, with important implications for medical genomics.


Working Group Co-organizers: Bjarni Vilhjalmsson & Sasha Gusev

For all local students, postdocs, and faculty - The PQG is continuing a less formal working group seminar to provide the opportunity to present and participate in the discussion of works in progress focused on the methods and analysis of high-dimensional data in genetics and genomics. This year we have a renewed focus on the work of young investigators. With the current enthusiastic speakers who have already volunteered, we look forward to an exciting seminar series for 2012-2013.


Tuesday, April 22, 2014
12:30-2:30, Building 2-Room 426

Kaitlin Samocha
Analytic and Translational Genetics Unit | MGH

Identification of a Set of Highly Constrained Genes from Exome Sequencing Data

A major challenge of medical genetics is to determine which variants, if any, contribute to disease in a patient. While variants can be prioritized based on their predicted deleteriousness, information about the disrupted gene can also be used to highlight those variants that are more likely to contribute to disease. For example, damaging variants in genes expressed in the relevant tissue might be prioritized over variants in genes that are not expressed in the tissue. Another potential way to prioritize variants is by the evolutionary constraint of the gene.

We developed a sequence context based model of de novo variation to create per-gene probabilities of synonymous, missense, and loss-of-function mutations. We noticed a high correlation (0.94) between the probability of a synonymous mutation in a gene and the number of rare, synonymous variants identified in that same gene using the NHLBI’s Exome Sequencing Project data ( We predicted the number of variants that we would expect to see in the dataset and – in order to quantify deviations from those expected values – created a Z score of the chi-squared difference between the observed and expected variation. While the distribution of these Z scores for the synonymous variants was normal, there is a marked shift in the missense distribution towards having fewer variants than predicted.

We identified a list of excessively constrained genes representing roughly 5% of all genes. This set of genes identified as excessively constrained showed enrichment for entries in the Online Mendelian Inheritance in Man (OMIM) database and, in particular, for those with a dominant inheritance pattern. Using published data, we found that de novo loss-of-function variants identified in patients with autism and intellectual disability were in a constrained gene more often than expected (p < 0.0001 for both). This trend did not hold for those genes with a de novo loss-of-function variant in a control (p = 0.66), indicating that this approach can effectively prioritize genes in which mutations can strongly predispose to disease.

Tuesday, February 25, 2014
12:30-2:30, Building 2-Room 426

Stephan Ripke
Analytic and Translational Genetics Unit | MGH

Reality at Last: Psychiatric Genomics Consortium Quadruples Schizophrenia GWAS Sample Size

Genome wide association studies have been a principal study design in human genetics for almost ten years now. They have been performed on almost all known heritable human traits and diseases with the hope of shedding light on the biology of so called complex genetic diseases like diabetes mellitus, hypertension and hyperlipidemia. The psychiatric field, which has traditionally lacked biologic instruments for medical research was especially enthusiastic about this new technology. Still, early analyses did not bear fruit causing many to question or abandon this approach.

Genome wide association studies are now clearly one of the most successful genetic analysis methods in medical research. Results from these studies have led to many known and possibly novel drug targets and have deciphered many new biological pathways. Interestingly, psychiatric phenotypes are in no way inferior to somatic phenotypes. Yet the path to this success was more arduous than expected.

In this talk I will present lessons learned during the last ten years of GWAS studies. I will give an overview over this journey, highlighting early successes from somatic diseases (e.g. Crohn’s disease) and later but no less striking success from psychiatric diseases, schizophrenia in particular.

I will then attempt to translate these experiences into expectations and recommendations for future work in the field of genetics and in the field of psychiatric genetics specifically.


Tuesday, November 19, 2013
12:30-2:30, Building 2-Room 426

Gosia Trynka
The Raychaudhuri Lab | BWH | HMS

Analyses assessing enrichment of GWAS variants for non-coding annotations in the genome are upwardly biased

Interpreting the function of variants associated to complex traits is challenging, particularly because only a small proportion of the variants maps to the gene coding regions. The activity of the genomic regions can be inferred from whole genome assays for histone modifications, chromatin accessibility or specific transcription factors. Now such maps are becoming available for hundreds of cell-types and tissues.

A test for enrichment within chromatin annotations has become a widely applied approach to infer functions of trait-associated variants. However, currently there are no standards to correctly carry out such analysis. Typically, enrichment analysis takes associated variants and quantifies the overlap with chromatin annotations. The observed enrichment is then compared to the null distribution of SNPs sampled from the whole genome based on different parameters. Most frequently the null is defined by random sets of SNPs or matched for proximity to TSS and minor allele frequency. Different definitions of null distribution highly influence the significance of observed results.

We tested different sets of matching parameters and defined those that are essential to sufficiently control for type 1 error. With simulations we show that the currently employed matching methods are strongly biased towards false positive results. Additionally, we develop an independent method that does not rely on matching and instead takes into account local complex correlation of genomic annotations present at the associated loci.


Tuesday, November 5, 2013
12:30-2:30, Building 2-Room 426

David Golan
Ph.D candidate Rosset lab Tel-Aviv University

Accurate Estimation of Heritability in Case-Control GWAS

Linear mixed effects models have recently gained popularity as the method of choice for estimating heritability from GWAS data, i.e. quantifying how much of the variability of a phenotype can be explained by the genotyped SNPs.
However, most of the interesting diseases and disorders studied are rare (typically affecting <1% of the population), and so the proportion of cases in a study is usually considerably higher than the proportion of cases in the population. This over-representation of cases invalidates several key assumptions of linear mixed models, e.g. the normality and independence of the random effects. Ignoring these problems results in shrunken estimates of heritability. We propose an alternative approach for estimating heritability, related to the well known method of Haseman and Elston. We derive the relationship between the genetic similarity and the phenotypic similarity of any two individuals as a function of the heritability, while explicitly conditioning on the fact that both individuals were selected for the study. Our method then entials regressing the pairwise phenotypic similarities on the pairwise genetic similarities and using the slope to obtain an estimate of the heritability. We show, using simulations, that our method yields unbiased estimates which are considerably more accurate than the current state-of-the-art methodology. Applying our method to several well-studied GWAS yields heritability estimates which are considerably higher than previously published results.

We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.


ay, October 22, 2013
12:30-2:30, Building 2-Room 426

Saurabh Ghosh
Human Genetics Unit
Indian Statistical Institute, Kolkata

Integrating Multiple Phenotypes For Association Mapping

Most clinical end-point traits are governed by a set of quantitative and qualitative precursors and a single precursor is unlikely to explain the variation in the end-point trait completely. Thus, it may be a prudent strategy to analyze a multivariate phenotype vector possibly comprising both quantitative as well as qualitative precursors for association mapping of a clinical end-point trait. The major statistical challenge in the analyses of multivariate phenotypes lies in the modelling of the vector of phenotypes, particularly in the presence of both quantitative and binary traits in the multivariate phenotype vector.

For population-based data, we propose a novel Binomial regression approach that models the likelihood of the number of minor alleles at a SNP conditional on the vector of multivariate phenotype using a logistic link function. For family-based data comprising informative trios, we propose a logistic regression method that models the transmission probability of a marker allele from a heterozygous parent conditioned on the multivariate phenotype vector and the allele transmitted by the other parent. In both the approaches, the test for association is based on all the regression coefficients.

We carry out extensive simulations under a wide spectrum of genetic models and probability distributions of the multivariate phenotype vector to evaluate the powers of our test procedures. We apply the proposed population-based method to analyze a multivariate phenotype comprising homocysteine levels, Vitamin B12 levels and affection status in a study on Coronary Artery Disease and the family-based method to analyze a vector of four endophenotypes associated with alcoholism: the maximum number of drinks in a 24 hour period, Beta 2 EEG Waves, externalizing symptoms and the COGA diagnosis trait in the Collaborative Study on the Genetics of Alcoholism (COGA) project.


ay, September 24, 2013
12:30-2:30, Building 2-Room 426

Ron Do
Research Fellow in Genetics
Harvard Medical School

Human Genetics Approaches to Understand if Triglyceride-Rich Lipoproteins Cause Coronary Artery Disease

Plasma triglycerides are transported in specific lipoproteins; in observational epidemiologic studies, increased triglyceride levels correlate with higher risk for coronary artery disease (CAD). However, it is unclear whether this association reflects causal processes. Genetic information can help assess causality, and can be useful in dissecting the influences of correlated measures such as triglycerides, low-density lipoprotein cholesterol (LDL-C), and high-density lipoprotein cholesterol (HDL-C). Here, we present results from two studies that help us understand if triglyceride-rich lipoproteins cause coronary artery disease. In the first study, we used 185 common polymorphisms recently mapped for plasma lipid traits (P < 5 x 10-8 for each) to examine the role of triglycerides on risk for CAD. First, we highlight loci associated with both LDL-C and triglycerides, and show that the direction and magnitude of both are factors in determining risk for CAD. Second, we consider loci with a strong magnitude of association with triglycerides but a minimal one with LDL-C, and show that these loci are also associated with CAD. Finally, in a model accounting for effects on LDL-C and/or HDL-C, a polymorphism’s strength of effect on triglycerides is correlated with the magnitude of its effect on CAD risk. In the second study, we sequenced the exomes of 1,027 individuals with myocardial infarction (MI) at an early age, compared their sequences with 946 older individuals without MI, and validated initial associations using statistical imputation, genotyping, and re-sequencing. In follow-up sequencing of > 10,000 individuals, we observed rare non-synonymous mutations in the APOA5 gene were more frequent in early-onset MI patients versus MI-free controls at an exome-wide significance threshold (APOA5, 1.4% versus 0.6%, odds ratio = 2.2, P =5 x 10-7). Carriers had higher plasma triglyceride concentrations compared to non-carriers (median in carriers was 167 mg/dl versus 104 mg/dl for non-carriers, P=0.007). These results suggest that triglyceride-rich lipoproteins may causally influence risk for CAD and that novel therapeutic approaches targeted to TG-rich lipoproteins might be expected to reduce risk of CAD.



Working Group Organizer: Hugues Aschard
Postdoc Committee: Bjarni Vilhjalmsson, Jin Zhou, Megha Padi

Tuesday, April 2, 2013
12:30-2:30, Building 2, Room 426

Nikolaos A. Patsopoulos, MD, PhD
Instructor in Neurology
Brigham & Women's Hospital
Harvard Medical School

The genetics of common variation in Multiple Sclerosis

Multiple Sclerosis is a neurodegenerative disease with strong evidence of genetic predisposition. In this PQG I'll present current and on-going efforts of the International Multiple Sclerosis Genetics Consortium (IMSGC) to identify and fine-map genetic effects in large-scale data sets. The focus will be in design, addressing technical issues, e.g. complicated sample structure, and application of methods, e.g. control for population stratification.

Tuesday, March 5, 2013
12:30-2:30, Building 2, Room 426

Tune H. Pers
Research Fellow in Pediatrics
Children's Hospital Boston
Harvard Medical School

Predicting pathways and selecting likely causal genes from a height genome-wide association study by integrative network-based analysis

In genome-wide association (GWA) studies many loci have no single obviously causal gene, therefore the challenge for moving from association to novel biological insight is to identify which gene at each locus most likely explains the association. Another challenge is to identify whether associated loci coalesce onto biological pathways. A key first step is to use computational approaches to prioritize genes within loci as most likely to be biologically relevant, and to assess whether genes in associated loci enrich for particular gene sets. Our approach is based on expression patterns derived from a heterogenous panel of 80,000 expression arrays that predict functions for genes. The predicted functions include predicted protein-protein interactions, predicted phenotypic consequences, predicted Reactome pathway members, predicted KEGG pathway members, and predicted Gene Ontology term membership. We subsequently used these predicted functions to systematically identify the most likely causal gene(s) at a given locus. We evaluate our method using the GIANT consortium's GWA meta-analyses for human height. For human height GWA loci, our integrative approach performs consistently better in predicting likely causal genes than approaches based on one or two data types only. As unbiased benchmarks, we tested for enrichment of genes that recently have been shown to be differentially expressed in rodent growth plates, and for genes that have associated missense variants. Based on these and other benchmarks, this approach outperforms methods using fewer data types. We have developed an unbiased computational approach that integrates a variety of data types and performs well in prioritizing potentially causal genes from GWA data.


Tuesday, December 4, 2012
12:30-1:30, Building 2, Room 426

Paz Polak
The Broad Institute
BWH/HMS, Sunyaev Lab

Revealing the factors that shape the regional mutation rates in melanoma

Recent advances in high-throughput sequencing technology enable us to identify cancer-specific mutations at any position in a patient’s genome, allowing dozens of studies to reveal a rich world of mutation patterns. We have been studying how the density of mutations changes along the genome in different cancer types. We used epigenetic data from several normal cell types, recently released by the ENCODE project consortium, to predict the variation in cancer-specific mutation densities along human chromosomes. For prediction, we used three different methods: Random Forest; Poisson regression with elastic net regularization, which is a compromise between lasso and ridge regression; and Generalized Additive Models (GAM), which smooth the predictors and deal with the need to linearize their relations. Overall, we were able to explain up to 80% of the variation in melanoma genomes, showing that chromatin structure is involved in establishing mutation patterns in cancer.

Tuesday, November 27, 2012
12:30-1:30, Building 2, Room 426

Sara Lindstrom
Research Scientist
Department of Epidemiology

Deep targeted sequencing of 12 breast cancer loci in 4,700 women across four different ethnicities

Genome-wide association studies (GWAS) have identified multiple genetic loci associated with breast cancer risk. However, the underlying genetic structure in these regions is not fully understood and it is likely that the index GWAS signal originates from one or more yet unidentified causal variants within the region. We used next-generation sequencing to characterize 12 GWAS-discovered breast cancer loci in a total of 2,300 breast cancer cases and 2,300 controls across four ethnic populations. Region intervals spanned between 46kb and 973kb. In total we hybrid-captured and sequenced 5.5Mb. On average, we were able to capture 82% of the non-repetitive sequence in the targeted regions, and the average fraction of captured bases sequenced with a depth >20x was more than 95%. After single nucleotide variant (SNV) calling and quality control, a total of 138,898 SNVs remained for analysis. 87.4% of those had a minor allele frequency less than 0.5% and 50.2% were private mutations. Across all regions, a total of 81 SNVs showed evidence for association with breast cancer (P<0.001). However, we did not find unequivocal association evidence supporting a causal role for any individual SNV. These results illustrate the challenges facing post-GWAS fine-mapping studies.


Tuesday, October 2, 2012
12:30-2:00, Building 2, Room 426

Hugues Aschard
Postdoc of Peter Kraft

Exploring the effect of gene-gene and gene-environment interactions in breast-cancer risk prediction

Genome-wide association scans have identified scores of common genetic variants associated with the risk of complex diseases in the last years. However their aggregate effects on risk beyond traditional factors remain uncertain. Recent papers reported that the addition of information on these genetic variants to existing risk models for common diseases can moderately improve discrimination and prediction in estimating that risk. Here we explore the extent to which the consideration of interaction between genetic variants (GxG) and interaction between genetic variants and established clinical factors (GxE) can add to these risk-assessment models. Using data from the Nurses’ Health Study, we derive the predictive ability of 15 independent published risk alleles in 1145 breast cancer cases and 1142 matched controls. We evaluate the extent to which single nucleotide polymorphisms (SNPs) can discriminate breast cancer cases from controls in the whole sample and in strata defined by non-genetic risk factors. We conduct then a simulation study to explore the potential improvement of discrimination if complex GxG and GxE interactions exist and we know them.


2011-2012 Working Groups

Working Group Organizers: Elizabeth Schifano & Monica Ter-Minassian

For all local students, postdocs, and faculty - The PQG is continuing a less formal working group seminar to provide the opportunity to present and participate in the discussion of works in progress focused on the methods and analysis of high-dimensional data in genetics and genomics. This year we have a renewed focus on the work of young investigators. With the current enthusiastic speakers who have already volunteered, we look forward to an exciting seminar series for 2011-2012.

Tuesday, March 27, 2012
12:30-2:00, Building 2-Room 426

Aedin Culhane, Ph.D.
Research Scientist
Department of Biostatistic
Dana-Farber Cancer Institute

New biclustering approaches for pan-cancer analysis of a large compendium of genomics data

Though it is widely recognized that molecular traits such as p53 mutation and deregulation of the cell cycle span many types of cancer. Conventionally classification of cancer is by anatomical site and studies of cancer generally analyze the disease within these boundaries. There is a growing recognition that histological and molecular heterogeneity cross anatomical boundaries, for example clear cell ovarian cancer is closer to clear cell renal cancer than it is to other ovarian cancers. New pan-cancer analysis approaches are required to discover the molecular taxonomy of cancer upon which personalized cancer medicine can be built. I will describe a new biclustering approach that we have developed to enable analysis of large compendium of cancer genomics data and the pan-cancer molecular traits we have discovered.



Tuesday, February 28, 2012
12:30-2:00, Building 2-Room 426

Peggy Lai, M.D.
Post-doctoral research fellow in Environmental Health
(Mentors: David Christiani and Winston Hide)

Using a gene expression signature to understand endotoxin related chronic obstructive lung disease

A quarter of patients with chronic obstructive pulmonary disease (COPD) in the United States have never smoked. Endotoxin is a common and poorly recognized environmental exposure that may cause non-tobacco related COPD. In recent studies, chronic inhalational endotoxin has been associated with the development of obstructive lung disease in both murine models and human epidemiologic studies of occupationally exposed populations. Despite this, endotoxin related lung disease has not been well studied. Three distinct murine models of chronic endotoxin exposure have been developed in different laboratories with use of microarrays to characterize global gene expression. We identified a common 101 gene signature for recurrent endotoxin exposure that is both biologically interesting at the gene level, and at the pathway level demonstrated increasing numbers of inflammation related genes with longer periods of endotoxin exposure; this is surprising in light of the well-known phenomenon of endotoxin tolerance, suggesting that dysregulated inflammation may play a role in endotoxin-related lung disease. The focus of this talk is on the application of biostatical and bioinformatics methods to a clinically relevant but poorly understood disease.


Tuesday, January 31, 2012
12:30-2:00, Building 2-Room 426

Lin Li, Ph.D.
Postdoctoral Research Fellow
Department of Biostatistics
Harvard School of Public Health

Incorporating gene expression information in SNP set association testing

Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are likely to be expression quantitative trait loci (eQTLs). It is of interest to utilize eQTL information for understanding the genetic basis underlying complex traits. On the other hand, SNP set association testing, especially region-based, can harness information of SNPs in linkage disequilibrium and lead to increased power. We are interested in incorporating gene expression information in SNP set association testing. Specifically, we employ kernel machine tests using weighted kernels, where the predefined weights are based on eQTL signals. We also consider an adaptive test that is robustly powerful regardless of whether the expression signals are relevant or not to the trait of interest. As an application, we analyze an asthma study data using our proposed methods.

(Joint work with Prof. Xihong Lin and Prof. Liming Liang)


Tuesday, December 13, 2011
12:30-2:00, Building 2-Room 426

Chen-yu Liu
Research Fellow, Department of Environmental Health
Harvard School of Public Health

A pilot study of residential petrochemical exposures and DNA methylation changes for leukemia risk

We conducted a pilot study to assess the associations of genome-wide methylation pattern changes in peripheral blood and residential petrochemical exposure on 30 childhood acute lymphoblastic leukemia (ALL) cases and controls, as an exploratory extension of a population-based case-control study [1-3]. Higher concentrations of selected PAHs and VOCs in the vicinity of petrochemical industries in the study area have been reported, compared to those in industrialized communities of the U.S. [4, 5]. Hypermethylation /hypomethylation of specific genes involved in carcinogens metabolisms and exposure to several carcinogens of PAH and VOC may increase the childhood ALL risk. We used geographic information system tools to estimate individual-level exposure by accounting for subjects' mobility, length of stay at each residence, distance to petrochemical plant(s), and monthly prevailing wind direction. DNA methylation profiles were obtained by using the Illumina Infinium HumanMethylation27 BeadArray. Differential methylation levels between ALL cases and controls were found in genes involved in inflammatory response, lymphocyte differentiation/ activation, and apoptosis. Similar methylation pattern changes were observed when comparing petrochemical exposure correlated effects. Seventy prevent genes were overlapping compared the petrochemical exposure correlated changes and leukemia associated changes. The results do not remain significant after adjusting for multiple comparisons.


  1. Liu, C.Y., et al., Maternal and offspring genetic variants of AKR1C3 and the risk of childhood leukemia. Carcinogenesis, 2008. 29(5): p. 984-90.
  2. Liu, C.Y., et al., Cured meat, vegetables, and bean-curd foods in relation to childhood acute leukemia risk: A population based case-control study. BMC Cancer, 2009. 9(1): p. 15.
  3. Yu, C.L., et al., Residential exposure to petrochemicals and the risk of leukemia: using geographic information system tools to estimate individual-level residential exposure. Am J Epidemiol, 2006. 164(3): p. 200-7.
  4. Lee, C., The characteristics of ambient volatile organic compounds and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese). Taipei, Taiwan, Republic of China: National Science Council. 1995.
  5. Lee, C. and G. Tsai, The characteristics of ambient polycyclic aromatic hydrocarbons and their carcinogenesis effects in the vicinity of one petrochemical industry in Taiwan. (In Chinese).Taipei, Taiwan, Republic of China: National Science Council,. 1994.

Tuesday, November 22, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Cristian Tomasetti
Postdoc with Giovanni Parmigiani
Dana-Farber Cancer Institute and Harvard School of Public Health

Mathematical Modeling of Random Genetic Mutations for a General Class of Tumor Growth Curves with Applications

Various attempts to study random genetic mutations have been made
in the mathematical modeling literature. An important goal has been the
estimation of the probability for a speci c mutation to be present in a tumor
of a given size and the related estimate for the expected number of mutants,
if mutants are indeed present. Previous mathematical models addressing
these questions have shown however two potential limits: the tumor was
assumed to be homogeneous and, on average, growing exponentially. In this
talk a few recent results will be presented where the heterogeneity of the
tumor cell population is taken into account, and the exponential growth
of cancer has been replaced by other, arguably more realistic types of tu-
mor growth. Various application of this mathematical framework to chronic
myeloid leukemia and gastrointestinal stromal tumor will be introduced.

Tuesday, October 25, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Noah Zaitlen
Postdoc with Alkes Price

Components of Heritability in An Icelandic Cohort

The combined genotypes and genealogy of 38,167 Icelanders provides the
opportunity to examine the contribution of different components of
genetic variation to the heritability of complex phenotypes. In this
work we focus on both parent-of-origin effects, and the contribution
of typed versus untyped variants. Genetic variation can alter
phenotype in a parent-of-origin specific manner, for example, via
imprinting. Kong et al [1] identified three Type 2 Diabetes (T2D)
variants including rs2334499, which is protective for T2D when
inherited maternally, confers risk when inherited paternally, and lies
in an imprinted region of the genome. The genealogy provides a means
of resolving not only highly accurate phasing, but also the
parent-of-origin of each typed variant. Using this information we
develop methods to examine the total contribution of parent-of origin
effects, as well as the differences between paternal and maternal
contributions to the heritability of complex phenotypes. We find that
variation of height shows little evidence of contribution from
parent-of-origin effects, while T2D shows highly significant evidence
(p-value < 1.9x10<sup>-5</sup>), suggesting that many more variants
such as rs2334499 remain to be found. In addition to parent-of-origin
effects, the unique long range phasing the deCode data provide a means
of efficiently estimating the fraction of the genome shared identical
by descent (IBD) as well as identical by state (IBS). As has been
recently demonstrated [2], IBS sharing can be used to estimate the
contribution to phenotype of genotyped SNPs and SNPs in linkage
disequilibrium (LD) to genotyped SNPs. IBD sharing estimates the total
narrow sense heritability of the phenotype, that is, the additive
contribution of all SNPs. Since both are readily available in these
data we are able to compare the IBD based estimates to the IBS based
estimates and show that for all phenotypes examined the majority of
heritability is well captured by the genotyped SNPs. We conclude with
a discussion of confounding effects in mixed model estimates of
heritability such as those provided by GCTA, showing that population
structure can lead to biased estimates of heritability even when
corrected for by principal component adjustments. 1 Kong et al.
Parental origin of sequence variants associated with complex diseases.
Nature 2009. 2 Yang el al. Genome partitioning of genetic variation
for complex traits using common SNPs. Nat Genet 2011.


Tuesday, September 27, 2011
12:30-2:00, Building 2-Room 426
a pizza lunch will be provided

Kimberly Glass

Postdoc with GC Yuan & John Quackenbush

Passing Messages Between Data Types to Refine Predicted Network Interactions

Gene regulatory network reconstruction is a fundamental question in computational biology. Although it is recognized that there are significant limitations when using individual datasets to reconstruct a regulatory network, it remains a major challenge to integrate multiple data sources effectively. We propose a message-passing model that utilizes multiple sources of regulatory information to predict regulatory relationships. A major advantage of our method is that the local connections are iteratively refined by exchanging information within each gene's functional neighborhood, thus increasing both numerical efficiency and biological accuracy. Using the yeast as a model system, we are able to demonstrate that this integrative method is able to more accurately reconstruct the regulatory network than other widely-accepted network reconstruction algorithms. In addition, our method is amenable to the future inclusion of data reflecting many different types of regulatory mechanisms, including protein complexes and protein-protein interactions, the binding of transcription factors to promoters, and epigenetic marks residing in the control regions of genes.


2010-2011 Working Groups

Tuesday, May 3, 2011
12:30-2:00, Building 2-Room 426

Sarah Fortune, MD
Assistant Professor of Immunology and Infectious Diseases
Department of Immunology and Infectious Diseases

Time for change: Using whole genome sequencing data to define the mutation rate of Mycobacterium tuberculosis during the course of infection

Mycobacterium tuberculosis (Mtb) poses a global health catastrophe that has been compounded by the emergence of highly drug resistant Mtb strains. In Mtb, all drug resistances are the result of chromosomal mutations and depend on the bacterium's capacity for mutation during the course of infection. We have used whole genome sequencing of bacterial isolates to derive estimates of the mutation rate of Mtb in the infected host in order to better understand the occurrence of drug resistance. Our data suggest that Mtb acquires mutations at roughly the same rate over time, irrespective of the organism's replicative state. However, the mutation spectrum in Mtb strains from animals with active and latent disease reflects different mutational pressures in different disease states. From these data, we are developing a mutational clock for Mtb in the infected host, which we are validating through WGS of human isolates from defined transmission chains.

Tuesday, April 5, 2011
12:30-2:00, Building 2-Room 426

Melissa Merritt, Ph.D.
Postdoctoral Research Fellow
Department of Epidemiology, Harvard School of Public Health
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute

Cell-of-origin effect on tubo-ovarian tumor phenotype

The examination of gynecologic tissues with putative precursor lesions for ovarian tumors in clinical samples has identified a variety of suggested precursors for ovarian cancer. These observational studies have guided us to design experiments to address whether cells-of-origin influence the phenotype of tubo-ovarian cancer. To do this, we established paired normal human ovarian and fallopian tube epithelial cells in culture from two donors without cancer, and induced carcinogenic transformation of cells by the sequential introduction of hTERT, SV40 and H-Ras. The gene expression profiles of the normal and transformed cells were examined using HG-U133 Plus 2 microarrays (Affymetrix), which revealed a 'cell-of-origin' gene signature that distinguished ovarian (OV) vs fallopian tube (FT) epithelial cells from the same patient. Among the most significant genes we selected a subset that were over-expressed in either OV or FT origin cultured cells and used these to identify two subpopulations (OV-like versus FT-like) among human ovarian tumors in two publically available gene expression datasets. In this analysis FT-like tumors were predominantly composed of serous high grade cancers and were associated with significantly shorter disease-free survival. In contrast, OV-like tumors were associated with a better prognosis. We have previously demonstrated that cell-of-origin plays a role in determining the phenotype of breast tumors. These results provide further support for the hypothesis that cell-of-origin strongly influences tumor phenotype. We suggest that the most aggressive subtype of tubo-ovarian cancers may originate in the fallopian tube. The results from these studies may allow a cell-of-origin based improved molecular classification of ovarian cancers and a better understanding of the pathogenesis of mullerian tumors.

Tuesday, March 1, 2011
12:30-2:00, Building 2-Room 426

Monica Ter-Minassian, Sc.D.
Post doctoral Fellow at HSPH (Environmental Health Dept.) and DFCI (Medical Oncology Dept.)

Genetic variability in tobacco-specific nitrosamine NNK to NNAL metabolism

The nitrosamine NNK, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone, is known to be one of the most potent tobacco carcinogens, particularly for lung adenocarcinoma. Recently, NNK urinary metabolites, total NNAL, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol and its glucuronides, have been shown to be good predictors of lung cancer risk, years prior to diagnosis. We sought to determine if several genetic polymorphisms significantly contributed to the wide range of inter-individual variability observed in total NNAL output. The study subjects were derived from the Harvard/ Massachusetts General Hospital Lung cancer case-control study. We analyzed 87 self-described smokers (35 lung cancer cases and 52 controls), with urine samples collected at time of diagnosis (1992-1996). We tested 77 tagging SNPs in 16 genes related to the metabolism of NNK to total NNAL derived from prior GWAS genotyping on these subjects. Using a weighted case status least squares regression, we tested for the association of each SNP with square-root (sqrt) transformed total NNAL (pmol per mg creatinine), controlling for age, sex, sqrt packyears and sqrt nicotine (ng per mg creatinine). Three HSD11B1 SNPs and AKR1C4 rs7083869 were significantly associated with decreasing total NNAL levels. HSD11B1 and AKR1C4 enzymes are carbonyl reductases directly involved in the single step reduction of NNK to NNAL. The HSD11B1 SNPs may be correlated with the functional variant rs13306401 and the AKR1C4 SNP is correlated with the enzyme activity reducing variant rs17134592, L311V.


Nicola Segata, Ph.D.
Post Doctoral Fellow
Department of Biostatistics , HSPH

Metagenomic Biomarker Discovery and the Human Microbiome

Metagenomics has provided a new avenue for biomarker discovery, since changes in the composition and functional activity of microbial communities can provide insight into the ecological differences among communities or provide diagnostic or prognostic power when applied to the human microbiome. We propose the LDA Effect Size (LEfSe) algorithm to discover and explain microbial and functional biomarkers in the human microbiota and other microbiomes. LEfSe determines the features (organisms, clades, OTUs, genes, or functions) most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and biological relevance. We demonstrate this method to be effective in mining human microbiomes for metagenomic biomarkers associated with mucosal tissues and with different levels of aerobic metabolism. Similarly, when applied to 16S rRNA gene data describing a murine ulcerative colitis gut community, LEfSe confirms the key role played by Bifidobacterium and suggests the involvement of additional clades including Metascardovia. Finally, we provide characterizations of microbial functional activity from metagenomic community sequencing, comparing environmental bacterial and viral microbiomes. A comparison of LEfSe with existing microbial biomarker discovery methods and with standard statistical approaches (including an evaluation incorporating synthetic data) highlights a lower false discovery rate, consistent ranking of biomarkers relevance, and concise representations of taxonomic and functional shifts in microbial communities associated with environmental conditions or disease phenotypes.



Tuesday, February 1, 2011
12:30-2:00, Building 2-Room 426

Gabriel Altshuler
Post Doctoral Associate, HSPH

Pathway Fingerprinting; A Functional Framework for Multi-platform Integration

No objective, broadly applicable approach exists to compare molecular profiles, yet public repositories of gene expression arrays, RNASeq, GWAS, CAGE, epigenetic marks and ChIPseq data represent major resources for discovery. There is a pressing need for an objective, biologically-interpretable, functional abstraction of the comprehensive sample space of cellular genome, transcriptome and regulome data to compare these experiments within a consistent global framework. Pathway profiling significantly out-performs gene-based comparisons for cross-platform analysis by exploiting the fact that molecular profiles reflect activity of genes acting in concert across pathways. We have developed the ‘pathway fingerprint', a mapping of any molecular profile to a fixed set of known pathways. Each fingerprint generated is standardized relative to the full public data corpus so they are directly comparable to any other, ensuring broad applicability. The method has been successfully piloted for the analysis of expression data. We performed a pathway fingerprint meta-analysis to establish and verify the stem cell pluripotency pathway signature. This was used to classify cell types and successfully identify additional pluripotent arrays from the GEO expression database. Moving away from small-scale comparisons and towards a literature-wide study of pluripotency has offered a means to resolve an ongoing debate over the pluripotent potential of testis-derived stem cells. We are now in the process of expanding the utility of the pathway fingerprint to GWAS, RNASeq, and eipgenetic data to provide a unified method for functional integration within and between these experimental platforms.

Levi Waldron
Research Fellow, HSPH & DFCI

Whole-genome expression profiling of degraded tumor RNA from the NHS/HPFS colorectal cancer cohorts

Over 20 million formalin-fixed, paraffin-embedded (FFPE) tissue blocks per year are routinely stored for cancer patients in the United States, and in particular, are available for cancer patients from long-term epidemiological studies including the Nurses Health Study and Health Professionals Follow-up Study (NHS/HPFS). Traditional microarray technology requires high-quality RNA from fresh-frozen tissues, and are not appropriate for gene expression profiling of these tissues. The recently introduced cDNA-mediated Annealing, Selection, extension, and Ligation (DASL) microarray assay by Illumina enables whole-genome expression profiling from degraded, clinical FFPE tissues. We present a case study involving more than 1,000 patients from the NHS/HPFS colorectal cancer cohort, demonstrating an end-to-end pipeline for quality control, normalization, and application of the data to tumor subtyping and differential gene expression. We conclude that this technology differs in important ways from the traditional microarray, discuss some of the challenges and opportunities associated with its adoption, and present initial results for the subtyping of colorectal cancer.


Tuesday, December 7, 2010
12:30-2:00, Building 2-Room 426

Elaine Hoffman, Ph.D.
Research Scientist
Department of Biostatistics
Harvard School of Public Health

Applications of Path Analysis & Structural Equation Models in Environmental Health Studies

Path analysis and structural equation models will be presented in the context of environmental health. The Bangladesh-Arsenic project, Italian-Manganese project, and some of my other projects will be used as examples. The statistical application of path analysis and structural equation models and some of the obstacles I have encountered also will be discussed. Path analysis is a statistical model that can account for complex relationships among variables and correlated variables. It is often used when there are suspected causal relationships. Structural equation models are a class of covariance structure models that simultaneously model multiple surrogates of both exposure and outcome. Both path and structural equation models are often shown as path diagrams. These models have been extensively used in the social sciences, and more recently are beginning to be used in environmental epidemiology. This presentation is a Work-in-Progress, with many of the statistical ideas discussed not having been published or completed yet. Software packages will be discussed.

Tuesday, November 2, 2010
12:30-2:00, Building 2-Room 426

Benjamin Haibe-Kains, Ph.D.
Research Fellow
Department of Biostatistics & Computational Biology
Dana-Farber Cancer Institute

Breast Cancer Molecular Subtypes: A Three-gene Model for a Translation Into Clinic

Background: Gene expression studies have well established that breast cancer (BC), in addition to being clinically diverse, is also a molecular heterogeneous disease. The early studies classified BC into at least three clinically relevant molecular subtypes: basal-like, HER2-enriched, and luminal tumors, with each subtype exhibiting different prognosis and response to therapies. Demonstration of the molecular heterogeneity within BC has changed the way clinicians perceive the disease and has a dramatic impact on the design of new clinical trials.

During the last decade, several methods have been proposed to classify breast cancer into their corresponding molecular subtype using gene expression. Three versions of a "Single Sample Predicton" (SSP), based on a hierarchical clustering and a nearest centroid classifier using different sets of "intrinsic" genes, have been proposed; however this method of classification has been shown to be highly unreliable by many investigators. We have previously reported a "Subtype Clustering Model" (SCM), which uses a mixture of three Gaussians with ER, HER2 and proliferation-related gene modules, to estimate probabilities of belonging to each BC molecular subtype. This model has improved classification stability; however such a method is not so easily clinically implemented.

In this work we developed a novel SCM-based 3-gene classifier for molecular subtyping and we evaluated its concordance, robustness and prognostic value compared with five existing classifiers.

Materials and Methods: We refined the SCM to its simplest form, a classification model that uses only the 3 genes reported to be the main discriminators of the molecular subtypes: ER, HER2 and AURKA (proliferation). We evaluated the concordance and robustness of five previously described classifiers- three SSPs and two SCMs- and the new 3-gene SCM, using gene expression and clinical data from a large compendium of publicly available BC datasets comprising 5,113 primary breast tumors. Clinical relevance was determined from survival analysis of a subset consisting of 1,318 untreated node-negative patients.

Results: SCM-based classifiers, including our 3-gene model, were significantly more robust than all the SSPs (prediction strength > 0.8, p < 0.001); notably, the 3-gene SCM was the most robust classifier for molecular subtypes. Although all models were concordant (Cramer's V = 0.54-0.81, p < 0.001), with basal-like subtype being particularly well defined (median Cramer's V of 0.8), SCMs yielded stronger concordance than SSP models. Overall, SCMs were also more consistent with traditional clinical variables (ER, HER2 status by IHC/FISH and histological grade). All classifications yielded significant and independent prognostic value.

Conclusions: We found significant disparities in robustness of BC molecular subtype classification models. SCMs outperformed SSPs and consistently identified molecular subtypes in numerous datasets derived using various microarray technologies and conducted by different laboratories. Compared with existing models, we propose that a 3-gene SCM-based model is the most reliable and its simplicity could be a significant step towards translation of BC molecular subtyping into the clinic.

Keywords: molecular subtypes, gene expression, clustering, classification, robustness, concordance, prognosis


Tuesday, October 5, 2010
12:30-2:00, Building 2-Room 426

Pinaki Sarder, Ph.D
Research Fellow
Department of Biostatistics
Harvard School of Public Health

Functional understanding of microbial communities using experimental data integration

To understand the functional and metabolic activities of microbes and microbial communities, it is critical to link genes and proteins to their biological roles. This encompasses both their biochemical activities and the processes and pathways in which they are used by the cell. This problem is typically approached by transferring knowledge to newly sequenced genomes by relying on sequence similarity. This can be a difficult process involving sparse knowledge and the propagation of error, and even the best-studied organisms' genomes are only partially characterized. To mitigate these issues, we have developed a data integration method TafTan leveraging all experimental results available from multiple model systems to identify potential functional roles for genes in a new organism. The performance of TaFTan's genome-wide functional network prediction was evaluated using ~300 experimental datasets from 20 model organisms. This evaluation study demonstrated that TaFTan is able to significantly improve individual organisms' inferred functional networks by transferring knowledge from other experimentally characterized systems.

September 21, 2010
12:30-2:00, Building 2-Room 426 (Biostats Conference Room)

X. Shirley Liu, Ph.D.
Associate Professor of Biostatistics
Departments of Biostatistics and Computational Biology
Harvard School of Public Health & Dana-Farber Cancer Institute

Computational Genomics of Gene Regulation

High throughput genomics technologies brought a paradigm shift to gene regulation studies, but they also created challenges on data analysis. In this talk, I will highlight two studies conducted in my lab to show how computational and statistical algorithms could help remove the noise in the data, provide informative results, and help design efficient experiments. One study is a model-based analysis of tiling arrays for ChIP-chip peak calling, and the other is using dynamics of H3K4me2 nucleosomes to infer the in vivo transcription factors and their binding sites driving a biological process.


2009-2010 Working Groups

Tuesday, April 27, 2010
12:30-2:00 PM

Robert Wright, M.D., M.P.H.
Associate Professor
Department of Environmental Health, HSPH
Department of Pediatrics, HMS

A Framework for Measuring Gene-Environment Interactions in Children

Will discuss issues of development unique to children that will improve estimates of gene environment interaction including child-specific pitfalls to case control and family-based association designs.


Tuesday, March 23, 2010
12:30-2:00 PM
Kresge 201

Marianne Wessling-Resnick, Ph.D.
Professor of Nutritional Biochemistry
Dept of Genetics and Complex Diseases
Director of the PhD Program in Biological Sciences in Public Health
Harvard School of Public Health

Chemical Genetics of Iron Transport

Chemical genetics is an emerging field that takes advantage of combinatorial chemical and small molecule libraries to dissect complex biological processes. Small molecules can act very fast, can be very specific, and can help to distinguish the temporal order of molecular steps and the hierarchical regulation of biological processes. Because small molecules can alter the function of a specific gene product, they can be used in a manner analogous to the use of inducible dominant or homozygous recessive genetic mutations. A large body of biochemical literature is based on the past use of small molecule antagonists that were employed in "reverse chemical genetics" approaches to conditionally eliminate protein function, and on that basis to subsequently identify the target, its mechanism of action, and its regulation. Thus, ouabain helped to define the catalytic cycle of the NaK-ATPase, cytochalasin B was instrumental in defining the molecular basis for insulin's action to stimulate glucose uptake, and analogs of amiloride were used to purify and define the epithelial Na channel. There is a need to develop "forward chemical genetics" in order to discover small molecules that partner with key elements in a pathway of interest. Our goals are to discover small molecule inhibitors of iron transport using chemical genetics and to use these reagents to advance our understanding of the factors, mechanisms, and regulation of different pathways of iron metabolism.

Patrick Loerch
Research Fellow, Computational Biology

Using Networks to Integrate Diverse -omics Datasets and Identify Disease Pathways

The dramatic increase in the application of omics-based platforms (microarrays, GWAS, proteomics, deep sequencing, metabolomics, etc) in biological research has resulted in the generation of vast public and
private data repositories spanning a wide array of diseases, tissues and species. The challenge facing researchers today, commonly referred to as integrative genomics, is figuring out how to integrate, analyze and interpret all of this information within the context of a well-defined biological question. One biological question of particular importance is the identification of genes/proteins that contribute to, and/or are altered as a result of, the onset of a specific disease. As opposed to the gene-centric approach to integrative genomics, which looks for genes that are associated with a disease across multiple omics platforms, we have developed a pathway-centric approach. We hypothesize that the onset of disease involves the altered regulation of specific pathways, which can be triggered by any number of genes or proteins, so long as the end result is the same. Working under this hypothesis, we have developed a network-based approach to integrating various omics datasets with the aim of identifying disease pathways. This approach also allows us to take into account a number of biological realities, such as the regulation of pathway members at various states (from transcription to translation) and the fact that the regulation of some pathway members will simply not be observable through omics technologies. Here we will present the development of this methodology within the context of identifying disease pathways, discuss a specific application/validation of the method, and describe ongoing efforts to further refine this approach.

Tuesday, February 23, 2010

12:30-2:00 PM
Kresge 201

Curtis Huttenhower, Ph.D
Assistant Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Harvard School of Public Health

Scalable Data Mining for Functional Genomics and Metagenomics

The average human body contains over ten times as many microbial cells as "human" cells. These microbial communities are usually beneficial, but their dysfunction has been linked to conditions ranging from obesity to antibiotic resistant infections. The recent dramatic reduction in the cost of DNA sequencing has opened up several exciting new ways in which we can explore how the human microflora vary across populations and how they can be manipulated to improve human health.

Biological network integration and mining algorithms provide a means of assembling the entire body of cultured microbial genomic data, understanding it from a systems level, and applying it to the study of uncharacterized species and communities. We compare unsupervised and supervised Bayesian approaches to biological network integration; this process provides maps of functional activity and genomewide interactomes in over 100 areas of cellular biology, using information from ~5,000 genome-scale experiments pertaining to 13 microbial species. In combination with graph alignment, these network manipulation tools provide a means for analyzing the functional activity unique to particular pathogens, transferring putative functional annotations to uncharacterized organisms, and potentially inferring interactomes using weighted network integration for metagenomic communities.


Ed Silverman, M.D., Ph.D.
Associate Professor of Medicine
Harvard Medical School
Associate Physician, Brigham and Women's Hospital

Genetic Epidemiology of Chronic Obstructive Pulmonary Disease

Genetic factors are likely important determinants of chronic obstructive pulmonary disease (COPD) susceptibility. Genome-wide association analysis has recently been performed in COPD, and several regions of highly significant association on chromosomes 15 and 4 have been identified. Because COPD is a heterogeneous disease, genetic studies have focused on the identification of distinct subgroups of COPD subjects (COPD subtypes) as well as disease-related conditions (COPD-related phenotypes). Genetic association studies combined with chest CT scan analysis have the potential to lead to substantial new insights into COPD pathogenesis, which could provide important pathways to develop new treatments for COPD.


Tuesday, January 26, 2010
12:30-2:00 PM
Kresge 201

John Quackenbush
Professor of Computational Biology and Bioinformatics

Network and State Space Models: Science and Science Fiction Approaches to Cell Fate Predictions

Two trends are driving innovation and discovery in biological sciences: technologies that allow holistic surveys of genes, proteins, and metabolites and a realization that biological processes are driven by complex networks of interacting biological molecules. However, there is a gap between the gene lists emerging from genome sequencing projects and the network diagrams that are essential if we are to understand the link between genotype and phenotype. ‘Omic technologies such as DNA microarrays were once heralded as providing a window into those networks, but so far their success has been limited, in large part because the high-dimensional they produce cannot be fully constrained by the limited number of measurements and in part because the data themselves represent only a small part of the complete story. To circumvent these limitations, we have developed methods that combine ‘omic data with other sources of information in an effort to leverage, more completely, the compendium of information that we have been able to amass. Here we will present a number of approaches we have developed, including an integrated database that collects clinical, research, and public domain data and synthesizes it to drive discovery and an application of seeded Bayesian Network analysis applied to gene expression data that deduces predictive models of network response. Looking forward, we will examine more abstract state-space models that may have potential to lead us to a more general predictive, theoretical biology.

Miguel Camargo
Merck Reserach Laboratories

"Pathway based analysis of whole genome siRNA screens and de novo pathway identification"

A prevailing model of Alzheimer's Disease etiology is the progressive aggregation of toxic Ab (Abeta) peptides in the brain. Ab is produced as a result of proteolytic processing of the amyloid precursor protein (APP). Cleavage of APP by beta- and gamma-secretases result in the production of Ab and hence several drug discovery efforts are aimed at
finding either beta- or gamma-secretase inhibitors. However, development of small molecules to either of these enzymes has proven to be challenging. In order to overcome limitations of developing b or g-secretase inhibitors, we performed large scale siRNA screens in order to identify novel regulators of APP processing that could represent more tractable drug targets. The screen measures the production of four APP proteolytic products: the non-amyloidgenic peptide (sAPPa) or the amyloidgenic pathway peptides (Ab40, Ab42, and sAPPa). We introduce a
novel analysis method that scores the overall effect of individual pathways on the processing of the APP protein. This method takes into account all genes in the pathway, thus allowing for small effects to be considered, and introduces the concept of scoring 'pathways' as opposed to individual genes as a way of mitigating against false positive hits.
Using this method, we identified novel and distinct pathways that regulate processing of APP into either amyloidgenic peptides or non-amyloidgeneic peptides respectively. We will also highlight how to leverage biological network data in combination with siRNA screens to identify novel pathways.

Tuesday, December 15, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)

Oliver Hofmann, Ph.D.
Research Associate in Biostatistics

Combining curated and in silico interaction data in network analysis

Integrating biological data obtained from multiple high throughput platforms is an area of active research. While it is possible to normalize for technical differences, laboratory effects and other artifacts the problem of merging data from different biological samples is still mostly unsolved. Standard methods include rank-based analysis of biological features (mRNA abundance, peptide counts etc.) rather than absolute measurements as well as higher levels of abstraction such as Gene Set Enrichment Analysis or the identification of Metagenes. By comparing biological systems at the network level an additional layer of data abstraction is added, allowing for the identificaton of connected network areas contributing to a phenotype of interest in heterogenous samples or between studies. Three different examples highlight current limitations of data availability and quality, identifying possible future methods for even higher levels of data abstraction to analyze complex systems.

Stalo Karageorgi, M.S.
Doctoral Student in Environmental Health

Polymorphisms in HSD17b2 and HSD17b4 and endometrial cancer risk

Hydroxysteroid dehydogenase 17b (HSD17b) genes encode for enzymes that control the last step in estrogen biosynthesis. The isoenzymes HSD17b2 and HSD17b4 in the uterus preferentially catalyze the conversion of estradiol, the most potent and active form of estrogen, to estrone, the inactive form of estrogen. Endometrial carcinoma is a disease strongly linked to the imbalance between the hormones estrogen and progesterone. We hypothesized that variation in single nucleotide polymorphisms (SNPs) in genes HSD17b2 and HSD17b4 may alter the enzyme activity, estradiol levels and risk of disease. Pairwise tagging SNPs were selected from the HapMap CEU database to capture all known common (MAF >0.05) genetic variation in the gene region with an r2 of at least 0.8. SNPs were genotyped in participants in the nested case-control studies in the Nurses' Health Study (NHS) (cases=544, controls=1296) and the Womens' Health Study (WHS) (cases=130, controls=389), who provided a blood or cheek cell sample. The association between SNPs and endometrial cancer was examined using conditional logistic regression to estimate odds ratio and 95% confidence intervals adjusted for known risk factors. We additionally investigated whether SNPs are predictive of plasma estradiol and estrone levels in the NHS using linear regression. This is the first study to report on genetic variation in HSD17b2 and HSD17b4 in relation to endometrial cancer.

Tuesday, November 17, 2009
12:30-2:00 PM
Biostatistics Conference Room (Building 2, Room 426)

Zhaoxi (Michael) Wang, M.D., Ph.D.
Research Scientist, Environmental & Occupational Medicine and Epidemiology Program

"Mitochondrial Variations in NSCLC by Microarray-based Resequencing"

Mutations in human mitochondrial genome (mtDNA) genome have long been suspected to play an important role in the development of cancer. Although most cancer cells harbor mtDNA mutations, the question of whether such mutations are associated with clinical prognosis of cancer remains unclear. In this study, we resequenced the entire mitochondrial genomes of tumor tissue from a population of 249 Korean non-small cell lung cancer (NSCLC) patients using the Affymetrix GeneChips Human Mitochondrial Resequencing Array 2.0 (Santa Clara, CA). In early stage (stage I/ II) NSCLC, patients with the haplogroup D4 had the worst clinical prognosis. Interestingly, haplogroup D4 was previous reported as a marker for extreme longevity in Japanese.

Alkes Price
Assistant Professor of Statistical Genetics, HSPH

"Effects of Cis and Trans Family Heritability on Single-tissue and Cross-tissue Gene Expression Regulation"

Family heritability is a useful approach for understanding the genetic basis of gene expression. Heritability analyses can evaluate the contribution of both cis and trans regulation by considering genetic relatedness either genome-wide (trans), or at the genomic location proximal to the expressed gene (cis). We used gene expression data from blood and adipose tissue cohorts to estimate the contribution of cis and trans regulation to heritable variation in gene expression. We estimate that cis regulation contributes 45±7% of heritability in blood expression and 30±3% of heritability in adipose tissue expression, with the difference entirely attributable to greater trans effects in adipose tissue. We also conducted a cross-tissue analysis to investigate regulation that is shared across tissues. Strikingly, we observed that cross-tissue regulation is dominated by cis effects. These analyses point to a greater contribution for cis regulation than previous admixture-based analyses. This divergence would be consistent with a substantial role for epigenetic regulation, whose effects are included in heritability analyses but excluded in admixture analyses. Our results have implications for understanding the causes of "missing heritability" in genetic association studies.


Tuesday, September 22, 2009 1 2:30-2:00 PM
Building 2, Room 426 - Biostatistics Conference Room

Meet & Greet, Monica Ter-Minassian practice talk on Genetic Risk Factors of Neuroendocrine Tumor


Please feel free to contact us with any comments or questions at: