Past PQG Seminar Series


2015-2016 Seminars

The aim of the PQG Seminar Series is to encourage the exchanging of ideas, promote interaction, collaboration, and research in quantitative genomics. It also aims to promote the mission of the PQG which is to improve health through an interdisciplinary study of genetics, behavior, environment and medicine. The seminar series looks to include the development and application of quantitative methods, especially for high dimensional data, as well as a focus on the training of quantitative genomic scientists.


Tuesday, February 2, 2016
2:00-3:00 PM
Kresge 502

Ivana Bozic
Research Associate
Department of Mathematics
Harvard University

Stochastic Evolutionary Modeling of Cancer Development and Resistance to Treatment

Cancer is the result of a stochastic evolutionary process characterized by the accumulation of mutations that are responsible for tumor growth, immune escape, and drug resistance, as well as mutations with no effect on the phenotype. Stochastic modeling can be used to describe the dynamics of tumor cell populations and obtain insights into the hidden evolutionary processes leading to cancer. I will present recent approaches that use branching process models of cancer evolution to quantify intra-tumor heterogeneity and the development of drug resistance, and their implications for interpretation of cancer sequencing data and the design of optimal treatment strategies.



Monday, February 1, 2016
12:30-2:00 PM

Hector Corrada Bravo
Assistant Professor
Department of Computer Science
University of Maryland

Visualization, Statistical Modeling and Discovery in Computational Epigenomics

The use of epigenomics to study mechanisms in development and disease using high-throughput techniques has been one of the most active areas in life and clinical sciences in the last five years. In this talk, I will present advances in statistical learning methods and data visualization for computational epigenomics and fundamental discoveries of molecular mechanisms in cancer facilitated by these tools.



Thursday, January 28, 2016
2:00-3:00 PM
Building 2, Room 426

a pizza lunch will be provided

Christine Peterson
Postdoctoral Scholar
Department of Health Research and Policy
Stanford University

Statistical Approaches for Making Sense of High-throughput Biological Data

In this talk, I will discuss statistical approaches I have developed to gain insight into the complex networks of regulation and interaction that govern biological systems. Understanding these networks and how they are disrupted by disease is an important step in identifying potential targets for the treatment of disease. Firstly, I will describe my work on the inference of biological networks such as metabolic or protein interaction networks from high-throughput data. In particular, I will address graphical modeling methods I have proposed in the Bayesian framework for inferring such networks based on limited sample sizes, and illustrate the application of these approaches to highlight mechanisms underlying cancer progression. Secondly, I will address the problem of establishing the genetic basis of multivariate traits such as gene expression or other molecular profiling data. Here I propose a multi-stage multiple testing procedure which controls important error rates regarding the discovery of regulatory variants and the association of these variants to traits.



Monday, January 25, 2016
2:00-3:30 PM

a pizza lunch will be provided

Michael I. Love
Postdoctoral Research Fellow, Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute and Department of Biostatistics, Harvard T.H. Chan School of Public Health

Statistical Methods for RNA-seq Data

Quantification of gene expression from RNA sequencing data is a fundamental task in computational biology, critical for projects across biological and biomedical sciences. Statistical analysis of RNA-seq data, such as identification of differentially expressed genes across samples or estimation of isoform abundances, presents new challenges: non-normality of count data, dependence of the variance on the mean, as well as technical artifacts in measurements. In this talk, I will discuss statistical methods I have developed for RNA-seq data, including robust estimators for inference of differential expression and an approach to remove systematic errors in isoform abundance estimates arising from variations in sample preparation.


Thursday, January 21, 2016
12:30-2:00 PM

a pizza lunch will be provided

Manuel Rivas
Broad Institute and Massachusetts General Hospital
Wellcome Trust Centre for Human Genetics Research, University of Oxford

Finding signals in human genome sequencing studies: Data, Models, and Inference

Genome sequencing studies applied to large case-control series, populations or biobanks with extensive phenotyping raise novel analytical challenges and present new opportunities to interrogate the human genome to better understand disease. In this talk I will focus on a special class of genetic variants that are increasingly found in sequencing studies, protein truncating variants (PTVs), which are typically expected to have large effect on gene function, are enriched for disease-causing mutations, and in the past few years some have been found to be protective against disease. PTVs, while not the only ones relevant to disease, offer unique insights into likely benefits and risks from therapeutic inhibition of the gene. I will consider recent sequencing efforts in autoimmune diseases and cardiometabolic traits to identify protective PTVs, and discuss the importance of improving our understanding of their functional consequences. Finally, I will introduce a statistical framework, named MRP, for rare variant association studies, that considers correlation, scale, and directionality of genetic effects across a group of 1) genetic variants, 2) phenotypes, and 3) studies. In so doing I am able to present formulations of the framework that considers the use of summary statistic data, the standard univariate and multivariate gene-based models, models for identifying protective protein-truncating variants, or computational algorithms to estimate the underlying mixture of neutral and functional variants from the distribution of rare variants and phenotype, which may provide opportunities for discovery and inference that are not addressed by the traditional one variant-one phenotype association study. These extensions are critical and poised to take advantage of major biobanking and precision medicine initiatives since we need to understand the full range of medical consequences – good and bad – of variation in a gene in order to confidently generate effective therapeutic hypotheses while recognizing unintended consequences up front rather than after tremendous investment is made.


Tuesday, January 19, 2016
12:30-2:00 PM

James Zhou
Postdoctoral Fellow
Microsoft Research New England

Harnessing the unseen for next generation population genomics and epigenomics

Sequencing of large human populations has the potential to transform disease diagnosis and treatment. In order to harness the power of this data avalanche, it is crucial to model and leverage the data and covariates that we do not see. I will illustrate this concept with two examples in genomics and epigenomics, where I developed scalable statistical algorithms with strong mathematical guarantees. I will first discuss my close collaboration with the largest exome sequencing consortium (ExAC) to infer statistical properties of rare and unseen human genetic variations. This work provides a unified framework to quantify the natural selection acting on our genome, annotate functional constraints, and predict the discovery rate of future sequencing projects. In the second part, I will describe complementary work to identify changes in the packing and chemical modifications of DNA—i.e., epigenomic variation—that are associated with diseases. This work requires flexible models of unseen covariates, especially cell-type composition. I will conclude by discussing the general statistical lessons we have learned and new research directions.


Tuesday, December 15, 2015
12:30-2:00 PM

Elin Grundberg
Assistant Professor
Human Genetics
McGill University

Capture the functional (adipose) epigenome for insight into metabolic disease risk

Common diseases such as obesity affect an alarming large number of individuals worldwide. Obesity is complex in nature meaning the disease is caused by multiple underlying factors of which only 30-40% are believed to be genetic effects. In past years most of these gene regions have been characterized but how they manifest themselves, interact with the environment or differ from pure environmental effects are still not known. With the development of novel high-throughput DNA sequencing technologies we are now able to screen the millions of sites in our genome that are susceptible for not only genetic but also environmental modulation (‘epigenetic’). I will discuss our efforts in implementing novel next-generation sequencing-based epigenomic tools and how these technology breakthroughs allow us to extend our knowledge of biological processes associated with common diseases.



Tuesday, November 17, 2015
12:30-2:00 PM

Gad Getz
The Cancer Genome Computational Analysis Group
The Broad Institute of MIT and Harvard

Cancer Genomics and Evolution


Tuesday, October 20, 2015
12:30-2:00 PM
Kresge 502

Scott Carter
Assistant Professor of Computational Biology
Department of Biostatistics
Dana-Farber Cancer Institute
Harvard T.H. Chan School of Public Health

Computational dissection of intra-tumor genetic heterogeneity and applications to the study of cancer treatment, evolution, and metastasis



Tuesday, September 29, 2015
12:30-2:00 PM
Kresge 502

Timothy Rebbeck
Professor of Cancer Epidemiology
Department of Epidemiology
Harvard T.H. Chan School of Public Health

Prediction and Modification of Cancer Risk in BRCA1/2 Mutation Carriers

Inherited mutations in BRCA1 and BRCA2 confer breast and ovarian cancer risks that are substantially higher than in the general population. In the two decades since these genes were identified, we have learned that risks associated with these mutations may be modified by other factors and exhibit substantial genotype-phenotype heterogeneity. While BRCA1/2 are not likely to be representative of all disease predisposing genes in the population, they serve as a paradigm for our understanding of genomic etiology and precision prevention.


FRIDAY, May 15, 2015
12:30-2:00 PM
Kresge G2

Christopher Amos
Professor of Community and Family Medicine
Professor of Genetics
Associate Director for Population Sciences, Norris Cotton Cancer Center
Dartmouth Geisel School of Medicine

Modeling Genome Wide Effects on Cancer Risk

Genome wide association studies (GWAS) have been a highly effective tool for exploring genetic contributions to complex diseases, but have failed to explain very much of the heritability or total genetic risk associated with cancer development. In this talk I describe ongoing efforts to characterize the features of single nucleotide polymorphisms that are associated with successful replication of findings during GWAS as a measure of the particular attributes that should be considered when designing studies and particularly for associating weights for evaluating findings from GWAS and large scale sequencing studies. Finally I describe new methods for seeking to identify further components of missing heritability that reflect gene-gene and gene-environment interactions. These interactions are modeled using a Bayesian approach. Results from simulation studies and application to data from association studies of lung cancer show that a model that weakly constrains the prior probabilities that interactions are included in a model generally outperformed, as reflected by root mean squared error and posterior probabilities, either models with stricter constraints or that did not impose constraints.


Tuesday, May 5, 2015
12:30-2:00 PM

Steve Horvath
Professor, Human Genetics & Biostatistics
University of California, Los Angeles

The Epigenetic Clock and Biological Age

I recently developed a DNA methylation based biomarker of aging known as the "epigenetic clock", which can be used to measure the DNA methylation (DNAm) age of any human (or chimpanzee) tissue, cell type, or fluid that contains DNA (with the exception of sperm). DNA methylation age of blood has been shown to predict all-cause mortality in later life, even after adjusting for known risk factors, which suggests that it relates to the biological aging process. Similarly, markers of physical and mental fitness are also found to be associated with the epigenetic clock (lower abilities associated with age acceleration). These results suggest that we may be close to achieving a long standing milestone in aging research: the development of an accurate measure of tissue age or even biological age. I will present several applications of this measure of tissue age, e.g. obesity and trisomy 21.

1) Horvath S (2013) DNA methylation age of human tissues and cell types. Genome Biology.2013, 14:R115. DOI: 10.1186/10.1186/gb-2013-14-10-r115 PMID 24138928
2) Marioni R, Shah S, McRae A, Chen B, Colicino E, Harris S, Gibson J, Henders A, Redmond P, Cox S, Pattie A, Corley J, Murphy L, Martin N, Montgomery G, Feinberg A, Fallin M, Multhaup M, Jaffe A, Joehanes R, Schwartz J, Just A, Lunetta K, Murabito JM, Starr J, Horvath S, Baccarelli A, Levy D, Visscher P, Wray N, Deary I (2015) DNA methylation age of blood predicts all-cause mortality in later life. Genome Biology 16:25 doi:10.1186/s13059-015-0584-6
3) Horvath S, Erhart W, Brosch M, Ammerpohl O, von Schoenfels W, Ahrens M, Heits N, Bell JT, Tsai PC, Spector TD, Deloukas P, Siebert R, Sipos B, Becker T, Roecken C, Schafmayer C, Hampe J (2014) Obesity accelerates epigenetic aging of human liver. Proc Natl Acad Sci U S A [6]. pii: 201412759. doi: 10.1073/pnas.1412759111 PMID 25313081
4 Horvath S, Garagnani P, Bacalini MG, Pirazzini C, Salvioli S, Gentilini D, DiBlasio AM, Giuliani C, Tung S, Vinters HV, Franceschi C (2015) Accelerated Epigenetic Aging in Down Syndrome. Aging Cell. 9 FEB 2015 DOI: 10.1111/acel.12325 PMID: 25678027

Tuesday, April 14, 2015
12:30-2:00 PM

Hua Tang
Associate Professor, Genetics
Associate Professor (By courtesy), Statistics
Member, Bio-X
Member, Stanford Cancer Institute
Stanford School of Medicine

Learning about the Genetic Architecture of Complex Traits Across Populations

Genome-wide association studies (GWAS) have become a standard approach for identifying loci influencing complex traits. However, GWAS in non-European populations are hampered by limited sample sizes and are thus underpowered. Can GWAS results in one population be exploited to boost the power of mapping loci relevant in another population? The first part of the this talk will describe a set of analyses, which address the question, “to what extent does the genetic architecture of a complex trait overlap between human populations?” The second part of the talk will introduce an empirical Bayes approach, which improves the power of mapping trait loci relevant in a specific minority population through adaptively leveraging multi-ethnic evidence. A case study on plasma lipid concentration will be presented.


Tuesday, December 9, 2014
12:30-2:00 PM
Kresge G2

Brian Browning
Associate Professor
Department of Medicine, Division of Medical Genetics
University of Washington

Haplotype frequency models: what they are and why they matter

Haplotype frequency models are used to estimate the population frequency of a sequence of alleles at tightly-linked loci. These models are used in a wide variety of genetic analyses because they enable analyses to use information from correlated, closely-spaced variants. This talk will describe a new haplotype frequency model that that uses relatedness to reduce computational complexity. The new model uses the same graphical model as the Beagle haplotype frequency model, but unlike the original Beagle model, the new model incorporates genetic recombination, genotype error, and identity by descent. We can use this model to estimate haplotypes from unphased genotype data. The new model produces more accurate haplotypes than existing methods, and it requires substantially less computation time than the most accurate existing method.

Tuesday, November 4, 2014
12:30-2:00 PM
Kresge G2

Molly Przeworski
Department of Biological Sciences &
Department of Systems Biology
Columbia University

An Evolutionary Perspective on Human Germline Mutation

The revolution in sequencing technologies has made it feasible to identify de novo mutations in transmissions from parents to offspring, providing an unprecedented opportunity to learn about the genesis and properties of germline mutations. As we show, however, when recent pedigree studies are considered jointly and alongside results from other methodologies, it becomes clear that the pieces of the puzzle do not fit together. We discuss these gaps in our understanding in terms of three sets of interwoven questions: (i) On a mechanistic level, what proportion of mutations is introduced through mistakes in the replication process versus non-replicative, “spontaneous” errors? (ii) In terms of variation among individuals, why do mutation rates depend so strongly on sex and age? (iii) From an evolutionary perspective, how do mating systems and life history traits shape the mutation rate of a species? We present simple mathematical model ​s​ for the behavior of replication-driven and spontaneous errors over ontogenesis and discuss implications for human genetics and evolutionary biology.


Tuesday, October 14, 2014
12:30-2:00 PM
Kresge G2

Hongzhe Li
Departments of Biostatistics & Epidemiology
University of Pennsylvania - Perelman School of Medicine

Microbiome, Metagenomics and High-Dimensional Compositional Data Analysis

Next-generation sequencing technologies allow 16S ribosomal RNA gene surveys or whole metagenome shotgun sequencing in order to characterize taxonomic and functional compositions of gut microbiomes. The outputs from such studies are short sequence reads derived from a mixture of genomes of different species in a given microbial community. We first present a brief overview of the statistical methods we used for 16S rRNA data analysis. We then introduce a multi-sample model-based method to quantify the bacterial compositions based on shotgun metagenomics data using species-specific marker genes. The resulting data are high-dimensional compositional data, which complicate many of the downstream analyses. We introduce the GLMs with linear constraint on regression parameters in order to identify the bacterial taxa that are associated clinical outcomes and a composition-adjusted thresholding procedure to estimate correlation network from compositional data. We demonstrate the methods using two on-going gut microbiome studies at the University of Pennsylvania.


Tuesday, September 16, 2014
12:30-2:00 PM
Kresge G2

Soumya Raychaudhuri

Assistant Professor of Medicine, Harvard Medical School
Divisions of Genetics & Rheumatology
Department of Medicine, Brigham and Women's Hospital

“Disentangling effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex trait loci”



2013-2014 Seminars

Thursday, May 6, 2014
12:30-2:00, FXB G13

Steven McCarroll
Professor in the Genetics Department
Harvard Medical School

Where is the rest of the human genome?

Whole-genome sequencing is increasingly used to search for genetic variants underlying human disease. In this seminar, I want to describe ways in which every sequencing experiment can also be used to teach us surprising things about how genomes work in everyone. First, there are large amounts of human genome sequence that are missing from maps of the human genome – but using a combination of mathematics and historical mixtures of human populations, we can figure out where these genes have been hiding and how they have remained hidden from view. Second, some regions of the human genome segregate in many different structural forms within human populations, and appear to contribute to biological variation among humans. Third, we can use whole genome sequence data to study active processes of DNA replication in human cells, with surprising findings about how DNA replication varies from person to person.

Thursday, April 17, 2014
12:30-2:00, FXB G12

Ben Yung
Head of Department of Health Technology and Informatics
Chair Professor of Biomedical Science
The Hong Kong Polytechnic University

A multidisciplinary research on cancer and its metabolic risk factor: from computational characterization, functional discovery to clinical diagnostics development

Nucleophosmin 1 (NPM1) was first identified as a nucleolar phosphoprotein and subsequently shown to be highly expressed in the granular region of the nucleolus. NPM1 increases rapidly in response to mitogenic stimuli and elevated expression of NPM protein is detected in highly proliferating and malignant cells. In addition, NPM has proven to be a multifunctional protein involved in many cellular activities including ribosomal biogenesis, centrosome duplication and transcription regulation. As our understanding of NPM1 has increased, more complex mechanisms in cancer cells will be revealed.

Distributions of expressional correlations over neoplastic and normal states reveal structural difference at a threshold, which defines a strongly co-expressed gene network with the best coherence with neoplasm. By such novel structural co-expression analysis, genome-wide co-expression in normal state was found to be stronger than that in chronic myelogenous leukemia (CML). Conversely, more links between NPM1 and BCR-ABL-related pathway were noted in CML. Normal-specific network showed dissociation of NPM1 with ribosomal proteins (RP) while CML-specific co-expressions rendered a large network connecting NPM1 to RP genes through RPL10A, RPL31 and RPL36A. Our results implicated a critical role of NPM1 in joining a cascade of ribosomal biogenesis, protein synthesis, cell proliferative and anti-apoptotic events in CML. Furthermore, we speculated that NPM1 and its co-expressed genes may be illegitimately activated in CML, as inferred by their positive expressional correlations and targeted by the same transcription factor set. This novel network analysis platform can also be applied to other cancers in order to discover promising markers for diagnostic, prognostic and therapeutic applications. Furthermore, we postulate that for gene networks that are transcriptionally dis-coordinated in cancer, their methylation states may have already been altered before the disease onset. To reveal such molecular association, we aim to study the differential methylation patterns of metabolic syndromes for cancer gene co-expression networks that will be identified in the project.

Tuesday, April 15, 2014
12:30-2:00, FXB G13

Eli Stahl

Assistant Professor, Psychiatry
Assistant Professor, Genetics and Genomic Sciences
Mount Sinai Icahn School of Medicine

Rare Variant Genetic Architecture of Schizophrenia and Bipolar Disorder

Tuesday, March 11, 2014
12:30-2:00, Kresge G2

Lior Patcher
Raymond and Beverly Sackler Chair in Computational Biology
Director of the Center for Computational Biology
Professor of Molecular and Cell Biology, Mathematics, and Computer Science
University of California, Berkeley

Making sense of RNA-Seq

RNA-Seq has become one of the primary applications for high-throughput sequencing, providing an unprecedented view of the dynamics of transcriptomes in a wide variety of organisms, tissues and settings. I will discuss recent technological developments in RNA-Seq and the implications they have for analysis and interpretation. Using
examples from my own research, I will show how transcriptomics is now tractable at the isoform level, and how it is shedding light on previously unsolved problems in developmental biology and population genomics.

Tuesday, February 18, 2014
12:30-2:00, Kresge G2

Laura Lazzeroni
Associate Professor (Research) of Psychiatry and Behavioral Sciences
Stanford University School of Medicine

Interpretation of P-values: Uncertainty, Estimation and Replication

Scientists often use the p-value as a measure of evidence in high-dimensional analyses such as genome wide association studies (GWAS). I will present some ideas about interpreting p-values and utilizing the information they provide. Topics to be discussed include:

· The p-value as an estimator.
· P-values and replication.
· Designing GWAS follow-up studies.
· Selection bias corrections to offset the “winner’s curse”
· Comparing evidence from independent SNPs

My collaborators on this research are Ilana Belitskaya-Levy and Ying Lu.

Tuesday, January 28, 2014
12:30-2:00, Kresge G2

Jun Liu, Ph.D.
Professor of Statistics, Department of Statistics, Harvard University
Professor in the Department of Biostatistics, HSPH

Detection and Expansion of Gene Modules Based on Evolutionary History

Availability of genome sequences from diverse organisms provides a special opportunity to chart the evolutionary history of genes of interest. Such analyses provide insights into evolutionary pressures driving gene retention or loss and help to predict gene function based on correlated evolution. A major challenge in defining the phylogeny of a pathway, however, lies in the fact that its members typically do not exhibit a single, coherent ancestry, but rather, comprise a mosaic of evolutionary gene modules, each with a distinct history. We introduce a new computational method for automated detection and expansion of such modules in eukaryotes. Our method, called CLIME (clustering by inferred models of evolution), accepts as input a predefined species tree, a homology matrix, and a gene set of interest. CLIME partitions the input gene set into disjoint modules, simultaneously learning the number of modules and an evolutionary model that defines each module. Using these modules CLIME scores all genes in the genome for the likelihood of having emerged under a module’s inferred history, thereby expanding its membership. We applied CLIME to a tree of life consisting of 138 eukaryotic organisms. CLIME faithfully recovers known evolutionary modules within mitochondrial complex I, the calcium uniporter, and cilia while yielding new predictions. We have also applied it systematically to over 1000 classically defined human pathways, as well as the entire proteomes of yeast, red algae, and malaria. The results reveal unanticipated evolutionary modularity and novel, co-evolving components within many well-studied pathways. CLIME should become increasingly useful with the growing wealth of genome sequences from highly diverse organisms.

Based on joint work with Yang Li, Sarah E. Calvo, Roee Gutman, and Vamsi Mootha


Tuesday, December 17, 2013
12:30-2:00, FXB G13

Raphael Gottardo
Fred Hutchinson Cancer Research Center
Vaccine and Infectious Disease Division
Public Health Sciences Division

Characterizing Antigen-specific T-cell Poly-functionality Using Single-cell Assays

Cell populations in blood and tissue are not homogeneous; even clonotypes of individual cells can exist in different biochemical states that define measurable functional differences between them. This single-cell heterogeneity is informative, but lost in assays that measure cell mixtures. Recent technical advances such as cytometry and multiplexed microfluidic have enabled the high-throughput quantification of genes or proteins at the single-cell level. Although many analytic tools exist for analyzing high-dimensional data, such as from gene expression arrays, none have been developed specifically for the analysis of single-cell data, which has its own bioinformatics and statistical challenges. During this talk I will give an overview of statistical challenges involved in the analysis of single-cell data and show how such technologies can be used to characterize antigen-specific T-cells.


Tuesday, November 12, 2013
12:30-2:00, FXB G13

Robert Plenge
Vice President, Head of Genetics and Pharmacogenomics
Merck Research Laboratories

Human Genetics for Target Validation in Drug Discovery

More than 90% of the compounds that enter clinical trials fail to demonstrate sufficient safety and efficacy to gain regulatory approval. Most of this failure is due to the limited predictive value of preclinical models of disease, and our continued ignorance regarding the consequences of perturbing specific targets over long periods of time in humans. ‘Experiments of nature’ — naturally occurring mutations in humans that affect the activity of a particular protein target or targets — can be used to estimate the probable efficacy and toxicity of a drug targeting such proteins, as well as to establish causal rather than reactive relationships between targets and outcomes. In my talk I will describe the concept of dose–response curves derived from experiments of nature, with an emphasis on human genetics as a valuable tool to prioritize molecular targets in drug development. I will discuss empirical examples of drug–gene pairs that support the role of human genetics in testing therapeutic hypotheses at the stage of target validation, provide objective criteria to prioritize genetic findings for future drug discovery efforts and highlight the limitations of a target validation approach that is anchored in human genetics. Further, I will emphasize how human genetics can be used to uncover critical biological pathways, and how these pathways can be used to guide drug discovery.


12:30-2:00, FXB G13

William Cookson
Professor of Genomic Medicine
Faculty of Medicine, National Heart & Lung Institute
Imperial College of London

Epigenetic Association Mapping: the Mother of all Information

The extent and determinants of epigenetic variation in the human genome are not systematically understood, hindering the application of genome-wide studies of epigenetic status for common complex diseases. By studying DNA from peripheral blood lymphocytes (PBL) in families we have used segregation analyses to quantify the heritable and environmental components of CpG Island (CGI) methylation. We find CGI methylation captures cell-specific genomic responses to factors that are largely environmentally derived. We have identified reproducible CGI associations that account for 20% of variation in the total serum IgE (an important quantitative trait underlying asthma and allergy), and attributed this variation to circulating eosinophils. Genome function is mediated through interactive networks of genes and regulatory elements, and we have discovered the presence of strongly co-ordinated regulation of CGI in the form of 30 scale-free correlation networks (meQTN). Enrichment analysis identified meQTN modules that could be attributed to peripheral blood neutrophils, lymphocytes, monocytes and eosinophils. These and other modules were enriched by CGI associated to asthma and the total serum IgE. Although DNA is usually considered a static repository of information, our studies suggest genome-wide mapping of CGI may directly and deeply inform on mechanisms of diverse diseases.


Tuesday, September 17, 2013
12:30-2:30, Kresge G2

Alon Keinan
Assistant Professor
Department of Biological Statistics & Computational Biology
Cornell University

Recent human population growth: rare variants, mutation load, and complex disease

Human populations have experienced recent explosive growth since the Neolithic revolution. We demonstrate how such growth predicts an abundance of rare variants, and show that it has not been captured by earlier demographic modeling studies mostly due to small sample size. Recent studies that sequenced a very large number of individuals observed an extreme excess of rare variants, and provided clear evidence of recent population growth, though demographic estimates have varied greatly among studies. These studies were based on protein-coding genes, in which variants are also impacted by natural selection. Hence, we introduce new targeted sequencing data for studying recent human history with minimal confounding by natural selection. Modeling recent demographic history based on the distribution of allele frequencies in these data, our models fit very well and shed light on the discrepancies among recent studies. Another important question is how negative selection operates during a recent epoch of rapid population growth, when the population is not at equilibrium. We examined the trajectories of mutations with different fitness effects using computer simulations and conclude that each individual carries slightly more deleterious alleles than expected in the absence of growth, but the average fitness effect of these alleles is less deleterious. Combined, our results point to increased load of rare variants with small effect size playing a role in the individual genetic burden of complex disease risk.



2012-2013 Seminars

Tuesday, May 7, 2013

Franziska Michor, Ph.D.
Associate Professor
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
Department of Biostatistics, Harvard School of Public Health

Evolution of a Cancer Genome

Cancer emerges due to an evolutionary process in somatic tissue. The fundamental laws of evolution can best be formulated as exact mathematical equations. Therefore, the process of cancer initiation and progression is amenable to mathematical investigation. Current areas of research of the lab include cancer stem cells, evolution of drug resistance, and the dynamics of metastasis formation. In this talk I will introduce two examples of the application of evolutionary theory to cancer genomics and treatment.

Tuesday, April 30, 2013

Jonathan Pritchard
Professor, Department of Human Genetics
The University of Chicago

The genetic basis of human gene expression variation

Genetic variants that impact gene regulation likely play a central role in both evolution and the genetics of both complex traits. Yet the mechanisms by which they do so are poorly understood and it remains very difficult to predict which variants have regulatory effects in any given cell type. In this talk I will describe work that our group has done on mapping regulatory variants such as expression QTLs to understand the primary mechanisms by which such variants act. I will discuss both our work on analytical methods and the biological results.

Tuesday, April 9, 2013
Kresge G3

Douglas M. Robinson
Senior Associate Director of Biostatistics
Novartis Institute for Biomedical Research, Inc.

The predictive value of a 5-gene signature as a patient pre-selection tool in medulloblastoma for Hedgehog pathway inhibitor therapy

Medulloblastoma (MB), an invasive primitive neuroectodermal tumor of the posterior fossa, is the most common brain tumor in children, comprising ~20% of childhood and <2% of adult brain tumors. Current standard of care treatment, surgery followed by craniospinal radiation and chemotherapy, can lead to significant long term toxicities, especially in very young patients. At the time of relapse, no standard salvage therapy exists. Therefore, targeted therapies are needed. Several studies have used gene expression profiling to identify distinct molecular subgroups of MB, including one characterized by activated Hedgehog (Hh) signaling. Using available gene expression data, a 5-gene Hh signature that can be assayed in formalin-fixed paraffin-embedded (FFPE) samples by standard RT-PCR was identified.

Two sets of matched fresh frozen and FFPE MB specimens were used; one for development of the 5-gene signature and one for its independent validation. Hh activation status was determined in fresh frozen samples by gene expression profiling using the GeneChip human genome U133 Plus 2.0 array (Affymetrix, Santa Clara, CA) and in FFPE samples by RT-PCR analysis.

The 5-gene Hh signature was selected from a larger panel of 73 genes that were associated with the Hh subgroup classification, as determined by standard Affymetrix gene expression profiling. Eighteen of these genes shown to be differentially expressed in FFPE were chosen for the RT-PCR gene card that formed the basis of the Elastic Net model building exercise. Based on the expression levels of the 5-gene signature, a predictive model was used to compute a propensity score (0–100%) representative of the Hh activation status of each tumor sample. The median propensity scores for the 17 non-Hh-activated tumors was 0.7% (range: 0.1–3.0%) compared to 87.9% (range: 69.1–97.6%) in the eight Hh-activated tumors. Hh activation status of 25 independent MB samples defined by the 5-gene signature and assayed by RT-PCR were in 100% agreement with the Hh activation status determined by gene expression profiling. In order to determine the predictive value of this assay as a tool to identify patients who might benefit from treatment with a Hh pathway inhibitor, MB samples from patients (n=13) enrolled in recent phase I trials of the Smoothened inhibitor LDE225 were analyzed and correlated with the respective tumor responses. Using the 5-gene signature, all patients (n=4) who responded to LDE225 treatment (PR or CR) were found to have Hh-pathway activated tumors, whereas all patients who did not respond (n=9) were found to have Hh non-activated tumors. These results suggest an association between Hh activation status determined by the 5-gene Hh signature and tumor response to LDE225 treatment. Data from an ongoing phase I/II trial in pediatric patients will enable determination of the predictive value of this patient pre-selection assay.

Tuesday, March 12, 2013
12:30-2:00, KRESGE G2

Francis Ouellette
Associate Director
Informatics and Bio-computing
Ontario Institute for Cancer Research

Processing cancer genomic data at the Ontario Institute for Cancer Research for the International Cancer Genome Consortium

The goal of the International Cancer Genome Consortium (ICGC) is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. This goal is well under way, and we plan to complete this in the next few years. The Ontario Institute for Cancer Research (OICR) is active with the ICGC on many fronts: it is involved in generating data for 2 of the 50 different tumour types, it is the executive headquarters for the ICGC, and it is also the home for the Data Coordinating Center (DCC). I will present on this, and how this integrates with some of The Cancer Genome Atlas (TCGA) activities and some of the work we have done in my group on using data from the ICGC.


Tuesday, February 19, 2013
12:30-2:00, KRESGE G2

Benjamin Neale, PhD
Assistant Professor, Anayltic and Translational Genetics Unit - Mass General Hospital
Instructor, Harvard Medical School
Associated Reseacher, The Broad Institute

Analytic Issues in the Assessment of Rare Variation

With advances in sequencing and genotyping technology, human genetics is now capable of assaying rare variation. In this talk, analytic issues associated with assessing the impact of rare variation on complex traits will be described. Specifically, the asymptotic properties of common association tests will be described. In addition, a comprehensive overview of how to model the probability of de novo mutation and how to leverage this to assess the strength of evidence for a given gene.



Tuesday, January 29, 2013
12:30-2:00, KRESGE G2

Brad Bernstein, MD, PhD
Associate Pathologist, Massachusetts General Hospital
Associate Professor of Pathology, Harvard Medical School
Early Career Scientist, Howard Hughes Medical Institute
Senior Associate Member, Broad Institute

Global Epigenetic Regulation of Normal and Malignant Cells

We are using high-throughput sequencing technology to characterize chromatin state and regulation in normal and cancer stem cells. The data enable systematic annotation of proximal and distal gene regulatory elements, their cell type-specificities and their functional interactions. We show how these approaches can be used to interpret genetic variants associated with human disease, as well as to reconstruct transcriptional regulatory networks in cancer stem cells. Finally, we will present characterizations of chromatin regulator proteins that suggest strategies for modulating genome function and cell state with chemical inhibitors of chromatin enzymes.


Tuesday, December 11, 2012
12:30-2:00, KRESGE G2

Cecilia Lindgren
University Research Lecturer, Group Head / PI and Fellow
Nuffield Department of Medicine
The University of Oxford
The Broad Institute

Loci Associated with Fat Distribution Have Complex Sex-combined and Sex-specific Effects

Waist-hip ratio (WHR) is a heritable measure of body fat distribution and a significant predictor of metabolic and cardiovascular risk independent of overall obesity, as measured by body mass index (BMI). We performed sex-combined fixed effects inverse variance meta-analysis for BMI-adjusted WHR on 142,762 individuals from 57 genome-wide association studies and 67,325 individuals from 28 studies genotyped on the Metabochip. Given previous reports of sexually dimorphic genetic effects on fat distribution, we also performed male-specific (N=93,482) and female-specific (N=116,742) meta-analyses. The sex-combined analysis identified 39 loci (25 novel) associated with BMI-adjusted WHR at genome-wide significance (P<5x10-8), and an additional nine female-specific loci. Twelve of sex-combined loci showed significantly different sex effects, with seven having an effect only in females, four loci with stronger effects in females than males, and one locus with effects only in males. The enrichment of female-specific associations is consistent with much higher estimated heritability of BMI-adjusted WHR in women (h2~46%) compared to men (h2~19%). We used GCTA to perform approximate conditional analysis, estimating the linkage disequilibrium between SNPs using combined GWAS and Metabochip genotype data from 949 Swedish individuals from the PIVUS study. Nine loci harbor multiple signals of association at genome-wide significance. These loci show complex patterns of sexual dimorphism in genetic effects. For example, at the VEGFA locus both association signals are stronger in women than men (1df test of heterogeneity between sex-specific allelic effects and 2.2x10-2). In contrast, at the WARS2/SPAG17 locus, we observed one male-specific, one female-specific, and two sex-combined association signals. Our results highlight the importance of sex-specific analyses, conditional analysis, and fine mapping to fully elucidate the complex genetic architecture and biological mechanisms of body fat distribution.

Tuesday, November 13, 2012
12:30-2:00, KRESGE G2

Peter Laird
Director, USC Epigenome Center
Professor of Surgery
Professor of Biochemistry & Molecular Biology
Keck School of Medicine
USC / Norris Comprehensive Cancer Center

Exploring the Cancer Methylome

Cancer develops not only as a result of genetic mutations and genomic rearrangements, but also as a consequence of numerous epigenetic alterations, including extensive changes in the distribution of DNA methylation throughout the genome. DNA methylation changes contribute directly to cancer by transcriptional silencing of tumor-suppressor genes through promoter CpG island hypermethylation. Broad epigenomic analysis of human tumors can reveal relationships between large numbers of epigenetic events and can provide insight into the mechanisms underlying concerted epigenetic change. Genomic loci targeted by Polycomb Group Repressors in embryonic stem cells, and involved in cellular differentiation are predisposed to aberrant DNA methylation in cancer cells, suggesting that an epigenetic block to cellular differentiation may sometimes be an initiating event in carcinogenesis. The very strong associations between distinct epigenetic subtypes, such as CpG Island Methylator Phenotypes (CIMP) and specific somatic genetic events, such as BRAF mutation in colorectal cancer and IDH1 mutation in glioblastoma multiforme are consistent with an early role for DNA methylation alterations, providing a favorable cellular context for the subsequent somatic mutation. The analysis of whole methylomes at single-basepair resolution reveals that cancer-associated changes occur differentially across defined regions of the genome associated with the nuclear lamina. It is apparent that epigenomic analysis is essential for a full understanding of the relationship between alterations in the cancer genome and the origin and clinical diversity of individual tumors.

Tuesday, October 16, 2012
12:30-2:00, KRESGE G2

Mark Gerstein
Albert L. Williams Professor of Biomedical Informatics,
Molecular Biophysics & Biochemistry and Computer Science

Human Genome Annotation

My talk will be concerned with the analysis of networks and the use of networks as a "next-generation annotation" for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression. Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the "middle-managers" acting as information-flow bottlenecks and with more "influential" TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

Architecture of the human regulatory network derived from ENCODE data.
Gerstein et al. Nature 489: 91

Classification of human genomic regions based on experimentally
determined binding sites of more than 100 transcription-related
KY Yip et al. (2012). Genome Biol 13: R48.

Understanding transcriptional regulation by integrative analysis of
transcription factor binding data.
C Cheng et al. (2012). Genome Res 22: 1658-67.

The GENCODE pseudogene resource.
B Pei et al. (2012). Genome Biol 13: R51.

Comparing genomes to computer operating systems in terms of the
topology and evolution of their regulatory control networks.
KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.

Tuesday, September 18, 2012
12:30-2:00, KRESGE G2

Dan L. Nicolae
Associate Professor
Departments of Medicine and Statistics
The University of Chicago

Next-generation genetic association studies - some challenges and opportunities

There are many challenges in the transition from genome-wide association studies to whole exome and to whole genome sequencing investigations. We have moved from single-marker tests on common SNPs to set-based inference on all variants in a functional element, and this has led to the development of many statistical tools. I will present a framework for the analysis of sequence data where we harness population genetics theory to provide prior information on effect sizes that allows a general and powerful test for association. I will also discuss some of the challenges in this transition, including the implicit and explicit assumptions on underlying genetic models of risk for a given set, and the interpretation of results.


2011-2012 Seminars

Tuesday, May 1, 2012
12:00-1:30, KRESGE 502

Neil Risch
Director, Institute for Human Genetics Professor,
Division of Biostatistics Department of Epidemiology and Biostatistics
University of California, San Francisco

Genetic Epidemiology Research on Aging in a Cohort of 100,000

Through an ARRA Grand Opportunity Award, we have recently completed genome-wide genotyping and telomere length analyses on a cohort of over 100,000 individuals who are participants in the Research Program on Genes, Environment and Health at Kaiser Permanente, Northern California.” The goal is to link the genomic information to extensive clinical, laboratory, radiologic, pathology and pharmacy information contained in Kaiser databases for extensive genetic-epidemiologic analysis. The cohort has mean age 65, with average membership spanning over 20 years, on average, providing extensive longitudinal health information. Information on environmental and behavioral risk factors has been obtained through geo-coded linkage to state and federal environmental/social databases, and from survey self-report. The cohort is multi-ethnic and contains numerous first, second and third degree relatives, allowing for a variety of genetic epidemiologic analyses on a large scale, to better understand the genetic and environmental contributors to age-related disease and healthy aging

Tuesday, April 24, 2012
12:30-2:00, FXB G12

Curtis Huttenhower
Assistant Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Harvard School of Public Health

Reducing microbial unemployment: functional roles in the human microbiome

Among many surprising insights, the genomic revolution has helped us to realize that we're never alone and, in fact, barely human. For most of our lives, we share our bodies with some ten times as many microbes as human cells; these are resident in our gut and on nearly every body surface, and they are responsible for a tremendous diversity of metabolic activity, immunomodulation, and intercellular signaling. In order to understand these microbes' relationship with their hosts, however, we must establish how homeostasis is maintained in health or disregulated in disease. I will present an overview of microbial metabolism and function core to the healthy human microbiome and a survey of microbes that cooperate and compete to fulfill these metabolic roles. Since even bacteria within the same "species" regularly carry strikingly different genomes, it is critical to identify community membership at the species or strain level whenever possible. Finally, I will discuss how metabolic function normally present in the gut microbiota is disrupted in inflammatory diseases such as Crohn's and ulcerative colitis.


Tuesday, March 20, 2012
12:30-2:00, FXB G12

John Novembre
Assistant Professor
Department of Ecology and Evolutionary Biology
Interdepartmental Program in Bioinformatics
University of California, Los Angeles

Insights to recombination and rare variants from large-scale human polymorphism data

Population structure can be key factor in shaping patterns of genetic variation in a sample. Depending on the end goals while analyzing genetic data, it may a primary focus, a problematic nuisance, or a useful tool that can give insight to another process. This talk will highlight two recent projects that center around the concept of population structure and ancestry inference. The first involves the use of chromosomal-level ancestry as a tool to estimate genome-wide recombination intensity maps. The second addresses how rare variants are distributed geographically in a deep sequencing project based on 202 genes sequenced in >14,000 human individuals. The analysis of the deep resequencing project also raises larger questions about the extent of purifying selection in humans and the importance of population growth for understanding patterns of genetic diversity in large samples.



Tuesday, February 14, 2012
12:30-2:00, Kresge G2

Xihong Lin
Professor of Biostatistics
Department of Biostatistics
Harvard School of Public Health

Design and Analysis of Whole Exome (Genome) Sequencing Association Studies

Sequencing studies, such as targeted, whole exome and whole genome sequencing studies, are increasingly being conducted to identify rare variants that are associated with complex traits. Design and analysis of such population based sequencing association studies face many challenges. The talk has three parts. I will first provide an overview of several methods for studying rare variant effects, including burden tests, SKAT and optimal unified tests. Analysis pipelines for whole exome sequencing association studies, such as filtering criteria and small sample adjustments of statistical methods, will be discussed. In the second part of the talk, I will discuss designs of sequencing association studies, such as sample size and power calculations, and pros and cons of extreme phenotype sampling and analysis strategies for extreme phenotype sequencing studies. In the last part of the talk, I will discuss the performance of imputation using GWAS data for studying rare variants effects. Simulation studies and real data will be used to illustrate the results.



Tuesday, December 6, 2011
12:30-2:00, Kresge G2

Matthew Stephens
Professor, Department of Human Genetics and Department of Statistics
The University of Chicago

A Unified Framework for Association Analysis of Multiple Related Phenotypes

In many ongoing genome-wide association studies, multiple related phenotypes are available for testing for association with genetic variants. In most cases, however, these related phenotypes are analysed independently from one another. For example, several studies have measured multiple lipid-related phenotypes, such as LDL-cholestrol, HDL-cholestrol, and Triglycerides, but in most cases the primary analysis has been a simple univariate scan for each phenotype. This type of univariate analysis fails to make full use of potentially rich phenotypic data.

While this observation is in some sense obvious, much less obvious is the right way to go about examining associations with multiple phenotypes. Common existing approaches include the use of methods such as MANOVA, canonical correlations, or Principal Components Analysis, to identify linear combinations of outcome that are associated with genetic variants. However, if such methods give a significant result, these associations are not always easy to interpret. Indeed the usual approach to explaining observed multivariate associations is to revert to univariate tests, which seems far from ideal.

In this work we outline an approach to dealing with multiple phenotypes based on Bayesian model averaging. The method attempts to identify which subset of phenotypes is associated with a given genotype. In this way it incorporates the null model (no phenotypes associated with genotype); the simple univariate alternative (only one phenotype associated with genotype) and the general alternative (all phenotypes associated with genotype) into a single unified framework. In particular our approach both tests for and explains multivariate associations within a single model, avoiding the need to resort to univariate tests when explaining and interpreting significant multivariate findings. We illustrate the approach on examples, and show how, when combined with multiple phenotype data, the method can improve both power and interpretation of association analyses.

Tuesday, November 15, 2011
12:30-2:00, Kresge G2

Alex Meissner, Ph.D.
Harvard University Department of Stem Cell and Regenerative Biology
Broad Institute

DNA Methylation Dynamics in Stem Cells and Development

Tuesday, October 4, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

Daniel J. Schaid, Ph.D.
Professor of Biostatistics, Mayo Clinic

Enhancing Analysis of Genome Wide Association Studies with Gene Ontology and other Structures for Gene Sets

Genome wide association studies (GWAS) measure hundreds of thousands of genetic markers (single nucleotide polymorphsisms, SNPs) on large numbers of diseased cases and non-diseased controls, with analyses tending to focus on the association of single SNPs with disease status. Although many GWAS find that the associations of SNPs with disease status tend to be small, with odds ratios 1.25 – 1.5, and that modeling of multiple SNPs has limited ability to accurately predict disease status, the greatest benefit of GWAS might be new leads towards complex genetic pathways. This benefit can be enhanced by using prior information about how genes work together in biological pathways in order to form groups of genes — a group of genes with modest effect might be more powerful than individual genes or single SNPs. This presentation will discuss general strategies for analyzing sets of SNPs, or sets of genes, as well as newly developed computational and statistical methods that use information from the publically available Gene Ontology (GO).. The GO provides standardized representations of gene and gene product attributes across species and databases, and achieves this by a controlled vocabulary. Specifically, GO structures details about genes in a directed acyclic graph, such that specific details are linked to more general details. This provides a natural way to recursively create gene sets, from highly-specific small sets of genes to very general large sets of genes. By mapping SNPs from a GWAS to genes, and then mapping genes to the GO structure, we are able to scan the entire GO structure in terms of gene sets, to seek the set with the most extreme statistic for the association of the gene set with disease. Computational and statistical methods will be emphasized in this presentation, along with applications to several GWAS data sets. Strengths and limitations of our approach will be discussed, as well as future research directions.


Tuesday, September 20, 2011
12:30-2:00, Kresge G2

Steven Watkins, Ph.D.
Senior Visiting Scientist at Harvard School of Public Health
Senior Vice President of Research at Tethys Bioscience

Metabolomics in the Laboratory, the Clinic and the Marketplace: An opinion on the state of the field after a decade of earnest effort.

The technologies for measuring metabolites broadly, rapidly and accurately are mature enough to support both basic research and medical applications. On the basic research side, there are many success stories and metabolomics has been widely adopted into the workflow. On the clinical side, there are still hurdles to overcome and success with this hugely promising technology have not been as forthcoming. The reasons for this are intertwined with the very nature of metabolomics itself - and center on complexity in the required instrumentation and the interpretation of multi-marker panels. This talk will review our experience with metabolomics (and a little bit of proteomics) at Lipomics and Tethys in approaching basic and clinical studies. Those efforts spawned successful advances in the knowledge of metabolism as well as commercial diagnostic products, but not routine strategies that generalize across applications and fields. Metabolomics is beginning to bear fruit, and remains immensely promising, but from our perspective unfortunately still requires specialized expertise to conduct.


2010-2011 Seminars

Tuesday, May 17, 2011
12:30-2:00, FXB G13

Barbara E. Stranger Ph.D.
Instructor of Medicine
Division of Genetics, Department of Medicine
Harvard Medical School and Brigham and Women's Hospital

Genomics of Human Gene Expression

Genetic variation in gene expression has long been studied with the aim to understand the landscape of regulatory variants but also to assist in the interpretation and elucidation of disease signals. We present two projects: (1) Analysis of the genetic basis of genome-wide gene expression patterns in lymphoblastoid cell lines from 726 individuals from 8 populations from the HapMap3 project. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We dissect the specific functional pathways differentiated between populations and highlight patterns of sharing of expression quantitative trait loci (eQTLs) between populations, which are determined by population relatedness and discover significant sharing of eQTL effects between Asians, European-admixed and African subpopulations. (2) To understand the evolutionary and functional consequences of immune-mediated disease susceptibility, we performed a series of distinct, but interrelated large-scale analyses of three different data types: (a) genetic variants reported to be associated with ten different immune-mediated diseases from published genome-wide association studies (GWAS); (b) a genome-wide scan for signatures of positive selection in a population of European ancestry; and (c) an eQTL mapping study in peripheral blood mononuclear cells (PBMCs). Our results suggest that changes in gene expression levels influencing immune-mediated disease have been targets of recent positive selection, perhaps in some cases, due to a selective advantage from protection against infectious disease in the past.



Tuesday, April 12, 2011
12:30-2:00, Kresge G3

Gonçalo Abecasis, D.Phil
Felix E. Moore Collegiate Professor of Biostatistics
Center for Statistical Genetics
Department of Biostatistics
School of Public Health, University of Michigan

Sequencing Thousands of Human Genomes

Identifying and characterizing the genetic variants that affect human traits is one of the central objectives of human genetics. Ultimately, this aim will be achieved by examining the relationship between interesting traits and the whole genome sequences of many individuals. Whole genome re-sequencing of thousands of individuals is not yet feasible, but advances in laboratory methods (for example, to enable the genotyping of thousands of individuals at hundreds of thousands of SNP sites) and in statistical methodology (for example, to enable accurate correction for population stratification and genotype imputation) have resulted in substantial progress in our understanding of complex disease biology. Here, I discuss some the analytical and study design challenges posed by the first generation of whole genome sequencing studies. These studies will enable the examination of 1,000s of individuals at >15 million of polymorphic sites. These studies have been made possible by continuing advances in laboratory technology and statistical methods and should further refine our understanding of complex disease genetics. I illustrate the possibilities both with simulation and with results from ongoing studies.

Tuesday, March 29, 2011
12:30-2:00, Kresge G2

Hongbing Shen, M.D., Ph.D.
Dean of School of Public Health
Professor of Epidemiology
Nanjing Medical University

A Large-Scale Genomewide Association Study of Lung Cancer in Han Chinese: Cumulative Effects of Six Genetic Variants

This is the first large-scale GWAS of lung cancer in Han Chinese. We performed a GWAS scan in 5,430 subjects (2,342 lung cancer cases and 3,088 controls), followed by a two-stage validation among 12,722 subjects (6,313 cases and 6,409 controls). The combined analyses identified 6 well-replicated SNPs with significant associations (P < 5×10-8) of lung cancer in the genes of TP63 (at 3q28), TERT-CLPTM1L (at 5p15.33), MIPEP-TNFRSF19 (at 13q12.12), and MTMR3-HORMAD2-LIF (at 22q12.2). Among the above 6 SNPs, 4 were newly identified in Chinese. The population attributable risk (PAR) of these six SNPs was 59.1% and could be increased to 77.3% (male: 84.0%, female: 65.6%) after incorporating packyear of smoking. the results suggest that genetic variations in 3q28, 5p15.33, 13q12.12 and 22q12.2 may cumulatively contribute to the susceptibility of lung cancer in Chinese.

FRIDAY, March 4, 2011
12:30-2:00, FXB G13

Peter M. Visscher, Ph.D.
Senior Principal Research Fellow
Queensland Statistical Genetics Laboratory
Queensland Institute of Medical Research
Brisbane, Australia

Genome-partitioning of Genetic Variation for Complex Traits Using GWAS Data

Common complex disease is caused by a combination of multiple genes and environmental effects. Traditionally the genetics of disease has been studied using concepts that refer to the combined effect of all genes (e.g., heritability or sibling risk), for example by studying the recurrence risk or phenotypic correlation of relatives. Genome-wide association studies (GWAS) facilitate the dissection of heritability into individual locus effect. They have been successful in finding many SNPs associated with complex traits and have greatly increased the number of genes where variation is known to affect the trait. However, GWAS have been criticized for not explaining more of the genetic variation that we know exists in the population, and many hypotheses have been put forward to explain the missing heritability. The most plausible explanations are that (i) causal effects are too small to be detected with statistical significance and (ii) causal variants are not well tagged by the SNPs on the commercial arrays, for example because their heterozygosity is lower than that of genotyped SNPs. The use of all GWAS data simultaneously in an estimation rather than hypothesis testing framework can capture much more variation than in gene discovery approaches, and allows the partitioning of variation across chromosomes and chromosome segments. We show how such whole genome methods can be used to better understand the genetic architecture of complex traits, with applications in height, BMI, schizophrenia and other traits. The results demonstrate that for all traits studied, a substantial proportion of additive genetic variation is tagged by common SNPs and that genetic variation is smeared out over the entire genome. We conclude that these traits are highly polygenic, that variation explained by causal variants is small on average and that GWAS with increasing sample size will discover more variants. In addition, we show that the same statistical approaches used to estimate genetic variance are relevant in genetic risk prediction for complex traits.

Tuesday, February 8, 2011
12:30-2:00, Kresge G3 - PLEASE NOTE ROOM CHANGE

Eleazar Eskin, Ph.D.
Associate Professor
Computer Science
Human Genetics
University of California, Los Angeles

Statistical Methods for Association Studies with Rare Variants

Sequencing studies have been discovering numerous rare variants, allowing one to test effects of rare variants on disease susceptibility Recently, several groupwise association tests that group rare variants in genes and detect associations between groups and diseases have been proposed to increase the statistical power of studies on rare variants. One major challenge in these methods is to determine which variants are the actual causal variants in a group, and to overcome this challenge, previous methods used prior information that specifies how likely each variant is causal. Another challenge is how to combine information from multiple rare variants in a gene. Both of these challenges affect the statistical power of these methods in more complicated ways than in traditional association studies. In this talk, I discuss some recent work on measuring the statistical power of rare variant association methods and present some new methods motivate by our observations.


Tuesday, December 14, 2010
12:30-2:00, Kresge G2

Steven R. Kleeberger, Ph.D.
Acting Deputy Director
Environmental Genetics Group
National Institute of Environmental Health Sciences
& National Toxicology Program

Genetic Mechanisms of Susceptibility to Oxidant-induced Lung Disease: New Insights

Environmental oxidants remain a major public health concern in industrialized cities throughout the world. Population and epidemiological studies have associated oxidant air pollutant exposures with morbidity and mortality outcomes, and underscore the important detrimental effects of these pollutants on the lung. Inter-individual variation in human pulmonary responses to air pollutants suggests that some subpopulations are at increased risk to the detrimental effects of pollutant exposure, and it is becoming increasingly clear that genetic background is an important susceptibility factor. We have utilized multiple positional cloning approaches in mice to identify genes that determine differential responsiveness to ozone-induced injury and inflammation, including Tnf, Tlr4, and MHC Class II genes. Integrative genomics approaches in mouse models have led to the identification of additional susceptibility gene candidates including Marco, Nqo1, and Hsp70. Importantly, comparative mapping between the human and other genomes can also yield candidate susceptibility genes. Ongoing association studies in human subjects and tissue specific gene expression profiling in juvenile rhesus macaque monkeys have provided compelling validation of a number of oxidant susceptibility gene candidates. Results from these studies have also been informative for other oxidant-related disorders, including susceptibility to respiratory syncytial virus (RSV) disease. The combined investigations in inbred mice, human subjects, and non-human primates have provided, and will continue to provide, important insight to understanding genetic factors that contribute to differential susceptibility to oxidants.


Tuesday, November 30, 2010
12:30-2:00, Kresge G2

Tianxi Cai, Ph.D.
Associate Professor of Biostatistics
Department of Biostatistics, HSPH

Adaptive Naive Bayes Kernel Machine Approach to Classification with GWAS Data

As genetic studies of human diseases progress, it is becoming increasingly evident that genetics often play a major and complex role in many types of diseases. Therefore, the complexity of the genetic architecture of human health and disease makes it difficult to identify genomic markers associated with disease risk or to construct accurate genetic risk prediction models. Accurate risk assessment is further complicated by the availability of a large number of markers that may be predominately unrelated to the outcome or may explain a relatively small amount of genetic variation. Often, standard prediction models merely rely on additive or marginal relationships between the markers and the phenotype of interest. Marginal association based analysis has limited power in identifying markers truly associated with disease, resulting in a large number of false positives and false negatives. Simple additive modeling does not perform well when the underlying structure of association involves interactions and other non-linear effects. Additionally, these methods do not make use of information that may be available regarding genetic pathways or gene structure. We propose a multi-stage method relating possibly predictive markers to the risk of disease by first forming multiple gene-sets based on certain biological criteria. By imposing a naive bayes kernel machine model, we estimate gene-set specific risk models that relate information from each gene-set to the outcome. In the second stage, we aggregate information across all gene-sets by adaptively estimating the weights for each gene-set via a regularization procedure. The KM framework efficiently models the potentially non-linear effects of predictors without specifying a particular functional form. Estimation and predictive accuracy is further improved with kernel PCA approximation to reduce the degrees of freedom in the first stage and with adaptive regularization in the second stage to remove non-informative regions from the final prediction model. Prediction accuracy is assessed with bias-corrected ROC curves and AUC statistics. Numerical studies suggest that the model performs well in the presence of non-informative regions and both linear and non-linear effects.


Tuesday, October 26, 2010
12:30-2:00, Kresge G2

Andrea Baccarelli, M.D., Ph.D.
Associate Professor of Environmental Epigenetics
Department of Environmental Health, HSPH

Epigenetic Modifications Induced by Environmental Pollutants: Results from Human Studies

Epigenetics investigates heritable changes in gene expression that occur without changes in DNA sequence. Several epigenetic mechanisms, including DNA methylation and histone modifications, can change genome function under exogenous influence. Results obtained from animal models indicate that in utero or early-life environmental exposures produce effects that can be inherited transgenerationally and are accompanied by epigenetic alterations. The search for human equivalents of the epigenetic mechanisms identified in animal models is in progress. I will present evidence from human studies of individuals exposed to air pollution and metals indicating that epigenetic alterations mediate effects caused by exposure to environmental toxicants. In these investigations we have shown that environmental toxicants cause altered methylation of human repetitive elements or genes. Some exposures can alter epigenetic states and the same and/or similar epigenetic alterations can be found in patients with the disease of concern. In recent preliminary studies, we have shown alterations of histone modifications in subjects exposed to metal-rich airborne particles. I will present original data demonstrating that altered DNA methylation in blood and other tissues is associated with environmentally-induced disease, such as cardiovascular disease and asthma. On the basis of current evidence, I will propose possible models for the interplay between environmental exposures and the human epigenome.


Tuesday, September 28, 2010
12:30-2:00, Kresge G2

Tyler VanderWeele, Ph.D.
Associate Professor of Epidemiology
Departments of Epidemiology & Biostatistics, HSPH

Genetic Variants on 15q25.1, Smoking and Lung Cancer: an Assessment of Mediation and Interaction

Genome-wide association studies have identified variants on chromosome 15q25.1 that increase the risk of both lung cancer as well as nicotine dependence and associated smoking behavior. However, there remains debate as to whether the effect on lung is direct or operates through pathways related to smoking behavior. Of the three studies that initially reported the association between the variants and lung cancer, two suggested that the association was direct and one that it was primarily through nicotine dependence. Thorgeirsson and co-authors note also a third possible explanation of the associations: that the variant may increase individuals’ vulnerability to the harmful effect of tobacco smoke, a form of gene-environment interaction. In order to determine the extent to which variants on the 15q25.1 region affect lung cancer through nicotine dependence and associated smoking behavior or through other pathways, we applied novel methodology for mediation analysis to a case-control study for lung cancer of 1836 cases and 1452 controls. For two SNPs, rs8034191 and rs1051730, on 15q25.1, we estimated the indirect effect mediated by smoking (cigarettes per day), the direct effect through other pathways and the overall proportion mediated. Analyses allowed for the possibility that the effect of smoking varied by groups defined by the genetic variant. The effect of the variants on lung cancer mediated through smoking appears to be smaller than the independent effect through other pathways.


2009-2010 Seminars

Tuesday, April 6, 2010
12:30-2:00, Kresge G2

Nick Patterson, Ph.D.
Senior Computational Biologist
The Broad Institute

Admixture Graphs: Learning Genetic History

We describe a new methodology, useful for the analysis of genetic data, that generalizes phylogenetic trees. The new methods raise a number of statistical and mathematical issues, and yield some surprising insights into ancient human history. We will give examples from Eurasia, India, and South America.

Tuesday, March 2, 2010
12:30-2:00, Kresge G2

Hakon Hakonarson, M.D., Ph.D.
Center for Applied Genomics
The Children's Hospital of Philadelphia

Genetics of Complex Pediatric Disorders: Novel Analytical Approaches

Genome wide association studies have delivered on the promise of uncovering genetic determinants of complex disease, using high-throughput methods allowing large volumes of SNPs (105-106) to be genotyped in large cohort studies. The GWA approach serves the critical need for a comprehensive and unbiased strategy to identify causal genes related to complex disease and is rapidly replacing the more traditional candidate gene studies and microsatellite-based linkage mapping approaches that have dominated the gene discovery attempts for common diseases in previous years. As a consequence of employing this array-based technology over the last three years dramatic discoveries of key variants involved in multiple complex diseases and related traits have been reported in the top scientific literature, including over 2000 novel loci with multiple replications in over 100 disease areas by independent groups. In this talk, discoveries will be reviewed and large-scale database efforts discussed and their use in complex genetic disorders and genomics of drug response. Novel analytical approaches will also be presented addressing pathway-based analyses and tagging or rare variants that may account for some of the missing heritability in GWAS.

Tuesday, February 2, 2010
12:30-2:00, Kresge G2

Shamil Sunyaev, Ph.D.
Assistant Professor of Medicine
Harvard Medical School
Brigham & Women's Hospital

"Human Deleterious Mutations: Evolution, Function and Disease"

About one hundred new mutations occur in the genome of the average human. We can learn about the origin, evolutionary fate, functional and phenotypic effects of human mutations from the rapidly increasing DNA sequencing data. Evolutionary analysis of human deleterious mutations suggests strategies for identifying genes involved in human complex diseases. Computational and statistical approaches based on evolutionary and structural considerations can assist in genetic diagnostics of human monogenic or oligogenic diseases, as was tested in case of cardiomyopathy.

Tuesday, December 1, 2009
12:00-1:30, FXB G13

David Reich, Ph.D.
Associate Professor
Harvard Medical School, Department of Genetics
Associate Member, Broad Institute

"Reconstructing Indian Population History and Implications for South Asian Gene Discovery"

India has been underrepresented in genome-wide surveys of human variation. Here I describe an analysis of patterns of variation in 25 diverse Indian groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most groups today. One, the "Ancestral North Indians" (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the "Ancestral South Indians" (ASI), is not close to any group outside the subcontinent. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71%, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI-related ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, which we show reflects strong founder effects whose genetic signatures have been maintained for thousands of years due to endogamy. There are two key medical implications. First, our observations predict that there will be an excess of recessive diseases in India, different in each group, whose risk variants should be easy to identify using standard genetic methods. Second, the genetic risk factors that are only present in the ASI will be very difficult to discover without building specific genetic variation resources for South Asians.

Tuesday, November 3, 2009
12:00-1:30, FXB G13

Giovanni Parmigiani, Ph.D.
Chair, Department of Biostatistics & Computational Biology, DFCI
Professor, Department of Biostatistics, HSPH

"Cross-study Differential Gene Expression"

In this lecture I will present statistical issues associated with combining microarray data across studies. I will focus on the role of hierarchical Bayesian models in constructing useful rules to shrink across both genes and studies, and to classify genes based on the patterns of concordance across studies. I will describe in detail a model we call XDE, and evaluate its performance in a comprehensive fashion, using both artificial data, and a "split sample" validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis but also under realistic alternatives. Compared to a more direct combination of t- or SAM-statistics, the 1 - AUC values for the Bayesian model is roughly half of the corresponding values for the t- and SAM-statistics. In small studies, XDE generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. Finally, I will illustrate our model using four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.

A technical report can be obtained from:


Tuesday, October 6, 2009
12:00-1:30, FXB G13

Peter Park, Ph.D.
Assistant Professor of Pediatrics HMS Center for Biomedical Informatics Children's Hospital Informatics Program

"ChIP-sequencing: Data Analysis and Applications"
ChIP-seq combines chromatin immunoprecipitation (ChIP) with next-generation sequencing to identify protein-DNA interactions on a genome-wide scale. After a brief introduction to next-generation sequencing, a number of practical issues in analysis of ChIP-seq data will be discussed, including experimental design, detection of binding sites, and determination of whether a sufficient depth of sequencing has been achieved. Application of ChIP-seq to the study of X-chromosome dosage compensation in Drosophila and nucleosome positioning will be described. If time allows, updates from the Cancer Genome Atlas and the model organism ENCODE projects will be given.


Tuesday, September 29, 2009
12:00-1:30, FXB G13

Mitchell Gail, M.D., Ph.D.
Senior Investigator Biostatistics Branch Division of Cancer Epidemiology and Genetics National Cancer Institute

"The value of adding single nucleotide polymorphism data to a model that predicts breast cancer risk"

Seven single nucleotide polymorphisms (SNPs) have recently been confirmed to be associated with breast cancer. I assessed the value of adding these SNPs to the Breast Cancer Risk Assessment Tool (BCRAT), which is based on ages at menarche and at first live birth, family history of breast cancer, and history of breast biopsy examinations. The model with these SNPs (BCRATplus7) had an area under the receiver operating characteristic curve (AUC) of 0.632, compared to 0.607 for BCRAT. This improvement is less than from adding mammographic density to BCRAT. I also assessed how much BCRATplus7 reduced expected losses in deciding whether a woman should take tamoxifen to prevent breast cancer and in deciding whether a woman should have a mammogram. In addition, I examined whether BCRATplus7 was more effective than BCRAT in allocating a scarce public health resource, such as access to mammography, based on ranking women on their breast cancer risk and allocating the resource to those at highest risk. In none of these applications did BCRATplus7 perform substantially better than BCRAT. A cross-classification of risk by the two models indicated that some women would change risk categories, depending on the risk threshold, if BCRATplus7 were used instead of BCRAT, but it is not known if BCRATplus7 is well calibrated. These results were hardly changed if three additional very recently identified SNPs were added. I conclude that the available SNPs do not improve the performance of models to estimate breast cancer risk enough to warrant their use outside the research setting.