Seminar Archive

The goal of the PQG Seminar Series is to encourage the exchanging of ideas and promote interaction, collaboration, and research in quantitative genomics. It seeks to further the development and application of quantitative methods, especially for high dimensional data, as well as focus on the training of quantitative genomic scientists.

2020-2021 Seminars

May 4, 2021

Heng Li

Assistant Professor
Biomedical Informatics, Harvard Medical School
Biostatistics and Computational Biology, Dana-Farber Cancer Institute

The assembly of a human pangenome


April 13, 2021

Associate Professor, Department of Human Genetics
Emory University School of Medicine

Clonal hematopoiesis of indeterminate potential and DNA methylation in human aging

Age-related changes to the epigenome are well-documented, especially the pattern of genome-wide DNA methylation (DNAm) changes observed in blood and other tissues. This pattern is so robust it can accurately estimate individual age and predict lifespan and health outcomes, but its underlying causes remain to be elucidated. Age is also associated with an increased prevalence of somatic mutation in nuclear DNA, particularly in frequently regenerating cells such as hematopoietic stem cells (HSCs). Occasionally, mutations in HSCs may occur in a leukemogenic driver gene causing clonal selection and expansion – termed clonal hematopoiesis of indeterminate potential (CHIP). The genes most commonly implicated in CHIP, DNMT3A and TET2, have roles in epigenomic regulation. Thus, we hypothesized that age-related DNAm changes may be partially driven by age-related CHIP.  Here, I will review work on DNAm and aging by our group and others, followed by recent work on CHIP and aging.  I will then describe the results from an epigenome-wide association study (EWAS) to test for association between CHIP and DNAm in two population-based cohort studies.


March 2, 2021

Karen Miga

Satellite DNA biologist, Co-lead
telomere-to-telomere (T2T) consortium

Telomere-to-Telomere Chromosome Assemblies: New Insights Into Genome Biology and Structure

We are entering into an exciting era of genomics where truly complete, high-quality assemblies of human chromosomes are available end-to-end, or from ‘telomere-to-telomere’ (T2T). Recently, the Telomere-to-Telomere (T2T) consortium announced our v1.0 assembly that includes more than 150 Mbp of novel sequence compared to GRCh38, achieves near-perfect sequence accuracy, and unlocks the most complex regions of the genome to functional study. This technological advance, crediting the confluence of new assembly methods with long read sequencing technologies, offers a new opportunity to comprehensively the genomic structure and epigenetic organization in the most repeat-dense regions of our chromosomes. In particular, I will focus on the release of initial genetic and epigenetic reference of all human centromeric regions. High-resolution study of the pericentromeric sequence content and organization reveals new satellite families, sites of transposable element insertion, segmental duplications, and pericentromeric gene predictions. Using unique markers (marker-assisted method) to anchor ultra-long nanopore reads to human centromeric regions regions we report hypomethylated dips at every centromeric region, as previously described for the T2TX centromere. These sites are shown to coincide with regions enriched in centromere protein A (CENP-A) and may provide a signature of sites of kinetochore assembly genome-wide.


February 2, 2021

Jordan Smoller

Professor of Psychiatry
Harvard Medical School

Leveraging Genomics and Big Data to Advance Precision Psychiatry

Neuropsychiatric disorders are common, often disabling conditions. Despite substantial advances in our understanding of the nature of these disorders, fundamental unsolved questions remain to be answered.  The classification of psychiatric disorders is still based on symptom-based syndromes defined by expert consensus. We recognize that early intervention can improve prognosis, but we have precious few tools for identifying those at risk or for preventing illness. We have therapeutics that help many people and are life-saving for some, but available treatments are based on decades-old insights and administered in a largely trial-and-error fashion. The recent emergence of precision medicine emphasizes the role of individual differences in biology, environment, and lifestyle to develop more targeted and effective approaches to diagnosis, treatment, and prevention. In this presentation, I will review recent work by our group to advance a precision medicine approach to psychiatry, leveraging genomics and large-scale data resources (e.g. electronic health records) to clarify the structure of mental illness, facilitate risk stratification and prevention, and translate genomic insights into more targeted treatment options.


December 8, 2020

Chunyu Liu

Research Associate Professor, Biostatistics
BU School of Public Health

Analysis of mitochondrial genome from whole genome sequencing

Mitochondrial DNA copy number and heteroplasmic mutations can be measured from whole genome sequencing.  In this talk, Dr. Liu will discuss association analyses of mitochondrial DNA copy number with age, sex, and cardiometabolic traits using a few large cohorts from NHLBI’s TOPMed project. In addition, Dr. Liu will talk about the inheritance of heteroplasmic mutations in human.


November 17, 2020

Chunyu Liu

Professor of Psychiatry and Behavioral Sciences, Neuroscience and Physiology
SUNY Upstate Medical University

Functional annotation of GWAS findings of psychiatric disorders

Genetic studies have discovered hundreds of significant associations for major psychiatric disorders. Explaining those genetic associations require multi-dimensional investigation of gene functions. QTL mapping is one of the high-throughput approaches. In this talk, I will introduce several on-going projects of QTL we have for annotating genetic variants in the human brain, which includes a study of East Asian brains, first-trimester fetal brain, major neural cell types, multi- omics QTL in adult brain, and cellular models to validate the predict functions.


October 20, 2020

Kari North

Professor of Epidemiology
Carolina Center for Genome Sciences, UNC Gillings School of Public Health

Integrative Approaches to Identifying Function and Significance of Adiposity Susceptibility Genes

In 2019, ~100 million Americans were obese, fueling increases in obesity-related morbidity, mortality, and health care costs, largely from cardiometabolic diseases (CMD). GWAS have demonstrated the fundamental role of genetic susceptibility in obesity risk, including the >1000 loci identified to date. Each GWAS-identified locus potentially provides novel biologic insight; yet the identification of the functional variants, genes, and underlying pathways at these loci has limited translation for precision medicine. OMICs (e.g. genetics, transcriptOMICs, methylOMICs, and metabolOMICs) lie along pathways linking genetic susceptibility to obesity and are emerging as powerful disease biomarkers that provide targetable “mechanistic bridges” linking GWAS findings with obesity risk.  OMIC scans in the same individuals in which obesity associated loci discoveries were made are now available, thereby facilitating comprehensive and efficient integration with genetic data to illuminate the underlying genes and mechanistic pathways of obesity-associated loci. Leveraging collaborations in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), TransOMICs for Precision Medicine (TOPMed) Program, and the Genome Sequencing Project (GSP),  we have begun the process of identifying the genes underlying GWAS signals so that we can perform clinical characterization and conduct in vitro functional studies to characterize the molecular underpinnings and biological mechanisms of obesity-risk loci. Our approach will substantially move the field away from tag variants and loci to causal variants, genes, and mechanisms. We anticipate that this work will generate fundamental and important insights into the underlying etiology of obesity and ultimately point the way forward towards prevention and treatment.


October 6, 2020

Shai Carmi

Senior Lecturer
Braun School of Public Health and Community Medicine
The Hebrew University of Jerusalem, Israel (HUJI)

Evaluating the utility of screening human embryos for polygenic traits

It is now technically feasible to use polygenic scores and preimplantation genetic testing to screen human embryos for traits (e.g., height or cognitive ability). However, the expected outcomes of embryo screening are unclear, which undermines discussion of associated ethical concerns. We used simulations, theory, and real data to evaluate the potential gain of embryo screening, defined as the increase in trait value when selecting the top-scoring embryo for implantation. The gain increases slowly with the number of embryos but more rapidly with the variance explained by the score. Given current technology, the average gain due to screening would be ~2.5 cm for height and ~2.5 IQ points for cognitive ability. I will also discuss a preliminary assessment of the utility of embryo screening for complex disease risk.

2019-2020 Seminars

May 5, 2020 – cancelled

Naomi Wray

Professorial Research Fellow, Institute for Molecular Bioscience
Affiliate Professor, Queensland Brain Institute
University of Queensland


April 14, 2020 – cancelled

David Gifford

Professor of Electrical Engineering and Computer Science
Professor of Biological Engineering
MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)


March 24, 2020 – cancelled

Jordan Smoller

Professor of Psychiatry, Harvard Medical School
Professor of Epidemiology, Harvard School of Public Health


March 3, 2020

Nir Yosef

Associate Professor, Department of Electrical Engineering and Computer Science
Center for Computational Biology, UC Berkeley

Computational tools for multimodal analysis of single cell trajectories

Single-cell RNA-sequencing has emerged as a popular modality for dissecting temporal processes such as tissue development and cellular differentiation. Current efforts in the field now go beyond mRNA and provide additional types of information for every cell, which may further enhance these studies. In this talk, I will survey our efforts to develop computational tools that can leverage two such data types. The first data type I will discuss is protein abundance – this is enabled by Total-Seq or CITE-Seq, a method which estimates the abundance of dozens of proteins on a cell’s membrane in addition to its transcriptome. We have designed a method, Total-VI, that learns a joint representation of a cell’s state from paired mRNA and protein measurements while capturing uncertainty and propagating it to a range of tasks (e.g., batch correction, subpopulation identification, differential expression). I will demonstrate the utility of Total-VI in the context of an ongoing study of T cell lineage specification in the thymus. The second data type I will cover is traceable genetic alterations – this approach uses CRISPR/Cas9 to introduce heritable marks to cells, enabling the reconstruction of cellular lineages. I will discuss a suite of tools (Cassiopeia) we have designed to help build such lineages, while tackling hurdles such as missing data and the need for scalability. I will conclude by briefly presenting a third method (HotSpot) designed for a joint analysis of transcriptome and CRISPR/Cas9- based lineage information, and its application to characterizing heritable gene programs during embryogenesis.


December 3, 2019

Su-In Lee

Associate Professor, Paul G. Allen School of Computer Science & Engineering
Adjunct Associate Professor, Depts of Genome Sciences, Electrical and Computer Engineering, and Biomedical Informatics and Medical Education
University of Washington, Seattle

Explainable Artificial Intelligence for Biology and Health

Modern machine learning (ML) models can accurately predict patient progress, an individual’s phenotype, or molecular events such as transcription factor binding. However, they do not explain why selected features make sense or why a particular prediction was made. For example, a model may predict that a patient will get chronic kidney disease, which can lead to kidney failure. The lack of explanations about which features drove the prediction – e.g., high systolic blood pressure, high BMI, or others – hinders medical professionals in making diagnoses and decisions on appropriate clinical actions. I will briefly describe my group’s efforts to develop interpretable ML techniques for varied biological and medical applications, including treating cancer based on a patient’s own molecular profile, identifying therapeutic targets for Alzheimer’s, predicting kidney diseases, preventing complications during surgery, enabling pre-hospital diagnoses for trauma patients, and improving our understanding of pan-cancer biology and genome biology. My talk will focus in greater detail on interpretable ML techniques to identify molecular markers for anti-cancer drugs for acute myeloid leukemia, published in Nature Communications (Jan 2018) and our more recent works in collaboration with Harvard Medical School; our explainable artificial intelligence system, Prescience, for preventing hypoxemia in patients under anesthesia, featured on the cover of Nature Biomedical Engineering (Oct 2018); SHAP, our general ML framework on model interpretability, published as a full oral presentation at Neural Information Processing Systems (2017; cited >480 for 2 years); Tree explainer, our polynomial time algorithm for computing SHAP values for tree models (Accepted to Nature Machine Intelligence); and our ML approach to identifying Alzheimer’s disease therapeutic target.

 


November 12, 2019

Adam Siepel

Professor, Watson School of Biological Sciences
Chair, Simons Center for Quantitative Biology
Cold Spring Harbor Laboratory

Applications of Nascent RNA-sequencing: Transcriptional Regulation, RNA Stability, and Regulatory Evolution


October 21, 2019

Raluca Gordân

Associate Professor, Biostatistics & Bioinformatics
Duke Center for Genomic and Computational Biology

Formation and consequences of DNA mutations in transcription factor binding sites

The vast majority of germline and somatic mutations fall within non-coding regions of the genome, where they can affect interactions with transcription factor (TF) proteins. I will present QBiC-Pred, a new method to quantitatively predict the effects of single nucleotide mutations on TF-DNA binding (http://qbic.genome.duke.edu). Our predictions are in great agreement with in vitro and in vivo TF binding data, as well as gene expression data. I will also present recent results investigating how mutations form in TF binding sites. Several genomic studies suggested that TF binding might be, to some extent, mutagenic, because TFs could interfere with DNA repair at binding sites containing DNA lesions. But it is currently unknown whether TFs can even bind to such sites. Focusing on mismatch lesions (i.e. non-complementary base pairs that result from replication errors, 5mC deamination, and other cellular processes), we investigated the binding of 21 human TFs to sites containing mismatches. Surprisingly, we found a widespread increase in TF binding due to mismatches, which can be explained based on dynamic changes in DNA structure. Given the high affinity with which all tested TFs can bind to mismatch lesions, we conclude that competition with mismatch repair enzymes is indeed a potential mechanism by which TFs can modulate the formation of mutations in their binding sites.

2018-2019 Seminars

May 7, 2019

Iuliana Ionita-Laza

Associate Professor of Biostatistics
Columbia University, Mailman School of Public Health

Integrative statistical approaches for the analysis of whole-genome sequencing data

Continuous advances in massively parallel sequencing technologies make large whole-genome sequencing studies increasingly feasible. The analysis of such data is challenging due to the large number of rare variants in noncoding regions of the genome, our limited understanding of their functional effects, and the lack of natural units for testing. In this talk I will describe some of our work to address these challenges. In particular, I will discuss unsupervised and semi-supervised approaches to predict cell type/tissue specific regulatory function for variants in noncoding regions. I will also discuss sequence-based association tests for noncoding regions that are able to integrate a large number of functional predictions for improved power to identify the signals in noncoding regions. Throughout the talk I will show applications to several datasets.


April 30, 2019

Rahul Satija

Assistant Professor of Biology
Center for Genomics and Systems Biology, NYU

Comprehensive Integration of Single Cell Data
Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we develop a computational strategy to “anchor” diverse datasets together, enabling us to integrate and compare single cell measurements not only across scRNA-seq technologies, but different modalities as well. After demonstrating substantial improvement over existing methods for data integration, we anchor scRNA-seq experiments with scATAC-seq datasets to explore chromatin differences in closely related interneuron subsets, and project single cell protein measurements onto a human bone marrow atlas to annotate and characterize lymphocyte populations. Lastly, we demonstrate how anchoring can harmonize in-situ gene expression and scRNA-seq datasets, allowing for the transcriptome-wide imputation of spatial gene expression patterns, and the identification of spatial relationships between mapped cell types in the visual cortex. Our work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets.

April 16, 2019

Anshul Kundaje

Assistant Professor of Genetics and Computer Science
Stanford University

Deciphering the cis-regulatory code of the genome using interpretable deep learning models
Functional genomics experiments profiling genome-wide regulatory state have revealed millions of putative regulatory elements in diverse cell states. These massive datasets have spurred the development of Deep Neural Networks (DNNs) that can accurate map DNA sequence to associated cell-type specific molecular phenotypes such as TF binding, chromatin accessibility, splicing and gene expression. I will present a critical overview of a variety of deep learning architectures, training and model evaluation strategies for learning predictive regulatory models from functional genomics profiles. Beyond high prediction accuracy, the primary appeal of DNNs is that they are capable of automatically learning predictive, biologically relevant patterns directly from raw data representations (e.g. raw DNA sequence) without many prior assumptions. I will present efficient interpretation engines for deep learning models to decipher nucleotide-resolution transcription factor binding events, improved TF motif representations, combinatorial motif interactions in cis-regulatory sequence grammars, dynamic cis and trans regulatory drivers of cellular differentiation and non-coding regulatory genetic variants. Finally, I will present Kipoi (http://kipoi.org/), the first machine learning model zoo for genomics. Kipoi democratizes machine learning for genomics by providing a unified, user-friendly framework to archive, share, access, use and build on models developed by the community.

March 5, 2019

Michael Beer

Associate Professor, Department of Biomedical Engineering
Johns Hopkins School of Medicine

Using Machine Learning to Predict the Impact of Non-coding Genetic Variation and Enhancer-Promoter Interactions

Most SNPs associated with common human disease are intergenic and contribute to disease susceptibility by altering enhancer activity.  Several machine learning methods have recently been proposed to detect the disrupted regulatory sequences and to predict the impact of regulatory mutations, based on SVMs and Deep Neural Networks (DNN). While these methods have similar cross-fold validation rates, variability in feature detection and importance can lead to significantly different predictions for mutation impact.  Here I will show that most human population variation arises from weak transcription factor binding site disruption, and that differences in machine learning approaches can dramatically affect the accuracy of the predictions.   I will then discuss serious limitations in most current machine learning formulations to predict enhancer promoter interactions and potential improvements.  Finally, I will describe how these methods can identify functionally conserved regulatory elements missed by conventional sequence alignment methods.  Together, these results show that statistical learning from large functional datasets can more accurately determine the quantitative contribution of weak binding sites to enhancer function in their native cellular contexts.


February 5, 2019

Sharon Browning

Research Professor, Department of Biostatistics
University of Washington

Our archaic human ancestors

Archaic human populations such as Neanderthals split off from the ancestors of modern humans hundreds of thousand years ago. Around fifty thousand years ago, as modern humans expanded their geographical range, they encountered and admixed with archaic human populations. Recent sequencing of DNA from Neanderthal fossils showed that Eurasians carry several percent of their DNA from Neanderthal ancestors. It is clear that other archaic humans also contributed to the human lineage. In particular, Denisovans, an archaic human population that lived in Asia, contributed significantly to the ancestry of Papuans, and to a lesser extent to the ancestry of Asians. My analysis of human genetic data from the 1000 Genomes Project revealed that there were at least two occurrences of admixture between Denisovans and modern humans: one occurrence led to significant Denisovan ancestry in Papuans and a small fraction of Denisovan ancestry throughout Asia, and the other occurrence contributed a small fraction of the ancestry of modern East Asians. I will present this result and the statistical methodology that was used to obtain it.


December 4, 2018

J. Paul Brooks

Associate Professor, Department of Statistical Sciences and Operations Research
Virginia Commonwealth University

Methods and Results for High-Throughput Investigations of the Microbiome in Pregnancy

Advances in sequencing technology have facilitated comprehensive surveys of the human microbiome, the community of microorganisms that reside in various body habitats.  Longitudinal and multi-omic measurements provide a glimpse of the dynamics of host-microbiome interactions which can provide insight into human health and disease.  These high-throughput measurements pose unique challenges for data analysis, motivating the development of new methods.  A class of new outlier-insensitive analysis methods for microbiome data will be presented.  Then results will be discussed from the two phases of the Human Microbiome Project based at Virginia Commonwealth University, whose aims included an understanding of the role of the microbiome in healthy pregnancies and pregnancy complications.


November 13, 2018

Ying Zhang

Assistant Professor, Cell and Molecular Biology
University of Rhode Island

From genomes to ecosystems: microbial metabolism and evolution as revealed by genome-scale models

Metabolism is an important process that governs the homeostasis of individual cells, multicellular organisms, and complex multi-organismal interactions. In human, animals, and invertebrates, metabolism of host-associated microbiomes is essential for mediating food digestion and is correlated with the health and disease of the host. In the marine environment, metabolism of microbial communities mediates diverse ecosystem function and is essential for the circulation of carbon nutrients and energy compounds. This presentation will demonstrate the application of genome-scale models into studying microbial metabolism. Case studies will be provided to illustrate the molecular mechanisms underlying metabolic adaptation, energy conservation, and metabolic variations among different species.


October 23, 2018

Shamil Sunyaev

Professor and Distinguished Chair of Computational Genomics
Harvard Medical School, Brigham & Women’s Hospital

From genetic data to the molecular function and disease risk prediction

Genetic studies of human complex phenotypes revealed the important role of
the non-coding fraction of the genome. There is a growing interest in
translating the genetic discoveries to mechanistic models of the functional
effect of allelic variants. The experimental studies could be guided by
hypotheses generated by statistical and computational methods. Our methods
use epigenomic annotations, co-localization of GWAS signals with association
peaks for molecular and cellular phenotypes, and Mendelian genetics data.
Massive genetic data also propelled the research on polygenic risk
prediction models. A new method for predicting risk is based on
“non-parametric shrinkage”.


September 18, 2018

Maximilian Haeussler

Associate Research Scientist, Biomolecular Engineering
Center for Biomolecular Science & Engineering
Jack Baskin School of Engineering, UCSC

Endless scores most beautiful – an overview of computational predictions for CRISPR specificity and efficiency

More than a dozen scoring algorithms have been published to predict CRISPR on-target efficiency and off-target effects. I will review the field, the differences between the scores, practical advice on using them in experiments and how they are presented on the website crispor.org and the UCSC Genome Browser.

2017-2018 Seminars

May 8, 2017

Eimear Kenny

Assistant Professor
Genetics and Genomic Sciences
Icahn School of Medicine at Mount Sinai

Human Diversity and Health Through the Lens of Genomics


April 17, 2018

Noah Zaitlen

Assistant Professor
Department of Medicine
Lung Biology Center
University of California, San Francisco

Phenotypic heterogeneity within and between populations

Complex phenotypes vary in their distributions between worldwide populations due to both genetic and environmental factors. Partitioning the relative contributions of genetics and environmental exposures to population differences can guide clinical care inform epidemiological studies of disease. Within populations, disease states are assigned on the basis on easily measured biomarkers or surveys. However, mutations from independent genetic pathways can manifest with similar phenotypes, resulting in disease heterogeneity. This reduces the power of genetic studies and prevents precise treatment of disease. We consider statistical methods identify disease heterogeneity when genotypes and multiple phenotypes are jointly measured.


March 6, 2018

Li Hsu

Affiliate Professor
University of Washington
Fred Hutchinson Cancer Research Center

A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics: An Application to a Large-Scale Genome-wide Association Study of Colorectal Cancer

Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recently genetic association studies that leverage external transcriptome data have received much attention and shown promises for discovering novel variants. One such approach PrediXcan is to use predicted gene expression through genetic regulation. However, there are limitations in this approach. The predicted gene expression may be biased resulting from regularized regression applied to moderately sample-sized reference studies. Further, some variants can individually influence disease risk through alternative functional mechanisms besides expression. Thus, testing only the association of predicted gene expression may potentially lose power. To tackle these challenges, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing for the residual effects of individual variants. We consider a set-based score testing framework, MiST (Mixed effects Score Test), and propose two data-driven combination approaches to jointly testing for the fixed and random effects. We also extend the mixed effects modeling framework to gene-environment interaction. We apply our approach to a large-scale GWAS study of colorectal cancer, and identify two genes POU5F1B and ATF1, which would have otherwise been missed by PrediXcan.


February 6, 2018

Hongyu Zhao

Department Chair and Ira V. Hiscock Professor of Biostatistics,
Professor of Genetics and Professor of Statistics and Data Science

Dissecting Genetic Architecture of Complex Diseases Through Integrated Genomic Analysis

Abstract: Genome-wide association study (GWAS) has been a great success in the past decade. However, significant challenges still remain in both identifying new risk loci and interpreting results. Complex structure of linkage disequilibrium also makes it challenging to separate causal variants from nonfunctional ones in large haplotype blocks. In this presentation, I will describe our recent efforts to integrate genomic functional annotations from computational predictions (e.g. genomic conservation) and high-throughput experiments (e.g. the ENCODE and Roadmap Epigenomics Projects) with GWAS summary statistics. Tissue and cell specific annotations allow us to infer relevant tissue/cell types at each risk locus. The usefulness of our methods will be demonstrated through their applications to several large GWASs. I will also discuss our approach to inferring genetic correlations from summary statistics. Joint analysis of multiple GWAS results allows us to infer genetic correlations among many complex traits. Finally, I will brief discuss the improvement of genetic risk prediction using annotation data.


December 5, 2017

Bhramar Mukherjee

John D. Kalbfleisch Collegiate Professor of Biostatistics
Associate Chair of Biostatistics
Professor of Epidemiology
The University of Michigan School of Public Health

Genes, Environment and Data Science: Some Like it All!

Last month, while reviewing an article by a famous statistician from Harvard University (you have to attend my lecture to know the name or simply guess!), I was compelled by the following sentence: “Seeing scientific applications turn into methodological advances is always a joy, at least for those of us who care about advancing the science of data, in addition to advancing science with data.” In this talk I will try to share this “joy” (and associated anxiety) of being a quantitative scientist at a time when our science and society are undergoing unprecedented information/data revolution. I will present two examples: (1) A phenomewide association study with polygenic risk scores and electronic health records using data from the Michigan Genomics Initiative; (2) Development of set-based inference for gene-environment interaction in the longitudinal Multi-Ethnic Study of Atherosclerosis. The examples are designed to illustrate that principled study design and data science methodology are at the heart of doing good science with data. This is joint work with many students and colleagues at University of Michigan.


November 14, 2017

Reka Albert

Distinguished Professor of Physics and Biology
Pennsylvania State University

Network-based cancer therapeutic target identification

Dynamic models of within-cell networks  can explain how these networks integrate internal and external inputs to give rise to the appropriate cellular response.  These models can be fruitfully used in cancer cells, whose aberrant decision-making can be connected to errors in the state of nodes or edges of gene regulatory or signaling networks. Over nearly fifteen years of collaboration with wet-bench scientists, my group has  found that discrete dynamic modeling  is very useful in synthesizing  qualitative interaction information into a predictive model. This talk will present our recent discrete dynamic models of signal transduction networks relevant to proliferation, apoptosis, or epithelial to mesenchymal transition of cancer cells.  Our model of PI3K mutant breast cancer identifies resistance mechanisms  to PI3K inhibitors and successful combinatorial interventions. Our model of epithelial to mesenchymal transition in liver cancer identifies combinatorial interventions that can successfully block the transition, or revert to the epithelial state. Discrete (logical) regulatory functions can be integrated into the regulatory network, allowing the insightful use of graph theoretic methods.  For example, we identified specific strongly connected components, called stable motifs, that can maintain an associated state regardless of the rest of the network, and thus represent points of no return in the dynamics of the system. Control of (a subset of) these stable motifs can guide the system into a desired attractor. I expect that network-based analyses will play an increasing role in the rational design of high-order therapeutic combinations.


October 17, 2017

Joe Pickrell

Junior Group Leader and Core Member
New York Genome Center

What can we learn from human genomics at scale and how do we get there?


September 19, 2017

Peter Kraft

Professor of Epidemiology
Harvard T.H. Chan School of Public Health

Genome-wide contributions to breast cancer risk, inferred from integrative analyses of 230,000 women

2016-2017 Seminars


Tuesday, May 2, 2017
12:30-1:30 PM
*Kresge G2*

Albert Hofman
Chair of the Department of Epidemiology
Stephen B. Kay Family Professor of Public Health and Clinical Epidemiology

The Alzheimer enigma: finding genes for dementia


Tuesday, April 4, 2017
12:30-1:30 PM
*Kresge G2*

Peter Park
Professor of Biomedical Informatics
Harvard Medical School

Somatic mutations in the brain

Whereas somatic mutations that confer a selective advantage to the proliferating cells play a critical role in tumorigenesis, the extent to which somatic mutations are present in post-mitotic neurons and whether they could play a role in neurodevelopmental or neurodegenerative diseases have been less clear. I will describe our efforts in using single cell whole-genome sequencing to characterize mutations in single neurons, focusing on the computational challenges in distinguishing true mutations from those generated from DNA amplification.

For computational scientists and statisticians, engaging in productive collaborations with experimentalists is essential. In the second part of the talk, I will discuss some of the lessons I have learned in how to best collaborate.

Thursday, March 23, 2017
12:30-1:30 PM
Kresge G1


Sarah Teichmann
Head of Cellular Genetics
Wellcome Trust Sanger Institute

Understanding Cellular Heterogeneity

From techniques such as microscopy and FACS analysis, we know that many cell populations harbour heterogeneity in morphology and protein expression. With the advent of high throughput single cell RNA-sequencing, we can now quantify transcriptomic cell-to-cell variation. I will discuss technical advances and biological insights into understanding cellular heterogeneity in T cells and ES cells using single cell RNA-sequencing.


Tuesday, December 6, 2016
12:30-1:30 PM
FXB G12

Ben Voight

Assistant Professor of Pharmacology and Genetics
University of Pennsylvania

“Genome-wide mutation rates viewed through a local nucleotide window”

The rate of mutation varies substantially across the human genome and fundamentally influences evolution and the incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site—the site’s trinucleotide sequence context—to study polymorphism levels across the genome. But the impact of larger sequence contexts has not been fully clarified even though context substantially influences rates of polymorphism. I will describe a new statistical framework we apply to data from the 1000 Genomes Project to demonstrate that a heptanucleotide (seven nucleotide) context window explains a substantial fraction of variability in rate of single nucleotide polymorphisms across the human genome. I will also describe new work and applications to model the frequency of small insertion/deletion polymorphisms and population-specific mutations.


 

Tuesday, November 15, 2016
12:30-1:30 PM
FXB G12

John Stamatoyannopoulos

Professor of Genome Sciences and Medicine
University of Washington

“Decoding human genome regulation: Moving beyond cells and maps”


Tuesday, October 18, 2016

12:30-1:30 PM
FXB G12

Chen-Zhong Zhang

Assistant Professor
Dana-Farber Cancer Institute

“Sequencing is believing: probing mutagenesis one cell at a time”

Single-cell sequencing can directly reveal cell-to-cell variation at both genomic and transcriptomic levels. The digital nature and base-pair resolution of single-cell sequencing also makes it a high-resolution, high-throughput assay for studying molecular biology at the single-cell level. We have recently developed an approach (‘Look-Seq’) combining DNA sequencing and live-cell imaging to characterize DNA damage due to cell division errors. The Look-Seq analysis directly links genetic mutations detected by DNA sequencing to aberrant chromosomes detected by live-cell imaging by the chromosomal haplotype. Importantly, comparison of the genetic variants phased to each homologous chromosome distinguishes biological variation on a single chromosome (but not its homolog) from single-cell sequencing artifacts affecting both homologs. I discuss how DNA damage and chromosome missegregation can be measured by single-cell sequencing and how we apply the LookSeq analysis to characterize mutagenesis on lagging or bridging chromosomes.


Tuesday, September 20, 2016
12:30-1:30 PM
FXB G12

Bogdan Pasaniuc

Assistant Professor, Pathology and Laboratory Medicine
David Geffen School of Medicine, UCLA

 “Emerging methods from summary GWAS data to understand genetics of complex traits” 

Many complex traits and diseases share a correlation at a phenotypic level. Such correlations can be attributed to shared environmental or genetic architectures. Quantifying the correlation in phenotypes that is due to genetics is of great interest in understanding the causal relationship between complex traits. Standard approaches to estimate either require individual-level genotype data, or make assumptions (e.g. random-effect) that renders them less suitable for local estimation. Here I present new methods for estimation of local genetic variance/covariance and discuss their relationship with the recently proposed methods for gene expression prediction as a tool to integrate eQTL and GWAS for a transcription-wide association scan (TWAS).

2015-2016 Seminars


Tuesday, April 12, 2016
12:30-2:00 PM
Kresge 200

Giovanni Parmigiani
Professor of Biostatistics
Department of Biostatistics and Computational Biology, DFCI
Department of Biostatistics, Harvard Chan School

Cross-study Analysis of Prediction Algorithms in Genomics


Tuesday, February 23, 2016
12:30-2:00 PM
Kresge 200

Mark Daly
Associate Professor of Medicine,
Harvard Medical School
Chief, Analytic and Translational Unit
Center for Human Genetic Research
Department of Medicine
Massachusetts General Hospital
Senior Associate Member
The Broad Institure of MIT and Harvard

Progress in Human Genetics: Putting GWAS results to work


Tuesday, February 2, 2016
2:00-3:00 PM
Kresge 502

Ivana Bozic
Research Associate
Department of Mathematics
Harvard University

Stochastic Evolutionary Modeling of Cancer Development and Resistance to Treatment

Cancer is the result of a stochastic evolutionary process characterized by the accumulation of mutations that are responsible for tumor growth, immune escape, and drug resistance, as well as mutations with no effect on the phenotype. Stochastic modeling can be used to describe the dynamics of tumor cell populations and obtain insights into the hidden evolutionary processes leading to cancer. I will present recent approaches that use branching process models of cancer evolution to quantify intra-tumor heterogeneity and the development of drug resistance, and their implications for interpretation of cancer sequencing data and the design of optimal treatment strategies.


Monday, February 1, 2016
12:30-2:00 PM
FXB G11

Hector Corrada Bravo
Assistant Professor
Department of Computer Science
University of Maryland

Visualization, Statistical Modeling and Discovery in Computational Epigenomics

The use of epigenomics to study mechanisms in development and disease using high-throughput techniques has been one of the most active areas in life and clinical sciences in the last five years. In this talk, I will present advances in statistical learning methods and data visualization for computational epigenomics and fundamental discoveries of molecular mechanisms in cancer facilitated by these tools.


Thursday, January 28, 2016
2:00-3:00 PM
Building 2, Room 426

a pizza lunch will be provided

Christine Peterson
Postdoctoral Scholar
Department of Health Research and Policy
Stanford University

Statistical Approaches for Making Sense of High-throughput Biological Data

In this talk, I will discuss statistical approaches I have developed to gain insight into the complex networks of regulation and interaction that govern biological systems. Understanding these networks and how they are disrupted by disease is an important step in identifying potential targets for the treatment of disease. Firstly, I will describe my work on the inference of biological networks such as metabolic or protein interaction networks from high-throughput data. In particular, I will address graphical modeling methods I have proposed in the Bayesian framework for inferring such networks based on limited sample sizes, and illustrate the application of these approaches to highlight mechanisms underlying cancer progression. Secondly, I will address the problem of establishing the genetic basis of multivariate traits such as gene expression or other molecular profiling data. Here I propose a multi-stage multiple testing procedure which controls important error rates regarding the discovery of regulatory variants and the association of these variants to traits.


Monday, January 25, 2016
2:00-3:30 PM
FXB G11

a pizza lunch will be provided

Michael I. Love
Postdoctoral Research Fellow, Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute and Department of Biostatistics, Harvard T.H. Chan School of Public Health

Statistical Methods for RNA-seq Data

Quantification of gene expression from RNA sequencing data is a fundamental task in computational biology, critical for projects across biological and biomedical sciences. Statistical analysis of RNA-seq data, such as identification of differentially expressed genes across samples or estimation of isoform abundances, presents new challenges: non-normality of count data, dependence of the variance on the mean, as well as technical artifacts in measurements. In this talk, I will discuss statistical methods I have developed for RNA-seq data, including robust estimators for inference of differential expression and an approach to remove systematic errors in isoform abundance estimates arising from variations in sample preparation.


Thursday, January 21, 2016
12:30-2:00 PM
FXB G13

a pizza lunch will be provided

Manuel Rivas
Broad Institute and Massachusetts General Hospital
Wellcome Trust Centre for Human Genetics Research, University of Oxford

Finding signals in human genome sequencing studies: Data, Models, and Inference

Genome sequencing studies applied to large case-control series, populations or biobanks with extensive phenotyping raise novel analytical challenges and present new opportunities to interrogate the human genome to better understand disease. In this talk I will focus on a special class of genetic variants that are increasingly found in sequencing studies, protein truncating variants (PTVs), which are typically expected to have large effect on gene function, are enriched for disease-causing mutations, and in the past few years some have been found to be protective against disease. PTVs, while not the only ones relevant to disease, offer unique insights into likely benefits and risks from therapeutic inhibition of the gene. I will consider recent sequencing efforts in autoimmune diseases and cardiometabolic traits to identify protective PTVs, and discuss the importance of improving our understanding of their functional consequences. Finally, I will introduce a statistical framework, named MRP, for rare variant association studies, that considers correlation, scale, and directionality of genetic effects across a group of 1) genetic variants, 2) phenotypes, and 3) studies. In so doing I am able to present formulations of the framework that considers the use of summary statistic data, the standard univariate and multivariate gene-based models, models for identifying protective protein-truncating variants, or computational algorithms to estimate the underlying mixture of neutral and functional variants from the distribution of rare variants and phenotype, which may provide opportunities for discovery and inference that are not addressed by the traditional one variant-one phenotype association study. These extensions are critical and poised to take advantage of major biobanking and precision medicine initiatives since we need to understand the full range of medical consequences – good and bad – of variation in a gene in order to confidently generate effective therapeutic hypotheses while recognizing unintended consequences up front rather than after tremendous investment is made.


Tuesday, January 19, 2016
12:30-2:00 PM
FXB G12

James Zhou
Postdoctoral Fellow
Microsoft Research New England

Harnessing the unseen for next generation population genomics and epigenomics

Sequencing of large human populations has the potential to transform disease diagnosis and treatment. In order to harness the power of this data avalanche, it is crucial to model and leverage the data and covariates that we do not see. I will illustrate this concept with two examples in genomics and epigenomics, where I developed scalable statistical algorithms with strong mathematical guarantees. I will first discuss my close collaboration with the largest exome sequencing consortium (ExAC) to infer statistical properties of rare and unseen human genetic variations. This work provides a unified framework to quantify the natural selection acting on our genome, annotate functional constraints, and predict the discovery rate of future sequencing projects. In the second part, I will describe complementary work to identify changes in the packing and chemical modifications of DNA—i.e., epigenomic variation—that are associated with diseases. This work requires flexible models of unseen covariates, especially cell-type composition. I will conclude by discussing the general statistical lessons we have learned and new research directions.


Tuesday, December 15, 2015
12:30-2:00 PM
FXB G12

Elin Grundberg
Assistant Professor
Human Genetics
McGill University

Capture the functional (adipose) epigenome for insight into metabolic disease risk

Common diseases such as obesity affect an alarming large number of individuals worldwide. Obesity is complex in nature meaning the disease is caused by multiple underlying factors of which only 30-40% are believed to be genetic effects. In past years most of these gene regions have been characterized but how they manifest themselves, interact with the environment or differ from pure environmental effects are still not known. With the development of novel high-throughput DNA sequencing technologies we are now able to screen the millions of sites in our genome that are susceptible for not only genetic but also environmental modulation (‘epigenetic’). I will discuss our efforts in implementing novel next-generation sequencing-based epigenomic tools and how these technology breakthroughs allow us to extend our knowledge of biological processes associated with common diseases.


Tuesday, November 17, 2015
12:30-2:00 PM
FXB G12

Gad Getz
Director
The Cancer Genome Computational Analysis Group
The Broad Institute of MIT and Harvard

Cancer Genomics and Evolution


Tuesday, October 20, 2015
12:30-2:00 PM
Kresge 502

Scott Carter
Assistant Professor of Computational Biology
Department of Biostatistics
Dana-Farber Cancer Institute
Harvard T.H. Chan School of Public Health

Computational dissection of intra-tumor genetic heterogeneity and applications to the study of cancer treatment, evolution, and metastasis


Tuesday, September 29, 2015
12:30-2:00 PM
Kresge 502

Timothy Rebbeck
Professor of Cancer Epidemiology
Department of Epidemiology
Harvard T.H. Chan School of Public Health

Prediction and Modification of Cancer Risk in BRCA1/2 Mutation Carriers

Inherited mutations in BRCA1 and BRCA2 confer breast and ovarian cancer risks that are substantially higher than in the general population. In the two decades since these genes were identified, we have learned that risks associated with these mutations may be modified by other factors and exhibit substantial genotype-phenotype heterogeneity. While BRCA1/2 are not likely to be representative of all disease predisposing genes in the population, they serve as a paradigm for our understanding of genomic etiology and precision prevention.


FRIDAY, May 15, 2015
12:30-2:00 PM
Kresge G2

Christopher Amos
Professor of Community and Family Medicine
Professor of Genetics
Associate Director for Population Sciences, Norris Cotton Cancer Center
Dartmouth Geisel School of Medicine

Modeling Genome Wide Effects on Cancer Risk

Genome wide association studies (GWAS) have been a highly effective tool for exploring genetic contributions to complex diseases, but have failed to explain very much of the heritability or total genetic risk associated with cancer development. In this talk I describe ongoing efforts to characterize the features of single nucleotide polymorphisms that are associated with successful replication of findings during GWAS as a measure of the particular attributes that should be considered when designing studies and particularly for associating weights for evaluating findings from GWAS and large scale sequencing studies. Finally I describe new methods for seeking to identify further components of missing heritability that reflect gene-gene and gene-environment interactions. These interactions are modeled using a Bayesian approach. Results from simulation studies and application to data from association studies of lung cancer show that a model that weakly constrains the prior probabilities that interactions are included in a model generally outperformed, as reflected by root mean squared error and posterior probabilities, either models with stricter constraints or that did not impose constraints.


Tuesday, May 5, 2015
12:30-2:00 PM
FXB G12

Steve Horvath
Professor, Human Genetics & Biostatistics
University of California, Los Angeles

The Epigenetic Clock and Biological Age

I recently developed a DNA methylation based biomarker of aging known as the “epigenetic clock”, which can be used to measure the DNA methylation (DNAm) age of any human (or chimpanzee) tissue, cell type, or fluid that contains DNA (with the exception of sperm). DNA methylation age of blood has been shown to predict all-cause mortality in later life, even after adjusting for known risk factors, which suggests that it relates to the biological aging process. Similarly, markers of physical and mental fitness are also found to be associated with the epigenetic clock (lower abilities associated with age acceleration). These results suggest that we may be close to achieving a long standing milestone in aging research: the development of an accurate measure of tissue age or even biological age. I will present several applications of this measure of tissue age, e.g. obesity and trisomy 21.

References:
1) Horvath S (2013) DNA methylation age of human tissues and cell types. Genome Biology.2013, 14:R115. DOI: 10.1186/10.1186/gb-2013-14-10-r115 PMID 24138928
2) Marioni R, Shah S, McRae A, Chen B, Colicino E, Harris S, Gibson J, Henders A, Redmond P, Cox S, Pattie A, Corley J, Murphy L, Martin N, Montgomery G, Feinberg A, Fallin M, Multhaup M, Jaffe A, Joehanes R, Schwartz J, Just A, Lunetta K, Murabito JM, Starr J, Horvath S, Baccarelli A, Levy D, Visscher P, Wray N, Deary I (2015) DNA methylation age of blood predicts all-cause mortality in later life. Genome Biology 16:25 doi:10.1186/s13059-015-0584-6
3) Horvath S, Erhart W, Brosch M, Ammerpohl O, von Schoenfels W, Ahrens M, Heits N, Bell JT, Tsai PC, Spector TD, Deloukas P, Siebert R, Sipos B, Becker T, Roecken C, Schafmayer C, Hampe J (2014) Obesity accelerates epigenetic aging of human liver. Proc Natl Acad Sci U S A [6]. pii: 201412759. doi: 10.1073/pnas.1412759111 PMID 25313081
4 Horvath S, Garagnani P, Bacalini MG, Pirazzini C, Salvioli S, Gentilini D, DiBlasio AM, Giuliani C, Tung S, Vinters HV, Franceschi C (2015) Accelerated Epigenetic Aging in Down Syndrome. Aging Cell. 9 FEB 2015 DOI: 10.1111/acel.12325 PMID: 25678027


Tuesday, April 14, 2015
12:30-2:00 PM
FXB G12

Hua Tang
Associate Professor, Genetics
Associate Professor (By courtesy), Statistics
Member, Bio-X
Member, Stanford Cancer Institute
Stanford School of Medicine

Learning about the Genetic Architecture of Complex Traits Across Populations

Genome-wide association studies (GWAS) have become a standard approach for identifying loci influencing complex traits. However, GWAS in non-European populations are hampered by limited sample sizes and are thus underpowered. Can GWAS results in one population be exploited to boost the power of mapping loci relevant in another population? The first part of the this talk will describe a set of analyses, which address the question, “to what extent does the genetic architecture of a complex trait overlap between human populations?” The second part of the talk will introduce an empirical Bayes approach, which improves the power of mapping trait loci relevant in a specific minority population through adaptively leveraging multi-ethnic evidence. A case study on plasma lipid concentration will be presented.


Tuesday, December 9, 2014
12:30-2:00 PM
Kresge G2

Brian Browning
Associate Professor
Department of Medicine, Division of Medical Genetics
University of Washington

Haplotype frequency models: what they are and why they matter

Haplotype frequency models are used to estimate the population frequency of a sequence of alleles at tightly-linked loci. These models are used in a wide variety of genetic analyses because they enable analyses to use information from correlated, closely-spaced variants. This talk will describe a new haplotype frequency model that that uses relatedness to reduce computational complexity. The new model uses the same graphical model as the Beagle haplotype frequency model, but unlike the original Beagle model, the new model incorporates genetic recombination, genotype error, and identity by descent. We can use this model to estimate haplotypes from unphased genotype data. The new model produces more accurate haplotypes than existing methods, and it requires substantially less computation time than the most accurate existing method.


Tuesday, November 4, 2014
12:30-2:00 PM
Kresge G2

Molly Przeworski
Professor
Department of Biological Sciences &
Department of Systems Biology
Columbia University

An Evolutionary Perspective on Human Germline Mutation

The revolution in sequencing technologies has made it feasible to identify de novo mutations in transmissions from parents to offspring, providing an unprecedented opportunity to learn about the genesis and properties of germline mutations. As we show, however, when recent pedigree studies are considered jointly and alongside results from other methodologies, it becomes clear that the pieces of the puzzle do not fit together. We discuss these gaps in our understanding in terms of three sets of interwoven questions: (i) On a mechanistic level, what proportion of mutations is introduced through mistakes in the replication process versus non-replicative, “spontaneous” errors? (ii) In terms of variation among individuals, why do mutation rates depend so strongly on sex and age? (iii) From an evolutionary perspective, how do mating systems and life history traits shape the mutation rate of a species? We present simple mathematical model s for the behavior of replication-driven and spontaneous errors over ontogenesis and discuss implications for human genetics and evolutionary biology.


Tuesday, October 14, 2014
12:30-2:00 PM
Kresge G2

Hongzhe Li
Professor
Departments of Biostatistics & Epidemiology
University of Pennsylvania – Perelman School of Medicine

Microbiome, Metagenomics and High-Dimensional Compositional Data Analysis

Next-generation sequencing technologies allow 16S ribosomal RNA gene surveys or whole metagenome shotgun sequencing in order to characterize taxonomic and functional compositions of gut microbiomes. The outputs from such studies are short sequence reads derived from a mixture of genomes of different species in a given microbial community. We first present a brief overview of the statistical methods we used for 16S rRNA data analysis. We then introduce a multi-sample model-based method to quantify the bacterial compositions based on shotgun metagenomics data using species-specific marker genes. The resulting data are high-dimensional compositional data, which complicate many of the downstream analyses. We introduce the GLMs with linear constraint on regression parameters in order to identify the bacterial taxa that are associated clinical outcomes and a composition-adjusted thresholding procedure to estimate correlation network from compositional data. We demonstrate the methods using two on-going gut microbiome studies at the University of Pennsylvania.


Tuesday, September 16, 2014
12:30-2:00 PM
Kresge G2

Soumya Raychaudhuri
Assistant Professor of Medicine, Harvard Medical School
Divisions of Genetics & Rheumatology
Department of Medicine, Brigham and Women’s Hospital

“Disentangling effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex trait loci”

2013-2014 Seminars


Thursday, May 6, 2014
12:30-2:00, FXB G13

Steven McCarroll
Professor in the Genetics Department
Harvard Medical School

Where is the rest of the human genome?

Whole-genome sequencing is increasingly used to search for genetic variants underlying human disease. In this seminar, I want to describe ways in which every sequencing experiment can also be used to teach us surprising things about how genomes work in everyone. First, there are large amounts of human genome sequence that are missing from maps of the human genome – but using a combination of mathematics and historical mixtures of human populations, we can figure out where these genes have been hiding and how they have remained hidden from view. Second, some regions of the human genome segregate in many different structural forms within human populations, and appear to contribute to biological variation among humans. Third, we can use whole genome sequence data to study active processes of DNA replication in human cells, with surprising findings about how DNA replication varies from person to person.


Thursday, April 17, 2014
12:30-2:00, FXB G12

Ben Yung
Head of Department of Health Technology and Informatics
Chair Professor of Biomedical Science
The Hong Kong Polytechnic University

A multidisciplinary research on cancer and its metabolic risk factor: from computational characterization, functional discovery to clinical diagnostics development

Nucleophosmin 1 (NPM1) was first identified as a nucleolar phosphoprotein and subsequently shown to be highly expressed in the granular region of the nucleolus. NPM1 increases rapidly in response to mitogenic stimuli and elevated expression of NPM protein is detected in highly proliferating and malignant cells. In addition, NPM has proven to be a multifunctional protein involved in many cellular activities including ribosomal biogenesis, centrosome duplication and transcription regulation. As our understanding of NPM1 has increased, more complex mechanisms in cancer cells will be revealed.

Distributions of expressional correlations over neoplastic and normal states reveal structural difference at a threshold, which defines a strongly co-expressed gene network with the best coherence with neoplasm. By such novel structural co-expression analysis, genome-wide co-expression in normal state was found to be stronger than that in chronic myelogenous leukemia (CML). Conversely, more links between NPM1 and BCR-ABL-related pathway were noted in CML. Normal-specific network showed dissociation of NPM1 with ribosomal proteins (RP) while CML-specific co-expressions rendered a large network connecting NPM1 to RP genes through RPL10A, RPL31 and RPL36A. Our results implicated a critical role of NPM1 in joining a cascade of ribosomal biogenesis, protein synthesis, cell proliferative and anti-apoptotic events in CML. Furthermore, we speculated that NPM1 and its co-expressed genes may be illegitimately activated in CML, as inferred by their positive expressional correlations and targeted by the same transcription factor set. This novel network analysis platform can also be applied to other cancers in order to discover promising markers for diagnostic, prognostic and therapeutic applications. Furthermore, we postulate that for gene networks that are transcriptionally dis-coordinated in cancer, their methylation states may have already been altered before the disease onset. To reveal such molecular association, we aim to study the differential methylation patterns of metabolic syndromes for cancer gene co-expression networks that will be identified in the project.


Tuesday, April 15, 2014
12:30-2:00, FXB G13

Eli Stahl

Assistant Professor, Psychiatry
Assistant Professor, Genetics and Genomic Sciences
Mount Sinai Icahn School of Medicine

Rare Variant Genetic Architecture of Schizophrenia and Bipolar Disorder


Tuesday, March 11, 2014
12:30-2:00, Kresge G2

Lior Patcher
Raymond and Beverly Sackler Chair in Computational Biology
Director of the Center for Computational Biology
Professor of Molecular and Cell Biology, Mathematics, and Computer Science
University of California, Berkeley

Making sense of RNA-Seq

RNA-Seq has become one of the primary applications for high-throughput sequencing, providing an unprecedented view of the dynamics of transcriptomes in a wide variety of organisms, tissues and settings. I will discuss recent technological developments in RNA-Seq and the implications they have for analysis and interpretation. Using
examples from my own research, I will show how transcriptomics is now tractable at the isoform level, and how it is shedding light on previously unsolved problems in developmental biology and population genomics.


Tuesday, February 18, 2014
12:30-2:00, Kresge G2

Laura Lazzeroni
Associate Professor (Research) of Psychiatry and Behavioral Sciences
Stanford University School of Medicine

Interpretation of P-values: Uncertainty, Estimation and Replication

Scientists often use the p-value as a measure of evidence in high-dimensional analyses such as genome wide association studies (GWAS). I will present some ideas about interpreting p-values and utilizing the information they provide. Topics to be discussed include:

· The p-value as an estimator.
· P-values and replication.
· Designing GWAS follow-up studies.
· Selection bias corrections to offset the “winner’s curse”
· Comparing evidence from independent SNPs

My collaborators on this research are Ilana Belitskaya-Levy and Ying Lu.


Tuesday, January 28, 2014
12:30-2:00, Kresge G2

Jun Liu, Ph.D.
Professor of Statistics, Department of Statistics, Harvard University
Professor in the Department of Biostatistics, HSPH

Detection and Expansion of Gene Modules Based on Evolutionary History

Availability of genome sequences from diverse organisms provides a special opportunity to chart the evolutionary history of genes of interest. Such analyses provide insights into evolutionary pressures driving gene retention or loss and help to predict gene function based on correlated evolution. A major challenge in defining the phylogeny of a pathway, however, lies in the fact that its members typically do not exhibit a single, coherent ancestry, but rather, comprise a mosaic of evolutionary gene modules, each with a distinct history. We introduce a new computational method for automated detection and expansion of such modules in eukaryotes. Our method, called CLIME (clustering by inferred models of evolution), accepts as input a predefined species tree, a homology matrix, and a gene set of interest. CLIME partitions the input gene set into disjoint modules, simultaneously learning the number of modules and an evolutionary model that defines each module. Using these modules CLIME scores all genes in the genome for the likelihood of having emerged under a module’s inferred history, thereby expanding its membership. We applied CLIME to a tree of life consisting of 138 eukaryotic organisms. CLIME faithfully recovers known evolutionary modules within mitochondrial complex I, the calcium uniporter, and cilia while yielding new predictions. We have also applied it systematically to over 1000 classically defined human pathways, as well as the entire proteomes of yeast, red algae, and malaria. The results reveal unanticipated evolutionary modularity and novel, co-evolving components within many well-studied pathways. CLIME should become increasingly useful with the growing wealth of genome sequences from highly diverse organisms.

Based on joint work with Yang Li, Sarah E. Calvo, Roee Gutman, and Vamsi Mootha


Tuesday, December 17, 2013
12:30-2:00, FXB G13

Raphael Gottardo
Fred Hutchinson Cancer Research Center
Vaccine and Infectious Disease Division
Public Health Sciences Division

Characterizing Antigen-specific T-cell Poly-functionality Using Single-cell Assays

Cell populations in blood and tissue are not homogeneous; even clonotypes of individual cells can exist in different biochemical states that define measurable functional differences between them. This single-cell heterogeneity is informative, but lost in assays that measure cell mixtures. Recent technical advances such as cytometry and multiplexed microfluidic have enabled the high-throughput quantification of genes or proteins at the single-cell level. Although many analytic tools exist for analyzing high-dimensional data, such as from gene expression arrays, none have been developed specifically for the analysis of single-cell data, which has its own bioinformatics and statistical challenges. During this talk I will give an overview of statistical challenges involved in the analysis of single-cell data and show how such technologies can be used to characterize antigen-specific T-cells.


Tuesday, November 12, 2013
12:30-2:00, FXB G13

Robert Plenge
Vice President, Head of Genetics and Pharmacogenomics
Merck Research Laboratories

Human Genetics for Target Validation in Drug Discovery

More than 90% of the compounds that enter clinical trials fail to demonstrate sufficient safety and efficacy to gain regulatory approval. Most of this failure is due to the limited predictive value of preclinical models of disease, and our continued ignorance regarding the consequences of perturbing specific targets over long periods of time in humans. ‘Experiments of nature’ — naturally occurring mutations in humans that affect the activity of a particular protein target or targets — can be used to estimate the probable efficacy and toxicity of a drug targeting such proteins, as well as to establish causal rather than reactive relationships between targets and outcomes. In my talk I will describe the concept of dose–response curves derived from experiments of nature, with an emphasis on human genetics as a valuable tool to prioritize molecular targets in drug development. I will discuss empirical examples of drug–gene pairs that support the role of human genetics in testing therapeutic hypotheses at the stage of target validation, provide objective criteria to prioritize genetic findings for future drug discovery efforts and highlight the limitations of a target validation approach that is anchored in human genetics. Further, I will emphasize how human genetics can be used to uncover critical biological pathways, and how these pathways can be used to guide drug discovery.


MONDAY, OCTOBER 21, 2013
12:30-2:00, FXB G13

William Cookson
Professor of Genomic Medicine
Faculty of Medicine, National Heart & Lung Institute
Imperial College of London

Epigenetic Association Mapping: the Mother of all Information

The extent and determinants of epigenetic variation in the human genome are not systematically understood, hindering the application of genome-wide studies of epigenetic status for common complex diseases. By studying DNA from peripheral blood lymphocytes (PBL) in families we have used segregation analyses to quantify the heritable and environmental components of CpG Island (CGI) methylation. We find CGI methylation captures cell-specific genomic responses to factors that are largely environmentally derived. We have identified reproducible CGI associations that account for 20% of variation in the total serum IgE (an important quantitative trait underlying asthma and allergy), and attributed this variation to circulating eosinophils. Genome function is mediated through interactive networks of genes and regulatory elements, and we have discovered the presence of strongly co-ordinated regulation of CGI in the form of 30 scale-free correlation networks (meQTN). Enrichment analysis identified meQTN modules that could be attributed to peripheral blood neutrophils, lymphocytes, monocytes and eosinophils. These and other modules were enriched by CGI associated to asthma and the total serum IgE. Although DNA is usually considered a static repository of information, our studies suggest genome-wide mapping of CGI may directly and deeply inform on mechanisms of diverse diseases.


Tuesday, September 17, 2013
12:30-2:30, Kresge G2

Alon Keinan
Assistant Professor
Department of Biological Statistics & Computational Biology
Cornell University

Recent human population growth: rare variants, mutation load, and complex disease

Human populations have experienced recent explosive growth since the Neolithic revolution. We demonstrate how such growth predicts an abundance of rare variants, and show that it has not been captured by earlier demographic modeling studies mostly due to small sample size. Recent studies that sequenced a very large number of individuals observed an extreme excess of rare variants, and provided clear evidence of recent population growth, though demographic estimates have varied greatly among studies. These studies were based on protein-coding genes, in which variants are also impacted by natural selection. Hence, we introduce new targeted sequencing data for studying recent human history with minimal confounding by natural selection. Modeling recent demographic history based on the distribution of allele frequencies in these data, our models fit very well and shed light on the discrepancies among recent studies. Another important question is how negative selection operates during a recent epoch of rapid population growth, when the population is not at equilibrium. We examined the trajectories of mutations with different fitness effects using computer simulations and conclude that each individual carries slightly more deleterious alleles than expected in the absence of growth, but the average fitness effect of these alleles is less deleterious. Combined, our results point to increased load of rare variants with small effect size playing a role in the individual genetic burden of complex disease risk.

2012-2013 Seminars


Tuesday, May 7, 2013
12:30-2:00
FXB G13

Franziska Michor, Ph.D.
Associate Professor
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
Department of Biostatistics, Harvard School of Public Health

Evolution of a Cancer Genome

Cancer emerges due to an evolutionary process in somatic tissue. The fundamental laws of evolution can best be formulated as exact mathematical equations. Therefore, the process of cancer initiation and progression is amenable to mathematical investigation. Current areas of research of the lab include cancer stem cells, evolution of drug resistance, and the dynamics of metastasis formation. In this talk I will introduce two examples of the application of evolutionary theory to cancer genomics and treatment.


Tuesday, April 30, 2013
12:30-2:00
FXB G12

Jonathan Pritchard
Professor, Department of Human Genetics
The University of Chicago

The genetic basis of human gene expression variation

Genetic variants that impact gene regulation likely play a central role in both evolution and the genetics of both complex traits. Yet the mechanisms by which they do so are poorly understood and it remains very difficult to predict which variants have regulatory effects in any given cell type. In this talk I will describe work that our group has done on mapping regulatory variants such as expression QTLs to understand the primary mechanisms by which such variants act. I will discuss both our work on analytical methods and the biological results.


Tuesday, April 9, 2013
12:30-2:00
Kresge G3

Douglas M. Robinson
Senior Associate Director of Biostatistics
Novartis Institute for Biomedical Research, Inc.

The predictive value of a 5-gene signature as a patient pre-selection tool in medulloblastoma for Hedgehog pathway inhibitor therapy

Medulloblastoma (MB), an invasive primitive neuroectodermal tumor of the posterior fossa, is the most common brain tumor in children, comprising ~20% of childhood and <2% of adult brain tumors. Current standard of care treatment, surgery followed by craniospinal radiation and chemotherapy, can lead to significant long term toxicities, especially in very young patients. At the time of relapse, no standard salvage therapy exists. Therefore, targeted therapies are needed. Several studies have used gene expression profiling to identify distinct molecular subgroups of MB, including one characterized by activated Hedgehog (Hh) signaling. Using available gene expression data, a 5-gene Hh signature that can be assayed in formalin-fixed paraffin-embedded (FFPE) samples by standard RT-PCR was identified.

Two sets of matched fresh frozen and FFPE MB specimens were used; one for development of the 5-gene signature and one for its independent validation. Hh activation status was determined in fresh frozen samples by gene expression profiling using the GeneChip human genome U133 Plus 2.0 array (Affymetrix, Santa Clara, CA) and in FFPE samples by RT-PCR analysis.

The 5-gene Hh signature was selected from a larger panel of 73 genes that were associated with the Hh subgroup classification, as determined by standard Affymetrix gene expression profiling. Eighteen of these genes shown to be differentially expressed in FFPE were chosen for the RT-PCR gene card that formed the basis of the Elastic Net model building exercise. Based on the expression levels of the 5-gene signature, a predictive model was used to compute a propensity score (0–100%) representative of the Hh activation status of each tumor sample. The median propensity scores for the 17 non-Hh-activated tumors was 0.7% (range: 0.1–3.0%) compared to 87.9% (range: 69.1–97.6%) in the eight Hh-activated tumors. Hh activation status of 25 independent MB samples defined by the 5-gene signature and assayed by RT-PCR were in 100% agreement with the Hh activation status determined by gene expression profiling. In order to determine the predictive value of this assay as a tool to identify patients who might benefit from treatment with a Hh pathway inhibitor, MB samples from patients (n=13) enrolled in recent phase I trials of the Smoothened inhibitor LDE225 were analyzed and correlated with the respective tumor responses. Using the 5-gene signature, all patients (n=4) who responded to LDE225 treatment (PR or CR) were found to have Hh-pathway activated tumors, whereas all patients who did not respond (n=9) were found to have Hh non-activated tumors. These results suggest an association between Hh activation status determined by the 5-gene Hh signature and tumor response to LDE225 treatment. Data from an ongoing phase I/II trial in pediatric patients will enable determination of the predictive value of this patient pre-selection assay.


Tuesday, March 12, 2013
12:30-2:00, KRESGE G2

Francis Ouellette
Associate Director
Informatics and Bio-computing
Ontario Institute for Cancer Research

Processing cancer genomic data at the Ontario Institute for Cancer Research for theInternational Cancer Genome Consortium

The goal of the International Cancer Genome Consortium (ICGC) is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. This goal is well under way, and we plan to complete this in the next few years. The Ontario Institute for Cancer Research (OICR) is active with the ICGC on many fronts: it is involved in generating data for 2 of the 50 different tumour types, it is the executive headquarters for the ICGC, and it is also the home for the Data Coordinating Center (DCC). I will present on this, and how this integrates with some of The Cancer Genome Atlas (TCGA) activities and some of the work we have done in my group on using data from the ICGC.


Tuesday, February 19, 2013
12:30-2:00, KRESGE G2

Benjamin Neale, PhD
Assistant Professor, Anayltic and Translational Genetics Unit – Mass General Hospital
Instructor, Harvard Medical School
Associated Reseacher, The Broad Institute

Analytic Issues in the Assessment of Rare Variation

With advances in sequencing and genotyping technology, human genetics is now capable of assaying rare variation. In this talk, analytic issues associated with assessing the impact of rare variation on complex traits will be described. Specifically, the asymptotic properties of common association tests will be described. In addition, a comprehensive overview of how to model the probability of de novo mutation and how to leverage this to assess the strength of evidence for a given gene.


Tuesday, January 29, 2013
12:30-2:00, KRESGE G2

Brad Bernstein, MD, PhD
Associate Pathologist, Massachusetts General Hospital
Associate Professor of Pathology, Harvard Medical School
Early Career Scientist, Howard Hughes Medical Institute
Senior Associate Member, Broad Institute

Global Epigenetic Regulation of Normal and Malignant Cells

We are using high-throughput sequencing technology to characterize chromatin state and regulation in normal and cancer stem cells. The data enable systematic annotation of proximal and distal gene regulatory elements, their cell type-specificities and their functional interactions. We show how these approaches can be used to interpret genetic variants associated with human disease, as well as to reconstruct transcriptional regulatory networks in cancer stem cells. Finally, we will present characterizations of chromatin regulator proteins that suggest strategies for modulating genome function and cell state with chemical inhibitors of chromatin enzymes.


Tuesday, December 11, 2012
12:30-2:00, KRESGE G2

Cecilia Lindgren
University Research Lecturer, Group Head / PI and Fellow
Nuffield Department of Medicine
The University of Oxford
The Broad Institute

Loci Associated with Fat Distribution Have Complex Sex-combined and Sex-specific Effects

Waist-hip ratio (WHR) is a heritable measure of body fat distribution and a significant predictor of metabolic and cardiovascular risk independent of overall obesity, as measured by body mass index (BMI). We performed sex-combined fixed effects inverse variance meta-analysis for BMI-adjusted WHR on 142,762 individuals from 57 genome-wide association studies and 67,325 individuals from 28 studies genotyped on the Metabochip. Given previous reports of sexually dimorphic genetic effects on fat distribution, we also performed male-specific (N=93,482) and female-specific (N=116,742) meta-analyses. The sex-combined analysis identified 39 loci (25 novel) associated with BMI-adjusted WHR at genome-wide significance (P<5×10-8), and an additional nine female-specific loci. Twelve of sex-combined loci showed significantly different sex effects, with seven having an effect only in females, four loci with stronger effects in females than males, and one locus with effects only in males. The enrichment of female-specific associations is consistent with much higher estimated heritability of BMI-adjusted WHR in women (h2~46%) compared to men (h2~19%). We used GCTA to perform approximate conditional analysis, estimating the linkage disequilibrium between SNPs using combined GWAS and Metabochip genotype data from 949 Swedish individuals from the PIVUS study. Nine loci harbor multiple signals of association at genome-wide significance. These loci show complex patterns of sexual dimorphism in genetic effects. For example, at the VEGFA locus both association signals are stronger in women than men (1df test of heterogeneity between sex-specific allelic effects and 2.2×10-2). In contrast, at the WARS2/SPAG17 locus, we observed one male-specific, one female-specific, and two sex-combined association signals. Our results highlight the importance of sex-specific analyses, conditional analysis, and fine mapping to fully elucidate the complex genetic architecture and biological mechanisms of body fat distribution.


Tuesday, November 13, 2012
12:30-2:00, KRESGE G2

Peter Laird
Director, USC Epigenome Center
Professor of Surgery
Professor of Biochemistry & Molecular Biology
Keck School of Medicine
USC / Norris Comprehensive Cancer Center

Exploring the Cancer Methylome

Cancer develops not only as a result of genetic mutations and genomic rearrangements, but also as a consequence of numerous epigenetic alterations, including extensive changes in the distribution of DNA methylation throughout the genome. DNA methylation changes contribute directly to cancer by transcriptional silencing of tumor-suppressor genes through promoter CpG island hypermethylation. Broad epigenomic analysis of human tumors can reveal relationships between large numbers of epigenetic events and can provide insight into the mechanisms underlying concerted epigenetic change. Genomic loci targeted by Polycomb Group Repressors in embryonic stem cells, and involved in cellular differentiation are predisposed to aberrant DNA methylation in cancer cells, suggesting that an epigenetic block to cellular differentiation may sometimes be an initiating event in carcinogenesis. The very strong associations between distinct epigenetic subtypes, such as CpG Island Methylator Phenotypes (CIMP) and specific somatic genetic events, such as BRAF mutation in colorectal cancer and IDH1 mutation in glioblastoma multiforme are consistent with an early role for DNA methylation alterations, providing a favorable cellular context for the subsequent somatic mutation. The analysis of whole methylomes at single-basepair resolution reveals that cancer-associated changes occur differentially across defined regions of the genome associated with the nuclear lamina. It is apparent that epigenomic analysis is essential for a full understanding of the relationship between alterations in the cancer genome and the origin and clinical diversity of individual tumors.


Tuesday, October 16, 2012
12:30-2:00, KRESGE G2

Mark Gerstein
Albert L. Williams Professor of Biomedical Informatics,
Molecular Biophysics & Biochemistry and Computer Science

Human Genome Annotation

My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression. Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

http://networks.gersteinlab.org
http://tyna.gersteinlab.org

Architecture of the human regulatory network derived from ENCODE data.
Gerstein et al. Nature 489: 91

Classification of human genomic regions based on experimentally
determined binding sites of more than 100 transcription-related
factors.
KY Yip et al. (2012). Genome Biol 13: R48.

Understanding transcriptional regulation by integrative analysis of
transcription factor binding data.
C Cheng et al. (2012). Genome Res 22: 1658-67.

The GENCODE pseudogene resource.
B Pei et al. (2012). Genome Biol 13: R51.

Comparing genomes to computer operating systems in terms of the
topology and evolution of their regulatory control networks.
KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.


Tuesday, September 18, 2012
12:30-2:00, KRESGE G2

Dan L. Nicolae
Associate Professor
Departments of Medicine and Statistics
The University of Chicago

Next-generation genetic association studies – some challenges and opportunities

There are many challenges in the transition from genome-wide association studies to whole exome and to whole genome sequencing investigations. We have moved from single-marker tests on common SNPs to set-based inference on all variants in a functional element, and this has led to the development of many statistical tools. I will present a framework for the analysis of sequence data where we harness population genetics theory to provide prior information on effect sizes that allows a general and powerful test for association. I will also discuss some of the challenges in this transition, including the implicit and explicit assumptions on underlying genetic models of risk for a given set, and the interpretation of results.


2011-2012 Seminars


Tuesday, May 1, 2012
12:00-1:30, KRESGE 502

Neil Risch
Director, Institute for Human Genetics Professor,
Division of Biostatistics Department of Epidemiology and Biostatistics
University of California, San Francisco

Genetic Epidemiology Research on Aging in a Cohort of 100,000

Through an ARRA Grand Opportunity Award, we have recently completed genome-wide genotyping and telomere length analyses on a cohort of over 100,000 individuals who are participants in the Research Program on Genes, Environment and Health at Kaiser Permanente, Northern California.” The goal is to link the genomic information to extensive clinical, laboratory, radiologic, pathology and pharmacy information contained in Kaiser databases for extensive genetic-epidemiologic analysis. The cohort has mean age 65, with average membership spanning over 20 years, on average, providing extensive longitudinal health information. Information on environmental and behavioral risk factors has been obtained through geo-coded linkage to state and federal environmental/social databases, and from survey self-report. The cohort is multi-ethnic and contains numerous first, second and third degree relatives, allowing for a variety of genetic epidemiologic analyses on a large scale, to better understand the genetic and environmental contributors to age-related disease and healthy aging


Tuesday, April 24, 2012
12:30-2:00, FXB G12

Curtis Huttenhower
Assistant Professor of Computational Biology and Bioinformatics
Department of Biostatistics
Harvard School of Public Health

Reducing microbial unemployment: functional roles in the human microbiome

Among many surprising insights, the genomic revolution has helped us to realize that we’re never alone and, in fact, barely human. For most of our lives, we share our bodies with some ten times as many microbes as human cells; these are resident in our gut and on nearly every body surface, and they are responsible for a tremendous diversity of metabolic activity, immunomodulation, and intercellular signaling. In order to understand these microbes’ relationship with their hosts, however, we must establish how homeostasis is maintained in health or disregulated in disease. I will present an overview of microbial metabolism and function core to the healthy human microbiome and a survey of microbes that cooperate and compete to fulfill these metabolic roles. Since even bacteria within the same “species” regularly carry strikingly different genomes, it is critical to identify community membership at the species or strain level whenever possible. Finally, I will discuss how metabolic function normally present in the gut microbiota is disrupted in inflammatory diseases such as Crohn’s and ulcerative colitis.


Tuesday, March 20, 2012
12:30-2:00, FXB G12

John Novembre
Assistant Professor
Department of Ecology and Evolutionary Biology
Interdepartmental Program in Bioinformatics
University of California, Los Angeles

Insights to recombination and rare variants from large-scale human polymorphism data

Population structure can be key factor in shaping patterns of genetic variation in a sample. Depending on the end goals while analyzing genetic data, it may a primary focus, a problematic nuisance, or a useful tool that can give insight to another process. This talk will highlight two recent projects that center around the concept of population structure and ancestry inference. The first involves the use of chromosomal-level ancestry as a tool to estimate genome-wide recombination intensity maps. The second addresses how rare variants are distributed geographically in a deep sequencing project based on 202 genes sequenced in >14,000 human individuals. The analysis of the deep resequencing project also raises larger questions about the extent of purifying selection in humans and the importance of population growth for understanding patterns of genetic diversity in large samples.


Tuesday, February 14, 2012
12:30-2:00, Kresge G2

Xihong Lin
Professor of Biostatistics
Department of Biostatistics
Harvard School of Public Health

Design and Analysis of Whole Exome (Genome) Sequencing Association Studies

Sequencing studies, such as targeted, whole exome and whole genome sequencing studies, are increasingly being conducted to identify rare variants that are associated with complex traits. Design and analysis of such population based sequencing association studies face many challenges. The talk has three parts. I will first provide an overview of several methods for studying rare variant effects, including burden tests, SKAT and optimal unified tests. Analysis pipelines for whole exome sequencing association studies, such as filtering criteria and small sample adjustments of statistical methods, will be discussed. In the second part of the talk, I will discuss designs of sequencing association studies, such as sample size and power calculations, and pros and cons of extreme phenotype sampling and analysis strategies for extreme phenotype sequencing studies. In the last part of the talk, I will discuss the performance of imputation using GWAS data for studying rare variants effects. Simulation studies and real data will be used to illustrate the results.


Tuesday, December 6, 2011
12:30-2:00, Kresge G2

Matthew Stephens
Professor, Department of Human Genetics and Department of Statistics
The University of Chicago

A Unified Framework for Association Analysis of Multiple Related Phenotypes

In many ongoing genome-wide association studies, multiple related phenotypes are available for testing for association with genetic variants. In most cases, however, these related phenotypes are analysed independently from one another. For example, several studies have measured multiple lipid-related phenotypes, such as LDL-cholestrol, HDL-cholestrol, and Triglycerides, but in most cases the primary analysis has been a simple univariate scan for each phenotype. This type of univariate analysis fails to make full use of potentially rich phenotypic data.

While this observation is in some sense obvious, much less obvious is the right way to go about examining associations with multiple phenotypes. Common existing approaches include the use of methods such as MANOVA, canonical correlations, or Principal Components Analysis, to identify linear combinations of outcome that are associated with genetic variants. However, if such methods give a significant result, these associations are not always easy to interpret. Indeed the usual approach to explaining observed multivariate associations is to revert to univariate tests, which seems far from ideal.

In this work we outline an approach to dealing with multiple phenotypes based on Bayesian model averaging. The method attempts to identify which subset of phenotypes is associated with a given genotype. In this way it incorporates the null model (no phenotypes associated with genotype); the simple univariate alternative (only one phenotype associated with genotype) and the general alternative (all phenotypes associated with genotype) into a single unified framework. In particular our approach both tests for and explains multivariate associations within a single model, avoiding the need to resort to univariate tests when explaining and interpreting significant multivariate findings. We illustrate the approach on examples, and show how, when combined with multiple phenotype data, the method can improve both power and interpretation of association analyses.


Tuesday, November 15, 2011
12:30-2:00, Kresge G2

Alex Meissner, Ph.D.
Harvard University Department of Stem Cell and Regenerative Biology
Broad Institute

DNA Methylation Dynamics in Stem Cells and Development


Tuesday, October 4, 2011
12:30-2:00, Kresge G2

a pizza lunch will be provided

Daniel J. Schaid, Ph.D.
Professor of Biostatistics, Mayo Clinic

Enhancing Analysis of Genome Wide Association Studies with Gene Ontology and other Structures for Gene Sets

Genome wide association studies (GWAS) measure hundreds of thousands of genetic markers (single nucleotide polymorphsisms, SNPs) on large numbers of diseased cases and non-diseased controls, with analyses tending to focus on the association of single SNPs with disease status. Although many GWAS find that the associations of SNPs with disease status tend to be small, with odds ratios 1.25 – 1.5, and that modeling of multiple SNPs has limited ability to accurately predict disease status, the greatest benefit of GWAS might be new leads towards complex genetic pathways. This benefit can be enhanced by using prior information about how genes work together in biological pathways in order to form groups of genes — a group of genes with modest effect might be more powerful than individual genes or single SNPs. This presentation will discuss general strategies for analyzing sets of SNPs, or sets of genes, as well as newly developed computational and statistical methods that use information from the publically available Gene Ontology (GO).. The GO provides standardized representations of gene and gene product attributes across species and databases, and achieves this by a controlled vocabulary. Specifically, GO structures details about genes in a directed acyclic graph, such that specific details are linked to more general details. This provides a natural way to recursively create gene sets, from highly-specific small sets of genes to very general large sets of genes. By mapping SNPs from a GWAS to genes, and then mapping genes to the GO structure, we are able to scan the entire GO structure in terms of gene sets, to seek the set with the most extreme statistic for the association of the gene set with disease. Computational and statistical methods will be emphasized in this presentation, along with applications to several GWAS data sets. Strengths and limitations of our approach will be discussed, as well as future research directions.


Tuesday, September 20, 2011
12:30-2:00, Kresge G2

Steven Watkins, Ph.D.
Senior Visiting Scientist at Harvard School of Public Health
Senior Vice President of Research at Tethys Bioscience

Metabolomics in the Laboratory, the Clinic and the Marketplace: An opinion on the state of the field after a decade of earnest effort.

The technologies for measuring metabolites broadly, rapidly and accurately are mature enough to support both basic research and medical applications. On the basic research side, there are many success stories and metabolomics has been widely adopted into the workflow. On the clinical side, there are still hurdles to overcome and success with this hugely promising technology have not been as forthcoming. The reasons for this are intertwined with the very nature of metabolomics itself – and center on complexity in the required instrumentation and the interpretation of multi-marker panels. This talk will review our experience with metabolomics (and a little bit of proteomics) at Lipomics and Tethys in approaching basic and clinical studies. Those efforts spawned successful advances in the knowledge of metabolism as well as commercial diagnostic products, but not routine strategies that generalize across applications and fields. Metabolomics is beginning to bear fruit, and remains immensely promising, but from our perspective unfortunately still requires specialized expertise to conduct.

2010-2011 Seminars


Tuesday, May 17, 2011
12:30-2:00, FXB G13

Barbara E. Stranger Ph.D.
Instructor of Medicine
Division of Genetics, Department of Medicine
Harvard Medical School and Brigham and Women’s Hospital

Genomics of Human Gene Expression

Genetic variation in gene expression has long been studied with the aim to understand the landscape of regulatory variants but also to assist in the interpretation and elucidation of disease signals. We present two projects: (1) Analysis of the genetic basis of genome-wide gene expression patterns in lymphoblastoid cell lines from 726 individuals from 8 populations from the HapMap3 project. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We dissect the specific functional pathways differentiated between populations and highlight patterns of sharing of expression quantitative trait loci (eQTLs) between populations, which are determined by population relatedness and discover significant sharing of eQTL effects between Asians, European-admixed and African subpopulations. (2) To understand the evolutionary and functional consequences of immune-mediated disease susceptibility, we performed a series of distinct, but interrelated large-scale analyses of three different data types: (a) genetic variants reported to be associated with ten different immune-mediated diseases from published genome-wide association studies (GWAS); (b) a genome-wide scan for signatures of positive selection in a population of European ancestry; and (c) an eQTL mapping study in peripheral blood mononuclear cells (PBMCs). Our results suggest that changes in gene expression levels influencing immune-mediated disease have been targets of recent positive selection, perhaps in some cases, due to a selective advantage from protection against infectious disease in the past.


Tuesday, April 12, 2011
12:30-2:00, Kresge G3

Gonçalo Abecasis, D.Phil
Felix E. Moore Collegiate Professor of Biostatistics
Center for Statistical Genetics
Department of Biostatistics
School of Public Health, University of Michigan

Sequencing Thousands of Human Genomes

Identifying and characterizing the genetic variants that affect human traits is one of the central objectives of human genetics. Ultimately, this aim will be achieved by examining the relationship between interesting traits and the whole genome sequences of many individuals. Whole genome re-sequencing of thousands of individuals is not yet feasible, but advances in laboratory methods (for example, to enable the genotyping of thousands of individuals at hundreds of thousands of SNP sites) and in statistical methodology (for example, to enable accurate correction for population stratification and genotype imputation) have resulted in substantial progress in our understanding of complex disease biology. Here, I discuss some the analytical and study design challenges posed by the first generation of whole genome sequencing studies. These studies will enable the examination of 1,000s of individuals at >15 million of polymorphic sites. These studies have been made possible by continuing advances in laboratory technology and statistical methods and should further refine our understanding of complex disease genetics. I illustrate the possibilities both with simulation and with results from ongoing studies.


Tuesday, March 29, 2011
12:30-2:00, Kresge G2

Hongbing Shen, M.D., Ph.D.
Dean of School of Public Health
Professor of Epidemiology
Nanjing Medical University

A Large-Scale Genomewide Association Study of Lung Cancer in Han Chinese: Cumulative Effects of Six Genetic Variants

This is the first large-scale GWAS of lung cancer in Han Chinese. We performed a GWAS scan in 5,430 subjects (2,342 lung cancer cases and 3,088 controls), followed by a two-stage validation among 12,722 subjects (6,313 cases and 6,409 controls). The combined analyses identified 6 well-replicated SNPs with significant associations (P < 5×10-8) of lung cancer in the genes of TP63 (at 3q28), TERT-CLPTM1L (at 5p15.33), MIPEP-TNFRSF19 (at 13q12.12), and MTMR3-HORMAD2-LIF (at 22q12.2). Among the above 6 SNPs, 4 were newly identified in Chinese. The population attributable risk (PAR) of these six SNPs was 59.1% and could be increased to 77.3% (male: 84.0%, female: 65.6%) after incorporating packyear of smoking. the results suggest that genetic variations in 3q28, 5p15.33, 13q12.12 and 22q12.2 may cumulatively contribute to the susceptibility of lung cancer in Chinese.


FRIDAY, March 4, 2011
12:30-2:00, FXB G13

Peter M. Visscher, Ph.D.
Senior Principal Research Fellow
Queensland Statistical Genetics Laboratory
Queensland Institute of Medical Research
Brisbane, Australia

Genome-partitioning of Genetic Variation for Complex Traits Using GWAS Data

Common complex disease is caused by a combination of multiple genes and environmental effects. Traditionally the genetics of disease has been studied using concepts that refer to the combined effect of all genes (e.g., heritability or sibling risk), for example by studying the recurrence risk or phenotypic correlation of relatives. Genome-wide association studies (GWAS) facilitate the dissection of heritability into individual locus effect. They have been successful in finding many SNPs associated with complex traits and have greatly increased the number of genes where variation is known to affect the trait. However, GWAS have been criticized for not explaining more of the genetic variation that we know exists in the population, and many hypotheses have been put forward to explain the missing heritability. The most plausible explanations are that (i) causal effects are too small to be detected with statistical significance and (ii) causal variants are not well tagged by the SNPs on the commercial arrays, for example because their heterozygosity is lower than that of genotyped SNPs. The use of all GWAS data simultaneously in an estimation rather than hypothesis testing framework can capture much more variation than in gene discovery approaches, and allows the partitioning of variation across chromosomes and chromosome segments. We show how such whole genome methods can be used to better understand the genetic architecture of complex traits, with applications in height, BMI, schizophrenia and other traits. The results demonstrate that for all traits studied, a substantial proportion of additive genetic variation is tagged by common SNPs and that genetic variation is smeared out over the entire genome. We conclude that these traits are highly polygenic, that variation explained by causal variants is small on average and that GWAS with increasing sample size will discover more variants. In addition, we show that the same statistical approaches used to estimate genetic variance are relevant in genetic risk prediction for complex traits.


Tuesday, February 8, 2011
12:30-2:00, Kresge G3 – PLEASE NOTE ROOM CHANGE

Eleazar Eskin, Ph.D.
Associate Professor
Computer Science
Human Genetics
University of California, Los Angeles

Statistical Methods for Association Studies with Rare Variants

Sequencing studies have been discovering numerous rare variants, allowing one to test effects of rare variants on disease susceptibility Recently, several groupwise association tests that group rare variants in genes and detect associations between groups and diseases have been proposed to increase the statistical power of studies on rare variants. One major challenge in these methods is to determine which variants are the actual causal variants in a group, and to overcome this challenge, previous methods used prior information that specifies how likely each variant is causal. Another challenge is how to combine information from multiple rare variants in a gene. Both of these challenges affect the statistical power of these methods in more complicated ways than in traditional association studies. In this talk, I discuss some recent work on measuring the statistical power of rare variant association methods and present some new methods motivate by our observations.


Tuesday, December 14, 2010
12:30-2:00, Kresge G2

Steven R. Kleeberger, Ph.D.
Acting Deputy Director
Environmental Genetics Group
National Institute of Environmental Health Sciences
& National Toxicology Program

Genetic Mechanisms of Susceptibility to Oxidant-induced Lung Disease: New Insights

Environmental oxidants remain a major public health concern in industrialized cities throughout the world. Population and epidemiological studies have associated oxidant air pollutant exposures with morbidity and mortality outcomes, and underscore the important detrimental effects of these pollutants on the lung. Inter-individual variation in human pulmonary responses to air pollutants suggests that some subpopulations are at increased risk to the detrimental effects of pollutant exposure, and it is becoming increasingly clear that genetic background is an important susceptibility factor. We have utilized multiple positional cloning approaches in mice to identify genes that determine differential responsiveness to ozone-induced injury and inflammation, including Tnf, Tlr4, and MHC Class II genes. Integrative genomics approaches in mouse models have led to the identification of additional susceptibility gene candidates including Marco, Nqo1, and Hsp70. Importantly, comparative mapping between the human and other genomes can also yield candidate susceptibility genes. Ongoing association studies in human subjects and tissue specific gene expression profiling in juvenile rhesus macaque monkeys have provided compelling validation of a number of oxidant susceptibility gene candidates. Results from these studies have also been informative for other oxidant-related disorders, including susceptibility to respiratory syncytial virus (RSV) disease. The combined investigations in inbred mice, human subjects, and non-human primates have provided, and will continue to provide, important insight to understanding genetic factors that contribute to differential susceptibility to oxidants.


Tuesday, November 30, 2010
12:30-2:00, Kresge G2

Tianxi Cai, Ph.D.
Associate Professor of Biostatistics
Department of Biostatistics, HSPH

Adaptive Naive Bayes Kernel Machine Approach to Classification with GWAS Data

As genetic studies of human diseases progress, it is becoming increasingly evident that genetics often play a major and complex role in many types of diseases. Therefore, the complexity of the genetic architecture of human health and disease makes it difficult to identify genomic markers associated with disease risk or to construct accurate genetic risk prediction models. Accurate risk assessment is further complicated by the availability of a large number of markers that may be predominately unrelated to the outcome or may explain a relatively small amount of genetic variation. Often, standard prediction models merely rely on additive or marginal relationships between the markers and the phenotype of interest. Marginal association based analysis has limited power in identifying markers truly associated with disease, resulting in a large number of false positives and false negatives. Simple additive modeling does not perform well when the underlying structure of association involves interactions and other non-linear effects. Additionally, these methods do not make use of information that may be available regarding genetic pathways or gene structure. We propose a multi-stage method relating possibly predictive markers to the risk of disease by first forming multiple gene-sets based on certain biological criteria. By imposing a naive bayes kernel machine model, we estimate gene-set specific risk models that relate information from each gene-set to the outcome. In the second stage, we aggregate information across all gene-sets by adaptively estimating the weights for each gene-set via a regularization procedure. The KM framework efficiently models the potentially non-linear effects of predictors without specifying a particular functional form. Estimation and predictive accuracy is further improved with kernel PCA approximation to reduce the degrees of freedom in the first stage and with adaptive regularization in the second stage to remove non-informative regions from the final prediction model. Prediction accuracy is assessed with bias-corrected ROC curves and AUC statistics. Numerical studies suggest that the model performs well in the presence of non-informative regions and both linear and non-linear effects.


Tuesday, October 26, 2010
12:30-2:00, Kresge G2

Andrea Baccarelli, M.D., Ph.D.
Associate Professor of Environmental Epigenetics
Department of Environmental Health, HSPH

Epigenetic Modifications Induced by Environmental Pollutants: Results from Human Studies

Epigenetics investigates heritable changes in gene expression that occur without changes in DNA sequence. Several epigenetic mechanisms, including DNA methylation and histone modifications, can change genome function under exogenous influence. Results obtained from animal models indicate that in utero or early-life environmental exposures produce effects that can be inherited transgenerationally and are accompanied by epigenetic alterations. The search for human equivalents of the epigenetic mechanisms identified in animal models is in progress. I will present evidence from human studies of individuals exposed to air pollution and metals indicating that epigenetic alterations mediate effects caused by exposure to environmental toxicants. In these investigations we have shown that environmental toxicants cause altered methylation of human repetitive elements or genes. Some exposures can alter epigenetic states and the same and/or similar epigenetic alterations can be found in patients with the disease of concern. In recent preliminary studies, we have shown alterations of histone modifications in subjects exposed to metal-rich airborne particles. I will present original data demonstrating that altered DNA methylation in blood and other tissues is associated with environmentally-induced disease, such as cardiovascular disease and asthma. On the basis of current evidence, I will propose possible models for the interplay between environmental exposures and the human epigenome.


Tuesday, September 28, 2010
12:30-2:00, Kresge G2

Tyler VanderWeele, Ph.D.
Associate Professor of Epidemiology
Departments of Epidemiology & Biostatistics, HSPH

Genetic Variants on 15q25.1, Smoking and Lung Cancer: an Assessment of Mediation and Interaction

Genome-wide association studies have identified variants on chromosome 15q25.1 that increase the risk of both lung cancer as well as nicotine dependence and associated smoking behavior. However, there remains debate as to whether the effect on lung is direct or operates through pathways related to smoking behavior. Of the three studies that initially reported the association between the variants and lung cancer, two suggested that the association was direct and one that it was primarily through nicotine dependence. Thorgeirsson and co-authors note also a third possible explanation of the associations: that the variant may increase individuals’ vulnerability to the harmful effect of tobacco smoke, a form of gene-environment interaction. In order to determine the extent to which variants on the 15q25.1 region affect lung cancer through nicotine dependence and associated smoking behavior or through other pathways, we applied novel methodology for mediation analysis to a case-control study for lung cancer of 1836 cases and 1452 controls. For two SNPs, rs8034191 and rs1051730, on 15q25.1, we estimated the indirect effect mediated by smoking (cigarettes per day), the direct effect through other pathways and the overall proportion mediated. Analyses allowed for the possibility that the effect of smoking varied by groups defined by the genetic variant. The effect of the variants on lung cancer mediated through smoking appears to be smaller than the independent effect through other pathways.

2009-2010 Seminars


Tuesday, April 6, 2010
12:30-2:00, Kresge G2

Nick Patterson, Ph.D.
Senior Computational Biologist
The Broad Institute

Admixture Graphs: Learning Genetic History

We describe a new methodology, useful for the analysis of genetic data, that generalizes phylogenetic trees. The new methods raise a number of statistical and mathematical issues, and yield some surprising insights into ancient human history. We will give examples from Eurasia, India, and South America.


Tuesday, March 2, 2010
12:30-2:00, Kresge G2

Hakon Hakonarson, M.D., Ph.D.
Director
Center for Applied Genomics
The Children’s Hospital of Philadelphia

Genetics of Complex Pediatric Disorders: Novel Analytical Approaches

Genome wide association studies have delivered on the promise of uncovering genetic determinants of complex disease, using high-throughput methods allowing large volumes of SNPs (105-106) to be genotyped in large cohort studies. The GWA approach serves the critical need for a comprehensive and unbiased strategy to identify causal genes related to complex disease and is rapidly replacing the more traditional candidate gene studies and microsatellite-based linkage mapping approaches that have dominated the gene discovery attempts for common diseases in previous years. As a consequence of employing this array-based technology over the last three years dramatic discoveries of key variants involved in multiple complex diseases and related traits have been reported in the top scientific literature, including over 2000 novel loci with multiple replications in over 100 disease areas by independent groups. In this talk, discoveries will be reviewed and large-scale database efforts discussed and their use in complex genetic disorders and genomics of drug response. Novel analytical approaches will also be presented addressing pathway-based analyses and tagging or rare variants that may account for some of the missing heritability in GWAS.


Tuesday, February 2, 2010
12:30-2:00, Kresge G2

Shamil Sunyaev, Ph.D.
Assistant Professor of Medicine
Harvard Medical School
Brigham & Women’s Hospital

“Human Deleterious Mutations: Evolution, Function and Disease”

About one hundred new mutations occur in the genome of the average human. We can learn about the origin, evolutionary fate, functional and phenotypic effects of human mutations from the rapidly increasing DNA sequencing data. Evolutionary analysis of human deleterious mutations suggests strategies for identifying genes involved in human complex diseases. Computational and statistical approaches based on evolutionary and structural considerations can assist in genetic diagnostics of human monogenic or oligogenic diseases, as was tested in case of cardiomyopathy.


Tuesday, December 1, 2009
12:00-1:30, FXB G13

David Reich, Ph.D.
Associate Professor
Harvard Medical School, Department of Genetics
Associate Member, Broad Institute

“Reconstructing Indian Population History and Implications for South Asian Gene Discovery

India has been underrepresented in genome-wide surveys of human variation. Here I describe an analysis of patterns of variation in 25 diverse Indian groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most groups today. One, the “Ancestral North Indians” (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the “Ancestral South Indians” (ASI), is not close to any group outside the subcontinent. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71%, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI-related ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, which we show reflects strong founder effects whose genetic signatures have been maintained for thousands of years due to endogamy. There are two key medical implications. First, our observations predict that there will be an excess of recessive diseases in India, different in each group, whose risk variants should be easy to identify using standard genetic methods. Second, the genetic risk factors that are only present in the ASI will be very difficult to discover without building specific genetic variation resources for South Asians.


Tuesday, November 3, 2009
12:00-1:30, FXB G13

Giovanni Parmigiani, Ph.D.
Chair, Department of Biostatistics & Computational Biology, DFCI
Professor, Department of Biostatistics, HSPH

“Cross-study Differential Gene Expression”

In this lecture I will present statistical issues associated with combining microarray data across studies. I will focus on the role of hierarchical Bayesian models in constructing useful rules to shrink across both genes and studies, and to classify genes based on the patterns of concordance across studies. I will describe in detail a model we call XDE, and evaluate its performance in a comprehensive fashion, using both artificial data, and a “split sample” validation approach that provides an agnostic assessment of the model’s behavior not only under the null hypothesis but also under realistic alternatives. Compared to a more direct combination of t- or SAM-statistics, the 1 – AUC values for the Bayesian model is roughly half of the corresponding values for the t- and SAM-statistics. In small studies, XDE generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. Finally, I will illustrate our model using four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.

A technical report can be obtained from: http://www.bepress.com/jhubiostat/paper158/


Tuesday, October 6, 2009
12:00-1:30, FXB G13

Peter Park, Ph.D.
Assistant Professor of Pediatrics HMS Center for Biomedical Informatics Children’s Hospital Informatics Program

“ChIP-sequencing: Data Analysis and Applications”
ChIP-seq combines chromatin immunoprecipitation (ChIP) with next-generation sequencing to identify protein-DNA interactions on a genome-wide scale. After a brief introduction to next-generation sequencing, a number of practical issues in analysis of ChIP-seq data will be discussed, including experimental design, detection of binding sites, and determination of whether a sufficient depth of sequencing has been achieved. Application of ChIP-seq to the study of X-chromosome dosage compensation in Drosophila and nucleosome positioning will be described. If time allows, updates from the Cancer Genome Atlas and the model organism ENCODE projects will be given.


Tuesday, September 29, 2009
12:00-1:30, FXB G13

Mitchell Gail, M.D., Ph.D.
Senior Investigator Biostatistics Branch Division of Cancer Epidemiology and Genetics National Cancer Institute

“The value of adding single nucleotide polymorphism data to a model that predicts breast cancer risk”

Seven single nucleotide polymorphisms (SNPs) have recently been confirmed to be associated with breast cancer. I assessed the value of adding these SNPs to the Breast Cancer Risk Assessment Tool (BCRAT), which is based on ages at menarche and at first live birth, family history of breast cancer, and history of breast biopsy examinations. The model with these SNPs (BCRATplus7) had an area under the receiver operating characteristic curve (AUC) of 0.632, compared to 0.607 for BCRAT. This improvement is less than from adding mammographic density to BCRAT. I also assessed how much BCRATplus7 reduced expected losses in deciding whether a woman should take tamoxifen to prevent breast cancer and in deciding whether a woman should have a mammogram. In addition, I examined whether BCRATplus7 was more effective than BCRAT in allocating a scarce public health resource, such as access to mammography, based on ranking women on their breast cancer risk and allocating the resource to those at highest risk. In none of these applications did BCRATplus7 perform substantially better than BCRAT. A cross-classification of risk by the two models indicated that some women would change risk categories, depending on the risk threshold, if BCRATplus7 were used instead of BCRAT, but it is not known if BCRATplus7 is well calibrated. These results were hardly changed if three additional very recently identified SNPs were added. I conclude that the available SNPs do not improve the performance of models to estimate breast cancer risk enough to warrant their use outside the research setting