Department of Biostatistics
Big Data Seminar

2014 - 2015

Organizers: Marieke Kuijjer and Godwin Yung

Schedule: Mondays, 12:30-2:00 p.m.
HSPH2, Room 426 (unless otherwise notified)

Contract All | Expand All
Seminar Description
This working seminar focuses on statistical and computational methods for analyzing big data. Big data arise from a wide range of studies in health science research, such as genetics and genomics, environmental health research, comparative effective research, electronic medical records, neuroscience, and social networks. We discuss recent developments in statistical and computational methodology for analyzing big data and health science applications where big data arise. The goal of this seminar is to exchange ideas and stimulate more quantitative research in this challenging and important area.


October 6

John Quackenbush, Ph.D.
Professor of Computational Biology and Bioinformatics, Harvard School of Public Health / Dana-Farber Cancer Institute

"Taming the Big Data Dragon"
ABSTRACT: Nearly every major scientific revolution in history has been driven by one thing: data. Today, the availability of Big Data from a wide variety of sources is transforming health and biomedical research into an information science, where discovery is driven by our ability to effectively collect, manage, analyze, and interpret data. New technologies are providing abundance levels of thousands of proteins, population levels of thousands of microbial species, expression measures for tens of thousands of genes, information on patterns of genetic variation at millions of locations across the genome, and quantitative imaging data—all on the same biological sample. These omic data can be linked to vast quantities of clinical metadata, allowing us to search for complex patterns that correlate with meaningful health and medical endpoints. Environmental sampling and satellite data can be cross-referenced with health claims information and Internet searches to provide insights into the impact of atmospheric pollution on human health. Anonymized data from cell-phone records and text messages can be tied to health outcomes data, helping us explore disease transmission networks. Realizing the full potential of Big Data will require that we develop new analytical methods to address a number of fundamental issues and that we develop new ways of integrating, comparing, and synthesizing information to leverage the volume, variety, and velocity of Big Data. Using experiences derived from our work, I will present some examples that highlight the challenges and opportunities that present themselves in today's data rich environment.

October 20

Ray Liu, Ph.D.
Head of Analytical Innovation and Consultation group, Takeda Pharmaceutical Company, Japan

"Quantitative Considerations on Big Data for Pharmaceutical R&D"
ABSTRACT: With the advancement in technology platforms, researchers in pharmaceutical R & D have encountered data large in size and dimension in recent years. In addition, data are increasingly generated from novel sources. The bigger and more diverse data pose special challenges and opportunities for pharmaceutical statisticians. For example, how to extract information from un-structured text data appeared in medical records? How to integrate information from various data sources to generate novel knowledge and increase the chance of regulatory approval? How to identify association between clinical outcomes and genomic markers in high dimensional data with small sample size? In this talk we will share hands-on experience learned from handling data from such situations, focusing on applying quantitative thinking to impact pharmaceutical R&D.

November 3

Jennifer Listgarten, Ph.D.
Researcher, Microsoft Research New England

"Efficient and Powerful Methods for Genome and Epigenome-Wide Association Studies"
ABSTRACT: Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome-wide associations, wherein individual or sets of genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. These confounders include population structure, family relatedness, cell type heterogeneity, and environmental confounders. I will discuss the state-of-the art approaches (based on linear mixed models) for conducting these analyses, in which the confounders are automatically deduced, and then corrected for, by the data and model.

December 1

Denis Agniel, Ph.D.
Research Associate in Biomedical Informatics, Center for Biomedical Informatics at Countway, Harvard Medical School / Center for Population Studies, Harvard School of Public Health

"Genome-wide Association Studies for Longitudinal Outcomes"
ABSTRACT: Genome-wide association studies (GWAS) can be important tools to explore susceptibility to acquisition and progression of complex diseases. The bulk of GWAS findings have not specifically been related to progression of disease monitored by longitudinally measured biological markers. Such longitudinal measures allow researchers to more clearly characterize clinical outcomes that cannot be captured in a few measurements. To this end, we seek to provide a framework in which to test for the association between genetic markers and a longitudinal outcome. The most common approach to analyzing longitudinal data in this context is linear mixed effects models, which may be overly simplistic for a complexly varying outcome. On the other hand, existing non-parametric methods may suffer from low power due to high degrees of freedom (DF) and may be computationally infeasible at the genome-wide scale. We propose a functional principal variance component (FPVC) testing framework which captures the nonlinearity in the data with potentially low DF and is fast enough to carry out for genome-wide studies. The FPVC testing unfolds in two stages. In the first stage, we summarize the longitudinal outcome according to its major patterns of variation via functional principal components analysis (FPCA). In the second stage, we employ a simple working model and variance component testing to examine the association between the summary of each patient's outcome and a set of genetic markers. Simulation results indicate that FPVC testing can offer large power gains over the standard linear mixed effects model.

December 15

Nicholas Horton, Sc.D.
Professor, Department of Mathematics and Statistics, Amherst College

"Building Precursors to the Analysis of Big Data: Guidelines for Undergraduate Programs in Statistics"
ABSTRACT: In a world of increasingly complex and sophisticated data, there is additional demand for graduates who are able to extract actionable information from it. They need to be able to ``think with data" and undertake computation in a nimble fashion. Undergraduates need practice in utilizing all steps of the scientific method to tackle real research questions. The statistical analysis process involves formulating good questions, considering whether available data are appropriate for addressing a problem, choosing from a set of different tools, undertaking the analyses in a reproducible manner, assessing the analytic methods, drawing appropriate conclusions, and communicating results. To address this new challenge and opportunity, the American Statistical Association recently updated its guidelines for undergraduate programs in statistics. The new guidelines describe how undergraduate programs in statistics should emphasize concepts and approaches for working with complex data and provide experience in designing studies and analyzing real data. In this talk, I will provide background on the increasing number of undergraduate statistics majors and minors and discuss recommendations for curriculum revisions to help prepare these students to use data to make evidence-based decisions.

February 9 (Kresge G2)

Andreas Matern
Vice President, Disruptive Innovation at Thomson Reuters

"Talk Title TBD"
ABSTRACT: None Given

February 23 (Kresge G2)

JP Onnela, Ph.D.
Assistant Professor, Department of Biostatistics, Harvard School of Public Health

"Talk Title TBD"
ABSTRACT: None Given

March 9

Peter J. Park, Ph.D.
Associate Professor, Harvard Medical School Center for Biomedical Informatics

"Structural Alterations in Cancer Genomes (how we analyzed 100TB of data and lived to tell about it)"
ABSTRACT: It is now possible to generate whole-genome sequencing data for a patient at an affordable cost, and the amount of publicly available data continues to grow rapidly. I will give an overview of the computational methods we use to identify various structural alterations in cancer genomes. I will also describe the challenges associated with large-scale data management and analysis: at ~150GB raw data per patient, some of our analyses involve >100TB of raw data. I will also share my thoughts on the role of statisticians in these genomics projects.

March 30 (Kresge G2)

Paul McDonagh, Ph.D.
Director, Computational Biology, Biogen Idec

"Talk Title TBD"
ABSTRACT: None Given

April 13 (Kresge G2)

Michelle Girvan, Ph.D.
Associate Professor, Department of Physics and the Institute for Physical Science and Technology (IPST), University of Maryland

"Talk Title TBD"
ABSTRACT: None Given

April 27 (Kresge G2)

Miguel Hernán, M.D., MPH, Dr.P.H.
Professor of Epidemiology, Departments of Epidemiology and Biostatistics, Harvard School of Public Health

"Talk Title TBD"
ABSTRACT: None Given

May 11 (Kresge G2)

Jeffrey Leek, Ph.D.
Associate Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

"Talk Title TBD"
ABSTRACT: None Given



Back to HSPH Biostatistics Maintained by the Biostatistics Webmaster
Last Update: December 10, 2014