Department of Biostatistics
Big Data Seminar
2014 - 2015
ABSTRACT: Nearly every major scientific revolution in history has been driven by one thing: data. Today, the availability of Big Data from a wide variety of sources is transforming health and biomedical research into an information science, where discovery is driven by our ability to effectively collect, manage, analyze, and interpret data. New technologies are providing abundance levels of thousands of proteins, population levels of thousands of microbial species, expression measures for tens of thousands of genes, information on patterns of genetic variation at millions of locations across the genome, and quantitative imaging data—all on the same biological sample. These omic data can be linked to vast quantities of clinical metadata, allowing us to search for complex patterns that correlate with meaningful health and medical endpoints. Environmental sampling and satellite data can be cross-referenced with health claims information and Internet searches to provide insights into the impact of atmospheric pollution on human health. Anonymized data from cell-phone records and text messages can be tied to health outcomes data, helping us explore disease transmission networks. Realizing the full potential of Big Data will require that we develop new analytical methods to address a number of fundamental issues and that we develop new ways of integrating, comparing, and synthesizing information to leverage the volume, variety, and velocity of Big Data. Using experiences derived from our work, I will present some examples that highlight the challenges and opportunities that present themselves in today's data rich environment.
ABSTRACT: With the advancement in technology platforms, researchers in pharmaceutical R & D have encountered data large in size and dimension in recent years. In addition, data are increasingly generated from novel sources. The bigger and more diverse data pose special challenges and opportunities for pharmaceutical statisticians. For example, how to extract information from un-structured text data appeared in medical records? How to integrate information from various data sources to generate novel knowledge and increase the chance of regulatory approval? How to identify association between clinical outcomes and genomic markers in high dimensional data with small sample size? In this talk we will share hands-on experience learned from handling data from such situations, focusing on applying quantitative thinking to impact pharmaceutical R&D.
ABSTRACT: Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome-wide associations, wherein individual or sets of genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. These confounders include population structure, family relatedness, cell type heterogeneity, and environmental confounders. I will discuss the state-of-the art approaches (based on linear mixed models) for conducting these analyses, in which the confounders are automatically deduced, and then corrected for, by the data and model.
ABSTRACT: Genome-wide association studies (GWAS) can be important tools to explore susceptibility to acquisition and progression of complex diseases. The bulk of GWAS findings have not specifically been related to progression of disease monitored by longitudinally measured biological markers. Such longitudinal measures allow researchers to more clearly characterize clinical outcomes that cannot be captured in a few measurements. To this end, we seek to provide a framework in which to test for the association between genetic markers and a longitudinal outcome. The most common approach to analyzing longitudinal data in this context is linear mixed effects models, which may be overly simplistic for a complexly varying outcome. On the other hand, existing non-parametric methods may suffer from low power due to high degrees of freedom (DF) and may be computationally infeasible at the genome-wide scale. We propose a functional principal variance component (FPVC) testing framework which captures the nonlinearity in the data with potentially low DF and is fast enough to carry out for genome-wide studies. The FPVC testing unfolds in two stages. In the first stage, we summarize the longitudinal outcome according to its major patterns of variation via functional principal components analysis (FPCA). In the second stage, we employ a simple working model and variance component testing to examine the association between the summary of each patient's outcome and a set of genetic markers. Simulation results indicate that FPVC testing can offer large power gains over the standard linear mixed effects model.
ABSTRACT: In a world of increasingly complex and sophisticated data, there is additional demand for graduates who are able to extract actionable information from it. They need to be able to ``think with data" and undertake computation in a nimble fashion. Undergraduates need practice in utilizing all steps of the scientific method to tackle real research questions. The statistical analysis process involves formulating good questions, considering whether available data are appropriate for addressing a problem, choosing from a set of different tools, undertaking the analyses in a reproducible manner, assessing the analytic methods, drawing appropriate conclusions, and communicating results. To address this new challenge and opportunity, the American Statistical Association recently updated its guidelines for undergraduate programs in statistics. The new guidelines describe how undergraduate programs in statistics should emphasize concepts and approaches for working with complex data and provide experience in designing studies and analyzing real data. In this talk, I will provide background on the increasing number of undergraduate statistics majors and minors and discuss recommendations for curriculum revisions to help prepare these students to use data to make evidence-based decisions.
ABSTRACT: None Given
ABSTRACT: Cell phones are now ubiquitous: it is estimated that the number of phones in use exceeds the size of the global population in 2015. I will talk about two lines of work in our lab. First, our recent and ongoing work uses call detail records (CDRs) to investigate the structure of large-scale social networks and their relationship to underlying geography. Second, a parallel approach, one that generates even richer data, is based on instrumenting patients with a customized, scientific smartphone application. This moment-by-moment monitoring of behavior, or digital phenotyping, enables the collection of both "active data" (surveys, voice samples, etc.) as well as "passive data" (location, social engagement, etc.). I will talk about some of our work that utilizes these two approaches, what types of insights they may yield to the study of social networks and human behavior, and how this work interfaces with public health. Finally, I will highlight some results from a recent pilot study that involved psychiatric outpatients.
ABSTRACT: None Given
ABSTRACT: It is now possible to generate whole-genome sequencing data for a patient at an affordable cost, and the amount of publicly available data continues to grow rapidly. I will give an overview of the computational methods we use to identify various structural alterations in cancer genomes. I will also describe the challenges associated with large-scale data management and analysis: at ~150GB raw data per patient, some of our analyses involve >100TB of raw data. I will also share my thoughts on the role of statisticians in these genomics projects.
ABSTRACT: The complex process of genetic control relies upon an elaborate network of interactions between genes. Our goal is to combine simple mathematical models with empirical data to understand the role of network structure in gene regulation. Our modeling efforts focus primarily on Boolean systems, which have received extensive attention as useful models for genetic control. An important aspect of Boolean network models is the stability of their dynamics in response to small perturbations. Previous approaches to studying stability have generally assumed uncorrelated random network structure, even though real gene networks typically have nontrivial topology significantly different from the random network paradigm. To address such situations, we present a general method for determining the stability of large gene networks, given some specified network topology. Additionally, we generalize our framework to handle a variety of more biologically realistic update rules, including non-synchronous update and non-Boolean models, in which there are more than two possible gene states. We discuss the application of our modeling approach to experimentally inferred gene networks, and explore the role of dynamical instability in both the evolution of gene networks and the occurrence of some cancers.
ABSTRACT: The last decade or so has seen the rise of big data, attracting interest from industry and academia alike. Compared to the popular but somewhat unactionable notion of Vs (volume, velocity, variety...), I will discuss what concrete results we can expect of the data using three characterizations: (1) predictive vs. descriptive; (2) probabilistic vs. deterministic; (3) multiple-dimensional vs. single-dimensional. Through examples drawn from various fields, we will see how this understanding sheds light on the real value of big data.
Bio: Dr. Lei Ding leads Consumer Data Science efforts at PayPal. In this role, he is responsible for the design, architecture, development, and quality of systems that operationalize data science at scale. Before, he led Data Science at Intent Media, an advertising startup headquartered in New York. He contributed significantly to online predictive systems that served millions of visitors each day on a number of premium e-commerce websites. Earlier in his career, he conducted applied R&D at several leading research labs including IBM Watson Lab. He received a PhD degree from Ohio State, and studied management at Stanford.
ABSTRACT: When we cannot conduct randomized experiments, we analyze observational data. Causal inference from large observational databases (Big Data) is an attempt to emulate a randomized experiment—the target experiment or target trial—that would answer the question of interest. Therefore analyses of observational data need to be evaluated with respect to how well they emulate a particular target trial.
This talk outlines a framework for comparative effectiveness research using Big Data that makes the target trial explicit. Our framework channels the existing counterfactual theory for comparing the effects of sustained interventions, organizes analytic approaches dispersed throughout the literature, provides a structured process for the criticism of observational studies, and helps avoid common methodologic pitfalls.
ABSTRACT: The global Big Data market is predicted to reach USD 48.3 billion by 2018 (source: Transparency Market Research). Big Data is fast becoming the new normal. High variety of available information and connecting together many disparate data sets is as much a critical aspect of Big Data as its sheer volume. A growing talent gap in the market is a challenge in its own right. By 2018, the United States alone could face a shortage of 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the Big Data know-how (source: McKinsey Global Institute). Pharmaceutical and biomedical industries are driven to seek new ways to better utilize their data by consolidating diverse internal and external resources, as traditional translational research is deemed inefficient, costly, and a key focus for competitive advantage. Additionally, the effective linking of diverse data sources can reveal hidden relationships and guide novel research strategies. Linked data, a term coined by Sir Tim Berners-Lee, is a method of publishing structured data so that it can be cross-referenced and processed by computers, and is now increasingly being used to publish large biomedical datasets. The Open Pharmacological Concepts Triple Store (Open PHACTS) is one such innovative initiative that uses semantic web and Linked Data technologies to enable scientists to easily access and process data from multiple sources to drive new scientific and strategic insights in target finding, drug repurposing, and precision medicine. Simply put, Big Data is the challenge that many organizations face in attempting to effectively use and gain insight from all of the available data.
ABSTRACT: Together with Brian Caffo and Roger Peng I created the Johns Hopkins Data Science Specialization. This program has generated 2.79 enrollments in its first year - 1 million more enrollments than than all MITx and Harvardx courses over the 2-year period 2012-2014 combined. This specialization was created with less than $5,000 in equipment by three professors with no online teaching experience in their spare time over about a five month period. In this talk I'll discuss the creation and history of the DSS, what we have learned from the experience, and share some qualitative and quantitative descriptions of the program. I will wrap up with a discussion of statisticians role in the big data era and the future of MOOCs from my perspective.
|Back to SPH Biostatistics|| Maintained by the
Last Update: May 4, 2015