Department of Biostatistics
Big Data Seminar 2015 - 2016 |

Organizers: Sheila Gaynor

Schedule:

This working seminar focuses on statistical and computational methods for analyzing big data. Big data arise from a wide range of studies in health science research, such as genetics and genomics, environmental health research, comparative effective research, electronic medical records, neuroscience, and social networks. We discuss recent developments in statistical and computational methodology for analyzing big data and health science applications where big data arise. The goal of this seminar is to exchange ideas and stimulate more quantitative research in this challenging and important area.

Assistant Professor of Statistics, Rice University, and Chief Scientist, RStudio

ABSTRACT: A fluent interface lets you easily express yourself in code. Over time a fluent interface retreats to your subconcious. You don't need to bring it to mind; the code just flows out of your fingers. I strive for this fluency in all the packages I write, and while I don't always succeed, I think I've learned some valuable lessons along the way.

In this talk, I'll discuss three guidelines that make it easier to develop fluent interfaces:

This talk will help you make best use of my recent packages, and teach you how to apply the same principles to make your own code easier to use.

Pure functions. A pure function only interacts with the world through its inputs and outputs; it has no side-effects. Pure functions make great building blocks because they're are easy to reason about and can be easily composed.Predictable interfaces. It's easier to learn a function if its consistent, because you can learn the behaviour of a whole group of functions at once. I'll highlight the benefits of predictability with some of my favourite R "WAT"s (including `c()`, `sapply()` and `sample()`).Pipes. Pure predictable functions are nice in isolation but are most powerful in combination. The pipe, `%>%`, is particularly in important when combining many functions because it turns function composition on its head so you can read it from left-to-right. I'll show you how this has helped me build dplyr, rvest, ggvis, lowliner, stringr and more.

Assistant Professor, Departments of Sociology and Statistics, Indiana University - Bloomington

ABSTRACT: The exponential random graph model (ERGM) has become a standard statistical tool for modeling social networks. In particular, ERGM provides great flexibility to account for covariates effects on tie formation as well as endogenous network formation processes (e.g., reciprocity and transitivity). However, due to its reliance on Monte Carlo Markov Chains, it is difficult to estimate ERGMs on large networks (e.g., networks composed of hundreds of nodes and edges). This paper describes several methods to address the computational challenges in estimating ERGMs on large networks and compares their advantages and disadvantages. The paper also uses a school friendship network to demonstrate selected methods.

W. R. Kenan, Jr. Distinguished Professor and Chair, Department of Biostatistics, and Professor, Department of Statistics and Operations Research, University of North Carolina at Chapel Hill

ABSTRACT: There has recently been an explosion of interest and activity in personalized medicine. However, the goal of personalized medicine—wherein treatments are targeted to take into account patient heterogeneity—has been a focus of medicine for centuries. Precision medicine, on the other hand, is a much more recent refinement which seeks to develop personalized medicine that is empirically based, scientifically rigorous, and reproducible. In this presentation, we describe several new machine learning developments which advance this quest through discovering individualized treatment rules based on patient-level features. Regression and classification are useful statistical tools for estimating such rules based on either observational data or data from a randomized trial, and machine learning approaches can help with this because of their ability to artfully handle high dimensional feature spaces with potentially complex interactions. For the multiple decision setting, reinforcement learning, which is similar to but different from regression, is necessary to properly account for delayed effects. There are several other intriguing nonstandard machine learning tools which can also greatly facilitate discovery of treatment rules. One of these is outcome weighted learning, or O-learning, which directly estimates the decision rules without requiring regression modeling and is thus robust to model misspecification. Several clinical examples illustrating these approaches will also be given.

Associate Director, Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health

ABSTRACT: Until recently, estimating the dynamics of human populations was almost impossible at the scale of populations in low-income regions. However, the rapid adoption of mobile technologies, particularly among vulnerable and hard-to-reach populations, has led to enormous amounts of information on the location and mobility patterns of millions of individuals in the most remote parts of the world. We use this data to parameterize human mobility, and combine our estimates with infectious disease epidemiological models to predict how people spread diseases. I will discuss the application of this new approach to infectious disease epidemiology, as well as some of the limitations of the methods and future directions of the field.

Cardiologist, Brigham and Women's Hospital

Research Scientist, Broad Institute of MIT and Harvard

Venture Partner, Google VenturesBroad Institute

ABSTRACT: Fueled by an explosion of next generation sequence data, the Broad Institute has recently created a new group called the "Data Sciences Platform" that is dedicated to developing a scalable platform for storage and analyses of genomic data. In this talk, we will overview the formation and mission of the Broad Data Sciences Platform. Topics covered include: new methods for genomic data analysis, the Broad's NCI Cloud Pilot, and the formation of patient-centric data donation platforms.

Postdoctoral Research Associate, Department of Statistical Science, University of Pennsylvania

ABSTRACT: Rotational post-hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations and (c) better oriented sparse solutions. To avoid the pre-specification of the factor cardinality, we extend the loading matrix to have innitely many columns with the Indian Buffet Process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the Spike-and-Slab LASSO prior, a two-component refinement of the Laplace prior (Ročková 2015). A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional gene expression data, which would render posterior simulation impractical.

Postdoctoral Fellow, Department of Statistical Science, Duke University

ABSTRACT: The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure, particularly when the data set is large. We introduce a simple, coherent approach to Bayesian inference that improves robustness to small departures from the model: rather than condition on the observed data exactly, one conditions on the event that the model generates data close to the observed data, with respect to a given statistical distance. When closeness is defined in terms of relative entropy, the resulting "coarsened posterior" can be approximated by simply raising the likelihood to a certain fractional power, making the method computationally efficient and easy to implement in practice. We illustrate with real and simulated data, and provide theoretical results.

Principal Researcher, Microsoft Research (NYC Lab)

ABSTRACT: We study the general problem of how to learn through experience to make intelligent decisions. In this setting, called the contextual bandits problem, the learner must repeatedly decide which action to take in response to an observed context, and is then permitted to observe the received reward, but only for the chosen action. The goal is to learn through experience to behave nearly as well as the best policy (or decision rule) in some possibly very large and rich space of candidate policies. Previous approaches to this problem were all highly inefficient and often extremely complicated. In this work, we present a new, fast, and simple algorithm that learns to behave as well as the best policy at a rate that is (almost) statistically optimal. Our approach assumes access to a kind of oracle for classification learning problems which can be used to select policies; in practice, most off-the-shelf classification algorithms could be used for this purpose. Our algorithm makes very modest use of the oracle, which it calls far less than once per round, on average, a huge improvement over previous methods. These properties suggest this may be the most practical contextual bandits algorithm among all existing approaches that are provably effective for general policy classes.

This is joint work with Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford and Lihong Li.

K.T. Li Professor of International Health, Director, Harvard Global Health Institute, Department of Health Policy and Management, Harvard T.H. Chan School of Public Health

ABSTRACT: None Given

Distinguished Research Staff Member and Senior Manager, Health Informatics Research, Thomas J. Watson Research Center, IBM Research

ABSTRACT: None Given

Continuing Education/Special Program Instructor, Harvard University / Senior Enterprise Architect, NTT Data, Inc.

ABSTRACT: None Given

Assistant Professor, Statistics Department, Yale University

ABSTRACT: None Given

Dean's Career Development Professor and Associate Professor of Information Systems

Director, Event and Pattern Detection Laboratory, H.J. Heinz III College, Carnegie Mellon University

ABSTRACT: None Given

Professor of Computational Biology, Harvard T.H. Chan School of Public Health / Dana-Farber Cancer Institute

ABSTRACT: None Given

Back
to SPH Biostatistics |
Maintained by the
Biostatistics
Webmaster Last Update: February 4, 2016 |