Colloquium Seminar Series

Upcoming Colloquium

March 23rd
4:00 – 5:00 pm
Kresge 200

Lexin Li
Professor of Biostatistics at the Department of Biostatistics and Epidemiology
University of California, Berkeley

Statistical Neuroimaging Analysis: An Overview

Understanding the inner workings of human brains, as well as their connections with neurological disorders, is one of the most intriguing scientific questions. Studies in neuroscience are greatly facilitated by a variety of neuroimaging technologies, including anatomical magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), electroencephalography (EEG), diffusion tensor imaging, positron emission tomography (PET), among many others. The size and complexity of medical imaging data, however, pose numerous challenges, and call for constant development of new statistical methods. In this talk, I give an overview of a range of neuroimaging topics our group has been investigating, including imaging tensor analysis, brain connectivity network analysis, multimodality analysis, and imaging causal analysis. I also illustrate with a number of specific case studies.


Upcoming Colloquiums:

March 23rd- Lexin Li

April 13th- Stijn Vansteelandt

April 20th- Eric Laber

Colloquium Archive

March 9th,2023- Susan Murphy

March 9th
4:00 – 5:00 pm
FXB- G13
Susan Murphy 

Mallinckrodt Professor of Statistics and of Computer Science,
Radcliffe Alumnae Professor at the Radcliffe Institute, Harvard University

Inference for Longitudinal Data After Adaptive Sampling


Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and (b) allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.

February 23rd,2023- Samuel Kou

February 23rd

4:00 – 5:00 pm
FXB – G13

Samuel Kou
Departments of Statistics and Biostatistics
Harvard University

Catalytic Prior Distributions for Bayesian Inference

The prior distribution is an essential part of Bayesian statistics, and yet in practice, it is often challenging to quantify existing knowledge into pragmatic prior distributions. In this talk we will discuss a general method for constructing prior distributions that stabilize the estimation of complex target models, especially when the sample sizes are too small for standard statistical analysis, which is a common situation encountered by practitioners with real data. The key idea of our method is to supplement the observed data with a relatively small amount of “synthetic” data generated, for example, from the predictive distribution of a simpler, stably estimated model. This general class of prior distributions, called “catalytic prior distributions” is easy to use and allows direct statistical interpretation. In the numerical evaluations, the resulting posterior estimation using catalytic prior distribution outperforms the maximum likelihood estimate from the target model and is generally superior to or comparable in performance to competitive existing methods. We will illustrate the usefulness of the catalytic prior approach through real examples and explore the connection between the catalytic prior approach and a few popular regularization methods.

February 9th,2023- Fernanda Viegas & Martin Watternberg

February 9th, 2023
4:00 – 5:00 pm

Combined Colloquium with:

Fernanda Viegas
Sally Starling Seaver Professor at Harvard Radcliffe Institute
Gordon McKay Professor of Computer Science
Harvard John A. Paulson School of Engineering and Applied Sciences

Martin Wattenberg
Gordon McKay Professor of Computer Science

Beyond graphs and charts: harnessing the power of data visualization

While most of us are familiar with simple graphs such as bar charts and line charts, data visualization is a broad and expressive medium. In this presentation we’ll touch on some of the visual techniques for powerful exploratory data analysis that look at different kinds of rich data such as text and images. We’ll also discuss storytelling with data and the key differences between communication and exploration with data visualization.

January 26th,2023- Tamara Broderick , MIT

January 26th, 2023
4:00 – 5:00 pm
Kresge G2

Tamara Broderick, PhD
Associate Professor
Machine Learning and Statistics

 An Automatic Finite-Sample Robustness Metric: Can Dropping a Little Data Change Conclusions?

One hopes that data analyses will be used to make beneficial decisions regarding people’s health, finances, and well-being. But the data fed to an analysis may systematically differ from the data where these decisions are ultimately applied. For instance, suppose we analyze data in one country and conclude that microcredit is effective at alleviating poverty; based on this analysis, we decide to distribute microcredit in other locations and in future years. We might then ask: can we trust our conclusion to apply under new conditions? If we found that a very small percentage of the original data was instrumental in determining the original conclusion, we might expect the conclusion to be unstable under new conditions. So we propose a method to assess the sensitivity of data analyses to the removal of a very small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide an approximation. We call our resulting method the Approximate Maximum Influence Perturbation. Our approximation is automatically computable, theoretically supported, and works for common estimators. We show that any non-robustness our metric finds is conclusive. Empirics demonstrate that while some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data even in simple models and even when standard errors are small.

April 21st, 2022 - Ting Ye , University of Washington

Ting Ye, Ph.D.
Assistant Professor
Department of Biostatistics
University of Washington

Robust Mendelian Randomization in the Presence of many Weak Instruments and Widespread Horizontal Pleiotropy

Mendelian randomization (MR) has become a popular approach to study the effect of a modifiable exposure on an outcome by using genetic variants as instrumental variables (IVs). Two distinct challenges persist in MR: (i) each genetic variant explains a relatively small proportion of variance in the exposure and there are many such variants, a setting known as many weak IVs; and (ii) many genetic variants may have direct effects on the outcome not through the exposure, or in genetic terms, when there exists widespread horizontal pleiotropy. To address these two challenges simultaneously, we propose two novel estimators, the debiased inverse-variance weighted (dIVW) estimator for summary-data MR and the GENIUS-MAWII estimator for individual-data MR, and we establish their statistical properties. We conclude by demonstrating these two methods in simulated and real datasets.

Short bio: Ting Ye is an Assistant Professor in the Department of Biostatistics at the University of Washington. Her research interests focus on developing pragmatic and robust statistical methods for causal inference in biomedical and social sciences. Most of her research has been about addressing complications in clinical trials and hidden biases in observational studies.

March 24th, 2022 - James Scott, University of Texas at Austin

March 24th, 2022- James Scott, PhD
Professor of Statistics and Data Science
Fayez Sarofim & Co. Centennial Professor in Business
University of Texas at Austin

BART and Its Variations: Three Applications in Obstetrics

In this talk, I will describe some of my group’s ongoing work to address statistical challenges in functional regression, with specific application to obstetrics, and with the overall public-health goal of reducing complications of pregnancy. I will first describe a general Bayesian approach for nonparametric regression, called “BART with targeted smoothing” (or tsBART). TsBART is based on a very popular nonparametric regression framework called Bayesian Additive Regression Trees (BART), but modified in some crucial ways to address what our medical collaborators have identified as some of the most common data-science motifs in obstetric research. I will then describe how we’ve adapted this framework in a variety of directions, to help answer three different questions in obstetrics research: (1) how to recommend an optimal gestational age of delivery that minimizes the overall risk of perinatal mortality in high-risk pregnancies; (2) how preeclampsia is related to birthweight in low-resource hospital settings, such as those common in lower and middle-income countries; and (3) how two different dosing protocols for early medical abortion compare in effectiveness over the first nine weeks of gestation.


February 17th, 2022- Chiara Sabatti, Stanford University

February 17th, 2022- Chiara Sabatti, PhD
Professor of Biomedical Data Science
Professor of Statistics
Stanford University 

Genetic variants  across human populations and our understanding of the genetic basis of traits

Abstract: Identifying which genetic variants influence medically relevant phenotypes is an important task both for therapeutic development and for risk prediction. In the last decade, genome wide association studies have been the most widely-used instrument to tackle this question. One challenge that they encounter is in the interplay between genetic variability and the structure of human populations. In this talk, we will focus on some opportunities that arise when one collects data from diverse populations and present statistical methods that allow us to leverage them.

The presentation will be based on joint work with M. Sesia, S. Li, Z. Ren, Y. Romano and E. Candes.


April 15th, 2021- Bhramar Mukherjee, University of Michigan

April 15th –Bhramar Mukherjee, PhD
Chair of Biostatistics
University of Michigan

Handling Outcome Misclassification and Selection Bias in Association Studies Using Electronic Health Records

In this talk we will discuss statistical challenges and opportunities with joint analysis of electronic health records and genomic data through “Genome and Phenome-Wide Association Studies(GWAS and PheWAS)”. We posit a modeling framework that helps us to understand the effect of both selection bias and outcome misclassification in assessing genetic associations across the medical phenome. We will propose various inferential strategies that handle both sources of bias to yield improved inference. We will use data from the UK Biobank and the Michigan Genomics Initiative, a longitudinal biorepository at Michigan Medicine, launched in 2012 to illustrate the analytic framework.

The examples illustrate that understanding sampling design and selection bias matters for big data, and are at the heart of doing good science with data. This is joint work with Lauren Beesley and Lars Fritsche at the University of Michigan.


For Zoom Info Please Email:

March 25th, 2021- Kathryn Roeder, Carnegie Mellon University

March 25th -Kathryn Roeder, PhD
Department of Statistics and Data Science
UPMC Professor of Statistics and Life Sciences
Carnegie Mellon University

Statistical challenges in the analysis of single-cell RNA-seq from brain cells

Quantification of gene expression using single cell RNA-sequencing of brain tissues, can be a critical step in the understanding of cell development and differences between cells sampled from case and control subjects.   We describe statistical challenges encountered analyzing expression of brain cells in the context of two projects. First, over-correction has been one of the main concerns in employing various data integration methods that risk removing the biological distinctions, which is harmful for cell type identification. Here, we present a simple yet surprisingly effective transfer learning model named cFIT for removing batch effects across experiments, technologies, subjects, and even species. Second, gene co-expression networks yield critical insights into biological processes, and single-cell RNA sequencing provides an opportunity to target inquiries at the cellular level.  However, due to the sparsity and heterogeneity of transcript counts, it is challenging to construct accurate gene co-expression networks.  We develop an alternative approach that estimates cell-specific networks for each single cell. We use this method to identify differential network genes in a comparison of cells from brains of individuals with autism spectrum disorder and those without.

February 12, 2021- Devan V. Mehrotra, University of Pennsylvania

Friday, February 12th, 2021
From 4- 5 pm
Lunchtime Career Chat : 1 – 2 PM
For Zoom Info Please Email:

Devon V. Mehrotra
University of Pennsylvania

A Novel Approach for Survival Analysis in Randomized Clinical Trials

Biostatistics and Research Decision Sciences, Merck & Co., Inc.
Randomized clinical trials are often designed to assess whether a test treatment prolongs survival relative to a control treatment. Increased patient heterogeneity, while desirable for generalizability of results, can weaken the ability of common statistical approaches to detect treatment differences, potentially hampering the regulatory approval of safe and efficacious therapies. A novel solution to this problem is proposed. A list of baseline covariates that have the potential to be prognostic for survival under either treatment is pre-specified in the analysis plan. At the analysis stage, using observed survival times but blinded to patient-level treatment assignment, ‘noise’ covariates are removed with elastic net Cox regression. The shortened covariate list is subsequently used by a conditional inference tree algorithm to segment the heterogeneous trial population into subpopulations of prognostically homogeneous patients (risk strata). After patient-level treatment unblinding, a treatment comparison is done within each formed risk stratum and stratum-level results are combined for overall statistical inference. The impressive power-boosting performance of our proposed 5-step stratified testing and amalgamation routine (5-STAR), relative to that of the logrank test and other common approaches that do not leverage inherently structured patient heterogeneity, is illustrated using a hypothetical and two real datasets along with simulation results. In addition, the importance of reporting stratum-level treatment effects is highlighted as a potential enabler of personalized medicine. An R package is available for implementation. (Joint work with Rachel Marceau West at Merck).

September 20, 2018 - MYRTO AWARD - Elizabeth Stuart, Johns Hopkins

Myrto Lefkopoulou Award
Thursday, September 20, 2018
Kresge G2
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm in the FXB Atrium

Elizabeth Stuart
Associate Dean for Education
Professor of Mental Health, Biostatistics, and Health Policy and Management
Johns Hopkins Bloomberg School of Public Health

Dealing with observed and unobserved effect moderators when estimating population average treatment effects

Many decisions in public health and public policy require estimation of population average treatment effects, including questions of cost effectiveness or when deciding whether to implement a screening program across a population. While randomized trials are seen as the gold standard for (internally valid) causal effects, they do not always yield accurate inferences regarding population effects. In particular, in the presence of treatment effect heterogeneity, the average treatment effect (ATE) in a randomized controlled trial (RCT) may differ from the average effect of the same treatment if applied to a target population of interest. If all treatment effect moderators are observed in the RCT and in a dataset representing the target population, then we can obtain an estimate for the target population ATE by adjusting for the difference in the distribution of the moderators between the two samples. However, that is often an unrealistic assumption in practice. This talk will discuss methods for generalizing treatment effects under that assumption, as well as sensitivity analyses for two situations: (1) where we cannot adjust for a specific moderator observed in the RCT because we do not observe it in the target population; and (2) where we are concerned that the treatment effect may be moderated by factors not observed even in the RCT. These sensitivity analyses are particularly crucial given the often limited data available from trials and on the population. The methods are applied to examples in drug abuse treatment. Implications for study design and analyses are also discussed, when interest is in a target population ATE.

October 11, 2018 - LAGAKOS AWARD - Amy Herring, Duke University

Lagakos Distinguished Alumni Award
Thursday, October 11, 2018
Kresge G3
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm in the FXB Atrium

Amy Herring
Professor of Statistical Science
Research Professor of Global Health
Duke University

Statistics for Science’s Sake

From decapitated cats (my first Harvard biostatistics project!) to birth defects (my second project and continued scholarly focus) and beyond, we will consider a series of case studies of scientific problems that pose interesting statistical challenges motivating new methodological development. We will address the motivating scientific problems, drawbacks of existing or standard analysis approaches, and the process of collaboration in multiple disciplines, with a focus on strategies for generating ideas for research beyond graduate school and throughout one’s career.

November 15, 2018 - Pierre Bushel,National Institute of Environmental Health Sciences

November 15, 2018
Building 2, Room 426
Happy Hour to follow in the FXB Atrium

Pierre Bushel
Professor of Statistical Science
Staff Scientist, Biostatistics and Computational Biology Branch
National Institute of Environmental Health Sciences

A Mashup of Statistics and Bioinformatics Applied to Genomic and Genetic Data for Better Understanding of Biological Consequences

Exposure to certain environmental and chemical stressors can cause adverse health conditions and are of immense public concern.  Determining the mechanisms by which these insults affect biological systems is paramount to derive of remedies to improve public health.  Genetics and genomics are increasingly being used as tools to investigate environmental health sciences and toxicology.  In this presentation we will examine two cases of applying statistics and bioinformatics for better understanding of biological consequences elicited from chemical and environmental exposures.  In the first case, a human clinical study of gene expression changes in the blood from responders and non-responders to acetaminophen (APAP) is utilized.  APAP is the active ingredient in Tylenol and is toxic to the liver when taken while having consumed alcohol, overdosed or in some cases of possible/unknown genetic susceptibility.  Here we demonstrate the use of piecewise linear regression modeling of the data and pathway analysis to identify gene signatures and molecular pathways that are indicative of an adverse response to APAP.  Particularly, a network of genes associated with protein misfolding and the accumulation of misfolded proteins in the endoplasmic reticulum may potentially play a crucial role in mediating APAP toxicity.  In the second case, strains of mice exposed to hyperoxia and normoxia are genotyped and characterized phenotypically for genetic analysis.  Chronic high oxygen saturation causes mitochondrial dysfunction, excessive reactive oxygen species that damage DNA and adversely affects the lungs of preterm infants.  Here we illustrate the use of mixed linear modeling with a specified correlation structure and the MitoMiner database to reveal epistatic interactions between the nuclear and mitochondrial genomes that are associated with the hyperoxia-induced lung injury phenotype.  In particular, nuclear genes with allelic interactions function in the mitochondrial respiratory chain which is involved in oxidative phosphorylation to create adenosine triphosphate (i.e., the cell’s energy source).

Marvin Zelen Leadership Award in Statistical Science
Thursday, May 9, 2019
Room TBD
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm

2019 recipient to be announced


May 3, 2018 - Joseph Hogan, Brown University

Thursday, May 3, 2018
3:30 – 4:30 PM

Joseph Hogan
Carole and Lawrence Sirovich Professor of Public Health
Professor of Biostatistics
Chair of Biostatistics
Brown University

Using electronic health records to model engagement and retention in HIV care

The HIV care cascade is a conceptual model describing the stages of care leading to long-term viral suppression of those with HIV. Distinct stages include case identification, linkage to care, initiation of antiretroviral treatment, and eventual viral suppression. After entering care, individuals are subject to disengagement from care, dropout, and mortality.

Owing to the complexity of the cascade, evaluation of efficacy and cost effectiveness of specific policies has primarily relied on simulation-based approaches of mathematical models, where model parameters may be informed by multiple data sources that come from different populations or samples. The growing availability of electronic health records and large-scale cohort data on HIV-infected individuals presents an opportunity for a more unified, data-driven approach using statistical models.

We describe a statistical framework based on multistate models that can be used for regression analysis, prediction and causal inferences. We illustrate using data from a large HIV care program in Kenya, focusing on comparisons between statistical and mathematical modeling approaches for inferring causal effects about treatment policies.

April 26, 2018 - ZELEN AWARD - Constantine Gatsonis, Brown University

Marvin Zelen Leadership Award in Statistical Science

Thursday, April 26, 2018
FXB Building, Room G13
Award & Lecture: 3:45-4:45pm
Coffee & Tea served
Reception to follow at 4:45pm in the FXB Atrium

Constantine Gatsonis
Henry Ledyard Goddard University Professor of Public Health
Professor of Biostatistics
Director of Statistical Sciences
Brown University

The Evaluation of Diagnostic Imaging in the Era of Radiomics

The quantitative analysis of imaging via machine learning methods for high dimensional data is defining the new frontier for diagnostic imaging. A vast array of imaging-based markers is becoming available, each marker carrying claims of potential utility in clinical care and each being a potential candidate for inclusion in clinical trials. The research enterprise in the clinical evaluation of diagnostic imaging has made great strides in the past three decades. In particular, we now have well developed paradigms for the study of diagnostic and predictive accuracy in the multi-center setting. We are also making progress on the more complex problem of assessing the impact of tests on patient outcomes, via randomized and observational studies, analysis of large databases, and simulation modeling. However the volume of the new radiomics-based markers and their potential for fast evolution, even without formal learning, poses a new set of challenges for researchers and regulators. In this presentation we will survey the methodologic advances in the clinical evaluation of diagnostic imaging in recent decades, showcase examples of radiomics-based modalities, and discuss the statistical and regulatory challenges they create.

March 29, 2018 - Catherine Calder, Ohio State

Thursday, March 29, 2018
3:30 – 4:30 PM

Catherine Calder
Professor of Statistics
The Ohio State University

Activity Patterns and Ecological Networks:  Identifying Shared Exposures to Social Contexts

In the social and health sciences, research on ‘neighborhood effects’ focuses on linking features of social contexts or exposures to health, educational, and criminological outcomes.  Traditionally, individuals are assigned a specific neighborhood, frequently operationalized by the census tract of residence, which may not contain the locations of routine activities.  In order to better characterize the many social contexts to which individuals are exposed as a result of the spatially-distributed locations of their routine activities and to understand the consequences of these socio-spatial exposures, we have developed the concept of ecological networks.  Ecological networks are two-mode networks that indirectly link individuals through the spatial overlap in their routine activities.  This presentation focuses on statistical methodology for understanding and comparing the structure ecological network(s).  In particular, we propose a continuous latent space (CLS) model that allows for third-order dependence patterns in the interactions between individuals and the places they visit and a parsimonious non-Euclidean CLS model that facilitates extensions to multi-level modeling.  We illustrate our methodology using activity pattern and sample survey data from Los Angeles, CA and Columbus, OH.

March 5, 2018 - JOINT w. STATS - Emily Fox, University of Washington

Monday, March 5, 2018
3:30 – 4:30 PM

Emily Fox
Associate Professor & The Amazon Professor of Machine Learning
Paul G. Allen School of Computer Science & Engineering
Department of Statistics, The University of Washington

Machine Learning for Analyzing Neuroimaging Time Series

Recent neuroimaging modalities, such as high-density electrocorticography (ECoG) and magnetoencephalography (MEG), provide rich descriptions of brain activity over time opening the door to new analyses of observed phenomena including seizures and understanding the neural underpinnings of complex cognitive processes. However, such data likewise present new challenges for traditional time series methods.  One challenge is how to model the complex evolution of the time series with intricate and possibly evolving relationships between the multitude of dimensions.  Another challenge is how to scale such analyses to long and possibly streaming recordings in the case of ECoG, or how to leverage few and costly MEG observations.  In this talk, we first discuss methods for automatically parsing the complex dynamics of ECoG recordings for the sake

of analyzing seizure activity. To handle the lengthy ECoG recordings, we develop stochastic gradient MCMC and variational algorithms that iterate on subsequences of the data while properly accounting for broken temporal dependencies. We then turn to studying the functional connectivity of auditory attention using MEG recordings.  We explore notions of undirected graphical models of the time series in the frequency domain as well as learning time-varying directed interactions with state-space models.  We conclude by discussing recent work on inferring Granger causal networks in nonlinear time series using penalized neural network models.

January 25, 2018 - Peter Mueller, UT Austin

Thursday, January 25, 2018
3:30 – 4:30 PM

Peter Mueller
Department Chair (Interim), Professor
Department of Statistics & Data Science, Department of Mathematics
The University of Texas at Austin

Bayesian Feature Allocation Models for Tumor Heterogeneity

We characterize tumor variability by hypothetical latent cell types that are defined by the presence of some subset of recorded SNV’s. (single nucleotide variants, that is, point mutations).  Assuming that each sample is composed of some sample-specific proportions of these cell types we can then fit the observed proportions of SNV’s for each sample.  In other words, by fitting the observed proportions of SNV’s in each sample we impute latent underlying cell types, essentially by a deconvolution of the observed proportions as a weighted average of binary indicators that define cell types by the presence or absence of different SNV’s. In the first approach we use the generic feature allocation model of the Indian buffet process (IBP) as a prior for the latent cell subpopulations. In a second version of the proposed approach we make use of pairs of SNV’s that are jointly recorded on the same reads, thereby contributing valuable haplotype information.
Inference now requires feature allocation models beyond the binary IBP. We introduce a categorical extension of the IBP. Finally, in a third approach we replace the IBP by a prior based on a stylized model of a phylogenetic tree of cell subpopulations.

December 7, 2017 - David Draper, UC Santa Cruz

Thursday, December 7, 2017
3:30 – 4:30 PM
Kresge 202A

David Draper
Professor, Department of Applied Mathematics and Statistics
Jack Baskin School of Engineering
University of California, Santa Cruz

Optimal Bayesian Analysis of A/B Tests (Randomized Controlled Trials) in Data Science at Big-Data Scale

When coping with uncertainty, You typically begin with a problem P = (Q,C), in which Q lists the questions of principal interest to be answered and C summarizes the real-world context in which those questions arise. Q and C together de ne (θ,D,B), in which θ (which may be in nite-dimensional) is the unknown of principal interest, D summarizes Your data resources for decreasing Your uncertainty about θ, and B is a nite (ideally exhaustive) set of (true/false) propositions summarizing, and all rendered true by, context C.

The Bayesian paradigm provides one way to arrive at logically-internally-consistent inferences about θ, predictions for new data D∗, and decisions under uncertainty (either through Bayesian decision theory or Bayesian game theory). In this paradigm it’s necessary to build a stochastic model M that relates knowns (D and B) to unknowns (θ). For inference and prediction, on which I focus in this talk, the model takes the form M = {p(θ|B),p(D|θ,B)}, in which the (prior) distribution p(θ | B) quanti es Your information about θ external to Your dataset D and the (sampling) distribution p(D | θ, B) quanti es Your information about θ internal to D, when converted to Your likelihood distribution l(θ | D, B) ∝ p(D | θ, B).

The fundamental problem of applied statistics is that the mapping from P to M is often not unique: You have basic uncertainty about θ, but often You also have model uncertainty about how to specify Your uncertainty about θ. It turns out, however, that there are situations in which problem context C implies a unique choice of p(θ | B) and/or l(θ | D, B); let’s agree to say that optimal Bayesian model specification has occurred in such situations, leading to optimal Bayesian analysis via Bayes’s Theorem and its corollaries.

In this talk I’ll identify an important class of problems arising in the analysis of randomized controlled trials (typically called A/B tests in Data Science) in which optimal Bayesian analysis is made possible by the use of Bayesian non-parametric modeling, and I’ll illustrate this type of analysis with an A/B test at Big-Data scale (involving about 22 million observations). Along the way I’ll demonstrate that the frequentist bootstrap is actually a Bayesian non-parametric method in disguise, and I’ll discuss how large-scale observational studies may admit similar analyses via conditional exchangeability assumptions (although such analyses will typically not be optimal in the sense de ned here, because such exchangeability assumptions are typically not uniquely speci ed by problem context).

November 16, 2017 - Joe Koopmeiners, University of Minnesota

Thursday, November 16, 2017
3:30 – 4:30 PM
Kresge 202A

Joe Koopmeiners
Associate Professor, Division of Biostatistics
University of Minnesota

A Multi-source Adaptive Platform Design for Emerging Infectious Diseases

Emerging infectious diseases challenge traditional paradigms for clinical translation of therapeutic interventions. The Ebola virus disease outbreak in West Africa was a recent example which heralded the need for alternative designs that can be sufficiently flexible to compare multiple potentially effective treatment regimes in a context with high mortality and limited available treatment options. The PREVAIL II master protocol was designed to address these concerns by sequentially evaluating multiple treatments in a single trial and incorporating aggressive interim monitoring with the purpose of identifying efficacious treatments as soon as possible. One shortcoming, however, is that supplemental information from controls in previous trial segments was not utilized. In this talk we address this limitation by proposing an adaptive design methodology that facilitates “information sharing” across potentially non-exchangeable segments using multi-source exchangeability models (MEMs). The design uses multi-source adaptive randomization to target information balance within a trial segment in relation to posterior effective sample size. When compared to the standard platform design, we demonstrate that MEMs with adaptive randomization can improve power with limited type-I error inflation. Further, the adaptive platform effectuates more balance with respect to the distribution of acquired information among study arms with more patients randomized to experimental regimens which, when effective, yields reductions in the overall mortality rate for trial participants.

October 19, 2017 - LAGAKOS AWARD - Nicholas Horton, Amherst College

Thursday, October 19, 2017
FXB Building, Room G12
Award & Lecture: 3:30-4:30pm
Coffee & Tea served at 3:00pm
Reception to follow at 4:30pm in the FXB Atrium

Nicholas Horton
Professor of Statistics at Amherst College

Multivariate thinking and the introductory biostatistics course: preparing students to make sense of a world of observational data

We live in a world of ever expanding found (or what we might call observational) data. To make decisions and disentangle complex relationships in such a world, students need a background in design and confounding. The GAISE College Report enunicated the importance of multivariate thinking as a way to move beyond bivariate thinking. But how do such learning outcomes compete with other aspects of statistics knowledge (e.g., inference and p-values) in introductory courses that are already overfull. In this talk I will offer some reflections and guidance about how we might move forward, with specific implications for introductory biostatistics courses.

October 2, 2017 - JOINT w. STATS - Rebecca Steorts, Duke University

Monday, October 2, 2017
4:15 PM
Science Center, Hall A
Cambridge, MA

Rebecca C. Steorts
Department of Statistical Science, affiliated faculty in Computer Science, Biostatistics and Bioinformatics, the information initiative at Duke, and the Social Science Research Institute

Entity Resolution with Societal Impacts in Statistical Machine Learning

Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions.  Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in Syria? In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns.  In such cases, analysts need to use methods from statistical and computational science known as entity resolution (record linkage or de-duplication) to proceed with analysis.  Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution, with applications to the Syrian conflict but also official statistics, and the food and music industry. This work, which is a joint collaboration with researchers at Rice University and the Human Rights Data Analysis Group (HRDAG) touches on the interdisciplinary research that is crucial to problems with societal impacts that are at the forefront of both national and international news. 

September 28, 2017 - MYRTO AWARD - Ciprian Crainiceanu, Johns Hopkins

Thursday, September 28, 2017
FXB Building, Room G12
Award & Lecture: 3:30-4:30pm
Coffee & Tea served at 3:00pm
Reception to follow at 4:30pm in the FXB Atrium

Ciprian Crainiceanu
Professor, Department of Biostatistics
Johns Hopkins University

Biostatistical Methods for Wearable and Implantable Technology

Wearable and Implantable Technology (WIT) is rapidly changing the Biostatistics data analytic landscape due to their reduced bias and measurement error as well as to the sheer size and complexity of the signals. In this talk I will review some of the most used and useful sensors in Health Sciences and the ever expanding WIT analytic environment. I will describe the use of WIT sensors including accelerometers, heart monitors, glucose monitors and their combination with ecological momentary assessment (EMA). This rapidly expanding data eco-system is characterized by  multivariate densely sampled time series  with complex and highly non-stationary structures. I will introduce an array of scientific problems that can be answered using WIT and I will describe methods designed to analyze the WIT data from the micro- (sub-second-level) to the macro-scale (minute-, hour- or day-level) data.

April 5, 2017 - Enrique Schisterman, NIH

Wednesday April 5, 2017
3:00 – 4:00 PM
Building 2, Room 426

Enrique F. Schisterman
Chief and Senior Investigator
Epidemiology Branch

Pooling Bio-Markers as a Cost-Effective Design

Evaluating biomarkers in epidemiological studies can be expensive and time consuming. Many investigators use techniques such as random sampling or pooling biospecimens in order to cut costs and save time on experiments. Commonly, analyses based on pooled data are strongly restricted by distributional assumptions that are challenging to validate because of the pooled biospecimens. Random sampling provides data that can be easily analyzed. However, random sampling methods are not optimal cost-efficient designs for estimating means. We propose and examine a cost-efficient hybrid design that involves taking a sample of both pooled and unpooled data in an optimal proportion in order to efficiently estimate the unknown parameters of the biomarker distribution.

March 23, 2017 - Amy Herring, UNC

Thursday, March 23, 2017
3:00 – 4:00 PM
Building 2, Room 426

Amy Herring
Associate Chair
Department of Biostatistics
UNC Gillings School of Global Public Health

Bayesian models for multivariate dynamic survey data

Modeling and computation for multivariate longitudinal data have proven challenging, particularly when data contain discrete measurements of different types.  Motivated by data on the fluidity of sexuality from adolescence to adulthood, we propose a novel nonparametric approach for mixed-scale longitudinal data.  The proposed approach relies on an underlying variable mixture model with time-varying latent factors.  We use our approach to address hypotheses in The National Longitudinal Study of Adolescent to Adult Health, which selected participants via stratified random sampling, leading to discrepancies between the sample and population that are further compounded by missing data.  Survey weights have been constructed to address these issues, but including them in complex statistical models is challenging.  Bias arising from the survey design and nonresponse is addressed.  The approach is assessed via simulation and used to answer questions about the associations among sexual orientation identity, behaviors, and attraction in the transition from adolescence to young adulthood.

February 9, 2017 - Jennifer Hill, NYU

Thursday, February 9, 2017
12:30-1:30 PM
Building 2, Room 426

Jennifer L. Hill
Professor of Applied Statistics & Data Science
New York University

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

Statisticians have made great strides towards assumption-free estimation of causal estimands in the past few decades. However this explosion in research has resulted in a breadth of inferential strategies that both create opportunities for more reliable inference as well as complicate the choices that an applied researcher has to make and defend. Relatedly, researchers advocating for new methods typically compare their method to (at best) 2 or 3 other causal inference strategies and test using simulations that may or may not be designed to equally tease out flaws in all the competing methods. The causal inference data analysis challenge, “Is Your SATT Where It’s At?”, launched as part of the 2016 Atlantic Causal Inference Conference, sought to make progress with respect to both of these issues. The researchers creating the data testing grounds were distinct from the researchers submitting methods whose efficacy would be evaluated. Results from over 30 competitors in the two parallel versions of the competition (Black Box Algorithms and Do It Yourself Analyses) are presented along with post-hoc analyses that reveal information about the characteristics of causal inference strategies and settings that affect performance. The most consistent conclusion was that the automated (black box) methods performed better overall than the user-controlled methods across scenarios.