Colloquium Seminar Series

Upcoming Colloquium


April 18th- Kara Rudolph ( Columbia – NYC)

Speakers will share their own perspectives; they do not speak for Harvard


Upcoming Colloquiums:

April 18th, 2024
Kara Rudolph, Phd, MHS
Assistant Professor of Epidemiology
Columbia University
4:00 – 5:00 PM
FXB- 301 

Improving Efficiency in Transporting Average Treatment Effects

We develop flexible, semiparametric estimators of the average treatment effect (ATE) transported to a new population (“target population”) that offer potential efficiency gains. First, we propose two one-step semiparametric estimators that incorporate knowledge of which covariates are effect modifiers and which are both effect modifiers and differentially distributed between the source and target populations. These estimators can be used even when not all covariates are observed in the target population; one requires that only effect modifiers are observed, and the other requires that only those modifiers that are also differentially distributed are observed. Second, we propose a collaborative one-step estimator when researchers do not have knowledge about which covariates are effect modifiers and which differ in distribution between the populations, but require all covariates to be measured in the target population. We use simulation to compare finite sample performance across our proposed estimators and existing estimators of the transported ATE, including in the presence of practical violations of the positivity assumption. Lastly, we apply our proposed estimators to a large-scale housing trial.

Colloquium Archive

April 11th , 2024- Yu Shen

April 11th, 2024
Yu Shen
Conversation with a Living Legend Professor & Chair ad interim
Department of Biostatistics
The University of Texas MD Anderson Cancer Center
4:00 – 5:00 PM
FXB- 301 

Data Integration in Statistical Inference and Risk Prediction

In comparative effectiveness research and risk prediction for rare types of cancer, it is desirable to combine multiple sources of data, e.g., the primary cohort data together with aggregate information derived from cancer registry databases. Such integration of data may improve statistical efficiency and accuracy of risk prediction, but also pose statistical challenges for incomparability between different sources of data. We develop the adaptive estimation procedures, which used the combined information to determine the degree of information borrowing from the aggregate data of the external resource. We apply the proposed methods to evaluate the long-term effect of several commonly used treatments for inflammatory breast cancer by tumor subtypes, while combining the inflammatory breast cancer patient cohort at MD Anderson and external data.

March 28th, 2024- Jose Zubizarrera

March 28th, 2024
Jose Zubizarrera, PhD
Professor, Department of Health Care Policy, Harvard Medical School; Professor, Department of Biostatistics, Harvard School of Public Health; Faculty Affiliate, Department of Statistics, Harvard University
4:00 – 5:00 PM
FXB- 301 

Anatomy of Event Studies: Hypothetical Experiments,Exact Decomposition, and Robust Estimation

In recent decades, event studies have emerged as a leading methodology in health and social research for evaluating the causal effects of staggered interventions. In this paper, we analyze event studies from the perspective of experimental design and provide a novel characterization of the classical dynamic two-way fixed effects (TWFE) regression estimator for event studies. Our decomposition is expressed in closed form and reveals in finite samples the hypothetical experiment that TWFE regression adjustments approximate. This decomposition offers insights into how standard regression estimators use information from various units and time points, clarifying and generalizing the notion of forbidden comparison noted in the literature in simpler settings. We propose a robust weighting approach for estimation in event studies, which allows investigators to progressively build larger valid weighted contrasts by leveraging, in a sequential manner, increasingly stronger assumptions on the potential outcomes and the assignment mechanism. This weighting approach is adaptable to a generally defined estimand and allows for generalization. We provide weighting diagnostics and visualization tools. We illustrate these methods in a case study of the impact of divorce reforms on female suicide.

March 21st, 2024- Stephen Bates

Stephen Bates, PhD
March 21st, 2024
Assistant Professor in AI and Decision-making
MIT EECS
4:00 – 5:00 PM
FXB- 301


Hypothesis Testing with Information Asymmetry

Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a regulator such as the FDA) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent’s strategic behavior — their choice to run a trial or not. In particular, we show how the principal can design a policy to elicit partial information about the agent’s private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be slightly less than the cost of the trial divided by the firm’s profit if the trial is successful.

March 7th, 2024- Boris Landa

March 7th
Boris Landa
Associate Research Scientist, Applied Mathematics
Yale University
4:00 – 5:00 PM
FXB-301

Standardizing Noise Spectra for Signal Detection and Recovery in Heteroskedastic Environments

Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, often by thresholding singular values at a certain critical level. While this approach is well-established for homoskedastic noise whose variance is identical across entries, numerous real-world applications, such as single-cell RNA sequencing (scRNA-seq), exhibit heteroskedastic noise. In this case, noise characteristics may vary considerably across rows and columns, posing a challenge for signal detection and recovery.

In this talk, I will present a principled approach for standardizing heteroskedastic noise levels by judiciously scaling the rows and columns. This standardization aims to enforce the standard spectral behavior of homoskedastic noise – the celebrated Marchenko-Pastur law – allowing for straightforward detection and recovery of low-rank signals. I will discuss two methods for accurately determining the correct scaling from the data: one tailored for count data (e.g., Poisson, negative binomial) with a general variance pattern, and another for generic data with a more restrictive variance pattern. I will demonstrate the effectiveness of these methods through simulations and real scRNA-seq data, highlighting their benefits for signal detection and recovery and showcasing excellent fits to the MP law post-normalization.

February 29th, 2024- Layla Parast Bartroff

Layla Parast Bartroff
Associate Professor, University of Texas at Austin
Department of Statistics and Data Sciences
4:00 – 5:00 PM
FXB-301

Robust Methods for Surrogate Marker Evaluation

For many clinical outcomes, randomized clinical trials to evaluate the effectiveness of a treatment or intervention require long-term follow-up of participants. In such settings, there is substantial interest in identifying and using surrogate markers – measurements or outcomes measured at an earlier time or with less cost that are predictive of the primary clinical outcome of interest – to evaluate the treatment effect. Several statistical methods have been proposed to evaluate potential surrogate markers including parametric and nonparametric methods to estimate the proportion of treatment effect explained by the surrogate, methods within a principal stratification framework, and methods for a meta-analytic setting i.e., where information from multiple trials is available. While useful, these methods generally do not address potential heterogeneity in the utility of the surrogate marker. In addition, available methods do not perform well when the sample size is small. In this talk, I will discuss various robust methods for surrogate marker evaluation including methods to examine and test for heterogeneity, and methods developed for the small sample setting. These methods will be illustrated using data from an AIDS clinical trial and a small pediatric trial among children with nonalcoholic fatty liver disease.

February 22th, 2024- Mei Sheng Duh

February 22nd
Mei Sheng Duh
Chief Epidemiologist at Analysis Group
FXB-301
4:00 – 5:00 PM

Statistical Methods for FDA-Mandated Vaccine Safety Surveillance: Case Study of Pfizer-BioNTech COVID-19 Vaccine

As a post-marketing requirement for the Emergency Use Authorization of the Pfizer-BioNTech COVID-19 vaccine in December 2020, Pfizer collaborated with the US Veterans Health Administration (VHA) and Analysis Group to initiate a long-term safety surveillance study that assessed whether the VHA population experienced an increased risk of safety events after receiving the Pfizer-BioNTech COVID-19 vaccine. An interim analysis using VHA’s electronic medical record data from 12/11/2020 to 8/31/2022 is reported here. Signal detection, signal evaluation, and signal verification analyses following Pfizer-BioNTech COVID-19 vaccine doses 1-4 were conducted for 48 pre-specified safety events in addition to a prioritized analysis of myocarditis/pericarditis.

The study population includes 1,649,677 VHA enrollees with mean age of 64.3 years that received ≥ 1 dose of the Pfizer-BioNTech COVID-19 vaccine. During the signal detection phase, binomial-based MaxSPRT with self-controlled risk interval design and Poisson-based MaxSPRT with historical seasonal influenza vaccine within an active comparator design were used. Potential signals were detected if the log-likelihood ratio exceeded the pre-specified critical value. During the signal evaluation phase, multivariate Poisson regression with active comparator design and conditional Poisson regression with self-controlled case series design, with pre-specified signal definition of effect estimate >3 and p-value <0.01 were applied. Signal verification phase consisted of chart review for adverse event adjudication.

Overall, no adverse safety signals were detected based on pre-established criteria. Real-world long-term active safety surveillance is an essential part of pandemic response to monitor any potential safety risks associated with vaccination and to complement the limitations of clinical trials due to smaller sample size and shorter observation duration

February 8th, 2024- Rafael Irizarry

February 8th
Rafael Irizarry
Professor and Chair of the Department of Data Science at Dana-Farber Cancer Institute  Professor of Applied Statistics at Harvard
FXB- 301
4:00 – 5:00 PM

25 Years of Data Science: Music, Genomics, and Public Health Surveillance

In this talk I will describe three examples of Statistics in action, from different application areas. The first example relates to the analysis of musical sound signals. I will describe how locally harmonic models can be used to provide meaningful parameters useful for manipulation of sounds. The second example relates to how biological discoveries can be enabled by genomics technology, exploratory data analysis, and statistical reasoning. Specifically, I will describe how statistical thinking led to a major improvement in the measurements produced by a technology for measuring gene expression. Finally, I will describe how we estimated excess mortality after hurricane María and how this motivated collaborations during the pandemic. I will describe how we build a system to gather and share SARS-2-Cov testing, COVID-19 hospitalization and deaths, and vaccination data. Throughout the talk I will highlight both the statistical insights and the important considerations that fall outside the realm of the current scope of our discipline, including data acquisition, data wrangling, software development, and collaborating with the press.

 

November 30th,2023- Lucas Janson


November 30th
Lucas Janson
Harvard Statistics
Kresge G3
4:00 – 5:00 PM

Controlled Discovery and Localization of Signals via Bayesian Linear Programming

Abstract: Scientists often must simultaneously localize and discover signals. For instance, in genetic fine-mapping, high correlations between nearby genetic variants make it hard to identify the exact locations of causal variants. So the statistical task is to output as many disjoint regions containing a signal as possible, each as small as possible, while controlling false positives. Similar problems arise in any application where signals cannot be perfectly localized, such as locating stars in astronomical surveys and changepoint detection in sequential data. Common Bayesian approaches to these problems involve computing a posterior distribution over signal locations. However, existing procedures to translate these posteriors into actual credible regions for the signals fail to capture all the information in the posterior, leading to lower power and (sometimes) inflated false discoveries. With this motivation, we introduce Bayesian Linear Programming (BLiP). Given a posterior distribution over signals, BLiP outputs credible regions for signals which verifiably nearly maximize expected power while controlling false positives. BLiP overcomes an extremely high-dimensional and nonconvex problem to verifiably nearly maximize expected power while controlling false positives. BLiP is very computationally efficient compared to the cost of computing the posterior and can wrap around nearly any Bayesian model and algorithm. Applying BLiP to existing state-of-the-art analyses of UK Biobank data (for genetic fine-mapping) and the Sloan Digital Sky Survey (for astronomical point source detection) increased power by 30-120% in just a few minutes of additional computation. BLiP is implemented in pyblip (Python) and blipr (R). This is joint work with Asher Spector (Stanford).


November 16th ,2023- Tianxi Cai

November 16th
Tianxi Cai
Harvard Chan
Kresge G3
4:00 – 5:00 pm

Crowdsourcing with Multi-institutional EHR to Improve Reliability of Real World Evidence – Opportunities and Challenges

The wide adoption of electronic health records (EHR) systems has led to the availability of large clinical datasets available for discovery research. EHR data, linked with bio-repository, is a valuable new source for deriving real-word, data-driven prediction models of disease risk and progression. Yet, they also bring analytical difficulties especially when aiming to leverage multi-institutional EHR data. Synthesizing information across healthcare systems is challenging due to heterogeneity and privacy. Statistical challenges also arise due to high dimensionality in the feature space. In this talk, I’ll discuss analytical approaches for mining EHR data to improve the reliability and generalizability of real world evidence generated from the analyses. These methods will be illustrated using EHR data from Mass General Brigham and Veteran Health Administration.

November 9th ,2023- Elizabeth (Betz) Halloran

November 9th
Elizabeth (Betz) Halloran
Fred Hutchinson Cancer Center
Kresge G3
4:00 – 5:00 PM

Estimating Population Effects of Pertussis Vaccination using Routinely Collected Data

Several scientific questions remain open concerning aspects of pertussis vaccination such as effects on transmission and duration of protection. Using data on King County pertussis disease surveillance and Washington State childhood immunization records, we estimated the direct, indirect, total, and overall effects of pertussis vaccination in a highly vaccinated population. We assessed the durability of vaccine effectiveness. We present our work in a historical context of previous examples of estimating various effects of pertussis vaccination on household transmission and indirect effects. We briefly compare these approaches to methods that use dynamic models or causal inference.

 

October 26th ,2023- David Benkeser

October 26th
David Benkeser
Emory Rollins Biostatistics
FXB- 301
4:00 – 5:00 PM

Exposure-Induced Confounding of Missingness in Cause of Failure with Applications in Estimating Strain-Specific Efficacy of Vaccines

A common goal of studies of preventive vaccines is to estimate whether and how their efficacy for preventing infection and/or disease varies by the strain of the infecting pathogen. This issue has come to the forefront of public interest over the past year as new strains of SARS-CoV2 have emerged. One of the challenges in learning about strain-specific efficacy of vaccines from randomized trials is the fact that many breakthrough infections have missing sequence data. This missingness can bias estimation of strain-specific efficacy. For example, vaccines often cause less virulent forms of infection, leading to lower levels of viral genetic material in samples taken from breakthrough infections and in turn, a higher probability of sequencing failure. On the other hand, some strains of pathogen may be naturally less virulent irrespective of host vaccination status. Thus, viral load may be an exposure-induced confounder in the context of estimation of strain-specific efficacy. In this talk, we will describe methodology to address this bias using flexible semiparametric methods for causal inference and illustrate these methods using examples from modern vaccine studies.

 

October 12th ,2023- Caroline Uhler


October 12th 

Caroline Uhler
MIT EECS + IDSS/ The Broad Institute
FXB- 301
4:00 -5:00 pm

Causal Representation Learning and Optimal Intervention Design

Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. Representation learning has become a key driver of deep learning applications, since it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. While representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks including predicting the effect of an intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of interventional data (drugs, knockouts, overexpression, etc.) in the biomedical sciences. However, these datasets are still miniscule compared to the action spaces of interest in these applications (e.g. interventions can take on continuous values like the dose of a drug or can be combinatorial as in combinatorial drug therapies). In this talk, we will present initial ideas towards building a statistical and computational framework for causal representation learning and discuss its applications to optimal intervention design in the context of drug design and single-cell biology.


October 5th ,2023- Peter Gilbert


October 5th
Peter Gilbert
Fred Hutchinson Cancer Center
FXB -301
4:00 -5:00 pm

Assessing Pathogen-Sequence Dependent Immune Correlates in Vaccine Efficacy Trials

For genetically diverse pathogens, generally a vaccine’s efficacy to prevent an infectious disease outcome depends on variation in both (1) biomarkers quantifying vaccine-elicited immune responses and (2) exposing pathogen sequences. Over the past 3 years, Coronavirus Prevention Network biostatisticians analyzed data from randomized, placebo-controlled COVID-19 vaccine efficacy trials using causal inference and survival analysis methods in the pursuit of modularly understanding (1) and (2).  With application to the international ENSEMBLE trial of Janssen’s Ad26 vector COVID-19 vaccine, this talk describes methods for joint assessment of (1) and (2).  Based on a mark-specific proportional hazards model (Sun, Qi, Heng, Gilbert, 2020, JRSS-C), the methods use nonparametric kernel smoothing in a quantitative SARS-CoV-2 amino acid sequence feature, augmented inverse probability weighting for missing biomarkers, and multiple imputation for missing pathogen sequences.  This talk aims to encourage statistical methods that jointly account for host and pathogen variation, toward improving biomarkers of vaccine efficacy and enhancing insights into how vaccines work.

 

September 28th,2023- Polina Golland

September 28th
Polina Golland
 MIT EECS + CSAIL
FXB -301
4:00 – 5:00 PM

Learning to Read X-ray: Applications to Heart Failure Monitoring

We propose and demonstrate a novel approach to training image classification models based on large collections of images with limited labels. We take advantage of availability of radiology reports to construct joint multimodal embedding that serves as a basis for classification. We demonstrate the advantages of this approach in application to assessment of pulmonary edema severity in congestive heart failure that motivated the development of the method.

Polina Golland is a Henry Ellis Warren (1894) professor of Electrical Engineering and Computer Science at MIT and a principal investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). Her primary research interest is in developing novel techniques for medical image analysis and understanding. With her students, Polina has demonstrated novel approaches to image segmentation, shape analysis, functional image analysis and population studies. She has served as an associate editor of the IEEE Transactions on Medical Imaging and of the IEEE Transactions on Pattern Analysis. Polina is currently on the editorial board of the Journal of Medical Image Analysis. She is a Fellow of the International Society for Medical Image Computing and Computer Assisted Interventions (MICCAI) and of the American Institute for Medical and Biological Engineering (AIMBE).

 

April 13th,2023- Stijn Vansteelandt

April 13th
4:00 – 5:00 pm
Kresge G3

Stijn Vansteelandt, Ghent University (Belgium)
Professor in the Department of Applied Mathematics, Computer Science and Statistics

Assumption-lean (causal) Modeling

In 2001, Leo Breiman criticized the statistical community for its reliance on “data models” (Breiman, 2001). In this talk, I will revisit Breiman’s critiques in light of recent developments in algorithmic modeling, debiased machine learning, and targeted learning that have taken place over the past 2 decades, largely within the causal inference literature (Vansteelandt, 2021). These advancements resolve Breiman’s critiques and have brought significant progress in providing more useful answers to scientific questions while also targeting well-understood causal estimands. However, these techniques require in-depth training in causal inference and primarily focus on evaluating the effects of dichotomous exposures, sometimes leading to poor practice (such as dichotomization of a continuous exposure, purely to `fit the framework’) or reversion to the traditional modeling culture when faced with more complex settings. Additionally, translating causal questions into the effects of interventions may not always be feasible or of immediate interest in the more descriptive phases of research where no specific interventions are (yet) envisaged; and even when it is, one may find the data to lack information about the wanted effects (e.g., because the considered intervention is unlikely for some in current practice).

To address these concerns, I will present a conceptual framework on assumption-lean regression (Vansteelandt and Dukes, 2022) that resolves Breiman’s critiques and other typical regression concerns, while building on debiased/targeted machine learning literature. This framework aims to be as broadly useful as standard regression methods, while continuing to resolve Breiman’s concerns and other typical concerns about regression.

A large part of this talk will be conceptual and is aimed to be widely accessible; parts of the talk will demonstrate in more detail how assumption-lean regression works in the context of generalised linear models (Vansteelandt and Dukes, 2022) and Cox proportional hazard models (Vansteelandt et al., 2022). I will also discuss ongoing extensions on flexible (non-parametric) assumption-lean modeling of the effects of a continuous exposure on an outcome.

References:
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.
Vansteelandt, S. (2021). Statistical Modelling in the Age of Data Science. Observational Studies, 7(1), 217-228.
Vansteelandt, S and Dukes, O. (2022) Assumption-lean inference for generalised linear model parameters (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(3), 657– 685.
Vansteelandt, S., Dukes, O., Van Lancker, K., & Martinussen, T. (2022). Assumption-lean Cox regression. Journal of the American Statistical Association, 1-10.

March 23rd,2023- Lexin Li

March 23rd
4:00 – 5:00 pm
Kresge 200

Lexin Li
Professor of Biostatistics at the Department of Biostatistics and Epidemiology
University of California, Berkeley

Statistical Neuroimaging Analysis: An Overview

Understanding the inner workings of human brains, as well as their connections with neurological disorders, is one of the most intriguing scientific questions. Studies in neuroscience are greatly facilitated by a variety of neuroimaging technologies, including anatomical magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), electroencephalography (EEG), diffusion tensor imaging, positron emission tomography (PET), among many others. The size and complexity of medical imaging data, however, pose numerous challenges, and call for constant development of new statistical methods. In this talk, I give an overview of a range of neuroimaging topics our group has been investigating, including imaging tensor analysis, brain connectivity network analysis, multimodality analysis, and imaging causal analysis. I also illustrate with a number of specific case studies.

March 9th,2023- Susan Murphy

March 9th
4:00 – 5:00 pm
FXB- G13
Susan Murphy 

Mallinckrodt Professor of Statistics and of Computer Science,
Radcliffe Alumnae Professor at the Radcliffe Institute, Harvard University

Inference for Longitudinal Data After Adaptive Sampling

 

Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and (b) allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.

February 23rd,2023- Samuel Kou



February 23rd

4:00 – 5:00 pm
FXB – G13

Samuel Kou
Departments of Statistics and Biostatistics
Harvard University

Catalytic Prior Distributions for Bayesian Inference

The prior distribution is an essential part of Bayesian statistics, and yet in practice, it is often challenging to quantify existing knowledge into pragmatic prior distributions. In this talk we will discuss a general method for constructing prior distributions that stabilize the estimation of complex target models, especially when the sample sizes are too small for standard statistical analysis, which is a common situation encountered by practitioners with real data. The key idea of our method is to supplement the observed data with a relatively small amount of “synthetic” data generated, for example, from the predictive distribution of a simpler, stably estimated model. This general class of prior distributions, called “catalytic prior distributions” is easy to use and allows direct statistical interpretation. In the numerical evaluations, the resulting posterior estimation using catalytic prior distribution outperforms the maximum likelihood estimate from the target model and is generally superior to or comparable in performance to competitive existing methods. We will illustrate the usefulness of the catalytic prior approach through real examples and explore the connection between the catalytic prior approach and a few popular regularization methods.

February 9th,2023- Fernanda Viegas & Martin Watternberg

February 9th, 2023
4:00 – 5:00 pm
FXB-G13

Combined Colloquium with:

Fernanda Viegas
Sally Starling Seaver Professor at Harvard Radcliffe Institute
Gordon McKay Professor of Computer Science
Harvard John A. Paulson School of Engineering and Applied Sciences

Martin Wattenberg
Gordon McKay Professor of Computer Science

Beyond graphs and charts: harnessing the power of data visualization

While most of us are familiar with simple graphs such as bar charts and line charts, data visualization is a broad and expressive medium. In this presentation we’ll touch on some of the visual techniques for powerful exploratory data analysis that look at different kinds of rich data such as text and images. We’ll also discuss storytelling with data and the key differences between communication and exploration with data visualization.

January 26th,2023- Tamara Broderick , MIT

January 26th, 2023
4:00 – 5:00 pm
Kresge G2

Tamara Broderick, PhD
Associate Professor
Machine Learning and Statistics
MIT

 An Automatic Finite-Sample Robustness Metric: Can Dropping a Little Data Change Conclusions?

One hopes that data analyses will be used to make beneficial decisions regarding people’s health, finances, and well-being. But the data fed to an analysis may systematically differ from the data where these decisions are ultimately applied. For instance, suppose we analyze data in one country and conclude that microcredit is effective at alleviating poverty; based on this analysis, we decide to distribute microcredit in other locations and in future years. We might then ask: can we trust our conclusion to apply under new conditions? If we found that a very small percentage of the original data was instrumental in determining the original conclusion, we might expect the conclusion to be unstable under new conditions. So we propose a method to assess the sensitivity of data analyses to the removal of a very small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide an approximation. We call our resulting method the Approximate Maximum Influence Perturbation. Our approximation is automatically computable, theoretically supported, and works for common estimators. We show that any non-robustness our metric finds is conclusive. Empirics demonstrate that while some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data even in simple models and even when standard errors are small.

April 21st, 2022 - Ting Ye , University of Washington

Ting Ye, Ph.D.
Assistant Professor
Department of Biostatistics
University of Washington

Robust Mendelian Randomization in the Presence of many Weak Instruments and Widespread Horizontal Pleiotropy

Mendelian randomization (MR) has become a popular approach to study the effect of a modifiable exposure on an outcome by using genetic variants as instrumental variables (IVs). Two distinct challenges persist in MR: (i) each genetic variant explains a relatively small proportion of variance in the exposure and there are many such variants, a setting known as many weak IVs; and (ii) many genetic variants may have direct effects on the outcome not through the exposure, or in genetic terms, when there exists widespread horizontal pleiotropy. To address these two challenges simultaneously, we propose two novel estimators, the debiased inverse-variance weighted (dIVW) estimator for summary-data MR and the GENIUS-MAWII estimator for individual-data MR, and we establish their statistical properties. We conclude by demonstrating these two methods in simulated and real datasets.

Short bio: Ting Ye is an Assistant Professor in the Department of Biostatistics at the University of Washington. Her research interests focus on developing pragmatic and robust statistical methods for causal inference in biomedical and social sciences. Most of her research has been about addressing complications in clinical trials and hidden biases in observational studies.

March 24th, 2022 - James Scott, University of Texas at Austin

March 24th, 2022- James Scott, PhD
Professor of Statistics and Data Science
Fayez Sarofim & Co. Centennial Professor in Business
University of Texas at Austin

BART and Its Variations: Three Applications in Obstetrics

In this talk, I will describe some of my group’s ongoing work to address statistical challenges in functional regression, with specific application to obstetrics, and with the overall public-health goal of reducing complications of pregnancy. I will first describe a general Bayesian approach for nonparametric regression, called “BART with targeted smoothing” (or tsBART). TsBART is based on a very popular nonparametric regression framework called Bayesian Additive Regression Trees (BART), but modified in some crucial ways to address what our medical collaborators have identified as some of the most common data-science motifs in obstetric research. I will then describe how we’ve adapted this framework in a variety of directions, to help answer three different questions in obstetrics research: (1) how to recommend an optimal gestational age of delivery that minimizes the overall risk of perinatal mortality in high-risk pregnancies; (2) how preeclampsia is related to birthweight in low-resource hospital settings, such as those common in lower and middle-income countries; and (3) how two different dosing protocols for early medical abortion compare in effectiveness over the first nine weeks of gestation.

 

February 17th, 2022- Chiara Sabatti, Stanford University


February 17th, 2022- Chiara Sabatti, PhD
Professor of Biomedical Data Science
Professor of Statistics
Stanford University 

Genetic variants  across human populations and our understanding of the genetic basis of traits

Abstract: Identifying which genetic variants influence medically relevant phenotypes is an important task both for therapeutic development and for risk prediction. In the last decade, genome wide association studies have been the most widely-used instrument to tackle this question. One challenge that they encounter is in the interplay between genetic variability and the structure of human populations. In this talk, we will focus on some opportunities that arise when one collects data from diverse populations and present statistical methods that allow us to leverage them.

The presentation will be based on joint work with M. Sesia, S. Li, Z. Ren, Y. Romano and E. Candes.

 

April 15th, 2021- Bhramar Mukherjee, University of Michigan

April 15th –Bhramar Mukherjee, PhD
Chair of Biostatistics
University of Michigan

Handling Outcome Misclassification and Selection Bias in Association Studies Using Electronic Health Records

In this talk we will discuss statistical challenges and opportunities with joint analysis of electronic health records and genomic data through “Genome and Phenome-Wide Association Studies(GWAS and PheWAS)”. We posit a modeling framework that helps us to understand the effect of both selection bias and outcome misclassification in assessing genetic associations across the medical phenome. We will propose various inferential strategies that handle both sources of bias to yield improved inference. We will use data from the UK Biobank and the Michigan Genomics Initiative, a longitudinal biorepository at Michigan Medicine, launched in 2012 to illustrate the analytic framework.

The examples illustrate that understanding sampling design and selection bias matters for big data, and are at the heart of doing good science with data. This is joint work with Lauren Beesley and Lars Fritsche at the University of Michigan.

 

For Zoom Info Please Email: Kpietrini@hsph.harvard.edu

March 25th, 2021- Kathryn Roeder, Carnegie Mellon University

March 25th -Kathryn Roeder, PhD
Department of Statistics and Data Science
UPMC Professor of Statistics and Life Sciences
Carnegie Mellon University

Statistical challenges in the analysis of single-cell RNA-seq from brain cells

Quantification of gene expression using single cell RNA-sequencing of brain tissues, can be a critical step in the understanding of cell development and differences between cells sampled from case and control subjects.   We describe statistical challenges encountered analyzing expression of brain cells in the context of two projects. First, over-correction has been one of the main concerns in employing various data integration methods that risk removing the biological distinctions, which is harmful for cell type identification. Here, we present a simple yet surprisingly effective transfer learning model named cFIT for removing batch effects across experiments, technologies, subjects, and even species. Second, gene co-expression networks yield critical insights into biological processes, and single-cell RNA sequencing provides an opportunity to target inquiries at the cellular level.  However, due to the sparsity and heterogeneity of transcript counts, it is challenging to construct accurate gene co-expression networks.  We develop an alternative approach that estimates cell-specific networks for each single cell. We use this method to identify differential network genes in a comparison of cells from brains of individuals with autism spectrum disorder and those without.

February 12, 2021- Devan V. Mehrotra, University of Pennsylvania

Friday, February 12th, 2021
From 4- 5 pm
Lunchtime Career Chat : 1 – 2 PM
For Zoom Info Please Email: Kpietrini@hsph.harvard.edu

Devon V. Mehrotra
University of Pennsylvania

A Novel Approach for Survival Analysis in Randomized Clinical Trials

Biostatistics and Research Decision Sciences, Merck & Co., Inc.
Randomized clinical trials are often designed to assess whether a test treatment prolongs survival relative to a control treatment. Increased patient heterogeneity, while desirable for generalizability of results, can weaken the ability of common statistical approaches to detect treatment differences, potentially hampering the regulatory approval of safe and efficacious therapies. A novel solution to this problem is proposed. A list of baseline covariates that have the potential to be prognostic for survival under either treatment is pre-specified in the analysis plan. At the analysis stage, using observed survival times but blinded to patient-level treatment assignment, ‘noise’ covariates are removed with elastic net Cox regression. The shortened covariate list is subsequently used by a conditional inference tree algorithm to segment the heterogeneous trial population into subpopulations of prognostically homogeneous patients (risk strata). After patient-level treatment unblinding, a treatment comparison is done within each formed risk stratum and stratum-level results are combined for overall statistical inference. The impressive power-boosting performance of our proposed 5-step stratified testing and amalgamation routine (5-STAR), relative to that of the logrank test and other common approaches that do not leverage inherently structured patient heterogeneity, is illustrated using a hypothetical and two real datasets along with simulation results. In addition, the importance of reporting stratum-level treatment effects is highlighted as a potential enabler of personalized medicine. An R package is available for implementation. (Joint work with Rachel Marceau West at Merck).

September 20, 2018 - MYRTO AWARD - Elizabeth Stuart, Johns Hopkins

Myrto Lefkopoulou Award
Thursday, September 20, 2018
Kresge G2
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm in the FXB Atrium

Elizabeth Stuart
Associate Dean for Education
Professor of Mental Health, Biostatistics, and Health Policy and Management
Johns Hopkins Bloomberg School of Public Health

Dealing with observed and unobserved effect moderators when estimating population average treatment effects

Many decisions in public health and public policy require estimation of population average treatment effects, including questions of cost effectiveness or when deciding whether to implement a screening program across a population. While randomized trials are seen as the gold standard for (internally valid) causal effects, they do not always yield accurate inferences regarding population effects. In particular, in the presence of treatment effect heterogeneity, the average treatment effect (ATE) in a randomized controlled trial (RCT) may differ from the average effect of the same treatment if applied to a target population of interest. If all treatment effect moderators are observed in the RCT and in a dataset representing the target population, then we can obtain an estimate for the target population ATE by adjusting for the difference in the distribution of the moderators between the two samples. However, that is often an unrealistic assumption in practice. This talk will discuss methods for generalizing treatment effects under that assumption, as well as sensitivity analyses for two situations: (1) where we cannot adjust for a specific moderator observed in the RCT because we do not observe it in the target population; and (2) where we are concerned that the treatment effect may be moderated by factors not observed even in the RCT. These sensitivity analyses are particularly crucial given the often limited data available from trials and on the population. The methods are applied to examples in drug abuse treatment. Implications for study design and analyses are also discussed, when interest is in a target population ATE.

October 11, 2018 - LAGAKOS AWARD - Amy Herring, Duke University

Lagakos Distinguished Alumni Award
Thursday, October 11, 2018
Kresge G3
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm in the FXB Atrium

Amy Herring
Professor of Statistical Science
Research Professor of Global Health
Duke University

Statistics for Science’s Sake

From decapitated cats (my first Harvard biostatistics project!) to birth defects (my second project and continued scholarly focus) and beyond, we will consider a series of case studies of scientific problems that pose interesting statistical challenges motivating new methodological development. We will address the motivating scientific problems, drawbacks of existing or standard analysis approaches, and the process of collaboration in multiple disciplines, with a focus on strategies for generating ideas for research beyond graduate school and throughout one’s career.

November 15, 2018 - Pierre Bushel,National Institute of Environmental Health Sciences

November 15, 2018
Building 2, Room 426
3:45-4:45pm
Happy Hour to follow in the FXB Atrium

Pierre Bushel
Professor of Statistical Science
Staff Scientist, Biostatistics and Computational Biology Branch
National Institute of Environmental Health Sciences

A Mashup of Statistics and Bioinformatics Applied to Genomic and Genetic Data for Better Understanding of Biological Consequences

Exposure to certain environmental and chemical stressors can cause adverse health conditions and are of immense public concern.  Determining the mechanisms by which these insults affect biological systems is paramount to derive of remedies to improve public health.  Genetics and genomics are increasingly being used as tools to investigate environmental health sciences and toxicology.  In this presentation we will examine two cases of applying statistics and bioinformatics for better understanding of biological consequences elicited from chemical and environmental exposures.  In the first case, a human clinical study of gene expression changes in the blood from responders and non-responders to acetaminophen (APAP) is utilized.  APAP is the active ingredient in Tylenol and is toxic to the liver when taken while having consumed alcohol, overdosed or in some cases of possible/unknown genetic susceptibility.  Here we demonstrate the use of piecewise linear regression modeling of the data and pathway analysis to identify gene signatures and molecular pathways that are indicative of an adverse response to APAP.  Particularly, a network of genes associated with protein misfolding and the accumulation of misfolded proteins in the endoplasmic reticulum may potentially play a crucial role in mediating APAP toxicity.  In the second case, strains of mice exposed to hyperoxia and normoxia are genotyped and characterized phenotypically for genetic analysis.  Chronic high oxygen saturation causes mitochondrial dysfunction, excessive reactive oxygen species that damage DNA and adversely affects the lungs of preterm infants.  Here we illustrate the use of mixed linear modeling with a specified correlation structure and the MitoMiner database to reveal epistatic interactions between the nuclear and mitochondrial genomes that are associated with the hyperoxia-induced lung injury phenotype.  In particular, nuclear genes with allelic interactions function in the mitochondrial respiratory chain which is involved in oxidative phosphorylation to create adenosine triphosphate (i.e., the cell’s energy source).

Marvin Zelen Leadership Award in Statistical Science
Thursday, May 9, 2019
Room TBD
Award & Lecture: 3:45-4:45pm
Coffee & Tea served at 3:30pm
Reception to follow at 4:45pm

2019 recipient to be announced

Archive


May 3, 2018 - Joseph Hogan, Brown University

Thursday, May 3, 2018
3:30 – 4:30 PM
FXB G13

Joseph Hogan
Carole and Lawrence Sirovich Professor of Public Health
Professor of Biostatistics
Chair of Biostatistics
Brown University

Using electronic health records to model engagement and retention in HIV care

The HIV care cascade is a conceptual model describing the stages of care leading to long-term viral suppression of those with HIV. Distinct stages include case identification, linkage to care, initiation of antiretroviral treatment, and eventual viral suppression. After entering care, individuals are subject to disengagement from care, dropout, and mortality.

Owing to the complexity of the cascade, evaluation of efficacy and cost effectiveness of specific policies has primarily relied on simulation-based approaches of mathematical models, where model parameters may be informed by multiple data sources that come from different populations or samples. The growing availability of electronic health records and large-scale cohort data on HIV-infected individuals presents an opportunity for a more unified, data-driven approach using statistical models.

We describe a statistical framework based on multistate models that can be used for regression analysis, prediction and causal inferences. We illustrate using data from a large HIV care program in Kenya, focusing on comparisons between statistical and mathematical modeling approaches for inferring causal effects about treatment policies.

April 26, 2018 - ZELEN AWARD - Constantine Gatsonis, Brown University

Marvin Zelen Leadership Award in Statistical Science

Thursday, April 26, 2018
FXB Building, Room G13
Award & Lecture: 3:45-4:45pm
Coffee & Tea served
Reception to follow at 4:45pm in the FXB Atrium

Constantine Gatsonis
Henry Ledyard Goddard University Professor of Public Health
Professor of Biostatistics
Director of Statistical Sciences
Brown University

The Evaluation of Diagnostic Imaging in the Era of Radiomics

The quantitative analysis of imaging via machine learning methods for high dimensional data is defining the new frontier for diagnostic imaging. A vast array of imaging-based markers is becoming available, each marker carrying claims of potential utility in clinical care and each being a potential candidate for inclusion in clinical trials. The research enterprise in the clinical evaluation of diagnostic imaging has made great strides in the past three decades. In particular, we now have well developed paradigms for the study of diagnostic and predictive accuracy in the multi-center setting. We are also making progress on the more complex problem of assessing the impact of tests on patient outcomes, via randomized and observational studies, analysis of large databases, and simulation modeling. However the volume of the new radiomics-based markers and their potential for fast evolution, even without formal learning, poses a new set of challenges for researchers and regulators. In this presentation we will survey the methodologic advances in the clinical evaluation of diagnostic imaging in recent decades, showcase examples of radiomics-based modalities, and discuss the statistical and regulatory challenges they create.

March 29, 2018 - Catherine Calder, Ohio State

Thursday, March 29, 2018
3:30 – 4:30 PM
FXB G13

Catherine Calder
Professor of Statistics
The Ohio State University

Activity Patterns and Ecological Networks:  Identifying Shared Exposures to Social Contexts

In the social and health sciences, research on ‘neighborhood effects’ focuses on linking features of social contexts or exposures to health, educational, and criminological outcomes.  Traditionally, individuals are assigned a specific neighborhood, frequently operationalized by the census tract of residence, which may not contain the locations of routine activities.  In order to better characterize the many social contexts to which individuals are exposed as a result of the spatially-distributed locations of their routine activities and to understand the consequences of these socio-spatial exposures, we have developed the concept of ecological networks.  Ecological networks are two-mode networks that indirectly link individuals through the spatial overlap in their routine activities.  This presentation focuses on statistical methodology for understanding and comparing the structure ecological network(s).  In particular, we propose a continuous latent space (CLS) model that allows for third-order dependence patterns in the interactions between individuals and the places they visit and a parsimonious non-Euclidean CLS model that facilitates extensions to multi-level modeling.  We illustrate our methodology using activity pattern and sample survey data from Los Angeles, CA and Columbus, OH.

March 5, 2018 - JOINT w. STATS - Emily Fox, University of Washington

Monday, March 5, 2018
3:30 – 4:30 PM
FXB G13

Emily Fox
Associate Professor & The Amazon Professor of Machine Learning
Paul G. Allen School of Computer Science & Engineering
Department of Statistics, The University of Washington

Machine Learning for Analyzing Neuroimaging Time Series

Recent neuroimaging modalities, such as high-density electrocorticography (ECoG) and magnetoencephalography (MEG), provide rich descriptions of brain activity over time opening the door to new analyses of observed phenomena including seizures and understanding the neural underpinnings of complex cognitive processes. However, such data likewise present new challenges for traditional time series methods.  One challenge is how to model the complex evolution of the time series with intricate and possibly evolving relationships between the multitude of dimensions.  Another challenge is how to scale such analyses to long and possibly streaming recordings in the case of ECoG, or how to leverage few and costly MEG observations.  In this talk, we first discuss methods for automatically parsing the complex dynamics of ECoG recordings for the sake

of analyzing seizure activity. To handle the lengthy ECoG recordings, we develop stochastic gradient MCMC and variational algorithms that iterate on subsequences of the data while properly accounting for broken temporal dependencies. We then turn to studying the functional connectivity of auditory attention using MEG recordings.  We explore notions of undirected graphical models of the time series in the frequency domain as well as learning time-varying directed interactions with state-space models.  We conclude by discussing recent work on inferring Granger causal networks in nonlinear time series using penalized neural network models.

January 25, 2018 - Peter Mueller, UT Austin

Thursday, January 25, 2018
3:30 – 4:30 PM
FXB G13

Peter Mueller
Department Chair (Interim), Professor
Department of Statistics & Data Science, Department of Mathematics
The University of Texas at Austin

Bayesian Feature Allocation Models for Tumor Heterogeneity

We characterize tumor variability by hypothetical latent cell types that are defined by the presence of some subset of recorded SNV’s. (single nucleotide variants, that is, point mutations).  Assuming that each sample is composed of some sample-specific proportions of these cell types we can then fit the observed proportions of SNV’s for each sample.  In other words, by fitting the observed proportions of SNV’s in each sample we impute latent underlying cell types, essentially by a deconvolution of the observed proportions as a weighted average of binary indicators that define cell types by the presence or absence of different SNV’s. In the first approach we use the generic feature allocation model of the Indian buffet process (IBP) as a prior for the latent cell subpopulations. In a second version of the proposed approach we make use of pairs of SNV’s that are jointly recorded on the same reads, thereby contributing valuable haplotype information.
Inference now requires feature allocation models beyond the binary IBP. We introduce a categorical extension of the IBP. Finally, in a third approach we replace the IBP by a prior based on a stylized model of a phylogenetic tree of cell subpopulations.

December 7, 2017 - David Draper, UC Santa Cruz

Thursday, December 7, 2017
3:30 – 4:30 PM
Kresge 202A

David Draper
Professor, Department of Applied Mathematics and Statistics
Jack Baskin School of Engineering
University of California, Santa Cruz

Optimal Bayesian Analysis of A/B Tests (Randomized Controlled Trials) in Data Science at Big-Data Scale

When coping with uncertainty, You typically begin with a problem P = (Q,C), in which Q lists the questions of principal interest to be answered and C summarizes the real-world context in which those questions arise. Q and C together de ne (θ,D,B), in which θ (which may be in nite-dimensional) is the unknown of principal interest, D summarizes Your data resources for decreasing Your uncertainty about θ, and B is a nite (ideally exhaustive) set of (true/false) propositions summarizing, and all rendered true by, context C.

The Bayesian paradigm provides one way to arrive at logically-internally-consistent inferences about θ, predictions for new data D∗, and decisions under uncertainty (either through Bayesian decision theory or Bayesian game theory). In this paradigm it’s necessary to build a stochastic model M that relates knowns (D and B) to unknowns (θ). For inference and prediction, on which I focus in this talk, the model takes the form M = {p(θ|B),p(D|θ,B)}, in which the (prior) distribution p(θ | B) quanti es Your information about θ external to Your dataset D and the (sampling) distribution p(D | θ, B) quanti es Your information about θ internal to D, when converted to Your likelihood distribution l(θ | D, B) ∝ p(D | θ, B).

The fundamental problem of applied statistics is that the mapping from P to M is often not unique: You have basic uncertainty about θ, but often You also have model uncertainty about how to specify Your uncertainty about θ. It turns out, however, that there are situations in which problem context C implies a unique choice of p(θ | B) and/or l(θ | D, B); let’s agree to say that optimal Bayesian model specification has occurred in such situations, leading to optimal Bayesian analysis via Bayes’s Theorem and its corollaries.

In this talk I’ll identify an important class of problems arising in the analysis of randomized controlled trials (typically called A/B tests in Data Science) in which optimal Bayesian analysis is made possible by the use of Bayesian non-parametric modeling, and I’ll illustrate this type of analysis with an A/B test at Big-Data scale (involving about 22 million observations). Along the way I’ll demonstrate that the frequentist bootstrap is actually a Bayesian non-parametric method in disguise, and I’ll discuss how large-scale observational studies may admit similar analyses via conditional exchangeability assumptions (although such analyses will typically not be optimal in the sense de ned here, because such exchangeability assumptions are typically not uniquely speci ed by problem context).

November 16, 2017 - Joe Koopmeiners, University of Minnesota

Thursday, November 16, 2017
3:30 – 4:30 PM
Kresge 202A

Joe Koopmeiners
Associate Professor, Division of Biostatistics
University of Minnesota

A Multi-source Adaptive Platform Design for Emerging Infectious Diseases

Emerging infectious diseases challenge traditional paradigms for clinical translation of therapeutic interventions. The Ebola virus disease outbreak in West Africa was a recent example which heralded the need for alternative designs that can be sufficiently flexible to compare multiple potentially effective treatment regimes in a context with high mortality and limited available treatment options. The PREVAIL II master protocol was designed to address these concerns by sequentially evaluating multiple treatments in a single trial and incorporating aggressive interim monitoring with the purpose of identifying efficacious treatments as soon as possible. One shortcoming, however, is that supplemental information from controls in previous trial segments was not utilized. In this talk we address this limitation by proposing an adaptive design methodology that facilitates “information sharing” across potentially non-exchangeable segments using multi-source exchangeability models (MEMs). The design uses multi-source adaptive randomization to target information balance within a trial segment in relation to posterior effective sample size. When compared to the standard platform design, we demonstrate that MEMs with adaptive randomization can improve power with limited type-I error inflation. Further, the adaptive platform effectuates more balance with respect to the distribution of acquired information among study arms with more patients randomized to experimental regimens which, when effective, yields reductions in the overall mortality rate for trial participants.

October 19, 2017 - LAGAKOS AWARD - Nicholas Horton, Amherst College

Thursday, October 19, 2017
FXB Building, Room G12
Award & Lecture: 3:30-4:30pm
Coffee & Tea served at 3:00pm
Reception to follow at 4:30pm in the FXB Atrium

Nicholas Horton
Professor of Statistics at Amherst College

Multivariate thinking and the introductory biostatistics course: preparing students to make sense of a world of observational data

We live in a world of ever expanding found (or what we might call observational) data. To make decisions and disentangle complex relationships in such a world, students need a background in design and confounding. The GAISE College Report enunicated the importance of multivariate thinking as a way to move beyond bivariate thinking. But how do such learning outcomes compete with other aspects of statistics knowledge (e.g., inference and p-values) in introductory courses that are already overfull. In this talk I will offer some reflections and guidance about how we might move forward, with specific implications for introductory biostatistics courses.

October 2, 2017 - JOINT w. STATS - Rebecca Steorts, Duke University

Monday, October 2, 2017
4:15 PM
Science Center, Hall A
Cambridge, MA

Rebecca C. Steorts
Department of Statistical Science, affiliated faculty in Computer Science, Biostatistics and Bioinformatics, the information initiative at Duke, and the Social Science Research Institute

Entity Resolution with Societal Impacts in Statistical Machine Learning

Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions.  Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in Syria? In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns.  In such cases, analysts need to use methods from statistical and computational science known as entity resolution (record linkage or de-duplication) to proceed with analysis.  Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution, with applications to the Syrian conflict but also official statistics, and the food and music industry. This work, which is a joint collaboration with researchers at Rice University and the Human Rights Data Analysis Group (HRDAG) touches on the interdisciplinary research that is crucial to problems with societal impacts that are at the forefront of both national and international news. 

September 28, 2017 - MYRTO AWARD - Ciprian Crainiceanu, Johns Hopkins

Thursday, September 28, 2017
FXB Building, Room G12Red
Award & Lecture: 3:30-4:30pm
Coffee & Tea served at 3:00pm
Reception to follow at 4:30pm in the FXB Atrium

Ciprian Crainiceanu
Professor, Department of Biostatistics
Johns Hopkins University

Biostatistical Methods for Wearable and Implantable Technology

Wearable and Implantable Technology (WIT) is rapidly changing the Biostatistics data analytic landscape due to their reduced bias and measurement error as well as to the sheer size and complexity of the signals. In this talk I will review some of the most used and useful sensors in Health Sciences and the ever expanding WIT analytic environment. I will describe the use of WIT sensors including accelerometers, heart monitors, glucose monitors and their combination with ecological momentary assessment (EMA). This rapidly expanding data eco-system is characterized by  multivariate densely sampled time series  with complex and highly non-stationary structures. I will introduce an array of scientific problems that can be answered using WIT and I will describe methods designed to analyze the WIT data from the micro- (sub-second-level) to the macro-scale (minute-, hour- or day-level) data.

April 5, 2017 - Enrique Schisterman, NIH

Wednesday April 5, 2017
3:00 – 4:00 PM
Building 2, Room 426

Enrique F. Schisterman
Chief and Senior Investigator
Epidemiology Branch
NICHD/DIPHR

Pooling Bio-Markers as a Cost-Effective Design

Evaluating biomarkers in epidemiological studies can be expensive and time consuming. Many investigators use techniques such as random sampling or pooling biospecimens in order to cut costs and save time on experiments. Commonly, analyses based on pooled data are strongly restricted by distributional assumptions that are challenging to validate because of the pooled biospecimens. Random sampling provides data that can be easily analyzed. However, random sampling methods are not optimal cost-efficient designs for estimating means. We propose and examine a cost-efficient hybrid design that involves taking a sample of both pooled and unpooled data in an optimal proportion in order to efficiently estimate the unknown parameters of the biomarker distribution.

March 23, 2017 - Amy Herring, UNC

Thursday, March 23, 2017
3:00 – 4:00 PM
Building 2, Room 426

Amy Herring
Associate Chair
Department of Biostatistics
UNC Gillings School of Global Public Health

Bayesian models for multivariate dynamic survey data

Modeling and computation for multivariate longitudinal data have proven challenging, particularly when data contain discrete measurements of different types.  Motivated by data on the fluidity of sexuality from adolescence to adulthood, we propose a novel nonparametric approach for mixed-scale longitudinal data.  The proposed approach relies on an underlying variable mixture model with time-varying latent factors.  We use our approach to address hypotheses in The National Longitudinal Study of Adolescent to Adult Health, which selected participants via stratified random sampling, leading to discrepancies between the sample and population that are further compounded by missing data.  Survey weights have been constructed to address these issues, but including them in complex statistical models is challenging.  Bias arising from the survey design and nonresponse is addressed.  The approach is assessed via simulation and used to answer questions about the associations among sexual orientation identity, behaviors, and attraction in the transition from adolescence to young adulthood.

February 9, 2017 - Jennifer Hill, NYU

Thursday, February 9, 2017
12:30-1:30 PM
Building 2, Room 426

Jennifer L. Hill
Professor of Applied Statistics & Data Science
New York University

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

Statisticians have made great strides towards assumption-free estimation of causal estimands in the past few decades. However this explosion in research has resulted in a breadth of inferential strategies that both create opportunities for more reliable inference as well as complicate the choices that an applied researcher has to make and defend. Relatedly, researchers advocating for new methods typically compare their method to (at best) 2 or 3 other causal inference strategies and test using simulations that may or may not be designed to equally tease out flaws in all the competing methods. The causal inference data analysis challenge, “Is Your SATT Where It’s At?”, launched as part of the 2016 Atlantic Causal Inference Conference, sought to make progress with respect to both of these issues. The researchers creating the data testing grounds were distinct from the researchers submitting methods whose efficacy would be evaluated. Results from over 30 competitors in the two parallel versions of the competition (Black Box Algorithms and Do It Yourself Analyses) are presented along with post-hoc analyses that reveal information about the characteristics of causal inference strategies and settings that affect performance. The most consistent conclusion was that the automated (black box) methods performed better overall than the user-controlled methods across scenarios.