B3D Mail List 
Recording of Past Seminars


Organized by the Department of Biostatistics at the Harvard T.H. Chan School of Public Health. Big Data (B3D) Seminar is a series of research talks on statistical, computational, and machine learning methods for analyzing large complex data sets, with a focus on applications in biomedical science and public health including:

  • Genetics and genomics
  • Epidemiological and environmental health science
  • Comparative effective research
  • Electronic medical records
  • Digital health
  • Neuroscience
  • Network Sciences
  • Precision Health

The goal of the seminar is to provide a forum for brainstorming and exchanging ideas, and promoting interdisciplinary collaboration among researchers from a variety of disciplines such as biostatistics/statistics, biomedical informatics, computer science, computational biology, biomedicine, public health, social sciences, and other related areas. The seminar will feature local, national, and international speakers who are leaders in their field.

For speaker recommendations or scientific questions:
Please contact seminar organizer Jeff Miller, or Susanne Churchill

For event information or other logistical questions:
Please contact seminar coordinator Katie Pietrini (HSPH)

Upcoming Virtual Seminars


2023 – Upcoming Schedule

Seminar Archive

May 2nd, 2023 - Leo Duan

Leo Duan, PhD
University of Florida
Department of Statistics

Model-based Spectral Clustering: Uncovering Clear Signals in fMRI Images using Measures of Network Vulnerability

pectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a ”forest process” as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. In this talk, I’ll demonstrate an interesting extension of the hierarchical mixture model applied to a heterogeneous set of resting-state fMRI data from an Alzheimer’s disease study. Despite high within-group variability, we were able to uncover meaningful signals using the proposed framework that characterizes the vulnerability of the brain to disruptions.


April 3rd, 2022 -Gemma Moran

April 3rd, 2023
1:00 – 2:00 pm
Zoom – Please email kpietrini@hsph.harvard.edu for link

Gemma Moran, PhD
Postdoctoral Research Scientist
Columbia University

Identifiable deep generative models via sparse decoding

We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.


March 27th, 2022 -Nima Hejazi

March 27th, 2023
1:00 – 2:00 pm

Nima Hejazi, PhD
Assistant Professor of Biostatistics
Harvard T.H Chan School of Public Health

Combining Causal Inference and Machine Learning for Model-Agnostic Discovery in High-Dimensional Biology

The widespread availability of high-dimensional data has catalyzed biological pattern discovery. Today, the simultaneous measurement of thousands to millions of biological characteristics (in, e.g., genomics, metabolomics, proteomics) is commonplace in many experimental settings, making the simultaneous screening of such a large magnitude of characteristics a central problem in computational biology and allied sciences. The information that may be gleaned from such studies promises substantial progress, yet population-level biomedical and public health sciences must often operate without access to the great precision offered by modern biological techniques for molecular-cellular level manipulations (as used in, e.g., chemical biology). Here, statistical innovations bridge the gap — being used to dissect mechanistic processes and to mitigate the inferential obstacles imposed by potential confounding in observational (non-randomized) studies. Unfortunately, most off-the-shelf statistical techniques rely on restrictive assumptions that invite opportunities for bias due to model misspecification (when the biological process under study fails to obey assumed mathematical conveniences). Model-agnostic statistical inference, drawing on causal inference and semiparametric efficiency theory, provides an avenue for avoiding restrictive modeling assumptions while obtaining robust statistical inference on scientifically relevant target parameters. We outline this framework briefly and introduce a model-agnostic approach to biomarker discovery. The proposed approach readily accommodates statistical parameters informed by causal inference and leverages state-of-the-art machine learning to construct flexible and robust estimators (mitigating model misspecification bias) while using variance moderation to deliver stable, conservative inference in high-dimensional settings. The approach is implemented in the open-source biotmle R/Bioconductor package (https://bioconductor.org/biotmle). This talk is based on joint work with Alan Hubbard, Mark van der Laan, and Philippe Boileau, and is based on a recently published manuscript (doi: https://doi.org/10.1177/09622802221146313; pre-print: https://arxiv.org/abs/1710.05451).

February 13th, 2022 -Alexander Franks

February 13th, 2023
1:00 – 2:00 pm 

Alexander Franks
Dept of Statistics and Applied Probability
UC Santa Barbara

Sensitivity to Unobserved Confounding in Studies with Factor-structured Outcomes

We propose an approach for assessing sensitivity to unobserved confounding in studies with multiple outcomes. We demonstrate how prior knowledge unique to the multi-outcome setting can be leveraged to strengthen causal conclusions beyond what can be achieved from analyzing individual outcomes in isolation.  We argue that it is often reasonable to make a shared confounding assumption, under which residual dependence amongst outcomes can be used to simplify and sharpen sensitivity analyses.  We focus on a class of factor models for which we can bound the causal effects for all outcomes conditional on a single sensitivity parameter that represents the fraction of treatment variance explained by unobserved confounders.  We characterize how causal ignorance regions shrink under additional prior assumptions about the presence of null control outcomes,  and provide new approaches for quantifying the robustness of causal effect estimates.  Finally, we illustrate our  sensitivity analysis workflow in practice, in an analysis of both simulated data and a case study with data from the National Health and Nutrition Examination Survey (NHANES).

May 2nd, 2022 -Corwin Zigler

Corwin Zigler, Ph.D.
Associate Professor
Department of Statistics and Data Sciences
College of Natural Sciences | The University of Texas at Austin

Weather2vec: Representation Learning for Causal Inference with Non-Local Confounding in Air Pollution Studies

Causal effects of spatially-varying exposures on spatially-varying outcomes can be subject to non-local confounding (NLC), which exists when treatments and outcomes for an index unit are dictated in part by covariates of other (perhaps nearby) units. We formalize the problem of NLC and offers a deep-learning approach to encode neighboring covariate information into a vector defined for each observational unit that can be used to adjust for NLC. We evaluate the approach in two studies of causal effects of air pollution exposure, where meteorology is an inherently regional construct that threatens causal estimates with both local and neighborhood-level information. We illustrate the ability of the proposed U-net representation to capture relevant neighboring confounding information that cannot be fully characterized with simple functions of local and regional meteorological covariates.

April 4th, 2022 -Jyotishka Datta

April 4th, 2022
Jyotishka Datta
Department of Statistics, Virginia Tech

New Directions in Bayesian Shrinkage for Sparse, Structured Data

Sparse signal recovery remains an important challenge in large scale data analysis and global-local (G-L) shrinkage priors have undergone an explosive development in the last decade in both theory and methodology. In this talk, I will present two recent developments in Bayesian inference using G-L priors for sparse and structured data. In the first half of my talk,  I will present a a new prior (called GIGG) for sparse Bayesian linear regression models designed explicitly for predictors with bi-level grouping structure, which generalizes normal beta prime and horseshoe shrinkage. In the second half, I consider high-dimensional compositional data with sparsity or complex dependence structure as routinely observed in many areas including ecology, microbiomics, geology, environmetrics and social sciences, and propose a new prior distribution, specially designed to enable scaling to data with many categories. I will discuss the methodological challenges associated with each of these problems, and briefly review and discuss these recent developments. I will provide some theoretical support for the proposed methods and show improved performance in simulation settings and application to environmentrics and microbiome data.


March 7th, 2022 - Dawn Woodard (Uber)

March 7th, 2022 
Dawn Woodard
Senior Director of Data Science, Uber

Geospatial Technologies for Ride-Sharing and Delivery Platforms

Ride-sharing and delivery platforms such as Uber require complex geospatial inputs in order to generate their user experiences, match demand with drivers, and calculate fares.  For example, route planning for meal deliveries uses predictions of the travel time between any two locations in the road network, and platform efficiency heavily depends on the accuracy of these predictions.  I will describe the data-driven geospatial technologies, including those for travel time prediction, route optimization, and map error detection, that form the foundation of such multi-sided platforms.  I will detail the challenges, such as data sparsity on parts of the road network, and show that highly accurate predictions need to take into account the granular dynamics of the physical system (traffic patterns in the road network).  I will also compare several common approaches for travel time prediction, and provide rigorous theoretical results showing that one class of approaches has higher accuracy than alternatives.

May 17th, 2021-Caroline Colijn (SFU)

May 17th – Caroline Colijn (SFU)

What’s next for SARS-CoV-2:  evolution, selection and the next VOCs

I will discuss modelling transmission and vaccination, and then turn to where the SARS-CoV-2 virus is likely to be headed next and the public health implications of future VOC.

May 3rd, 2021 -Qi Long, UPenn

Communication Efficient Distributed Regression Analysis with Differential Privacy

Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to privacy concerns, government regulations and institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data.
To tackle such challenges, we propose a novel communication-efficient method that aggregates local optimal estimates, by recasting the problem of interest as a missing data problem. In addition, we propose incorporating posterior samples of individual sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while still maintaining the differential privacy property and thus mitigating the risk of information leaking. The proposed approach allows for proper statistical inference and can accommodate sparse regression, while preserving privacy. We provide theoretical investigation for the asymptotic properties of the proposed method and for differential privacy guarantee. Our method is shown in simulations and real data analyses to outperform several recently developed distributed analysis methods. This is joint work with Changgee Chang and Zhiqi Bu.

April 19th, 2021 - Babak Alipanahi

April 19th- Babak Alipanahi, PhD
Senior Research Scientist at Google Health
University of Toronto

Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology

In this talk, I will describe our efforts at Google Heath on automated phenotyping of biobank-scale cohorts. Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous optic nerve head features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5e−8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR and primary open-angle glaucoma in the independent EPIC-Norfolk cohort.

April 5th, 2021-Jonathan Huggins

April 5th- Jonathan Huggins
Assistant Professor
Department of Mathematics & Statistics
Boston University

Algorithmically Robust, General-Purpose Variational Inference

Black-box variational inference has become an increasingly attractive fast alternative to Markov chain Monte Carlo methods for general-purpose approximate Bayesian inference. However, two major obstacles to the widespread use of BBVI methods have remained – particularly in statistical and machine learning applications where reliable inferences are a necessity. The first challenge is that, as I will show, stochastic optimization methods used for black-box variational inference lack robustness across diverse model types. Motivated by these findings, I will present a more robust and accurate stochastic optimization framework by viewing the underlying optimization algorithm as producing a Markov chain. The second challenge is that variational methods lack post-hoc accuracy measures that are both theoretically justified and computationally efficient. To close this gap, I will present rigorous bounds on the error of posterior mean and uncertainty estimates that arise from full-distribution approximations, as in variational inference. These bounds are widely applicable, as they require only that the approximating and exact posteriors have polynomial moments. The bounds are also computationally efficient for variational inference because they require only standard values from variational objectives, straightforward analytic calculations, and simple Monte Carlo estimates. Our accuracy bounds and optimization framework point toward a new and improved workflow for more trustworthy variational inference.

March 22nd, 2021- Molei Liu

March 22nd -Molei Liu
4th Year PhD Student
Department of Biostatistics
Harvard Chan School of Public Health

Federated learning of large-scale heterogeneous healthcare data from multiple health systems

Evidence-based decision-making often relies on meta-analyzing multiple studies, which enables more precise estimation, more powerful knowledge extraction, and investigation of generalizability. For large-scale healthcare data from multiple sites, such integration usually encounters practical problems including individual data privacy, high dimensionality, and heterogeneity of data distribution across different sites. I will introduce our recent progress in developing federated learning methods for analyzing electronic health records (EHR) data from multiple healthcare systems. Our methods preserve individual data privacy and can be used for model-based estimation, prediction, variable selection, and hypothesis testing. They accommodate ultra-high dimensionality, as well as the model and design heterogeneity, and are proved to be statistically efficient. We also demonstrate their use through two real-world examples.

March 8th, 2021 - Sohini Ramachandran

March 8th – Sohini Ramachandran, PhD
Associate Professor of Ecology and Evolutionary Biology
Associate Professor of Computer Science
Director, Center for Computational Molecular Biology

Leveraging linkage disequilibrium to gain insights into human complex trait architecture

Since 2005, genome-wide association (GWA) datasets have been largely biased toward sampling Europeanancestry individuals, and recent studies have shown that GWA results estimated from European ancestry individuals apply heterogeneously in non-European ancestry individuals. Here, we argue that enrichment analyses which aggregate SNP-level association statistics at multiple genomic scales—to genes and pathways—have been overlooked and can generate biologically interpretable hypotheses regarding the genetic basis of complex trait architecture. We illustrate examples of the insights generated by enrichment analyses while studying 25 continuous traits assayed in 566,786 individuals from seven self-identified human ancestries in the UK Biobank and the Biobank Japan, as well as 44,348 admixed individuals from the PAGE consortium including cohorts of African-American, Hispanic and Latin American, Native Hawaiian, and Native American individuals. By testing for statistical associations at multiple genomic scales, enrichment analyses also illustrate the importance of reconciling contrasting results from association tests, heritability estimation, and prediction models in order to make personalized medicine a reality for all.


February 22nd, 2021 - Isaac Kohane

February 22nd-Isaac Kohane, MD, PhD
Chair of the Department of Biomedical Informatics
Harvard Medical School

Keeping the Doctor in the Loop

Medicine is driving many investigators from the machine learning community to the exciting opportunities presented by applying their methodological toolkits to improve patient care. They are inspired by the impressive successes in image analysis (e.g. in radiology, pathology and dermatology) to proceed to broad application to decision support across the time series of patient encounters with healthcare. I will examine closely some of the under-appreciated assumptions in that research/engineering agenda and how ignoring these will limit success in medical applications and conversely how these assumptions define a necessary and ambitious research program in shared human-ML decision making.

November 9th, 2020 - James Zou

James Zou, PhD
Stanford University

Assistant Professor of
Biomedical Data Science, CS and EE
Chan-Zuckerberg Investigator
Faculty Director of Stanford AI for Health

Computer Vision to deeply Phenotype Human Diseases across Physiological, Tissue and Molecular Scales

I will present new computer vision algorithms to learn complex morphologies and
phenotypes that are important for human diseases. I will illustrate this approach with
examples that capture physical scales from macro to micro: 1) video-based AI to assess heart function (Ouyang et al Nature 2020), 2)generating spatial transcriptomics from histology images (He et al Nature BME 2020), 3) learning morphodynamics of immune cells, and 4)making genome editing safer (Leenay et al Nature Biotech 2019). Throughout the talk I illustrate the general principles and tools for human-compatible ML that we’ve developed toenable these technologies (Ghorbani et al. NeurIPS 2020, Abid et al. Nature MI 2020).

October 19th,2020- Marinka Zitnik

Marinka Zitnik, PhD
Assistant Professor of Biomedical Informatics
Graph Neural Networks for Biomedical Data
The success of machine learning depends heavily on the choice of features on which the methods are applied. For that reason, much of the actual efforts in deploying algorithms go into engineering of features that support effective learning. In this talk, I describe our efforts to expand the scope and ease the applicability of machine learning on interconnected data and networks. First, I outline our methods for graph representation learning. The methods specify deep graph neural functions that map nodes in a graph to points in a compact vector space, termed embeddings. Importantly, these graph neural methods are optimized to embed networks such that performing algebraic operations in the learned embedding space reflects the network topology. We show how embeddings enable repurposing of drugs for new diseases, including COVID-19, and the discovery of dozens of drug combinations safe in patients with considerably fewer unwanted side effects than today’s treatments. Further, embeddings allow for accurate molecular phenotyping by identifying drug targets, disease proteins, and molecular functions better than much more complex algorithms. Lastly, I describe our efforts in learning actionable representations that allow users of our models to ask what-if questions and receive predictions that are accurate and can be interpreted meaningfully.

October 5th, 2020 - Rui Duan, Harvard

Rui Duan, PhD
Assistant Professor of Biostatistics


Efficient Integration of EHR and Other Healthcare Datasets

The growth of availability and variety of healthcare data sources has provided unique opportunities for data integration and evidence synthesis, which can potentially accelerate knowledge discovery and enable better clinical decision making. However, many practical and technical challenges, such as data privacy, high-dimensionality and heterogeneity across different datasets, remain to be addressed. In this talk, I will introduce several methods for effective and efficient integration of electronic health records and other healthcare datasets. Specifically, we develop communication-efficient distributed algorithms for jointly analyzing multiple datasets without the need of sharing patient-level data. Our algorithms do not require iterative communication across sites, and are able to account for heterogeneity across different datasets. We provide theoretical guarantees for the performance of our algorithms, and examples of implementing the algorithms to real-world clinical research networks.

September 30th,2020- Eran Segal

Wednesday September 30th 
Eran Segal, PhD
Computational Biologist

Personalizing treatments using microbiome and clinical data

Accumulating evidence supports a causal role for the human gut microbiome in obesity, diabetes, metabolic disorders, cardiovascular disease, and numerous other conditions. I will present our research on the role of the human microbiome in health and disease, ultimately aimed at developing personalized medicine approaches that combine human genetics, microbiome, and nutrition.
In one project, we tackled the subject of personalization of human nutrition, using a cohort of over 1,000 people in which we measured blood glucose response to >50,000 meals, lifestyle, medical and food frequency questionnaires, blood tests, genetics, and gut microbiome. We showed that blood glucose responses to meals greatly vary between people even when consuming identical foods; devised the first algorithm for accurately predicting personalized glucose responses to food based on clinical and microbiome data; and showed that personalized diets based on our algorithm successfully balanced blood glucose levels in prediabetic individuals.
Using the same cohort, we also studied the set of metabolites circulating in the human blood, termed the serum metabolome, which contain a plethora of biomarkers and causative agents. With the goal of identifying factors that determine levels of these metabolites, we devised machine learning algorithms that predict metabolite levels in held-out subjects. We show that a large number of these metabolites are significantly predicted by the microbiome and unravel specific bacteria that likely modulate particular metabolites. These findings pave the way towards microbiome-based therapeutics aimed at manipulating circulating metabolite levels for improved health.
Finally, I will present an algorithm that we devised for identifying variability in microbial sub-genomic regions. We find that such Sub-Genomic Variation (SGV) are prevalent in the microbiome across multiple microbial phyla, and that they are associated with bacterial fitness and their member genes are enriched for CRISPR-associated and antibiotic producing functions and depleted from housekeeping genes. We find over 100 novel associations between SGVs and host disease risk factors and uncover possible mechanistic links between the microbiome and its host, demonstrating that SGVs constitute a new layer of metagenomic information.

March 9th, 2020 - Shirley Liu, DFCI

Shirley Liu, DFCI
Professor of Biostatistics
Harvard University
I will discuss two algorithms that our laboratory developed to extract useful cancer immunology insights from treatment naïve RNA-seq samples in The Cancer Genome Atlas. First, we developed a computational method TRUST that can assemble T cell receptor (TCR) and B cell receptor (BCR) complementarity-determining regions (CDR3s) from unselected bulk tumor RNA-seq data (Li et al, Nat Genet 2016; Li et al, Nat Genet 2019). Specifically IgG1 and IgG3 B cells are associated with national killer cell activity in the tumors, implicating their important roles in B-cell mediated tumor immunity (Zhang et al, Genome Med 2019). Second, we derived Tumor Immune Dysfunction and tumor immune Exclusion gene expression signatures (TIDE) from pretreatment tumors to predict patient response to anti-PD1 and anti-CTLA4 treatment (Jiang et al, Nat Med 2018). Recently we integrated the omics data for over 33K samples in 188 tumor cohorts from public databases, 998 tumors from twelve ICB clinical studies, and eight immune-related CRISPR screens, and used it for hypothesis generation, biomarker optimization,and patient stratification (Fu et al, Genome Med 2020. Our work demonstrates how tumor RNA-seq, even on treatment naïve tumors, can be cost effective to inform tumor microenvironment and immunity.

February 24, 2020 - Lucas Janson, Harvard

Lucas Janson, PhD
Assistant Professor of Statistics
Harvard University
Should We Model X in High-Dimensional Inference?
Many important scientific questions are about the relationship between a response variable Y and a set of explanatory variables X. For instance, Y might be a disease state and the X’s might be SNPs, and the scientific question is which of these SNPs are related to the disease. For answering such questions, most statistical methods assume a model for the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the joint distribution of X, especially for high-dimensional data. First, modeling X can lead to assumptions that are more realistic and verifiable. Second, there are substantial methodological payoffs in terms of much greater flexibility to use, e.g., domain knowledge and state-of-the-art machine learning for powerful inference while maintaining precise theoretical guarantees. I will briefly mention some of my recent and ongoing work on methods for high-dimensional inference that model X instead of Y, as well as some challenges and interesting directions for the future.

February 10, 2020 - David Van Valen, Caltech

Monday, February 10, 2020
2:15 – 3:15 PM 
LAHEY Room, 5th Floor
Countway Library, HMS

David Van Valen, MD, PhD
Assistant Professor of Biology and Biological Engineering
The Division of Biology and Biological Engineering, Caltech

Single cell biology in a Software 2.0 world

The study of living systems is challenging because of their high dimensionality, spatial and temporal heterogeneity, and high degree of variability in the fundamental unit of life – the living cell. Recently, advances in genomics, imaging, and machine learning are enabling researchers to tackle all of these challenges. In this talk, I describe my research group’s efforts to use machine learning to connect imaging and genomics measurements to enable high-dimensional measurements of living systems. We show how deep learning-based image segmentation enables the quantification of dozens of protein markers in spatial proteomics measurements of breast cancer and describe a new method for deep learning-based cell tracking which will enable information-theoretic measurements of cell signaling. Lastly, we relay our efforts in deploying deep learning models in the cloud for large-scale deep learning-enabled image analysis. By using single-cell imaging as the read out for a genetic screen, we show how we can identify deep connections between host cell energetics and viral decision making in a model system of viral infections.

December 2, 2019 - Ben Reis, Harvard Medical School

Monday, December 2, 2019
1:15 – 2:15 PM
MINOT Room, 5th Floor
Countway Library, HMS

Ben Reis
Faculty, Computational Health Informatics Program (CHIP)
Director, Predictive Medicine Group, Computational Health Informatics Program (CHIP)
Assistant Professor of Pediatrics, Harvard Medical School

“The Age of Predictive Medicine”

November 18, 2019 - Jean Fan, Harvard University

Monday, November 18, 2019
1:15 – 2:15 PM
LAHEY Room, 5th Floor
Countway Library, HMS

Jean Fan
Postdoctoral Fellow, Department of Chemistry and Chemical Biology, Harvard University
Incoming Assistant Professor in the Department of Biomedical Engineering at Johns Hopkins University –  July 2020

Analysis of spatially-resolved transcriptomic profiles by MERFISH reveals subcellular RNA compartmentalization and cell-cycle-associated transcriptional velocity

The spatial organization of RNAs within cells and spatial patterning of cells within tissues play crucial roles in many biological processes. Here, we demonstrate that multiplexed error-robust FISH (MERFISH) can achieve near-genome-wide, spatially resolved RNA profiling of individual cells with high accuracy and high detection efficiency. Using this approach, we identified RNA species enriched in different subcellular compartments, observed transcriptionally distinct cell states corresponding to different cell-cycle phases, and revealed spatial patterning of transcriptionally distinct cells. Spatially resolved transcriptome quantification within cells further enabled RNA velocity and pseudotime analysis, which revealed numerous genes with cell cycle-dependent expression. We anticipate that spatially resolved transcriptome analysis will advance our understanding of the interplay between gene regulation and spatial context in biological systems.

October 28, 2019 - Rachel Nethery, HSPH

Monday, October 28, 2019
1:15 – 2:15 PM
LAHEY Room, 5th Floor
Countway Library, HMS

Rachel Nethery
Assistant Professor of Biostatistics, Harvard T.H. Chan School of Public Health

Causal inference and machine learning approaches for evaluation of the health impacts of large-scale air quality regulations

We develop a causal inference approach to estimate the number of adverse health events prevented by large-scale air quality regulations via changes in exposure to multiple pollutants. This approach is motivated by regulations that impact pollution levels in all areas within their purview. We introduce a causal estimand called the Total Events Avoided (TEA) by the regulation, defined as the difference in the expected number of health events under the no-regulation pollution exposures and the observed number of health events under the with-regulation pollution exposures. We propose a matching method and a machine learning method that leverage high-resolution, population-level pollution and health data to estimate the TEA. Our approach improves upon traditional methods for regulation health impact analyses by clarifying the causal identifying assumptions, utilizing population-level data, minimizing parametric assumptions, and considering the impacts of multiple pollutants simultaneously. To reduce model-dependence, the TEA estimate captures health impacts only for units in the data whose anticipated no-regulation features are within the support of the observed with-regulation data, thereby providing a conservative but data-driven assessment to complement traditional parametric approaches. We apply these methods to investigate the health impacts of the 1990 Clean Air Act Amendments in the US Medicare population.

October 7, 2019 - Hyunghoon Cho, Broad

Monday, October 7, 2019
1:15 – 2:15 PM
LAHEY Room, 5th Floor
Countway Library, HMS

Hyunghoon Cho, PhD
Schmidt Fellow at the Broad Institute of MIT and Harvard

Biomedical Data Sharing and Analysis with Privacy

Researchers around the globe are gathering biomedical information at a massive scale. However, privacy and intellectual property concerns hinder open sharing of these data, presenting a key barrier to collaborative science. In this talk, I will describe how modern cryptographic tools present a path toward broader data sharing and collaboration in biomedicine as demonstrated by my recent work on genome-wide association studies (GWAS) and pharmacological machine learning. For each domain, I will introduce our efficient privacy-preserving analysis protocol that achieves state-of-the-art accuracy while ensuring the input data remain private throughout the protocol. Our protocols newly achieve scalability to a million genomes or drug compounds by drawing on a set of techniques aimed at reducing redundancy in computation. Key components of our pipelines, including secure principal component analysis (PCA) and secure neural networks, are broadly applicable to other data science domains. These results lay a foundation for more effective and cooperative biomedical research where individuals and institutes across nations pool their data together to enable novel life-saving discoveries.

Hoon Cho is a Schmidt Fellow at the Broad Institute of MIT and Harvard. He uses mathematics, cryptography, and machine learning to enhance the information we can gather from massive biomedical datasets. Cho is especially interested in solving problems in the areas of biomedical data privacy, single-cell genomics, and network biology. A key focus of his research is to broaden data sharing and collaboration in biomedical research by developing secure methods for analyzing sensitive data from individuals. He works closely with the Broad’s Data Sciences Platform.

Cho received his Ph.D. in electrical engineering and computer science at MIT, advised by Bonnie Berger. He also holds an M.S. in computer science and a B.S. with honors in computer science from Stanford University.

May 6, 2019 - Luwan Zhang, HSPH

Lahey Room*

Luwan Zhang, PhD
Postdoctoral Research Fellow at Harvard T.H. Chan School of Public Health

Automating consensus medical knowledge extraction using an ensemble of healthcare data from Electronic Health Records, insurance claims and medical literature.

The increasingly widespread adoption of Electronic Health Records (EHR) has enabled phenotypic information collection at an unprecedented granularity and scale. Despite its great promises in advancing clinical decision making, two major challenges remain to solve to fully unleash its power. The first challenge is that a medical concept (e.g. diagnosis, prescription, symptom) is often described in various synonyms, largely hindering data integrability and analysis reproducibility. The second comes from the inherent heterogeneity across different EHR systems that calls for an efficient combining solution for a more general and unbiased representative of the underlying network linking different medical concepts. In this talk, I will discuss some recent advances to solve these two challenges, including a novel spectral clustering method for grouping synonymous codes, and a graph learning algorithm for a consensus clinical knowledge graph discovery using multiple up-to-date data sources.

April 29, 2019 - Matthew T. Harrison, Brown

Minot Room*

Matthew T. Harrison, PhD
Associate Professor
Division of Applied Mathematics
Brown University

Bayesian methods for brain-computer interfaces
Brain-computer interfaces (BCIs) allow people to directly control devices with their thoughts. The BrainGate clinical trial is studying the use of intracortical BCIs in people with paralysis. Research participants in this trial have successfully used BCIs to control computer cursors, robotic arms, and other devices. A key component of this technology, called neural decoding, is the algorithm that translates the recorded brain signal into a movement command. In this talk I will describe some recent advances in neural decoding for BCIs, including a fast, learnable approximation for non-Gaussian filtering, that we call the discriminative Kalman filter. Bayesian methods play a prominent role. This is joint work with David Brandman, Michael Burkhart, and the BrainGate research team.
Matthew T. Harrison, Ph.D., is an Associate Professor of Applied Mathematics at Brown University in Providence, RI. His current research focuses on topics in statistics, machine learning, and data science, including statistical methods in neuroscience, conditional inference, multivariate binary time series and point processes, nonparametric Bayesian mixture models, importance sampling, and random graphs and tables. His research has been published in the leading journals of multiple disciplines, and he is an award-winning teacher at Brown. Dr. Harrison received his Ph.D. in Applied Mathematics from Brown University in 2005. Afterwards he was a postdoctoral member of the Mathematical Sciences Research Institute in Berkeley, CA, and a Visiting Assistant Professor of Statistics at Carnegie Mellon University in Pittsburgh, PA. He returned to Brown in 2009.

April 8, 2019 - Ani Eloyan, Brown

Lahey Room*

Ani Eloyan, PhD
Assistant Professor of Biostatistics
Brown University

Estimation of Tumor Heterogeneity for Radiomics in Cancer

Cancer patients routinely undergo radiological evaluations when images of various modalities including computed tomography, positron emission tomography, and magnetic resonance images are collected for diagnosis and to evaluate disease progression. Tumor characteristics often referred to as tumor heterogeneity can be computed using these clinical images and used as predictors of disease progression and patient survival. Several approaches to quantifying tumor heterogeneity have been proposed including simple intensity histogram based measures, metrics attempting to quantify average distance from a homogeneous surface, and texture analysis based methods. In this talk, I will describe a novel clustering based statistical approach to quantify tumor heterogeneity for prediction of survival of lung cancer patients using linear Cox Proportional Hazards models and penalized regression. Time-dependent receiver operating characteristic analysis will be used to compare various heterogeneity estimation approaches.

April 1, 2019 - Peter X.K. Song, University of Michigan

Lahey Room*

Peter X.K. Song, PhD
Professor of Biostatistics
Department of Biostatistics
School of Public Health
University of Michigan

Renewable Estimation and Incremental Inference with Streaming Data

I will present a new statistical paradigm for the analysis of streaming data based on renewable estimation and incremental inference in the context of generalized linear models.   Streaming data arise from many practical fields due to significant advances in technologies of data collection and data storage.  Our proposed renewable estimation enables us to sequentially update the maximum likelihood estimation and inference with current data and summary statistics of historic data, but with no use of any historic raw data themselves.  In the implementation, we design a new data flow, called the Rho architecture to accommodate the data storage of current and historic data, as well as to communicate with the computing layer of the Spark system in order to facilitate sequential learning.  We establish both estimation consistency and asymptotic normality for the renewable estimation and incremental inference. We illustrate our methods by numerical examples from both simulation experiments and real-world data analyses. This is a joint work with Lan Luo.

March 11, 2019 - Hongfang Liu, Mayo Clinic College of Medicine

Lahey Room*

Hongfang Liu, Ph.D.
Professor of Biomedical Informatics
Mayo Clinic College of Medicine

Accelerating Digital Health Sciences through Advanced Informatics and Analytics for Clinical Excellence (ADVANCE)      

The digital revolution in healthcare has led to the generation of a large amount of data from diverse sources which create tremendous opportunities for coordinating and monitoring patient care, analyzing and improving systems of care, conducting research to discover new treatments, assessing the effectiveness of medical interventions, and advancing population health. In this talk, I will provide an overview of the activities carried out in the ADVANCE lab.  I will take a deep dive into three specific research projects: i) digital phenotyping infrastructure empowered by natural language processing (NLP) and machine learning, ii) precision medicine delivery leveraging expert or data-driven semantics, and iii) risk stratification using advanced machine learning strategies.

WEDNESDAY, March 6, 2019 - Richard Neher, University of Basel

Lahey Room*

Richard Neher, PhD
Head of Research Group, Biozentrum
University of Basel

Tracking and predicting the spread of pathogens and resistance

Routine whole genome sequencing of viruses and bacteria now enables the reconstruction of spread and evolution of pathogens at unprecedented resolution. Influenza viruses, for example, are collected by more than one hundred national influenza centers and their sequences are available publicly within weeks of sampling. To analyze this continuous data stream, we developed automated analysis pipelines and web-based visualization available on nextstrain.org. In addition, we predict which viral variants will dominate future influenza seasons. These predictions now routinely inform the vaccine strain selection process by the WHO.

The nextstrain tool-chain and visualization platform has since been extended to a large number of viruses and bacterial pathogens. However, traditional phylogenetic approaches are of limited use for many bacterial pathogens since drug resistance spreads by horizontal transfer. Furthermore, the relevant genes are flanked by repetitive sequence and short read sequencing data doesn’t assemble into long enough contigs resolve their spread. Long read sequencing is necessary to successfully assemble these genomes and reconstruct the evolutionary history of acquired resistance. We sequenced a large fraction of carbapenemase resistant gram negative bacteria collected in the University hospital Basel using Oxford Nanopore technologies, developed a number of semi-automatic assemble quality assessment metrics, and computational methods to reconstruct plasmid evolution. Long-read sequencing should soon allow comprehensive and routine tracking of AMR at the gene, genome, and pan-genome level.

February 25, 2019 - Alan Beggs, Boston Children's Hospital

Minot Room*

Alan Beggs
Director, The Manton Center
Sir Edwin and Lady Manton Professor of Pediatrics
Boston Children’s Hospital

Genomic sequencing of newborns; the BabySeq Study

February 11, 2019 - Jian Ma, Carnegie Mellon

Minot Room*

Jian Ma, PhD
Associate Professor, Computational Biology
School of Computer Science
Carnegie Mellon University

Algorithms for studying nuclear genome compartmentalization

The chromosomes of the human genome are organized in three-dimensions by compartmentalizing the cell nucleus. However, the principles underlying nuclear compartmentalization and the functional impact of this organization are poorly understood. In this talk, I will introduce some of our recent works in developing new computational strategies based on whole-genome mapping data to study spatial localization of chromosome regions within the cell nucleus in different cellular conditions and also across different mammalian species. We hope that these methods will provide new insights into nuclear spatial and functional compartmentalization.

February 4, 2019 - Franziska Michor, DFCI

Lahey Room*

Franziska Michor
Professor of Computational Biology
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
Department of Biostatistics, Harvard T.H. Chan School of Public Health
Department of Stem Cell and Regenerative Biology, Harvard University

Optimizing treatment strategies in glioblastoma; mathematical modeling and clinical validation

January 28, 2019 - Haytham Kaafarani, HMS/MGH

Minot Room*

Haytham Kaafarani, MD, MPH, FACS
Associate Professor of Surgery, Harvard Medical School
Director, Patient Safety & Quality
Director, Clinical Research & International Research Fellowship
Division of Trauma, Emergency Surgery and Surgical Critical Care
Massachusetts General Hospital

Should I Operate on Marie? Big Data & Machine Learning in the Operating Room

Accurately predicting the risk of patient adverse outcomes following surgery remains challenging. Most existing risk assessment tools presume that the impact of risk factors is linear and cumulative. We used novel and advanced machine-learning techniques to better model the risk of postoperative mortality and morbidity following Emergency Surgery (ES). We used all ES patients in the ACS-NSQIP 2007-2013 database to train our predictive models. Optimal Classification Trees (OCT) were leveraged to train machine learning algorithms to predict postoperative mortality, morbidity, and 20 specific complications (e.g. sepsis, surgical site infection). Unlike classic methods (e.g. logistic regression), OCT is adaptive and reboots itself with each variable thus accounting for non-linear interactions among variables. The resulting OCT models are also readily interpretable, unlike other machine learning approaches that give black-box models. In preliminary testing, our approach delivers significant improvements over all existing methods in the literature (e.g. The Emergency Surgery Score-ESS; the ACS-NSQIP calculators), and the ability to accurately predict outcome with as little as 4 and a maximum of 10 variables/Trees.

December 10, 2018 - Alexander Gusev, DFCI

Minot Room*

Alexander Gusev, PhD
Assistant Professor of Medicine
Department Medicine
Dana-Farber Cancer Institute

Leveraging Genome-wide Association Studies to elucidate germline cancer mechanisms and germline-somatic interactions

Genome-wide association studies (GWAS) have been a powerful tool to identify genetic variants associated with disease risk. However, understanding the mechanisms by which these variants impact phenotype remains a great challenge. I will discuss methods that integrate multi-omics transcriptional and epigenetic data with GWAS to connect non-coding risk variants to their cellular context; target regulatory element; and susceptibility genes. I will discuss a new statistical method for quantifying the causal contribution of multi-omics features to total disease heritability. And I will propose a framework for GWAS-style analysis of polygenic germline-somatic interactions in large-scale clinical data.

October 29, 2018 - Junwei Lu, Biostatistics, HSPH

Minot Room*

Junwei Lu, PhD
Assistant Professor
Department of Biostatistics
Harvard T.H. Chan School of Public Health

Combinatorial Inference for Brain Imaging Datasets

Abstract: We propose the combinatorial inference to explore the global topological structures of graphical models.In particular, we conduct hypothesis tests on many combinatorial graph properties including connectivity, hub detection, perfect matching, etc. Our methods can be applied to any graph property which is invariant under the deletion of edges. On the other side, we also develop a generic minimax lower bound which shows the optimality of the proposed method for a large family of graph properties. Our methods are applied to the neuroscience by discovering hub voxels contributing to visual memories.

October 15, 2018 - Kun-Hsing Yu, HMS

Lahey Room*

Kun-Hsing Yu, MD, PhD
Department of Biomedical Informatics
Harvard Medical School

Toward precision oncology: quantitative pathology, multi-omics, and artificial intelligence

Precision medicine aims to account for individual differences to guide disease treatment. Omics profiling and digital pathology technologies provide a unique opportunity to characterize patients’ genotypes and microscopic phenotypes at an unprecedented resolution. However, the terabytes of data resulted from these high-throughput methods posed a great challenge in analyzing and integrating the signals from these modalities. To address this challenge, we developed machine-learning methods to connect the histopathology and omics aberrations with patients’ phenotypes. Our methods employed convolutional neural networks to distinguish patients with different cancer types and transcriptomic subtypes. We further built an integrative histopathology-transcriptomics model to generate better prognostic predictions for stage I lung adenocarcinoma patients compared with gene expression or histopathology studies alone. To facilitate studies on other cancer types, we established a generalizable deep learning platform for quantitative histopathology analyses (patent pending), with successful applications for characterizing the omics subtypes and treatment responses of brain, ovarian, gastric, colorectal, renal, and hematological cancers. Our results suggest that integrative omics-histopathology analyses can improve diagnosis and prognosis prediction and thereby contribute to precision oncology. Our approach is extensible to analyzing histopathology and omics changes in other complex diseases.

October 1, 2018 - Sebastien Haneuse, HSPH

Lahey Room*

Sebastien Haneuse, PhD
Associate Professor of Biostatistics
Department of Biostatistics
Harvard T.H. Chan School of Public Health

Adjusting for selection bias due to missing data in electronic health records-based research

While EHR data provide unique opportunities for public health research, selection due to incomplete data is an under-appreciated source of bias. When framed as a missing-data problem, standard methods could be applied, although these typically fail to acknowledge the often-complex interplay of clinical decisions made by patients, providers, and the health system, required for data to be complete. As such, residual selection bias may remain. Building on a recently-proposed framework for characterizing how data arise in EHR-based studies, we develop and evaluate a statistical framework for regression modeling based on inverse probability weighting that adjusts for selection bias in the complex setting of EHR-based research. We show that the resulting estimator is consistent and asymptotically Normal, and derive the form of the asymptotic variance. We use simulations to highlight the potential for bias when standard approaches are used to account for selection bias, and evaluate the small-sample operating characteristics of the proposed framework. Finally, the methods are illustrated using data from an on-going, multi-site EHR-based study of bariatric surgery on BMI.

September 17, 2018 - Andrew Beam, HMS

Lahey Room*

Andrew Beam, PhD
Department of Biomedical Informatics
Harvard Medical School

Machine Learning and Artificial Intelligence in Healthcare: Balancing the Promise and the Peril

Due to the emergence of large sources of healthcare data, machine learning researchers have the unprecedented opportunity to impact the lives of real patients through their work. From understanding the impact of legislation on drug pricing to predicting drug resistance in tuberculosis, data-driven methods allow us examine the healthcare decision points from a multitude of perspectives. In this talk I will highlight several of my own projects that attempt to answer meaningful clinical questions, while providing an overarching conceptual framework for machine learning in healthcare. Though there is immense potential for good, I will conclude with a cautionary example of how machine learning may in fact open the healthcare system to new avenues for harm and abuse through the use of adversarial examples.

Andrew Beam, PhD is a junior faculty member within the Department of Biomedical Informatics at Harvard Medical School. His research develops and applies machine-learning methods to extract meaningful insights from massive clinical datasets. He earned his PhD in bioinformatics for work on Bayesian neural network models of genome-wide association studies. His current work has special focus on machine learning for neonatal/perinatal medicine, but the unifying theme of all of his research is the use of data-driven methods to improve decision-making in healthcare.

May 7, 2018 - Rafael Irizarry, DFCI

Rafael Irizarry
Professor of Applied Statistics
Dana-Farber Cancer Institute
Harvard T.H. Chan School of Public Health

April 30, 2018 - Po-Ru Loh, Harvard Chan School

Po-Ru Loh
Assistant Professor of Medicine
Division of Genetics, Department of Medicine
Brigham and Women’s Hospital and Harvard Medical School

Insights into genetic mosaicism through the lens of statistical phasing

Mosaicism in genetics refers to the state in which cells of a single individual belong to two or more sub-populations with distinct genotypes. Mosaicism can arise either through mutations early in development, or through mutations later in life that subsequently undergo clonal expansion (e.g., cancer). In this talk, I will present new statistical methods that leverage population-based haplotype phasing to enable detection of mosaic chromosomal alterations present at very low cell fractions. I will describe insights into the causes and consequences of age-related clonal hematopoiesis revealed by this approach, and I will also discuss ongoing collaborations applying this methodology to study autism spectrum disorder.

April 2, 2018 - David Sontag, MIT

David Sontag
Hermann L. F. von Helmholtz Career Development Professor of Medical Engineering, Massachusetts Institute of Technology
Assistant Professor of Electrical Engineering and Computer Science, Massachusetts Institute of Technology

AI for Health Needs Causality

Recent success stories of using machine learning for diagnosing skin cancer, diabetic retinopathy, pneumonia, and breast cancer may give the impression that artificial intelligence (AI) is on the cusp of radically changing all aspects of health care. However, many of the most important problems, such as predicting disease progression, personalizing treatment to the individual, drug discovery, and finding optimal treatment policies, all require a fundamentally different way of thinking. Specifically, these problems require a focus on *causality* rather than simply prediction. Motivated by these challenges, my lab has been developing several new approaches for causal inference from observational data. In this talk, I describe our recent work on the deep Markov model (Krishnan, Shalit, Sontag AAAI ’17) and TARNet (Shalit, Johansson, Sontag, ICML ’17).

February 26, 2018 - Vincent Carey, HMS-Channing Lab

Vincent Carey
Professor, Medicine, Harvard Medical School
Associate Biostatistician, Channing Laboratory, Brigham And Women’s Hospital

Semantically rich interfaces for cloud-scale genomics
The problem of matching statistically and computationally efficient inference methods to problems in genome-scale biology is intrinsically difficult. The difficulty is compounded by rapid changes in both domains: computational environments for data management and statistics are evolving both technically and conceptually, and readouts in genome-scale biology are growing in diversity, size, and complexity.
This talk will review some approaches under development in the Bioconductor project to streamline exploratory data analysis for cloud-scale genomic data.  Apart from fostering scalable access to resources of arbitrary size, important concerns include the definition and use of biologically meaningful filters and covariates for integrative modeling.  Examples will be drawn from TCGA, the 10x million neuron dataset, and GTEx.

February 12, 2018 - Bonnie Berger, MIT

Bonnie Berger
Simons Professor of Mathematics
Departments of Electrical Engineering and Computer Science
Head of the Computation & Biology Group, Computer Science and AI Lab
Massachusetts Institute of Technology

Compressive algorithms for biomedical data science

January 29, 2018 - Ben Raphael, Princeton

Ben Raphael
Professor of Computer Science
Princeton University

Phylogenetic Reconstruction of Clonal Evolution in Tumors and Metastases

Cancer is an evolutionary process driven by somatic mutations that accumulate in a population of cells that form a primary tumor.  In later stages of cancer progression, cells migrate from a primary tumor and seed metastases at distant anatomical sites.  I will describe algorithms to reconstruct this evolutionary process from DNA sequencing data of tumors.  These algorithms address challenges that distinguish the tumor phylogeny problem from classical phylogenetic tree reconstruction, including challenges due to mixed samples and complex migration patterns.

January 22, 2018 - Valen Johnson, Texas A&M

Valen Johnson
University Distinguished Professor
Department Head of Statistics
Texas A&M University

Statistical factors that contribute to non-reproducibility of science

This talk examines two factors that lead to the misrepresentation or misinterpretation of scientific studies.  The first factor involves the information conveyed in the report of p-values that are marginally significant at the 0.05 level.  The second factor involves omission of information regarding the prior or marginal probability that a tested hypothesis is true.

December 11, 2017 - Ramnik Xavier, MGH

Ramnik Xavier
Chief, Gastrointestinal Unit, Massachusetts General Hospital
Director, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital
Kurt Isselbacher Professor of Medicine, Harvard Medical School

November 13, 2017 - Emmanouil Dermitzakis

Emmanouil Dermitzakis
Professor, University of Geneva Medical School
Director, Health 2030 Genome Center

Molecular QTLs in 3D: implications for complex disease and cancer

Molecular phenotypes inform us about genetic and environmental effects on cellular and tissue state. The elucidation of the genetic basis of gene expression and other cellular phenotypes is highly informative for the impact of genetic variants in the cell and the subsequent consequences in the organism. In this talk I will discuss recent advances in key areas of the analysis and integration of the genomics of gene expression, chromatin and cellular phenotypes in human populations and multiple tissues from various cohorts including the GTEx consortium and how this assists in the interpretation of regulatory networks and human disease variants. I will also discuss how these recent advances are informing us about the impact of regulatory variation in cancer.

October 30, 2017 - Susan Murphy, Harvard

Cosponsored by the Harvard Data Science Initiative
Reception to follow lecture in the Ballard Room, Countway Library

Susan Murphy
Professor of Statistics,
Radcliffe Alumnae Professor at the Radcliffe Institute, Harvard University and
Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences

Assessing Time-Varying Causal Interactions and Treatment Effects with Applications to Mobile Health

Mobile devices along with wearable sensors facilitate our ability to deliver supportive treatments anytime and anywhere. Indeed mobile interventions are being developed and employed across a variety of health fields, including to support HIV medication adherence, encourage physical activity and healthier eating as well as to support recovery in addictions. A critical question in the optimization of mobile health interventions is: “When and in which contexts, is it most useful to deliver treatments to the user?” This question concerns time-varying dynamic moderation by the context (location, stress, time of day,  mood, ambient noise, etc.) of the effectiveness of the treatments on user behavior. In this talk we discuss the micro-randomized trial design and associated data analyses for use in assessing moderation. We illustrate this approach with the micro-randomized trial of HeartSteps, a physical activity mobile intervention.

October 16, 2017 - Jian Peng, University of Illinois at U-C

Jian Peng
Assistant Professor
Department of Computer Science
University of Illinois at Urbana-Champaign

Integration and dissection of molecular networks for functional analysis and disease modeling 

Recent advances in biotechnology have enabled large-scale measurements of molecular interactions, alterations, and expression that occur in human cells. Identifying connections, patterns and deeper functional annotations among such heterogeneous measurements will potentially enhance our capability to identify biological processes underlying diseases, useful biomarkers, and novel drugs. In this talk, I will describe algorithms that use existing molecular interactions to integrate functional genomic data into molecular networks that uncover novel disease-related pathways. First, I will introduce Mashup, a machine learning algorithm that integrates multiple heterogeneous networks into compact topological features for functional inference. Second, I will discuss extensions of Mashup for discovering new disease factors and subnetworks in neurodegeneration and cancer. Finally, I will briefly introduce our most recent work on network-based gene-set functional analysis that extends the reach of current methods in functional enrichment analysis.

October 2, 2017 - Sharon-Lise Normand, Harvard

Sharon-Lise Normand
Professor of Biostatistics in the Dept of Health Care Policy, Harvard Medical School
Professor in the Dept of Biostatistics, Harvard T.H. Chan School of Public Health
Director, Medical Device Epidemiology Network (MDEpiNet) Methodology Center
Director, Massachusetts Data Analysis Center (Mass-DAC)

Data, Statistics, and Inference

Increased access to electronic health information have intensified attempts to understand the effect caused by new medical interventions in usual care populations as well as state-specific or country-specific public health policy decisions. By conditioning on rich confounding information, utilizing larger populations, and multiple sources of information, researchers aim to comply with key principles underpinning causal inference. During this talk, an examination of current (and future) substantive and methodological problems will be discussed. We consider Bayesian techniques for causal effect estimation using high-dimensional data, including regularizing priors and Bayesian additive regression trees. Assessing the comparative effectiveness of implantable medical devices illustrate methodology. Funded by R01- GM111339 and U01-FDA004493. Joint work with Jacob Spertus and Sherri Rose at Harvard Medical School.

September 18, 2017 - Jennifer Listgarten, Microsoft

Jennifer Listgarten
Senior Researcher
Microsoft Research New England

From Genetics to CRISPR Gene Editing with Machine Learning

Molecular biology, healthcare and medicine have been slowly morphing into large-scale, data driven sciences dependent on machine learning and applied statistics. In this talk I will start by explaining some of the modelling challenges in finding the genetic underpinnings of disease, which is important for screening, treatment, drug development, and basic biological insight. Genome and epigenome-wide associations, wherein individual or sets of (epi)genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of structure and confounders lie hidden in the data, leading to both spurious and missed associations if not properly addressed. Much of this talk will focus on how to model these types of data. Once we uncover genetic causes, genome editing—which is about deleting or changing parts of the genetic code—will one day let us fix the genome in a bespoke manner. Editing will also help us to understand mechanisms of disease, enable precision medicine and drug development, to name just a few more important applications. I will close by discussing how we developed machine learning approaches to enable more effective CRISPR gene editing.

May 15, 2017 - David Gotz, UNC-Chapel Hill

Lahey Room
Countway Library, HMS

David Gotz
Associate Professor of Information Science
Assistant Director for the Carolina Health Informatics Program
University of North Carolina – Chapel Hill

Visual analytics methods for high-dimensional temporal event data

Large-scale temporal event data, with vast numbers of long and complex sequences of time-stamped events, are found in a wide range of domains including social networking, security, and healthcare. In the medical domain, for example, electronic health record data with thousands of variables and millions of patients, can be used in comparative effectiveness studies, epidemiological investigations, and patient-centered outcomes research. However, exploratory analysis of this data is often slow, cumbersome, and error prone. This talk will review some of the challenges associated with this form of data and discuss new visual analytics methods being developed to provide practitioners with exploratory analysis tools that are faster, more intuitive, and more reliable.

May 8, 2017 - Marylyn Ritchie, Geisinger Health System

Room 403 – 4th floor, Countway Library

Marylyn D. Ritchie
Senior Investigator and Director of Biomedical and Translational Informatics, Geisinger Health System
Professor, Biochemistry and Molecular Biology, Pennsylvania State University
Director, Center for Systems Genomics – The Huck Institutes of the Life Sciences, Pennsylvania State University

Machine Learning Strategies in the Genome and the Phenome – Toward a Better Understanding of Complex Traits

Modern technology has enabled massive data generation; however, tools and software to work with these data in effective ways are limited. Genome science, in particular, has advanced at a tremendous pace during recent years with dramatic innovations in molecular data generation technology, data collection, and a paradigm shift from single lab science to large, collaborative network/consortia science. Still, the techniques to analyze these data to extract maximal information have not kept pace. Comprehensive collections of phenotypic data can be used in more integrated ways to better subset or stratify patients based on the totality of his or her health information. Similar, the availability of multi-omics data continues to increase. With the complexity of the networks of biological systems, the likelihood that every patient with a given disease has exactly the same underlying genetic architecture is unlikely. Success in understanding the architecture of complex traits will require a multi-pronged approach. Through applying machine learning to the rich phenotypic data of the EHR, these data can be mined to identify new and interesting patterns of disease expression and relationships. Machine learning strategies can also be used for meta-dimensional analysis of multiple omics datasets. We have been exploring machine learning technologies for evaluating both the phenomic and genomic landscape to improve our understanding of complex traits. These techniques show great promise for the future of precision medicine.

May 1, 2017 - Sam Kou, Harvard

Samuel Kou
Professor of Statistics
Department of Statistics
Harvard University

Big data, Google and disease detection: the statistical story

Big data collected from the internet have generated significant interest in not only the academic community but also industry and government agencies. They bring great potential in tracking and predicting massive social activities. We focus on tracking disease epidemics in this talk. We will discuss the applications, in particular, Google Flu Trends, some of the fallacy and the statistical implications. We will propose a new model that utilizes publicly available online data to estimate disease epidemics. Our model outperforms all previous real-time tracking models for influenza epidemics at the national level of the US. We will also draw some lessons for big data applications.

April 24, 2017 - Jessica Franklin & Sebastian Schneeweiss, BWH/HMS

Jessica Franklin
Assistant Professor of Medicine, Harvard Medical School
Biostatistician, Division of Pharmacoepidemiology & Pharmacoeconomics
Brigham and Women’s Hospital

Sebastian Schneeweiss
Professor in the Department of Epidemiology
Department of Epidemiology, Harvard Chan School
Professor of Medicine, Harvard Medical School
Vice Chief, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham & Women’s Hospital

Causal inference challenges and opportunities in administrative healthcare databases

Randomized controlled trials (RCTs) remain the gold standard for establishing the causal relationship between medications and health outcomes. However, for some clinical questions RCTs may be infeasible, unethical, costly, or generalizable to only a very narrow population. In these cases, observational studies from routinely collected “real-world” health data (RWD), such as health insurance claims and electronic health records, are crucial for supplementing the evidence from RCTs. In this talk, we will discuss some of the ongoing challenges in drawing causal inference about medications from these data as well as progress made. In particular, we will present research related to the automated selection of confounders from large retrospective databases to reduce confounding bias.

April 10, 2017 - Peter Kharchenko, HMS

Peter Kharchenko
Assistant Professor of Biomedical Informatics
Department of Biomedical Informatics
Harvard Medical School

From one to millions of cells: computational approaches for single-cell analysis

Over the last five years, our ability to isolate and analyze detailed molecular features of individual cells has expanded greatly. In particular, the number of cells measured by single-cell RNA-seq (scRNA-seq) experiments has gone from dozens to over a million cells, thanks to improved protocols and fluidic handling. Analysis of such data can provide detailed information on the composition of heterogeneous biological samples, and variety of cellular processes that altogether comprise the cellular state. Such inferences, however, require careful statistical treatment, to take into account measurement noise as well as inherent biological stochasticity. I will discuss several approaches we have developed to address such problems, including error modeling techniques, statistical interrogation of heterogeneity using gene sets, and visualization of complex heterogeneity patterns, implemented in PAGODA package. I will discuss how these approaches have been modified to enable fast analysis of very large datasets in PAGODA2, and how the flow of typical scRNA-seq analysis can be adapted to take advantage of potentially extensive repositories of scRNA-seq measurements. Finally, I will also discuss a focused computational approach for linking transcriptional and genetic heterogeneity within scRNA-seq measurements of human cancers.

April 3, 2017 - Finale Doshi-Velez, Harvard

Finale Doshi-Velez
Assistant Professor of Computer Science
John A. Paulson School of Engineering & Applied Sciences
Harvard University

Reinforcement Learning for Optimizing Treatment in HIV

HIV is a disease where a long-term view is needed to keep a rapidly-mutating virus in check.  Recently, there has been some success in predicting therapies which will produce an immediate reduction in viral load using kernel-based methods — methods that essentially cluster together patients with similar characteristics.  In this work, we first show that these methods can also be adapted to choose therapies with long-term benefits.  Second, we note that these kernel-based methods typically fail for patients that don’t closely align with other patients.  We present a combined kernel and model-based approach to optimize treatment, which improves outcomes by choosing a model — which offers better generalization — when there are no nearby neighbors.  Finally, I will discuss ongoing work (in simulation) to further personalize these treatments to individual patients.

March 27, 2017 - Jason Flannick, Broad & MGH

Waterhouse Room – 1st Floor
Gordon Hall – Harvard Medical School

Jason Flannick
Senior Group Leader, The Broad Institute of Harvard and MIT
Research Associate, Massachusetts General Hospital

Approaches to understand the genetic architecture and biology of type 2 diabetes

Abstract: Genome-wide association studies (GWAS) have identified numerous common variants associated with modest increases in risk of type 2 diabetes (T2D), suggesting clues into the causes and biology of disease. And yet, these findings have explained only a limited fraction of the genetic basis of T2D and have been slow to translate to improved patient care. In this talk, I will discuss what large scale next-generation sequencing of thousands to tens of thousands of T2D patients has taught us beyond GWAS, and what this means for the future of genetic research into T2D and other complex diseases. I will first present methods and an analysis to quantify the genetic architecture of T2D based on large-scale sequencing, suggesting that models motivating next-generation sequencing may have been overly optimistic; I will then, however, present specific examples where next-generation sequencing has in fact identified high-impact alleles leading to important biological insights into T2D. I will argue that these findings suggest two paths toward the future of T2D genetics research: studies of larger and larger scale but analyses that are richer and more individualistic. I will conclude with thoughts on new research paradigms and bioinformatics methods development to leverage the valuable sequence data that now exists for T2D and other complex diseases, in order to more rapidly translate genetic associations to new medicines or improved patient care.

March 22, 2017 -Peter Robinson, Jackson Lab

Armenise Amphitheater
Harvard Medical School
210 Longwood Ave.

Peter Robinson
Professor of Computational Biology
The Jackson Laboratory

Phenotype Driven Genomic Analysis

March 6, 2017 - David Parkes, Harvard

David Parkes
Area Dean for Computer Science
George F. Colony Professor of Computer Science
John A. Paulson School of Engineering & Applied Sciences
Harvard University

Robust Peer Prediction: Information without Verification

There are many settings where we want to promote the contribution of useful information but have no easy way to verify its correctness. Consider crowdsourcing measurements of air quality, asking users to answer questions about places in a city, or tag social media stories are real or fake. How can good information be promoted, even when the correct answer is costly to verify, or intrinsically unknowable because data is subjective or noisy? The idea of peer prediction is to ask multiple people the same question, and score reports based on similarity. Done right, this promises to promote informative reports. But done wrong, it can have unintended consequences (high paying, uninformative equilibria!) In this talk, I describe the correlated agreement (CA) mechanism, which can be combined with machine learning to provide a remarkably robust method of peer prediction. I demonstrate its properties in simulation on statistical models from user reports on Google Local Guides as well as through a replicator dynamics study using student peer-assessment data from EdX.

Joint work with Arpit Agarwal (U Penn), Rafael Frongillo (CU Boulder), Matthew Leifer (Harvard), Debmalya Mandal (Harvard), Galen Pickard (Google), Nisarg Shah (Harvard), and Victor Shnayder (Harvard).

January 30, 2017 - Maha Farhat, HMS DBMI

Maha Farhat
Assistant Professor of Biomedical Informatics
Department of Biomedical Informatics
Harvard Medical School

Bacterial genomics and its translation to an improved understanding and diagnosis of infectious disease

In 2017, the diagnosis of many infectious diseases involves the collection of patient samples for culture that often have poor sensitivity and incur considerable delays in the diagnosis and characterization of antibiotic resistance. Patients are often left with no diagnosis or are treated with empirical broad antibiotics exposing them and our society to ever increasing rates of drug resistance.  Sequencing technologies hold immense promise in increasing the sensitivity of testing and in capturing drug resistance in a rapid/point of care fashion, but several important barriers remain. Here I will discuss some of these and work that I and others have been doing to surmount these barriers. I will also discuss how sequencing of clinical isolates can help us gain insights into the biology of bacterial traits relevant to infectious disease such as virulence, transmissibility and antibiotic resistance.

December 12, 2016 - Emery Brown, MIT

Emery Brown, MD, PhD 

Edward Hood Taplin Professor of Medical Engineering and of Computational Neuroscience, Massachusetts Institute of Technology
Professor of Health Sciences and Technology, Massachusetts Institute of Technology
Warren M. Zapol Professor of Anesthesia, Harvard Medical School, Massachussetts General Hospital
Director, Harvard-MIT Health Sciences and Technology Program, MIT
Associate Director, Institute for Medical Engineering and Science, MIT
Investigator, Picower Center for Learning and Memory, Department of Brain and Cognitive Sciences, MIT

The Dynamics of the Unconscious Brain Under General Anesthesia

General anesthesia is a drug-induced, reversible condition comprised of five behavioral states: unconsciousness, amnesia (loss of memory), analgesia (loss of pain sensation), akinesia (immobility), and hemodynamic stability with control of the stress response. Our work shows that a primary mechanism through which anesthetics create these altered states of arousal is by initiating and maintaining highly structured oscillations. These oscillations impair communication among brain regions. We illustrate this effect by presenting findings from our human studies of general anesthesia using high-density EEG recordings and intracranial recordings. These studies have allowed us to give a detailed characterization of the neurophysiology of loss and recovery of consciousness due to propofol. We show how these dynamics change systematically with different anesthetic classes and with age. We present a neuro-metabolic model of burst suppression, the profound state of brain inactivation seen in deep states of general anesthesia. We use our characterization of burst suppression to implement a closed-loop anesthesia delivery system for control of a medically-induced coma. Finally, we demonstrate that the state of general anesthesia can be rapidly reversed by activating specific brain circuits. The success of our research has depended critically on tight coupling of experiments, signal processing research and mathematical modeling.

December 5, 2016 - Atul Butte, UCSF

Atul Butte
Director, Institute for Computational Health Sciences
Professor of Pediatrics
University of California, San Francisco

Translating a Trillion Points of Data into Therapies, Diagnostics, and New Insights into Disease

There is an urgent need to take what we have learned in our new genome era and use it to create a new system of precision medicine, delivering the best preventative or therapeutic intervention at the right time, for the right patients.  Dr. Butte’s lab at the University of California, San Francisco builds and applies tools that convert trillions of points of molecular, clinical, and epidemiological data –measured by researchers and clinicians over the past decade and now commonly termed ³big data² — into  diagnostics, therapeutics, and new insights into disease. Several of these methods or findings have been spun out into new biotechnology companies.  Dr. Butte, a computer scientist and pediatrician, will highlight his lab’s recent work,including the use of publicly-available molecular measurements to find new uses for drugs including new therapies for autoimmune diseases and cancer, discovering new druggable targets in disease, the evaluation of patients and populations presenting with whole genomes sequenced, integrating and reusing the clinical and genomic data that result from clinical trials, discovering new diagnostics include blood tests for complications during pregnancy, and how the next generation of biotech companies might even start in your garage.

November 28, 2016 - Shamil Sunyaev, HMS

Shamil Sunyaev
Professor and Distinguished Chair of Computational Genomics
Division of Genetics, Brigham & Women’s Hospital, Harvard Medical School
Department of Biomedical Informatics, Harvard Medical School

Genome function and evolution through the lens of human genetics
Genetics has traditionally provided an entry point into new areas of biology. Now, large-scale human genetics studies discovered a vast trove of noncoding alleles involved in complex traits. The focus is shifting to the interpretation of these genetic findings in terms of mechanistic hypotheses. Analysis of molecular phenotypes, such as gene expression, provides one possible venue to analyze functional effects of alleles associated with human phenotypes. We developed a method that tests for co-localization of eQTLs and GWAS signals, and showed that a fraction of GWAS hits in autoimmune diseases are due to known eQTLs. We also developed a method to assist fine-mapping using functional annotations. Importantly, the method leverages phenotype-specific information. Sequencing studies of Mendelian phenotypes also require biological interpretation. We developed evolutionary models that help prioritize Mendelian mutations and analyze the role of genetic background in Mendelian genetics.

November 14, 2016 - Madalena Costa, HMS/BIDMC

Madalena Costa
Assistant Professor of Medicine, Harvard Medical School
Division of Interdisciplinary Medicine and Biotechnology
Co-Director, The Margret & HA Rey Institute for Nonlinear Dynamics in Medicine
Beth Israel Deaconess Medical Center

Dynamical Assays: A New Frontier in Precision Medicine?

Statistical physics seems to be an unlikely frontier of biomedical informatics and precision medicine. The goal of this talk is to introduce new concepts and computational methods from complex systems, which may be useful in “unpacking” big data, such as multichannel sleep recordings, and for extracting hidden information from time series derived from wearable devices. I will also talk about complex signals informatics in the context of open-source science.

October 17, 2016 - Tamara Broderick, MIT

Tamara Broderick
ITT Career Development Assistant Professor
Massachusetts Institute of Technology

Fast Quantification of Uncertainty and Robustness with Variational Bayes

In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. These choices may be somewhat subjective and reasonably vary over some range. Thus, we wish to measure the sensitivity of posterior estimates to variation in these choices. While the field of robust Bayes has been formed to address this problem, its tools are not commonly used in practice—at least in part due to the difficulty of calculating robustness measures from MCMC draws. We demonstrate that, by contrast to MCMC, variational Bayes (VB) techniques are readily amenable to robustness analysis.
Since VB casts posterior inference as an optimization problem, its methodology is built on the ability to calculate derivatives of posterior quantities with respect to model parameters. We use this insight to develop local prior robustness measures for mean-field variational Bayes (MFVB), a particularly popular form of VB due to its fast runtime on large data sets. A potential problem with MFVB is that it has a well-known major failing: it can severely underestimate uncertainty and provides no information about covariance. We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for MFVB—both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB).

October 3, 2016 - Michael Baym, HMS

Michael Baym
Research Fellow in Systems Biology
Kishony Lab, Harvard Medical School

Evolutionary strategies to combat antibiotic resistance

Antibiotics are among the most important tools in medicine, but today their efficacy is threatened by the evolution of resistance. While resistance continues to spread, the development of new antibiotics is slowing. We need new strategies to delay or reverse the course of resistance evolution. In this talk I will describe several approaches to studying and manipulating resistance evolution in detail. Using a new experimental device, we have been able to study previously elusive aspects of the evolution of antibiotic resistance in spatial environments. Second, I will show how increasing the scale of experiments allows both the discovery of new avenues of attack and potential failure modes of evolutionary interventions. I will conclude with the algorithmic and biological challenges in the practical application of these approaches.

September 19, 2016 - Jukka-Pekka Onnela, HSPH

Jukka-Pekka Onnela
Assistant Professor of Biostatistics
Harvard T.H. Chan School of Public Health

Digital Phenotyping: Concept and Early Pilot Studies

We have defined the term “digital phenotyping” as smartphone-based moment-by-moment quantification of the individual-level human phenotype. As part of our efforts in this area, we have introduced a scalable research platform that enables us to customize the associated smartphone applications to collect large amounts of different kinds of social and behavioral data. This approach opens up some intriguing scientific opportunities in the biomedical sciences, but it also presents several interesting challenges for statistical learning. I will talk about the general concept of digital phenotyping and how we are currently using it in various pilot studies. I will also share some of our early results and highlight some data analytic and statistical challenges.