Quantitative Issues in Cancer Research Working Group Seminar

Please direct any logistical questions to David Cruikshank

The Cancer Working Group is currently sponsoring seminars. These are listed in the schedule below.

Upcoming Seminar


Cancer Working Group seminar meetings will be held in person in Building 2, Room 426 unless otherwise listed.  The link to each meeting will be posted along with the talk information.

Wednesday, May 8, 2024
Cancer Working Group
4:00-5:00 PM
Location: Building 2, Room 426
Event Link: https://www.hsph.harvard.edu/biostatistics/event/quantitative-issues-in-cancer-research-working-group-seminar-114/

Daniel Schwartz, Postdoctoral Research Fellow, Department of Biostatistics, Harvard University

Dynamic Latent Factor Models To Infer Dietary Patterns From Nutrition Survey Data

Abstract: A growing body of research has shown that poor diet is a leading risk factor for death, especially in connection with chronic diseases such as cardiovascular disease. However, these studies provide limited insights because they use simplistic measures of diet measured at a single timepoint. To address this issue, we develop a Bayesian dynamic latent factor model to succinctly describe multivariate dietary patterns. Our approach flexibly incorporates multivariate, longitudinal nutrition survey data such as food frequency questionnaires with multiple outcome types (e.g. ordinal, continuous, etc.). A truncated multiplicative gamma process prior is placed on the factor loadings to adaptively estimate low-dimensional dietary patterns. Importantly, our model also incorporates covariates such as demographics to assess how dietary patterns differ across subpopulations of interest. As a motivating application we consider the Black Women’s Health Study, where we construct dynamic measures of diet that will be used in downstream analyses to better understand cardiovascular disease risk among black women in the United States.

2023-2024 Dates


September 13, 2023

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

Title: A Class of Case-Cohort Designs for Semi-Competing Risks

ABSTRACT: Semi-competing risks refers to the setting where interest lies in some non-terminal event, the occurrence of which is subject to some terminal event (usually death). While existing analysis methods generally assume complete data on all relevant covariates, it is often the case, particularly with electronic health records databases and/or disease registries, that some information is not readily-available. To mitigate this, outcome-dependent sampling is a common strategy, especially in resource-limited settings, although researchers currently have only limited options in the semi-competing risks setting. We present a novel class of case-cohort designs for semi-competing risks within which researchers have flexibility to tailor allocation of resources in a variety of ways that best suit the disease context and study at-hand. For estimation and inference, we propose to use inverse-probability weighting for a parametric hazard regression-based frailty illness-death model. We present asymptotic results, along with a practical estimator of the asymptotic variance. Simulation results are presented that verify performance of the proposed analysis methods in finite settings and illustrate potential efficiency gains associated with the design. The work is motivated by and illustrated with data from the Center for International Blood & Marrow Transplant Research.

September 20, 2023

Luke Benz
Doctoral Student, Department of Biostatistics, Harvard University

Title: Understanding Missing Data when Emulating Target Trials in EHR-Based Observational Studies

ABSTRACT: In recent years, the target trial emulation framework has developed as a framework for helping researchers mitigate or avoid potential biases in observational studies. In simplest terms, target trial emulation requires researchers to specify the protocol for an ideal clinical trial they would run if possible, and subsequently establish an analogous version of the protocol for the observational study that adheres as closely as possible to that of the target trial.

A critical component of this target trial emulation framework is specifying the eligibility criteria for inclusion in the study. Electronic health record databases serve as a useful data source for observational analyses. However, as EHR are collected for billing purposes rather than any particular clinical question, useful information for statistical analyses may be unavailable. In particular, when using EHR databases to emulate target trials, it is frequently the case that subjects’ eligibility status can not be ascertained due to missing data in the covariates that comprise the inclusion criteria for the study. Nearly every observational analysis under the target trial emulation framework excludes all subjects with missing eligibility data, yet this could plausibly introduce selection bias, particularly when a sequence of emulated trials are pooled to increase power.

In this work, I will outline ongoing work on building infrastructure for several simulation studies to better understand settings where excluding subjects with missing eligibility data is problematic, and potential solutions. These simulation studies are motivated by the study of long term effects of Bariatric surgery, and simulation settings are informed by prior analyses conducted on EHR-based studies on the DURABLE cohort at Kaiser Permanente.

September 27, 2023

October 4, 2023

Phillip Nicol
Doctoral Student, Department of Biostatistics, Harvard University

Title: Model-based dimension reduction for spatial transcriptomics data

ABSTRACT: Recently developed technologies can measure gene expression at single cell resolution while simultaneously preserving the spatial location of samples. Standard dimension reduction techniques such as principal component analysis (PCA) can be applied to find a small set of genes that contribute biologically relevant variation. However, standard approaches do not model the count nature of the data which can lead to spurious results. Moreover, the resulting PCA factors may not be spatially coherent in the sense that nearby cells could have very different factor scores. In this talk I will discuss preliminary work on adding spatial penalties to a Poisson-based model for dimension reduction of single-cell gene expression data (scGBM). We will demonstrate the ability of our method to produce spatially coherent factors on both real and simulated data.

October 11, 2023

Mónica Robles Fontán
Doctoral Student, Department of Biostatistics, Harvard University

Title: Time-varying effectiveness of the COVID-19 Bivalent Vaccine

ABSTRACT: SARS-CoV-2 has now become a constant in our daily lives. To mitigate the severe outcomes of infection, the scientific community updates vaccines targeting both ancestral and newer, more prevalent strains. On October 12, 2022, the FDA approved the administration of a bivalent COVID-19 vaccine targeting Omicron strain infections. We intend to evaluate the bivalent vaccine’s effectiveness by analyzing cases, hospitalizations, and deaths accumulated in Puerto Rico. In particular, we will compare improvements in effectiveness with respect to different groups: the unvaccinated, those who received the primary series, and those who received the primary series followed by a booster shot. Preliminary results suggest that bivalent vaccine effectiveness significantly declines within 6 months after administration, consistent with the decline in vaccine effectiveness associated with the primary series.

October 25, 2023

Kimberly Greco
Doctoral Student, Department of Biostatistics, Harvard University

Title: Representation Learning from EHR Data

ABSTRACT: Gleaning meaningful insights from electronic health record (EHR) data is complex due to its sparse and high-dimensional nature. Representation learning is a fundamental tool for extracting features from EHR data and transforming a collection of clinical concepts into a lower-dimensional vector space optimized for machine learning applications. This presentation will introduce an innovative framework for integrating multi-source EHR data to generate vector embeddings, as well as in-progress work focused on sub-phenotyping rare disease patients at Boston Children’s Hospital using this framework.

November 1, 2023

Jodeci Wheaden
Doctoral Student, Department of Biostatistics, Harvard University

Title: The Link between Food Environment and Colorectal Cancer: A Systematic Review

ABSTRACT: This talk will discuss the paper “The Link between Food Environment and Colorectal Cancer: A Systematic Review” by Masdor et al. (2022), available at https://doi.org/10.3390/nu14193954, in relation to the speaker’s research interests.

November 8, 2023

Sajia Darwish
Doctoral Student, Department of Biostatistics, Harvard University

Title: Theory-driven statistical estimation of racism-related themes in the medical literature

Abstract: The scientific study of racism as a root cause of health inequities has been hampered by the policies and practices of medical journals. Monitoring the discourse around racism and health inequities in scientific publications is a critical aspect of understanding, confronting, and ultimately dismantling racism in medicine. Our goal is to develop a framework to evaluate the changes in the prevalence and composition of racism narratives over time and across journals through the use of natural language processing.

November 29, 2023

Carmen Rodriguez Cabrera
Doctoral Student, Department of Biostatistics, Harvard University

Title: Addressing Disparities in Polygenic Risk Scores Prediction

Abstract: Dimension reduction is an indispensable part of modern data science, and many algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it important to evaluate their relative performance, and to leverage and combine their individual strengths. This paper proposes a spectral method for assessing and combining multiple data embeddings of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure – the visualization eigenscore – of the relative performance of the embeddings for preserving the structure around each data point. It also generates a consensus embedding, having improved quality over individual visualizations in capturing the underlying structure. Our approach is flexible and works as a wrapper around any data embeddings. We analyze multiple real-world datasets to demonstrate the effectiveness of the method. We also provide theoretical justifications based on a general statistical framework, yielding several fundamental principles along with practical guidance. This is a joint work with Eric Sun and James Zou.

December 6, 2023

Rong Ma
Assistant Professor, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Title: A Spectral Approach to Assessing and Combining Multiple Data Embeddings

Abstract: Dimension reduction is an indispensable part of modern data science, and many algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it important to evaluate their relative performance, and to leverage and combine their individual strengths. This paper proposes a spectral method for assessing and combining multiple data embeddings of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure – the visualization eigenscore – of the relative performance of the embeddings for preserving the structure around each data point. It also generates a consensus embedding, having improved quality over individual visualizations in capturing the underlying structure. Our approach is flexible and works as a wrapper around any data embeddings. We analyze multiple real-world datasets to demonstrate the effectiveness of the method. We also provide theoretical justifications based on a general statistical framework, yielding several fundamental principles along with practical guidance. This is a joint work with Eric Sun and James Zou.

December 13, 2023

Daniel Schwartz
Postdoctoral Research Fellow, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Treatment Effect Estimation in Multisite Trials with Endogenous Design: Old Estimators, New Results

Abstract: In large-scale multisite randomized trials, key design features such as the sample size at each site often arise from an unpredictable social process. As a result, the sample sizes of the treated and control groups within each site, which are generally considered fixed, are more appropriately considered random. When the treatment effect also varies across sites it is often plausible that it will be associated with the site-specific sample sizes, for instance when larger sites are more effective. In this case we say that the trial design is endogenous. In this paper we argue that endogenous designs have a major consequence: to realistically evaluate the performance of an estimator under repeated sampling one must now integrate over the design features. This can challenge intuitions about when different estimators are best used. We present a simple model for multisite trials with random, endogenous designs. We then show, with asymptotic and simulation results, how popular treatment effect estimators can perform very differently under endogenous designs than they do under a classical perspective assuming that the design is fixed. We illustrate these ideas in a case study of a major multisite trial in education, showing how Bayesian analyses can explore the likely performance of the common estimators given the study’s data, under a range of prior beliefs about how endogenous the study design is.

January 24, 2024


Elizabeth Graff
PhD Student, Department of Biostatistics, Harvard University

Discussion of “Shift-aware Human Mobility Recovery with Graph Neural Network” (Sun et al. 2021)

Abstract: Human mobility recovery is of great importance for a wide range of location-based services. However, recovering human mobility is not trivial because of three challenges: 1) complex transition patterns among locations; 2) multi-level periodicity and shifting periodicity of human mobility; 3) sparsity of the collected trajectory data. In this paper, we propose PeriodicMove, a neural attention model based on graph neural network for human mobility recovery from lengthy and sparse trajectories. In PeriodicMove, we first construct a directed graph for each trajectory and capture complex location transition patterns using graph neural network. Then, we design two attention mechanisms which capture multi-level periodicity and shifting periodicity of human mobility respectively. Finally, a spatial-aware loss function is proposed to incorporate spatial proximity into the model optimization, which alleviates the data sparsity problem. We perform extensive experiments and the evaluation results demonstrate that PeriodicMove yields significant improvements over the competitors on two representative real-life mobility datasets. In addition, by providing high-quality mobility data, our model can benefit a variety of mobility-oriented downstream applications.

January 31, 2024


Jeremy Simon
Senior Research Scientist, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health

Discussion of “Shift-aware Human Mobility Recovery with Graph Neural Network” (Sun et al. 2021)

Abstract: I will discuss ongoing work from a personalized neoantigen vaccine clinical trial in melanoma, focusing on single-cell RNA-seq and TCR-seq data from patient samples to better understand the immune response following vaccination. I will also discuss preliminary work exploring whether single-cell chromatin accessibility (scATAC-seq) features help to define a rare but fascinating subtype of CLL.

February 7, 2024


Jodeci Wheaden
PhD Student, Department of Biostatistics, Harvard University

Lifestyle factors and impact on colon cancer risk

Abstract: In my exploration titled “Lifestyle Factors and Impact on Colon Cancer Risk,” I initially set out to understand the relationship between lifestyle choices and colon cancer risk. However, a pivot was necessary due to the scarcity of data on colon cancer incidents in the NHANES dataset. This led me to a more focused inquiry into how obesity, a significant public health concern, might play a role in the development of colon cancer. Delving into the NHANES 2015-2016 data for adults over 20, I examined various lifestyle factors and their association with obesity.

February 14, 2024


Giovanni Parmigiani
Professor, Department of Data Science, Dana Farber Cancer Institute and Department of Biostatistics, T.H. Chan School of Public Health

Digressions on Simpson’s Paradox

February 21, 2024


Luke Benz
PhD Student, Department of Biostatistics, Harvard University

Adjusting for Selection Bias Due to Missing Eligibility Criteria in Emulated Target Trials

Abstract: Target trial emulation (TTE) is a popular framework for observational studies based on electronic health records (EHR). A key component of this framework is determining the patient population eligible for inclusion in the study. Missingness in variables that define eligibility criteria, however, presents a major challenge, yet in practice, patients with incomplete data are frequently excluded from analysis despite the possibility of selection bias, which can arise when subjects with observed eligibility data are fundamentally different than excluded subjects. Despite this, to the best of our knowledge, however, very little work has developed methods to mitigate this concern. In this work, we propose a novel conceptual framework to address selection bias in a TTE studies, tailored towards time-to-event endpoints, and describe estimation and inferential procedures via inverse probability weighting (IPW). Under a realistic simulation infrastructure, which adequately captures the unique intricacies of EHR data, we characterize common settings under which missing eligibility data poses the threat of selection bias and evaluate the ability of IPW to address it. Finally, we demonstrate use of our method to evaluate the effect of bariatric surgery on microvascular outcomes.

February 28, 2024


Raphael Kim
Doctoral Student, Department of Biostatistics, Harvard University

Title: Cost constrained optimal treatment regimes under bipartite network interference

ABSTRACT: Adverse health effects of coal-fired power plant emissions are often studied under bipartite network interference (BNI) settings, in which the treated units are different from the units that outcomes are observed on and treatment units can affect multiple outcome units. There is growing literature on causal effect estimation under BNI, but to our knowledge, none have considered the problem of optimal treatment regimes under BNI. We introduce a Q Learning and A Learning approach for determining cost-constrained treatment under arbitrary BNI, and derive the asymptotic properties of our proposed estimators. We demonstrate the efficacy of our methods in a simulation study.

March 6, 2024


Phillip Nicol, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Estimation in Poisson log-bilinear models

Abstract: The Poisson log-bilinear model, also known as GLM-PCA, is a commonly used approach for dimension reduction in single-cell RNA-seq data. Model parameters are usually estimated via maximum likelihood. However, we show that the MLE can be undefined for some realistic single-cell datasets. In this talk, we show how this issue can be resolved by adding appropriate priors to the model parameters. Importantly, the prior information can be incorporated with minor adjustments to existing estimation algorithm. We demonstrate the approach on real and simulated single cell data and discuss extensions to spatial transcriptomics.

March 20, 2024


Sajia Darwish, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Discussion of “What is the probability of replicating a statistically significant effect?” (Miller 2009)

Abstract: If an initial experiment produces a statistically significant effect, what is the probability that this effect will be replicated in a follow-up experiment? [This paper] argues that this seemingly fundamental question can be interpreted in two very different ways and that its answer is, in practice, virtually unknowable under either interpretation. Although the data from an initial experiment can be used to estimate one type of replication probability, this estimate will rarely be precise enough to be of any use. The other type of replication probability is also unknowable, because it depends on unknown aspects of the research context. Thus, although it would be nice to know the probability of replicating a significant effect, researchers must accept the fact that they generally cannot determine this information, whichever type of replication probability they seek.

March 27, 2024


Mónica Robles Fontán, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Leveraging Record Linkage To Enhance Public Health Research

Abstract: Record linkage is the task of combining records from different populations that belong to a single entity to create a new single population. This task allows researchers to take advantage of existing data sources to answer scientific questions that otherwise would be difficult to assess, such as studies requiring large sample sizes. There are two main approaches to performing record linkage tasks: deterministically and probabilistically, although most implementations combine both approaches. In this talk, we will explore the problem of record linkage and discuss the theoretical framework as developed by Fellegi & Sunter (1969). We will discuss practical issues that arise in the task of record linkage, as well as a real-world example in the context of observational data for COVID-19 vaccination and outcomes from Puerto Rico.

April 3, 2024


Christian Covington, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Statistical theory and the practice of data analysis: A brief and biased history

Abstract: This talk gives an account of the replication crisis and how different disciplines– namely applied sciences, statistics, and theoretical computer science (TCS), have developed their own research agendas in order to address it. I distinguish between two tracks in the history of methodological development: one regarding adaptivity in data analysis, the other regarding “methodological uncertainty” in model selection, data processing choices, etc.

I provide an overview of a few different methodological approaches, developed in the statistics and TCS communities, for achieving valid inference under adaptivity. Then I describe two increasingly popular frameworks developed primarily by psychologists for incorporating methodological uncertainty into a data analysis pipeline: multiverse analysis and specification curve analysis. Through examples, I explore confusion and disagreement about how these ideas ought to be used. Finally, I argue that more work is needed to understand what these methods can and can’t provide, both philosophically and statistically, and provide some preliminary ideas to this end.

April 10, 2024

Elizabeth Graff, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Applications of Deep Learning for Graph-Structured Data: From Disease Spread to Social Networks

Abstract: How can we apply deep learning to solve problems in modeling the spread of disease? In this talk, we will explore the components and applications of Graph Neural Networks (GNNs), a class of neural networks that are specifically designed to learn from graph-structured data. We will discuss the versatility of graphs in representing complex relationships across various domains from molecular structures to social networks, which necessitates models like GNNs that can capture both the graph topology and node-level information. We will examine examples of studies that leverage GNNs to achieve machine learning tasks on graphs, including node and edge predictions and whole graph classifications.

April 17, 2024


Kimberly Greco, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

Graph Attention Framework to Enhance Rare Disease Sub-Phenotyping from EHR

Abstract: Accurately sub-phenotyping patients according to their risk for an adverse clinical outcome can significantly enhance clinical decision-making. Recent advances in patient representation learning have enabled the development of sophisticated clustering algorithms designed to accurately sub-phenotype patients in ways that are predictive of these outcomes. To optimize data for clustering, we introduce a methodology utilizing a Graph Attention Network (GAT) to enhance Electronic Health Record (EHR) code-level embeddings. This approach facilitates the generation of rich patient-level embeddings, which are then leveraged in downstream clustering tasks aimed at sub-phenotyping patients based on their risk of experiencing a particular outcome. Building on this foundation, we explore ongoing work focused on advancing personalized medicine for patients with rare diseases.

April 24, 2024

Carmen B. Rodriguez, PhD Student, Department of Biostatistics, Harvard T.H. Chan School of Public Health

A Bayesian Mixture Model Approach to Examining Socioeconomic Disparities in Endometrial Cancer Care in Massachusetts.

Abstract: Endometrial cancer (EC) is the most common gynecologic cancer in the United States. On average, African American women have 55% higher 5-year mortality risk compared to white women, and like other minority groups, they are vulnerable to receiving care that is not concordant with evidence-based treatment guidelines. These differences are linked to systemic and structural factors relating to difficulties in accessing care and the socioeconomic environments in which individuals reside. Previous research has examined socioeconomic factors (e.g., education, income) individually/independently, but these often interact as social determinants of health. In this project, we took a multifactorial approach in how we examine racial-ethnic and socioeconomic factors leading to bias and disparities in EC care. We follow a social determinants of health framework to describe neighborhood socioeconomic status (NSES) profiles/clusters. We identified NSES profiles through the application of a multivariate Bernoulli mixture model. Using census tract aggregate level data and patient-level information from 9318 patients collected in the 2006-2017 Massachusetts Cancer Registry, we examined differences in receipt of optimal care for EC patients in Massachusetts by NSES profiles. We compared the stability of our cluster profiles across three waves of the American Community Survey 5-year estimates (2006-2010,2011-2015 and 2015-2019), and compared these results to other aggregate measures used in cancer surveillance datasets.

May 1, 2024


Amy Zhou, PhD Student, Department of Biostatistics, Harvard University

Comparison of Outcome-Dependent Sampling for Semi-Competing Risks

Abstract: Outcome-dependent sampling is a commonly used design tool to collect otherwise unavailable information on a subset of participants rather than all participants. This is particularly useful in research settings where one or more covariates of interest may not be readily available, whether cost-prohibitive, time-consuming, or difficult to obtain in a resource-limited setting. Two common outcome-dependent sampling methods used in time-to-event settings are nested case-control and case-cohort. Classes of designs for both nested case-control and case-cohort were developed to extend their use to analysis of semi-competing risks. Semi-competing risks refers to the setting where interest lies in some non-terminal event, the occurrence of which is subject to some terminal event (typically, but not always, death). We compare the efficiency of these two designs for semi-competing risks through simulation to show the effect of censoring, type of risk factor, subcohort size, and more and illustrate the flexibility of these two designs to tailor resource allocation that best suit the disease context and study goals.

Cancer Working Group Seminar Archive