Training Grant in Quantitative Sciences for Cancer Research



Current and Recent Fellows (2015-2016)

Sixing Chen (advisor Xihong Lin; supported on this grant from 2012-2015) is a third year student, and current trainee on this grant, who is working with Dr. Lin to attempt to approximate power in the presence of measurement error. His current project entails methods to perform analysis with data from differing sequencing platforms. He is looking at two approaches, one of which allows analysis with naive variance estimates. He plans to apply this work to a cancer dataset and is currently working on the manuscript for this paper. He will also be presenting this work at JSM in August.

Kevin Galinsky (advisor Alkes Price; dissertation committee Nick Patterson [Broad Institute] and Liming Liang; supported on this grant from 2012-2014) is a third year student.

Mr. Galinsky is working with Dr. Price on a project to estimate principle components (PCs) of genomic data. Principle components analysis (PCA) is a technique that can infer population structure in genomic data and has also been used to correct for population stratification in genome wide association studies (GWAS). Unfortunately, traditional PCA can runs slowly as the number of samples gets large with running time proportional to the number of samples cubed and memory consumption proportional to the number of samples squared. Mr. Galinsky is working on applying an accurate approximation to PCA which runs linearly with the number of subjects to genomic data. This algorithm runs orders of magnitudes faster on data sets with tens of thousands of individuals. It allows the computation of PCs on very large datasets without specialized hardware. Mr. Galinsky has also extended a selection statistic that operates on discrete populations to instead use SNP loadings from PCA. He applied these tools to a cohort of 54k individuals of European ancestry from the San Francisco bay area and has found natural selection acting upon the ADH1B, IGFBP3 and the IGH loci. IGFBP3 has been implicated in a number of cancers and is associated with breast cancer. The results of this project have been submitted to Nature Genetics and are currently being reviewed.

Mr. Galinsky's next project is studying cross-population heritability. Heritability is defined as the proportion of phenotypic variance explainable by genetics. Mixed model methods have been employed to estimate heritability for numerous phenotypes and these methods have been extended to estimate heritability for two traits simultaneously, along with the correlation of genetic effects for these two traits. The goal of the project is to extend this technique from the study of two phenotypes in one population to one phenotype in two populations. This project seeks to address the questions of how applicable a GWAS in one population is on another as well as to potentially answer the question of whether additive causal effect sizes are indeed the same across populations, or if gene-by-gene and gene-by-environment interactions must be considered more.

Daniel Schlauch (advisor John Quackenbush; supported on this grant from 2012-2014) is a third year student. In the summer of 2013, Mr. Schlauch worked on a project with Professor Quackenbush identifying novel, validated molecular subtypes of ovarian cancer which have clinical and prognostic relevance. He has continued this research, focusing on the biological networks and signal passing mechanisms which drive tumor progression and which are differentiated between subtypes of cancer.

Mr. Schlauch has also been developing novel, computationally efficient methods for inferring gene regulatory networks based on the use of complementary data sources such as gene expression data and transcription factor binding motifs. He has developed a method, BERE, which infers bipartite regulatory networks with comparable or improved accuracy to leading algorithms that can be run on a local machine in a fraction of the time.

Mr. Schlauch recently contributed a Bioconductor package, pandaR, an R implementation of the network inference method PANDA, which has been shown to be a leader in inferring transcription factor-gene regulatory structure from expression data, motif data, and protein-protein interaction data. A draft publication for submission to Bioinformatics has been written regarding this package. In addition, much of his work has been on the analysis of regulatory networks for identifying biologically meaningful structural differences. His work has applications in the discovery of mechanisms for disease progression by identifying the key components in the regulatory network which are altered between two study groups. The analytic approaches are general and may suggest biological mechanism or potential therapeutic targets in a wide range of diseases not limited to cancer and COPD.

Ryan Sun (advisor Xihong Lin; dissertation committee Peter Kraft, Eric Tchetgen Tchetgen, David Christiani; supported on this grant from 2012-2014) is a third year student who has completed his written and oral exams.

Mr. Sun has been working with Dr. Xihong Lin to study statistical models used in detecting specific gene-environment interactions that may be associated with disease. It is believed that complex relationships between genetic and environmental risk factors characterize the etiology of many diseases, but there exist very few validated gene-environment interaction findings in literature. Studying these interactions can have a great public health benefit by helping to better understand biological mechanisms that cause complex diseases, identify populations susceptible to disease, and discover risk factors that may not show large marginal effects. Specifically, Mr. Sun's focus is on improving ways of performing inference on these interactions so they can better detect these between genes and environmental factors. Mr. Sun is in the later stages of drafting his first paper on this work, tentatively titled "esting for Gene-Environment Interactions under Misspecification of Environmental Exposure." He is developing theoretical results about the validity of testing for geneenvironment interactions in the common regression setting and suggests a new resampling-based procedure for performing this inference. The methods are broadly applicable to most combinations of environmental exposure and disease, including cancer genetics studies. He will be presenting this work at the Joint Statistical Meetings (JSM) in August, and has also submitted an abstract to present at the ICSA (International Chinese Statistical Association) Statistics Symposium. Mr. Sun has also been performing statistical analysis for the HSPH Superfund Research Project on their gene-environment team, specifically on a study of how toxic metals impact infant neurodevelopment.

David Tucker (advisor Paul Catalano; supported on this grant from 2012-2014), graduated from the program with an MS in Spring 2015.

In addition to completing coursework, Mr. Tucker undertook summer research under the advisement of Professor Francesca Dominici. Their work involves utilizing Bayesian hierarchical modelling to model the exposure-response relationship between aircraft noise and cardiovascular disease (CVD) and investigate evidence of a change point. The primary scientific question of interest seeks to determine if there is a noise threshold or change point below which there is a mitigated effect on a person's risk of CVD hospitalization and above which the risk of CVD hospitalization significantly increases. The dataset they worked with combines Medicare with noise contours provided by the Federal Aviation Administration.

David Cheng (advisor Lee-Jen Wei; supported on this grant from 2013-2015) is a second year student and current trainee on this grant.

Mr. Cheng is interested in statistical methodology related to identifying high-risk patient subgroups and optimal therapies for patient subgroups. This problem is particularly relevant to treating cancer due to the diversity of disease sub-types and treatments available in cancer. Such methodologies may aid the development of individualized strategies for the prevention, diagnosis, and treatment of cancer. He is also interested in developing methodology for comparative effectiveness research to compare the benefit and risks of competing treatments from observational data. These methods may help inform patient decisions regarding identifying the optimal treatment among all treatments that are available despite the lack of direct head-to-head evidence from randomized trials. Mr. Cheng is currently working with Professor Wei on the development of risk prediction methodology using decision tree models that are more interpretable and actionable for clinical decision making compared to traditional risk scoring approaches. This work was presented at a session with FDA CDER Office of Biostatistics director Lisa LaVange at the Dana Farber Cancer Institute. He is also currently working on developing a method for treatment selection based on decision trees to identify enrichable patient subgroup that are likely to benefit from treatment.

In a separate research project, Mr. Cheng is working on a manuscript on a method to allow for adjustments for differences in patient characteristics when comparing outcomes between studies when the full individual patient data is available for only one study. This method would potentially facilitate comparisons of treatments across studies when direct comparisons in a clinical trial would otherwise be unavailable or unfeasible.

Ina Jazic (advisor Sebastien Haneuse; supported on this grant from 2013-2015) is a second year student and current trainee on this grant. In addition to her completing her coursework and written qualifying exam, Ms. Jazic has been working on a project with DFCI Research Scientist Svitlana Tykeucheva and Professor Giovanni Parmigiani using gene expression data to detect cross-talk between tumor and stroma in breast and ovarian cancer. She is currently preparing an R package that will implement this method.

In the summer of 2014, Ms. Jazic began working with Sebastien Haneuse on a project related to semicompeting risks -- the situation where primary interest lies in a non-terminal event, such as diagnosis, metastasis, or hospital readmission, whose occurrence is subject to a terminal event. Specifically, when evaluating the impact of interventions on quality of life among populations at high risk of death (such as terminal cancer patients), it is crucial to correctly handle death as a competing risk. She is currently preparing a manuscript illustrating the implications of semi-competing risks on the use of composite endpoints in clinical studies.

In March 2014, Ms. Jazic presented "Cross-Talk Analysis in Breast Cancer Tissues," at the Joint Symposium of the Dana-Farber/Harvard Cancer Center Programs in Breast and Gynecologic Cancer in Boston, MA. This poster won an award in the Genetics and Genomics session. She has submitted an abstract based on the composite endpoints project to the International Conference on Health Policy Statistics in Providence, RI, that took place in October 2015.

Sarah Peskoe (advisor Donna Spiegelman; supported on this grant from 2013-2015) is a second year student and current trainee on this grant. In addition to completing her coursework, Ms. Peskoe is working on two research projects. The first project is developing methodology for the estimation of latency parameters in Cox regression and to investigate the effect of measurement error on the estimation of those latency parameters. Examples of this include estimating the age at which aspirin intake puts people at the greatest risk for subsequently developed colorectal cancer, or estimating the time since infection with HIV that it is best to start ARVs. This project is three-fold. The first aim is to develop methodology to incorporate point and interval estimates of a latency period (such as cumulative exposure variables, relevant exposure windows, and the beginning end and other landmark features of time-varying exposure metrics) in Cox regression. A second component of the project, which is Ms. Peskoe's primary focus, is to understand the effect of measurement error and misclassification on estimation of these latency parameters, specifically in Cox regression. Finally, she expects to develop statistical methods that take measurement error explicitly into account in unbiasedly estimating a broad range of latency parameters.

The second project is validation of a risk prediction model for colorectal cancer, incorporating both family history and modifiable/behavioral risk factors in the Health Professionals Follow-Up Study. Ms. Peskoe is working on her first dissertation paper with Professors Donna Spiegelman and Molin Wang; the working title is "Impact of measurement error on latency estimation in linear models." She presented the early findings from this work at HSPH Research Day in November, 2014. She has also been participating in the Nurses Health Study Air Pollution meetings, with a particular interest in studying the incidence of lung cancer when there is uncertainty in air pollution measures. Ms. Peskoe attended the Joint Statistical Meetings in 2014 (Boston) and will be attending again in 2015 (Seattle).

Theodore Huang (advisor Robert Gray; supported on this grant from 2014-2015) is a first year student and current trainee on this grant. As part of his course work, Mr. Huang has taken cancer cognate classes Epidemiology of Cancer (EPI 213) and Fundamental Concepts in Gene Mapping (BIO 227), which can be applied to cancer problems. He has also been conducting research with Dr. Parmigiani for the past winter and this spring semester, and has been working on organizing and analyzing a data set collected by the Cancer Genetics Network (CGN), with the goal of building and validating models that predict risk of carrying cancer susceptibility genes, particularly breast cancer.

Kelly Mosesso (advisor Paul Catalano; supported on this grant from 2014-2015) is a first year student and current trainee on this grant. In the summer of 2015, Ms. Mosesso will undertake a research project investigating the impact of differential cluster sizes on Bayesian analyses of cluster-correlated semicompeting risks hospital readmission data for late stage pancreatic cancer patients.

Abigail Sloan (advisor Lee-Jen Wei; supported on this grant from 2014-2015) is a first year student and current trainee on this grant. She has completed cancer cognate courses, Epidemiology of Cancer and Environmental Epigenetics, and her summer research project plan includes subtype analysis with missing subtype data. Missing data problems are common for tumor subtype data. The current "naive" method is to use the missing indicator method to treat the missing subtypes as a new subtype category. The research goal is to find out the missing pattern assumptions required for the missing indicator method to be valid, and propose methods to validly and efficiently handle missing subtype problems. This research was motivated by cancer subtype analyses in the Nurses Health Study and other cohort studies.

Divyagash Swargaloganathan (advisor Lorenzo Trippa; supported on this grant from 2014-2015) is a first year student and current trainee on this grant. During his upcoming summer research project, Mr. Swargaloganathan will work with Professor Martin Aryee to develop statistical and bioinformatics methods for the analysis of single-cell epigenetic data. The dual goals of this project are to perform data analysis in the context of a project where they are studying the role of epigenetic regulation of early embryo development, as well as to implement the methods developed into software that will make single-cell data analysis more accessible to other researchers. This research is motivated by recently developed genomics methods that allow single-cell analysis that allow researchers to probe the role of cellular heterogeneity, and to better understand the importance of cell-to-cell variability in normal and disease biology. It has been observed, for example, that the degree of cellular heterogeneity within a tumor correlates strongly with negative prognosis, for reasons that are yet poorly understood.

Samuel Tracy (advisor Xiaole Liu; supported on this grant from 2014-2015) is a first year student and current trainee on this grant. He has completed three cognates and other cancer-related courses- Epigenetics (EH 298), Gene Mapping (BIO 227), and Computational Biology (BIOS 298).

Mr. Tracy's winter research project was undertaken under the direction of Professor Xiaole Shirley Liu and Research Scientist Cliff Meyer at DFCI; it considered the effectiveness and efficiency of predicting regulatory gene sequences and enhancers with support vector machine classification in Python, using k-mers and training with sequence features determined by ChIP-seq analysis. His spring research study was undertaken with Professor Sebastien Haneuse, and involved the weighted, pseudo, and maximum likelihood estimation and inference procedures applicable to II-phase sampling in the case-control setting under logistic model assumptions. Much of that time was spent re-deriving, programming, and verifying the results of recent papers on the subject, as well as compiling a series of notes that document its historical progression. His summer project will be an extension of the study with Dr. Haneuse. Mr.Tracy will be exploring II-phase sampling of clustered data (e.g. patient information collected in groups from multiple hospitals) with the goal of developing and programming valid estimation and inference procedures that properly account for the clustering effect.

Michael Love (advisor Rafael Irizarry; supported on this grant from 2013-2015) is a Postdoctoral Fellow and current trainee on this grant. Dr Love received his Ph.D. in Computational Biology from Freie Universitaet in 2013. His main research accomplishments during training are summarized below:

With support from an NIH BD2K grant R25GM114818, Rafael Irizarry and Michael Love (Teaching Fellow and content lead) created and updated the biomedical data science free online course, HarvardX PH525x Data Analysis for Genomics. PH525x is broken into 8 distinct courses, each 2-4 weeks long, with many new lectures and screencasts. Drs. Irizarry and Love created new programming assessments which guide students step-by-step through typical bioinformatic analyses. The first offering, which launched in January 2015, had more than 9000 students actively engaging with the content (watched a video or took an assessment). The new series brings in expert faculty for these courses, including: Vincent Carey of Harvard Medical School and Brigham and Women's Hospital, Shannan Ho Sui of the Harvard Bioinformatics Core, Peter Kraft of HSPH, and Xiaole Shirley Liu and her lab at the Dana Farber Cancer-Institute and Harvard T.H. Chan School of Public Health. More information can be found at: Class notes are available here:

Maintained the DESeq2 package for differential expression of sequencing data. From November 2013 through May 2014, this involved finishing benchmarking analyses for the manuscript, "Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2", which was published in Genome Biology in December 2014 (PMC4302049), and has already received 27 citations; adding new functionality to the software, including stable multi-group comparisons; and providing support to DESeq2 users on various online mailing lists forums. During this time, DESeq2 was added as a built-in tool to Illumina's BaseSpace cloud computing environment for analyzing RNA-Seq data, and two independent manuscripts [1,2] found this software to be one of the most powerful for differntial analysis of sequencing data. The DESeq2 open source software package has been downloaded by 26,000 unique IP addresses in the past year, making it the 30th most downloaded package within the Bioconductor project, and one of the top package for statistical analysis of RNA-seq data[3,4]. DESeq2 users are also actively supported through engagement on the Bioconductor support website / mailing list. Dr. Love is the 17th top responder by ranking on the Bioconductor support site [5]. URL:

Proposed and collaborated with Bioconductor project team in the implementation of the GenomicFiles package [1], which provides core software infrastructure for performing statistical analysis of large numbers of files containing genomics data. This widens the scope of large-memory datasets which can be analyzed using Bioconductor and R statistical packages.


Dr. Love has given four presentations during training. The first, "Analysis of RNA-Seq at the gene level," was given in November 2013 at the Quantitative Issues in Cancer Research Working Seminar in the Department of Biostatistics at the Harvard School of Public Health. The second, "Shrinkage estimators for differential analysis of RNA-Seq," was given in May 2014 at the Immunology Division Bioinformatics Seminar at Harvard Medical School. The third, "Multiple group comparisons for RNA-Seq and stable effect size estimates," was given at HiTSeq in July 2014. The fourth, "Simplified processing of large genomic datasets with GenomicFiles," was presented at the BioC Conference in August 2014.

Dr. Love also presented at two panels at ISMB in July 2014: "RNA-Seq workflows in Bioconductor," in the "Trends in genomic data analysis with R/Bioconductor" workshop and "Details on running a MOOC in Bioinformatics and Biostatistics" in the "Workshop on Education in Bioinformatics." Finally, he taught a 2-hour lab, "Analysis of RNA-Seq using the DESeq2 package" at the BioC Conference in August 2014.