Return to Biostatistics
Current and Recent Fellows (2015-2016)
Sixing Chen (advisor Xihong Lin; supported on this grant from 2012-2015) is a third year student, and current trainee on this grant, who is working with Dr. Lin to attempt to approximate power in the presence of measurement error. His current project entails methods to perform analysis with data from differing sequencing platforms. He is looking at two approaches, one of which allows analysis with naive variance
estimates. He plans to apply this work to a cancer dataset and is currently working on the manuscript for this
paper. He will also be presenting this work at JSM in August.
Kevin Galinsky (advisor Alkes Price; dissertation committee Nick Patterson [Broad Institute] and
Liming Liang; supported on this grant from 2012-2014) is a third year student.
Mr. Galinsky is working with Dr. Price on a project to estimate principle components (PCs) of genomic
data. Principle components analysis (PCA) is a technique that can infer population structure in genomic data
and has also been used to correct for population stratification in genome wide association studies (GWAS).
Unfortunately, traditional PCA can runs slowly as the number of samples gets large with running time
proportional to the number of samples cubed and memory consumption proportional to the number of samples
squared. Mr. Galinsky is working on applying an accurate approximation to PCA which runs linearly with the
number of subjects to genomic data. This algorithm runs orders of magnitudes faster on data sets with tens of
thousands of individuals. It allows the computation of PCs on very large datasets without specialized hardware.
Mr. Galinsky has also extended a selection statistic that operates on discrete populations to instead use SNP
loadings from PCA. He applied these tools to a cohort of 54k individuals of European ancestry from the San
Francisco bay area and has found natural selection acting upon the ADH1B, IGFBP3 and the IGH loci. IGFBP3
has been implicated in a number of cancers and is associated with breast cancer. The results of this project
have been submitted to Nature Genetics and are currently being reviewed.
Mr. Galinsky's next project is studying cross-population heritability. Heritability is defined as the
proportion of phenotypic variance explainable by genetics. Mixed model methods have been employed to
estimate heritability for numerous phenotypes and these methods have been extended to estimate heritability
for two traits simultaneously, along with the correlation of genetic effects for these two traits. The goal of the
project is to extend this technique from the study of two phenotypes in one population to one phenotype in two
populations. This project seeks to address the questions of how applicable a GWAS in one population is on
another as well as to potentially answer the question of whether additive causal effect sizes are indeed the
same across populations, or if gene-by-gene and gene-by-environment interactions must be considered more.
Daniel Schlauch (advisor John Quackenbush; supported on this grant from 2012-2014) is a third
year student. In the summer of 2013, Mr. Schlauch worked on a project with Professor Quackenbush
identifying novel, validated molecular subtypes of ovarian cancer which have clinical and prognostic relevance.
He has continued this research, focusing on the biological networks and signal passing mechanisms which
drive tumor progression and which are differentiated between subtypes of cancer.
Mr. Schlauch has also been developing novel, computationally efficient methods for inferring gene regulatory
networks based on the use of complementary data sources such as gene expression data and transcription
factor binding motifs. He has developed a method, BERE, which infers bipartite regulatory networks with
comparable or improved accuracy to leading algorithms that can be run on a local machine in a fraction of the
Mr. Schlauch recently contributed a Bioconductor package, pandaR, an R implementation of the network
inference method PANDA, which has been shown to be a leader in inferring transcription factor-gene
regulatory structure from expression data, motif data, and protein-protein interaction data. A draft publication
for submission to Bioinformatics has been written regarding this package. In addition, much of his work has
been on the analysis of regulatory networks for identifying biologically meaningful structural differences. His
work has applications in the discovery of mechanisms for disease progression by identifying the key
components in the regulatory network which are altered between two study groups. The analytic approaches
are general and may suggest biological mechanism or potential therapeutic targets in a wide range of diseases
not limited to cancer and COPD.
Ryan Sun (advisor Xihong Lin; dissertation committee Peter Kraft, Eric Tchetgen Tchetgen, David
Christiani; supported on this grant from 2012-2014) is a third year student who has completed his written and
Mr. Sun has been working with Dr. Xihong Lin to study statistical models used in detecting specific
gene-environment interactions that may be associated with disease. It is believed that complex relationships
between genetic and environmental risk factors characterize the etiology of many diseases, but there exist very
few validated gene-environment interaction findings in literature. Studying these interactions can have a great
public health benefit by helping to better understand biological mechanisms that cause complex diseases,
identify populations susceptible to disease, and discover risk factors that may not show large marginal effects.
Specifically, Mr. Sun's focus is on improving ways of performing inference on these interactions so they can
better detect these between genes and environmental factors. Mr. Sun is in the later stages of drafting his first
paper on this work, tentatively titled "esting for Gene-Environment Interactions under Misspecification of
Environmental Exposure." He is developing theoretical results about the validity of testing for geneenvironment
interactions in the common regression setting and suggests a new resampling-based procedure
for performing this inference. The methods are broadly applicable to most combinations of environmental
exposure and disease, including cancer genetics studies. He will be presenting this work at the Joint Statistical
Meetings (JSM) in August, and has also submitted an abstract to present at the ICSA (International Chinese
Statistical Association) Statistics Symposium. Mr. Sun has also been performing statistical analysis for the
HSPH Superfund Research Project on their gene-environment team, specifically on a study of how toxic
metals impact infant neurodevelopment.
David Tucker (advisor Paul Catalano; supported on this grant from 2012-2014), graduated from
the program with an MS in Spring 2015.
In addition to completing coursework, Mr. Tucker undertook summer research under the advisement
of Professor Francesca Dominici. Their work involves utilizing Bayesian hierarchical modelling to model
the exposure-response relationship between aircraft noise and cardiovascular disease (CVD) and investigate
evidence of a change point. The primary scientific question of interest seeks to determine if there is a noise
threshold or change point below which there is a mitigated effect on a person's risk of CVD hospitalization and
above which the risk of CVD hospitalization significantly increases. The dataset they worked with
combines Medicare with noise contours provided by the Federal Aviation Administration.
David Cheng (advisor Lee-Jen Wei; supported on this grant from 2013-2015) is a second year
student and current trainee on this grant.
Mr. Cheng is interested in statistical methodology related to identifying high-risk patient subgroups and
optimal therapies for patient subgroups. This problem is particularly relevant to treating cancer due to the
diversity of disease sub-types and treatments available in cancer. Such methodologies may aid the
development of individualized strategies for the prevention, diagnosis, and treatment of cancer. He is also
interested in developing methodology for comparative effectiveness research to compare the benefit and risks
of competing treatments from observational data. These methods may help inform patient decisions regarding
identifying the optimal treatment among all treatments that are available despite the lack of direct head-to-head
evidence from randomized trials. Mr. Cheng is currently working with Professor Wei on the development of risk
prediction methodology using decision tree models that are more interpretable and actionable for clinical
decision making compared to traditional risk scoring approaches. This work was presented at a session with
FDA CDER Office of Biostatistics director Lisa LaVange at the Dana Farber Cancer Institute. He is also
currently working on developing a method for treatment selection based on decision trees to
identify enrichable patient subgroup that are likely to benefit from treatment.
In a separate research project, Mr. Cheng is working on a manuscript on a method to allow for
adjustments for differences in patient characteristics when comparing outcomes between studies when the full
individual patient data is available for only one study. This method would potentially facilitate comparisons of
treatments across studies when direct comparisons in a clinical trial would otherwise be unavailable
Ina Jazic (advisor Sebastien Haneuse; supported on this grant from 2013-2015) is a second year
student and current trainee on this grant. In addition to her completing her coursework and written qualifying
exam, Ms. Jazic has been working on a project with DFCI Research Scientist Svitlana Tykeucheva and
Professor Giovanni Parmigiani using gene expression data to detect cross-talk between tumor and stroma in
breast and ovarian cancer. She is currently preparing an R package that will implement this method.
In the summer of 2014, Ms. Jazic began working with Sebastien Haneuse on a project related to semicompeting
risks -- the situation where primary interest lies in a non-terminal event, such as diagnosis,
metastasis, or hospital readmission, whose occurrence is subject to a terminal event. Specifically, when
evaluating the impact of interventions on quality of life among populations at high risk of death (such as
terminal cancer patients), it is crucial to correctly handle death as a competing risk. She is currently preparing a
manuscript illustrating the implications of semi-competing risks on the use of composite endpoints in clinical
In March 2014, Ms. Jazic presented "Cross-Talk Analysis in Breast Cancer Tissues," at the Joint
Symposium of the Dana-Farber/Harvard Cancer Center Programs in Breast and Gynecologic Cancer in
Boston, MA. This poster won an award in the Genetics and Genomics session. She has submitted an abstract
based on the composite endpoints project to the International Conference on Health Policy Statistics in
Providence, RI, that took place in October 2015.
Sarah Peskoe (advisor Donna Spiegelman; supported on this grant from 2013-2015) is a second
year student and current trainee on this grant. In addition to completing her coursework, Ms. Peskoe is
working on two research projects. The first project is developing methodology for the estimation of latency
parameters in Cox regression and to investigate the effect of measurement error on the estimation of those
latency parameters. Examples of this include estimating the age at which aspirin intake puts people at the
greatest risk for subsequently developed colorectal cancer, or estimating the time since infection with HIV that
it is best to start ARVs. This project is three-fold. The first aim is to develop methodology to incorporate point
and interval estimates of a latency period (such as cumulative exposure variables, relevant exposure windows,
and the beginning end and other landmark features of time-varying exposure metrics) in Cox regression. A
second component of the project, which is Ms. Peskoe's primary focus, is to understand the effect of
measurement error and misclassification on estimation of these latency parameters, specifically in Cox
regression. Finally, she expects to develop statistical methods that take measurement error explicitly into
account in unbiasedly estimating a broad range of latency parameters.
The second project is validation of a risk prediction model for colorectal cancer, incorporating both
family history and modifiable/behavioral risk factors in the Health Professionals Follow-Up Study. Ms. Peskoe
is working on her first dissertation paper with Professors Donna Spiegelman and Molin Wang; the working title
is "Impact of measurement error on latency estimation in linear models." She presented the early findings from
this work at HSPH Research Day in November, 2014. She has also been participating in the Nurses Health
Study Air Pollution meetings, with a particular interest in studying the incidence of lung cancer when there is
uncertainty in air pollution measures. Ms. Peskoe attended the Joint Statistical Meetings in 2014 (Boston) and
will be attending again in 2015 (Seattle).
Theodore Huang (advisor Robert Gray; supported on this grant from 2014-2015) is a first year
student and current trainee on this grant. As part of his course work, Mr. Huang has taken cancer cognate
classes Epidemiology of Cancer (EPI 213) and Fundamental Concepts in Gene Mapping (BIO 227), which can
be applied to cancer problems. He has also been conducting research with Dr. Parmigiani for the past winter
and this spring semester, and has been working on organizing and analyzing a data set collected by the
Cancer Genetics Network (CGN), with the goal of building and validating models that predict risk of carrying
cancer susceptibility genes, particularly breast cancer.
Kelly Mosesso (advisor Paul Catalano; supported on this grant from 2014-2015) is a first year
student and current trainee on this grant. In the summer of 2015, Ms. Mosesso will undertake a research
project investigating the impact of differential cluster sizes on Bayesian analyses of cluster-correlated semicompeting
risks hospital readmission data for late stage pancreatic cancer patients.
Abigail Sloan (advisor Lee-Jen Wei; supported on this grant from 2014-2015) is a first year
student and current trainee on this grant. She has completed cancer cognate courses, Epidemiology of
Cancer and Environmental Epigenetics, and her summer research project plan includes subtype analysis with
missing subtype data. Missing data problems are common for tumor subtype data. The current "naive" method
is to use the missing indicator method to treat the missing subtypes as a new subtype category. The research
goal is to find out the missing pattern assumptions required for the missing indicator method to be valid, and
propose methods to validly and efficiently handle missing subtype problems. This research was motivated by
cancer subtype analyses in the Nurses Health Study and other cohort studies.
Divyagash Swargaloganathan (advisor Lorenzo Trippa; supported on this grant from 2014-2015)
is a first year student and current trainee on this grant. During his upcoming summer research project, Mr.
Swargaloganathan will work with Professor Martin Aryee to develop statistical and bioinformatics methods for
the analysis of single-cell epigenetic data. The dual goals of this project are to perform data analysis in the
context of a project where they are studying the role of epigenetic regulation of early embryo development, as
well as to implement the methods developed into software that will make single-cell data analysis more
accessible to other researchers. This research is motivated by recently developed genomics methods that
allow single-cell analysis that allow researchers to probe the role of cellular heterogeneity, and to better
understand the importance of cell-to-cell variability in normal and disease biology. It has been observed, for
example, that the degree of cellular heterogeneity within a tumor correlates strongly with negative prognosis,
for reasons that are yet poorly understood.
Samuel Tracy (advisor Xiaole Liu; supported on this grant from 2014-2015) is a first year student
and current trainee on this grant. He has completed three cognates and other cancer-related courses-
Epigenetics (EH 298), Gene Mapping (BIO 227), and Computational Biology (BIOS 298).
Mr. Tracy's winter research project was undertaken under the direction of Professor Xiaole Shirley Liu
and Research Scientist Cliff Meyer at DFCI; it considered the effectiveness and efficiency of predicting
regulatory gene sequences and enhancers with support vector machine classification in Python, using k-mers
and training with sequence features determined by ChIP-seq analysis. His spring research study was
undertaken with Professor Sebastien Haneuse, and involved the weighted, pseudo, and maximum
likelihood estimation and inference procedures applicable to II-phase sampling in the case-control setting
under logistic model assumptions. Much of that time was spent re-deriving, programming, and verifying the
results of recent papers on the subject, as well as compiling a series of notes that document its historical
progression. His summer project will be an extension of the study with Dr. Haneuse. Mr.Tracy will be exploring
II-phase sampling of clustered data (e.g. patient information collected in groups from multiple hospitals) with
the goal of developing and programming valid estimation and inference procedures that properly account for
the clustering effect.
Michael Love (advisor Rafael Irizarry; supported on this grant from 2013-2015) is a Postdoctoral
Fellow and current trainee on this grant. Dr Love received his Ph.D. in Computational Biology from Freie
Universitaet in 2013. His main research accomplishments during training are summarized below:
With support from an NIH BD2K grant R25GM114818, Rafael Irizarry and Michael Love (Teaching
Fellow and content lead) created and updated the biomedical data science free online course, HarvardX
PH525x Data Analysis for Genomics. PH525x is broken into 8 distinct courses, each 2-4 weeks long, with
many new lectures and screencasts. Drs. Irizarry and Love created new programming assessments which
guide students step-by-step through typical bioinformatic analyses. The first offering, which launched in
January 2015, had more than 9000 students actively engaging with the content (watched a video or took an
assessment). The new series brings in expert faculty for these courses, including: Vincent Carey of Harvard
Medical School and Brigham and Women's Hospital, Shannan Ho Sui of the Harvard Bioinformatics Core,
Peter Kraft of HSPH, and Xiaole Shirley Liu and her lab at the Dana Farber Cancer-Institute and Harvard T.H.
Chan School of Public Health. More information can be found at:
http://genomicsclass.github.io/book/pages/classes.html. Class notes are available here: http://genomicsclass.github.io/book/.
Maintained the DESeq2 package for differential expression of sequencing data. From November
2013 through May 2014, this involved finishing benchmarking analyses for the manuscript, "Moderated
estimation of fold change and dispersion for RNA-Seq data with DESeq2", which was published in Genome
Biology in December 2014 (PMC4302049), and has already received 27 citations; adding new functionality to
the software, including stable multi-group comparisons; and providing support to DESeq2 users on various
online mailing lists forums. During this time, DESeq2 was added as a built-in tool to Illumina's BaseSpace
cloud computing environment for analyzing RNA-Seq data, and two independent manuscripts [1,2] found this
software to be one of the most powerful for differntial analysis of sequencing data. The DESeq2 open source
software package has been downloaded by 26,000 unique IP addresses in the past year, making it the 30th
most downloaded package within the Bioconductor project, and one of the top package for statistical analysis
of RNA-seq data[3,4]. DESeq2 users are also actively supported through engagement on the Bioconductor
support website / mailing list. Dr. Love is the 17th top responder by ranking on the Bioconductor support site
Proposed and collaborated with Bioconductor project team in the implementation of the
GenomicFiles package , which provides core software infrastructure for performing statistical analysis of
large numbers of files containing genomics data. This widens the scope of large-memory datasets which can
be analyzed using Bioconductor and R statistical packages.
Dr. Love has given four presentations during training. The first, "Analysis of RNA-Seq at the gene
level," was given in November 2013 at the Quantitative Issues in Cancer Research Working Seminar in the
Department of Biostatistics at the Harvard School of Public Health. The second, "Shrinkage estimators for
differential analysis of RNA-Seq," was given in May 2014 at the Immunology Division Bioinformatics Seminar
at Harvard Medical School. The third, "Multiple group comparisons for RNA-Seq and stable effect size
estimates," was given at HiTSeq in July 2014. The fourth, "Simplified processing of large genomic datasets
with GenomicFiles," was presented at the BioC Conference in August 2014.
Dr. Love also presented at two panels at ISMB in July 2014: "RNA-Seq workflows in Bioconductor," in
the "Trends in genomic data analysis with R/Bioconductor" workshop and "Details on running a MOOC in
Bioinformatics and Biostatistics" in the "Workshop on Education in Bioinformatics." Finally, he taught a 2-hour
lab, "Analysis of RNA-Seq using the DESeq2 package" at the BioC Conference in August 2014.