Biostatistics/Epidemiology Training Grants in AIDS
 


Fellows






FELLOWS

Current Fellows (2013-2014)

David Cheng (advisor Lee-Jen Wei) is a first year student who is currently working to complete his coursework. In the current academic year he has taken the following courses: EPI213 Epidemiology of Cancer, BIO230 Probability Theory and Applications I, BIO231 Statistical Inference I, BIO232 Methods I, BIO233 Methods II, EPI201 Epidemiologic Methods I, EPI202 Epidemiologic Methods II, BIO510 Programming I, BIO509 Intro to Statistical Computing Environments, BIO238 Clinical Trials, EPI293 Analysis of Genetic Association Studies, and BIO523 Statistical and Quantitative Methods for Pharmaceutical Regulatory Science. David is interested in statistical methodology related to identifying high-risk patient subgroups and optimal therapies for specific patient subgroups. This problem is particularly relevant to treating cancer due to the diversity of disease sub types and treatments available in cancer. Such methodologies may aid the development of individualized strategies for the prevention, diagnosis, and treatment of cancer. He is also interested in methodology for comparative effectiveness research to compare the benefit and risks of competing treatments from observational data. These methods may help inform patient decisions regarding identifying the optimal treatment among all treatments that are available. David plans to begin work on a project related to these problems this summer. He will look at development of risk prediction methodology based on random forest models. David will examine ways of using random forests to identify and validate high-risk subgroups in a clinical trial population. Specifically, we will consider ways to estimate tree models that have a meaningful clinical interpretation and identify subgroups that can potentially be replicated in external data. The methodology that we develop will be applied to a breast cancer clinical trial data.

Ina Jazic (advisor Francesca Dominici) is a first year student who is currently working to complete her coursework. In the current academic year she has taken the following courses: BIO230 Probability Theory and Applications I, BIO231 Statistical Inference I, BIO232 Methods I, BIO233 Methods II, BIO238 Clinical Trials, BIO509 Intro to Statistical Computing Environments, EPI201 Introduction to Epidemiology, EPI213 Epidemiology of Cancer, and BIO523 Statistical and Quantitative Methods for Pharmaceutical Regulatory Science. In addition, Ina has been working on a project with Svitlana Tykeucheva and Giovanni Parmigiani using gene expression data to detect cross-talk between tumor and stroma in breast and ovarian cancer. This summer, she will be working with Sebastien Haneuse on a project whose goal is to establish a framework for semi-competing risks with applications to pancreatic cancer.

In March 2014 Ina presented, "Cross-Talk Analysis in Breast Cancer Tissues," at the Joint Symposium of the Dana-Farber/Harvard Cancer Center Programs in Breast and Gynecologic Cancer in Boston, MA. This poster won an award in the Genetics and Genomics session.

Sarah Peskoe (advisor Rebecca Betensky) is a first year student who is currently working to complete her coursework. In the current academic year she has taken the following courses: EPI213 Epidemiology of Cancer, BIO230 Probability Theory and Applications I, BIO231 Statistical Inference I, BIO232 Methods I, BIO233 Methods II, EPI201 Epidemiologic Methods I, EPI202 Epidemiologic Methods II, BIO510 Programming I, BIO509 Intro to Statistical Computing Environments, BIO238 Clinical Trials, EPI510 Global Cancer Epidemiology, and EPI509 Evidence Based Epidemiology. In addition she is working on 2 projects. The first project is developing methodology for the estimation of latency parameters in Cox regression and to investigate the effect of measurement error on the estimation of those latency parameters. Examples of this include estimating the age at which aspirin intake puts people at the greatest risk for subsequently developed colorectal cancer, or estimating the time since infection with HIV that it is best to start ARVs.

The overall project is three-fold. The first aim is to develop methodology to incorporate point and interval estimates of a latency period (such as a cumulative exposure variables, relevant exposure windows, and the beginning end and other landmark features of time-varying exposure metrics) in Cox regression. A second component of the project, which will be my primary focus, is to understand the effect of measurement error and misclassification on estimation of these latency parameters, specifically in Cox regression. We will try to investigate this analytically, perhaps starting with a linear mode and then in the Cox model, and if this proves intractable, through numerical calculations and/or a series of simulations that incorporate the methods developed to estimate latency parameters to evaluate the effect of measurement error under a series of scenarios likely to be encountered in practice. Finally, time permitting, we will aim to develop statistical methods that take measurement error explicitly into account in unbiasedly estimating a broad range of latency parameters. The second project is validation of a risk prediction model for colorectal cancer, incorporating both family history and modifiable/behavioral risk factors in the Health Professionals Follow-Up Study.

Kevin Galinksy (advisor Alkes Price) is a second year student who is working to complete his core coursework. In the current academic year he has taken the following courses: BIOSTAT 235 Advanced Regression and Statistical Learning, BIOSTAT 249 Bayesian Methodology in Biostatistics, BIOSTAT 250 Probability Theory and Applications II, and STAT 244 Linear and Generalized Linear Models. In addition to his coursework he is working with Dr. Price on a project dealing with population stratification in genome wide association studies (GWAS). Population stratification can confound the results of these studies if subjects come from populations with different disease and genetic profiles. One method to remove this confounding effect is to remove the effect of the top eigenvectors from principle components analysis (PCA). Unfortunately, PCA can run slowly, proportional to the square of the number of subjects in the study. Kevin is working on applying an accurate approximation to PCA which runs linearly with the number of subjects to genomic data. This algorithm consumes much less memory and runs orders of magnitudes faster when running on data sets with tens of thousands of individuals. It allows the computation of previously intractable PCs on commodity hardware.

Daniel Schlauch (advisor John Quackenbush) is a second year student who is working to complete his core coursework. In the current academic year he has taken BIOSTAT 235 Advanced Regression and Statistical Learning, BIOSTAT 244 Analysis of Failure Time Data, BIOSTAT 250 Probability Theory and Applications II, BIOSTAT 298 Introduction to Computational Biology and Bioinformatics, EPI511 Advanced Population & Med Genetics, BIOSTAT 249 Bayesian Methodology in Biostatistics, BIO514 Data Structures and Algorithms, BIO 515 Measurement Error and Misclassification, and EPI223 Cancer Prevention. In the summer of 2013 Daniel worked on a project under advisor John Quackenbush identifying novel, validated molecular subtypes of ovarian cancer which have clinical and prognostic relevance. He has continued this research under Quackenbush, focusing on the biological networks and signal passing mechanisms which drive tumor progression and which are differentiated between subtypes of cancer.

Sixing Chen (advisor Xihong Lin) is a second year student who is working to complete his coursework. In the current academic year he has taken BIO244 Analysis of Failure Time Data, BIO245 Analysis of Multivariate and Longitudinal Data, and BIO251 Statistical Inference II. Sixing is working with Dr. Lin to attempt to approximate power in the presence of measurement error. He is simulating situations with both prospective and case-control sampling to see how well his approximation performs.

David Tucker (advisor Paul Catalano) is a second year student who is working to complete his coursework. In the current academic year he has taken BIO249 Bayesian Methodology in Biostatistics, BIO250 Probability Theory and Applications II, STAT111 Introduction to Theoretical Statistics, BIO 244 Analysis of Failure Time Data, and BIO245 Analysis of Multivariate and Longitudinal Data. In addition to his coursework he is continuing his summer research under the advisement of Professor Francesca Dominici. Their work involves utilizing Bayesian hierarchical modelling to model the exposure-response relationship between aircraft noise and cardiovascular disease (CVD) and investigate evidence of a change point. The primary scientific question of interest seeks to determine if there is a noise threshold or change point below which there is a mitigated effect on a person's risk of CVD hospitalization and above which the risk of CVD hospitalization significantly increases. The dataset they are working with combines Medicare with noise contours provided by the Federal Aviation Administration.

Ryan Sun (advisor Xihong Lin) is a second year student who has completed his coursework. In the current academic year he has taken BIO245 Analysis of Multivariate and Longitudinal Data, BIO244 Analysis of Failure Time Data, BIO251 Statistical Inference II, and EPI511 Advanced Population and Medical Genetics. Ryan is currently working with Dr. Xihong Lin on studying how we can better perform inference on high-dimensional data, specifically genomic data. Generally when researchers conduct studies to determine which genes may be linked to diseases, they generate enormous data sets which contain millions of possible candidate SNPs (single nucleotide polymorphisms). The challenge of conducting inference on these millions of markers carries a number of problems that have not been considered in more classical settings. The point of this work is to see if we can improve these methods to better detect the links between genetics and diseases like cancer.

Ian Barnett (advisor Xihong Lin) is a fourth year student who has recently defended his thesis, "SNP-set Tests for Sequencing and Genomewide Association Studies". He has developed a method for SNP-set testing in GWAS studies based on the higher criticism. This method is powerful when the disease-causing SNPs within the SNP-set are sparse. The effectiveness of this method is demonstrated on the CGEM breast cancer GWAS dataset, successfully implicating the FGFR2 gene. The corresponding paper is in preparation and will be submitted to JASA data and application this summer. Also as part of this body of work, they have just had a preliminary paper accepted to Biometrika that sets a foundation for applying the higher criticism in SNP-set testing.

Ian won the student paper award at ENAR 2014 on March 17th in Baltimore, where he presented, "The generalized higher criticism for testing SNP-sets in genetic association studies". In addition he also won the JSM Statistics in Epidemiology student paper award for JSM 2014 in August. He will be presenting the same work there.

Florence Yong (advisor Lee-Jen Wei) is a fourth year student. She is currently working on her dissertation that focus' on Quantitative methods for stratified medicine. In paper one, Florence raises the awareness of overly optimistic models derived from conventional survival analysis approaches and proposes an efficient and more reliable modeling approach to select model and make inference. In paper two the objective is to propose systematic procedure to evaluate candidate models, propose discretization procedure to group patients into more meaningful and actionable categories. These research methods can be applied to vast areas; however, application to cancer research is especially timely in light of rapid genetic target screening and drug discovery effort.

In October 2013 Florence was awarded the Certificate of Distinction in Teaching which recognized her for outstanding teaching during the 2012-2013 academic year.

Florence also has been accepted to present at two upcoming meetings. The first is, "Predicting subject-specific outcome via an optimal stratification procedure with baseline covariates," at the JSM meeting in Boston, MA on August 6, 2014. The second is, "An analytical approach to personalized risk prediction and stratified intervention strategy," at the Inaugural Predictive Analytics World for Healthcare conference also in Boston, MA on October 6-7, 2014.

Michael Love (Mentor Rafael Irizarry) is a Postdoctoral Fellow in the Department of Biostatistics at the Harvard School of Public Health and the Dana-Farber Cancer Institute. Dr Love received his PhD in Computational Biology from Freie Universitaet in 2013.

His main research accomplishments during the time funded by the cancer training grant T32 are summarized below.

  • Teaching Fellow and content lead in the creation of the PH525x: "Data Analysis for Genomics" free online edX class, offered by the Harvard School of Public Health, and taught by Dr. Rafael Irizarry. The class runs from April 7, 2014 - June 13, 2014. The class offers 8 weeks of lectures in basic biology, genomics technologies and statistics, as well as hands-on labs for performing statistical analyses of real genomics datasets. The class has more than 5,000 students participating from around the world.

    1. URL: https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401

  • Maintained the DESeq2 package for differential expression of sequencing data. From November 2013 through May 2014, this involved finishing benchmarking analyses for the manuscript, "Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2", which was submitted in February, 2014; adding new functionality to the software, including stable multi-group comparisons; and providing support to DESeq2 users on various online mailing lists forums. During this time, DESeq2 was added as a built-in tool to Illumina's BaseSpace cloud computing environment for analyzing RNA-Seq data, and two independent manuscripts [1,2] found this software to be one of the most powerful for differntial analysis of sequencing data.

    1. URL: http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
    2. http://nar.oxfordjournals.org/content/early/2014/04/20/nar.gku310.abstract
    3. http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003531

  • Proposed and collaborated with Bioconductor project team in the implementation of the GenomicFiles package [1], which provides core software infrastructure for performing statistical analysis of large numbers of files containing genomics data. This widens the scope of large-memory datasets which can be analyzed using Bioconductor and R statistical packages.

    1. URL: http://www.bioconductor.org/packages/devel/bioc/html/GenomicFiles.html
Michael has also given two presentations. The first, "Analysis of RNA-Seq at the gene level," was given on November 21, 2013 at the Quantitative Issues in Cancer Research Working Seminar in the Department of Biostatistics at the Harvard School of Public Health. The second, "Shrinkage estimators for differential analysis of RNA-Seq," was given on May 1, 2014 at the Immunology Division Bioinformatics Seminar at Harvard Medical School.