Department of Biostatistics
Big Data Seminar
2013 - 2014
ABSTRACT: Given the availability of genomic data, there have been emerging interests in integrating multi-platform data. Here I propose to model epigenetic DNA methylation, micro-RNA expression and gene expression data as a biological process to delineate phenotypic traits under the framework of causal mediation modeling. I propose a regression model for the joint effect of methylation, micro-RNA expression and gene expression and their non-linear interactions on the outcome, and study three path-specific effects: the direct effect of methylation on the outcome, the effect mediated through expression and the effect through micro-RNA expression. I characterize correspondences between the three path-specific effects and coefficients in the regression model, which are influenced by causal relations among methylation, micro-RNA and gene expression. A score test for variance components of regression coefficients is developed to assess path-specific effects. The test statistic under the null follows a mixture of chi-square distributions, which can be approximated using a characteristic function inversion method or a perturbation procedure. I construct tests for candidate models determined by different combinations of methylation, micro-RNA, gene expression and their interactions, and further propose an omnibus test to accommodate different models. The utility of the method will be illustrated in numerical simulation studies and a glioblastoma data from The Cancer Genome Atlas (TCGA).
ABSTRACT: In cancer clinical research, discovery of molecular biomarkers has dramatically redefined how we evaluate efficacy of therapeutics. Further, the availability and decreasing costs of high-throughput assays allows for interrogation of the entire genome from biospecimen. As we have overcome the bioinformatic hurdle of "drinking from the firehose", a new series of challenges are emerging in translating the large number of genomic discoveries into multiplex assays as biomarkers of disease. In this talk, I focus on two recent extensions to the statistical methods used in "pathway-analysis" and their utility and performance as molecular data increase in size. First, we have derived accurate analytic approximations to several forms of the array-permutation based tests, which greatly reduce computational requirements. Second, by adding real and imaginary weights to a summation of score statistics, one can obtain positive and negative squared terms. In this way, gene signatures that are linear predictors or probabilistic definitions of genesets can be tested under our framework. These extensions allow for pathway analyses to serve as an intermediate step for evaluating signatures before translating predictive models into a new clinical context. Illustrations of their utilization draw from recent retrospective and prospective studies of molecular signatures in breast and ovarian cancer.
ABSTRACT: In oncology, individualized medicine requires techniques that enable visualization, quantification and time series of disease processes, in a non-invasive way in individual patients. Medical imaging is intuitively very suitable for this purpose. The emerging field of "Radiomics" converts medical images (CT, PET and MRI) into minable data by the high-throughput application of large amounts of data characterization algorithms. This provides an opportunity to quantify the tumor phenotype and to search for predictive biomarkers using non-invasive imaging assays that can be used throughout the course of treatment. The focus of my group is the development and application of radiomic methods in large datasets of various cancer types.
ABSTRACT: Rapidly evolving genomic technologies are providing unprecedented quantities and varieties of data with the potential to provide new insights into the processes driving disease. PANDA (Passing Attributes between Networks for Data Assimilation) is a network inference method that uses "message passing" to integrate multiple sources of genomic data. PANDA begins with a map of potential transcription factor regulatory "edges" and uses phenotype-specific expression and other data types to maximize the flow of information through the transcriptional network. We have applied PANDA to the analysis of a number of human diseases, including ovarian cancer, Chronic Obstructive Pulmonary Disease and Alzheimer's disease. By comparing networks between related phenotypes, we have identified potential therapeutic interventions and found changes in regulatory patterns that help to explain sexual dimorphism in the disease. These compelling examples demonstrate how integrative network approaches can be used to uncover potentially clinically-relevant biological mechanisms that could not be discovered using simple statistical comparisons between phenotypes. As the volume and variety of data available on individual samples increases, we are working to extend PANDA to better integrate additional complementary data types and to use these to infer the most likely networks for individual patients.
ABSTRACT: We present a class of loss functions which achieve Fisher Consistency in a Multi class setting, and a generic Boosting algorithm that goes with it. We further suggest a cross validation procedure for combining boosting algorithms over the same dataset, as well as a generalization of the SAMME algorithm, for various loss functions.
ABSTRACT: Larger sample sizes and greater demographic diversity would improve power and clinical utility in genetic studies, particularly as focus shifts to sequencing and rare variants; however, many genetic studies are nested in preexisting cohort studies which are necessarily of fixed size, and historically of limited ethnic diversity. These issues could be overcome by nesting genetic studies within hospitals with an infrastructure of electronic medical records (EMRs) linked to biological samples, provided that accurate information about disease outcomes and important clinical covariates can be ascertained from the EMRs. Recently, numerous algorithms have been developed to effectively extract phenotype information from EMR data using natural language processing techniques. These algorithms typically do not provide perfect classification due to the difficulty in extracting information from the records; instead, they produce for each patient a probability that the patient is a disease case. This probability can be thresholded to define case-control status, and this estimated case-control status has been used to replicate known disease-SNP associations in EMR-based genetic studies. However, using the estimated disease status in place of true disease status results in outcome misclassification, which can diminish test power and bias odds ratio estimates. In this paper, we propose an alternative analysis approach which uses the probability of disease from the algorithm directly. We demonstrate how our approach improves test power and effect estimation in simulation studies, and we describe its performance in a study of rheumatoid arthritis. Our work provides an easily implemented solution to a major practical challenge that arises in the use of EMR data, which can facilitate the use of EMR infrastructure for cost-effective, diverse genetic studies.
ABSTRACT: A uniquely powerful, but thus far underutilized, aspect of EMR research is the establishment and interrogation of very large patient registries across diverse health care institutions, with detailed data on patient histories, therapies, and clinical outcomes. The Harvard-Partners Healthcare Informatics for Integrating Biology and the Bedside (I2B2) Program has established an extensive framework to enable such studies, with critical features such as: extracting highly accurate patient data from the EMR through the use of structured data (e.g., billing codes) as well as narrative data (e.g., natural language processing of physician notes); maintaining patient confidentiality by rigorously de-identifying patient records; linking patient data to blood samples to enable genetic studies of disease susceptibility; and enabling participating institutions to maintain local control of confidential patient clinical data. I2B2 has successfully developed highly accurate algorithms for cases of RA, T2D, and other diseases based on EMRs. These algorithms have been validated at other US institutions and have been used to rapidly initiate genetic studies. In this talk, I'd describe various approaches to developing such algorithms and how the results from the algorithms can be used in follow up discovery research studies.
ABSTRACT: In this talk, I propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: legal decisions on workers' compensation claims (to understand relevant case law) and an OSHA database of occupation-related accident reports (to search for high risk circumstances). Our summarization framework, built on sparse classification methods, is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., emotional disability, or chemical gas), we automatically labels documents as being either on- or off-topic, and then use sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can be extended to allow for phrases of arbitrary length, which allows for potentially rich summarization. I further discuss how focus on specific aspects of the corpus and the purpose of the summaries can inform choices of regularization parameters and constraints on the model. Overall, I argue that sparse methods have much to offer text analysis, and hope that this work opens the door for a new branch of research in this important field.
ABSTRACT: Numerous gene signatures of patient prognosis for late-stage, high-grade ovarian cancer have been published, but diverse data and methods have made these difficult to compare objectively. However, the corresponding large volume of publicly available expression data creates an opportunity to validate previous findings and to develop more robust signatures. We thus built a database of uniformly processed and curated public ovarian cancer microarray data and clinical annotations, and re-implemented and validated 14 prognostic signatures published between 2007 and 2012. In this lecture I will describe the methodology and tools we developed for evaluating published signatures in this context. I will also use this application as the springboard for a more general discussion on how to evaluate statistical learning methods based on a collection of related studies.
ABSTRACT: The MIT SuperCloud provides a novel solution to the problem of merging enterprise clouds, databases, big data, and supercomputing technology. More specifically the MIT SuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases. Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample col- lection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (d4m.mit.edu) – an associative array environment for MATLAB developed at MIT. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to speed up the computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.
Combined work of Jeremy Kepner, Stephanie Dodson, Darrell Ricke, Vijay Gadepally, Pete Michaleas, Albert Reuther, Mayank Varia, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, and Antonio Rosa.
ABSTRACT: We study the problem of testing global null hypothesis against sparse alternatives in Gaussian linear regression and Binary regression frameworks. In particular, we study the minimax detection boundary, i.e., the necessary and sufficient conditions for the possibility of successful detection as both the sample size $n$ and the number of regressors $p$ tend to the infinity.
In the first part of the talk we revisit some existing literature in minimax hypothesis testing for Gaussian linear regression against sparse alternatives and provide detection boundaries for a class of tests which are based on non-convex penalized likelihood (NPL) procedures satisfying null consistency condition. We contrast our result against the simple ANOVA and demonstrate phase transition in asymptotic power between ANOVA and NPL tests depending on sparsity of the alternative. Apart from minimax considerations, we also provide theoretical comparisons between the local powers of these tests.
In the second part of the talk we contrast the results in Gaussian linear regression framework against Binary regression framework. Motivated by sequencing data examples, we investigate the complexity of the hypothesis testing problem when the design matrix also has specific sparsity structures. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear models. We derive the detection boundary as a function of two components: the minimal signal strength required for successful detection and the maximal sparsity of the design matrix. If the design matrix with binary entries is too sparse, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices which are not too sparse, we derive detection boundaries for both dense and sparse regimes of sparsity. For the dense regime, our results are rate optimal; for the sparse regime, we provide sharp constants. In the dense regime the generalized likelihood ratio test continues to be asymptotically powerful above the detection boundary. In the sparse regime, a version of the popular Higher Criticism test attains the detection boundary as a sharp upper bound.
ABSTRACT: We show that central limit theorems hold for high-dimensional normalized means hitting high-dimensional rectangles. These results apply even when p>> n. These theorems provide Gaussian distributional approximations that are not pivotal, but they can be consistently estimated via Gaussian multiplier methods and the empirical bootstrap. These results are useful for building confidence bands and for multiple testing via the step-down methods. Moreover, these results hold for approximately linear estimators. As an application we show that these central limit theorems apply to normalized Z-estimators of p> n target parameter in a class of problems, with estimating equations for each target parameter orthogonalized with respect to the nuisance functions being estimated via sparse methods.
(This talk is based primarily on the joint work with Denis Chetverikov and Kengo Kato.)
ABSTRACT: We propose a unified Bayesian framework for detecting genetic variants associated with a disease while exploiting image-based features as an intermediate phenotype. Traditionally, imaging genetics methods comprise two separate steps. First, image features are selected based on their relevance to the disease phenotype. Second, a set of genetic variants are identified to explain the selected features. In contrast, our method performs these tasks simultaneously to ultimately assign probabilistic measures of relevance to both genetic and imaging markers. We derive an efficient approximate inference algorithm that handles high dimensionality of imaging genetic data. We evaluate the algorithm on synthetic data and show that it outperforms traditional models. We also illustrate the application of the method in a study of Alzheimer's disease.
Joint work with Kayhan Batmanghelich, Adrian Dalca, Mert Sabuncu.
ABSTRACT: The microbial communities, or microbiomes, residing in the mammalian gut are inherently dynamic, changing due to many factors including host maturation, alteration of the diet, and exchange of microbes with the environment or other hosts. Recent advances in high-throughput technologies in experimental biology, such as DNA sequencing, are enabling collection of unprecedented amounts of microbiome data. I will discuss both computational and experimental projects in my lab for analyzing and generating longitudinal microbiome datasets. In particular, I will describe our work on nonparametric Bayesian models for analyzing time-series of microbial sequencing count data, and a synthetic biology experimental system for high-throughput improvement of fitness of micro-organisms during the dynamic process of gut colonization.
ABSTRACT: Building an accurate prediction model for disease risk is useful for prevention and for improving screening strategies. With advances in genetic testing, personalized medicine is within reach and we seek to develop risk scores based on a subject's genetic profile as well as environmental and clinical risk factors. However, current methods for constructing genetic risk scores do not include important types of genetic effects known as gene-environment interactions. The effects of genetic markers on disease risk may be modified by environmental risk factors and, conversely, the effects of environmental exposures may depend on genotype. We propose to build a risk prediction model that incorporates effects of genetics and environment and allows for possible interactions between them. We extend the method of adaptive naive bayes kernel machine regression, a powerful machine learning method, to model gene-environment interaction effects as well as possibly nonlinear effects within gene-sets. This flexible method can incorporate genome-wide data including rare variants and can improve prediction over a simple additive model when complex effects are present. We examine how the strength of interactions influences prediction ability with extensive simulation studies and illustrate the performance of our method in a large set of GWAS data to study the risk of colorectal cancer.
|Back to HSPH Biostatistics|| Maintained by the
Last Update: April 28 2014