Department of Biostatistics
Correlated and High-Dimensional Data Seminar
2012 - 2013
ABSTRACT: Covariance structure is of fundamental importance in many areas of statistical inference and a wide range of applications, including genomics, fMRI analysis, risk management, and web search problems. In the high dimensional setting where the dimension p can be much larger than the sample size n, classical methods and results based on fixed p and large n are no longer applicable. In this talk, I will discuss some recent results on optimal estimation of large covariance/precision matrices as well as sparse linear discriminant analysis with high-dimensional data. The results and technical analysis reveal new features that are quite different from the conventional low-dimensional problems.
ABSTRACT: The pace of life accelerates with city size, manifested in a per capita increase of almost all socioeconomic quantities such as GDP, wages, violent crime or contagious diseases. In this presentation I show that the structure and dynamics of the underlying network of human interactions provides a possible unifying mechanism for the origin of these pervasive regularities. By analyzing billions of anonymized call records from two European countries we find that human interactions follow a superlinear scale invariant relationship with population size and systematically accelerate within the constraints of social grouping. These results provide a general microscopic basis for a deeper understanding of urban socioeconomic processes.
ABSTRACT: The human body is associated with ten times more microbial cells than the human cells. The totality of the microbes, their genetic elements and environmental interaction constitutes the human microbiome. The human microbiome has been shown to be associated with human health and diseases. Next generation sequencing technologies have enabled researchers to study the microbiome in a culture-independent way using direct DNA sequencing. One strategy sequences the bacterial 16S rRNA gene for studying the bacterial component of the microbiome. The sequenced 16S tags are first clustered into the Operational Taxonomic Units (OTUs) and downstream statistical analyses are performed based on the OTU data. However, analysis of such OTU data raises several important statistical challenges, including modeling high-dimensional compositional data and over-dispersion and taking into account the phylogenetic relationship among OTUs. In this presentation, I will introduce two variable selection methods developed specifically for 16S data analysis, motivated by the study of identification of the gut microbiome-associated dietary nutrients and their affected OTUs. To correlate two high-dimensional data sets such as the nutrient and OTU data, sparse canonical correlation analysis (SCCA) as an exploratory method is usually applied. In order to utilize the OTU phylogenetic structure information in the SCCA setting, I developed a structure-constrained SCCA (SSCCA) method, which provides the flexibility to incorporate structure information for better OTU selection. In the second part of the talk, I will present a sparse Dirichlet-multinomial regression (SDMR) method, which is a model-based variable selection method for accounting the overdispersion of OTU counts. The SDMR uses a sparse group l1 penalty function to facilitate selection of covariates and OTUs simultaneously. I will illustrate these methods using simulations and analysis of a human gut microbiome study to associate dietary nutrient intakes with gut microbiome composition. Our results demonstrate that dietary nutrients have significant effects on the human gut microbiome composition and the identified associations are consistent with previous findings.
ABSTRACT: Recent developments in RNA-sequencing (RNA-seq) technology have led to a rapid increase in gene expression data in the form of counts. RNA-seq can be used for a variety of applications, however, identifying differential expression (DE) remains a key task in functional genomics. There have been a number of statistical methods for DE detection for RNA-seq data. One common feature of several leading methods is the use of the negative binomial (Gamma–Poisson mixture) model. That is, the unobserved gene expression is modeled by a gamma random variable and, given the expression, the sequencing read counts are modeled as Poisson. The distinct feature in various methods is how the variance, or dispersion, in the Gamma distribution is modeled and estimated. We evaluate several large public RNA-seq datasets and find that the estimated dispersion in existing methods does not adequately capture the heterogeneity of biological variance among samples. We present a new empirical Bayes shrinkage estimate of the dispersion parameters and demonstrate improved DE detection.
ABSTRACT: It is essential in many application domains to have transparency in predictive modeling. Domain experts do not tend to prefer "black box" predictive model models. They would like to understand how predictions are made, and possibly, prefer models that emulate the way a human expert might make a decision, with a few important variables, and a clear convincing reason to make a particular prediction.
I will discuss recent work on interpretable predictive modeling with decision lists. I will describe several approaches, including:
- an algorithm where not only the predictions, but the whole algorithm itself is interpretable to a human
- an algorithm based on Bayesian analysis
- an algorithm based on mixed-integer linear optimization
ABSTRACT: This talk is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. I will first introduce a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example. If time is permitted, I will introduce some newly developed model free screening procedure for categorical high-dimensional data.
ABSTRACT: Estimating rewiring gene regulatory networks over developing biological systems, such as proliferating cells, growing embryos, and differentiating cell lineages, is central to a deeper understanding of how cells evolve during development. However, one challenge in estimating such evolving networks is that their host cells are not only contiguously evolving, but also branching over time. For example, stem cells evolve into two more specialized daughter cells at each division, forming a tree of networks. Another example is in a laboratory setting: a biologist may apply several different drugs to a malignant cancer cell to analyze the changes each drug has produced in the treated cells. Each treated cell is not directly related to another treated cell, but rather to the malignant cancer cell that it was derived from.
We propose a novel statistical framework, which builds on the L1 plus total variation penalized graphical logistic regression, to effectively estimate multiple evolving gene networks corresponding to cell types related by either a linear-sequence or a tree-genealogy, based on only a few samples from each cell type. Our method takes advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We demonstrate that our method performs significantly better than existing methods via simulation, and enjoys strong statistical guarantees unlike other heuristic based approaches. We explore an application to a breast cancer analysis. Based on only a few microarray measurements, our algorithm is able to produce biologically valid results that provide insight into the progression and reversion of breast cancer. Finally, I will discuss a few additional complex scenarios for network estimation, where graphs are directional, have missing value, or are multi-attributes, and ideas for consistent structure estimation.
ABSTRACT: How do cells make decisions? Despite the large amount of microarray and RNAseq data publicly available, this question remains poorly understood. The major problem is that traditional genomic approaches can only measure the average behavior over a large population of cells, whereas cellular responses are highly heterogeneous even within the same cell-type. To overcome this limitation, new technologies are being developed at a rapid speed to profile gene expression at the single-cell resolution. Such technologies will have profound impact on how we understand the biology. One accompanied computational challenge is to how to distinguish deterministic signal from stochastic variation from single cell data. This cannot be addressed by traditional methods, such as PCA and clustering, which are useful for analysis of average gene expression profiles. As an initial attempt to overcome this challenge, we have developed a new method, called SCUBA, to analyze single cell gene expression data by combining dynamic clustering and bifurcation analysis. We have applied this method to analyze a published dataset for early mouse embryo development, providing a molecular view of Waddington's epigenetic landscape for early development. Furthermore, in collaboration with experimental biologists, we have also analyzed the cell hierarchy within the mouse blood system. Finally, I will discuss some challenges for single-cell data analysis.
ABSTRACT: The last decade has witnessed a surging scientific interest in, and a growing public awareness of, the connectedness of modern society. Physicians too are embedded in informal networks that result from their sharing of patients, information, and behaviors. I will talk about our work on analyzing these physician-physician networks constructed from Medicare data. I will also describe how a network science based approach can be used to identify naturally occurring groups of physicians that might be best suited to becoming accountable care organizations.
ABSTRACT: The presented work is motivated by the need of reliably estimating and predicting the survival rate for an individual whose gene expression information is available. In regression analyses of such high-throughput genomic data, the key analytical challenge is the high-dimensional covariate space as it far exceeds the number of subjects. In addition, it becomes increasingly important for the statistical methodologies to address the group variable selection problem because genes are naturally grouped according to some underlying biological process. To this end, our proposed framework utilizes special shrinkage priors that correspond to the elastic net, fused lasso, and group lasso which are popularly used in the literature to incorporate the cluster structure of the covariates into regression models. The time-to-event outcomes are related to the covariates using a Bayesian Cox proportional hazards model where the cumulative baseline hazard function is modeled through a discrete gamma process prior. The tuning parameters that control the degrees of shrinkage are jointly estimated and updated with other model parameters via Markov chain Monte Carlo (MCMC) sampling. We have proposed an efficient MCMC algorithm to fit our models and provided it in an easy to use R package for implementation. The variable selection and prediction performance of the proposed methods are assessed through simulation studies assuming several different underlying dependence structures of the covariates, and applications to three different real life data sets.
ABSTRACT: Two related classification methods ROAD and FANS are introduced under the high dimensional settings, where the number of features p is comparable to or larger than the sample size n. Both methods exploit correlation among features to boost the classification performance. ROAD is based on a two class Gaussian model with common covariance matrix; it is a regularized classifier that directly targets on the classification error. We motivate FANS by generalizing the Naive Bayes model, modeling the log ratios of class conditional densities as a linear combination of log ratios of marginal class conditional densities. Theoretical properties and efficient algorithms were developed for both methods, and their empirical performance is demonstrated by simulated and real data analysis.
|Back to HSPH Biostatistics|| Maintained by the
Last Update: May 2, 2013