Project 3

Jump to project examples.

Large-scale genomic, proteomic and other "omic" research has become increasingly important and common for discovering disease genes and "omic" biomarkers for cancer prevention and intervention, and for studying gene-environment interactions in population-based studies. Such high-dimensional "omic" data present fundamental statistical and computational challenges in data analysis and result interpretation. Limited statistical developments have been made on analysis of high-dimensional "omic" data in population-based studies. Such a methodological shortage limits the speed of using genomic and proteomic data to effectively advance population sciences.

The purpose of this proposal is to respond to this need by developing advanced statistical methods in conjunction with other advanced quantitative methods for analysis of high-dimensional genomic and proteomic data arising from population-based studies. The specific aims are to develop:

  1. variable selection methods for gene/biomarker discovery in the presence of a large number of SNPs or proteins and in studying gene-environment (space) interactions;
  2. false discovery rate (FDR) estimation for correlated SNP and expression data;
  3. a suite of tools using contemporary methods in signal processing based on local Fourier analysis to effectively preprocess biological data;
  4. supervised clustering methods for array CGH (aCGH) data to identify aCGH profiles related to disease outcomes;
  5. efficient user-friendly statistical software that implement these methods with the goal of disseminating them freely to health science researchers.

The proposed methods will be applied to data from the motivating Harvard/MGH lung cancer genetic susceptibility and progression studies, the Harvard/MGH lung cancer proteomic study, the DFCI lung cancer LBK mutation microarray study, the longitudinal HIV codon mutation study, and the Harvard/MGH brain tumor aCGH study. This project integrates closely with the spatial and surveillance projects 1 and 2 and the cores, as they have common theme of analysis of high-dimensional observational study data, need advanced computing, and jointly provide tools for studying gene-space interactions.

Project examples

  1. The Effect of Correlation in False Discovery Rate Estimation. Armin Schwartzman and Xihong Lin Current false discovery rate (FDR) methods mostly ignore the correlation structure in the data. The objective of this paper is to quantify the effect of correlation in FDR analysis. Specifically, we derive practical approximations for the mean, variance, distribution, and quantiles of the FDR estimator for arbitrarily correlated data... read the paper.
  2. Sparse Outcome Selection methods for multivariate responses data. Tamar Sofer, Arnab Maity, Brent Coull, Andrea Baccarelli, Joel Schwartz, and Xihong Lin In environmental epigenetics, one is interested in studying the effects of environmental exposures on DNA methylations in a genetic pathway/network, which often consists of a large number of genes and many of them are likely to be unaffected by exposures. We develop three sparse outcome selection (SOS) methods for modeling the association between multivariate responses in a genetic pathway/network and exposures. Our approach aims at selecting a subset of genetic outcomes in the pathway whose linear combination yields the highest correlation with a linear combination of exposure variables. Specifically, we propose Sparse Outcome Selection Principal Components Analysis (SOS-PCA), which is a semi-supervised correlation method; Sparse Outcome Selection (SOS) Canonical Correlation Analysis (SOS-CCA) and the step-forward Canonical Correlation Analysis (step-CCA), which are supervised correlation methods by introducing sparsity in the CCA weights. We investigate three criteria for selecting the tuning parameter, which include prediction correlation, Bayesian Information Criterion, and Correlation Information Criterion (CIC) which we developed. We compare the performance of these methods with the existing methods via simulations, and apply the methods to the Normative Aging data to study the effects of exposure to airborne particulate matter on DNA methylations in the asthma pathway.
  3. Penalized multivariate regression applied to gene methylation. Tamar Sofer and Xihong Lin We discuss multivariate regression under concave penalty functions that are known to hold oracle properties in univariate regression. We show that these functions also have oracle properties when the responses are correlated, under reasonable conditions. An efficient estimation procedure for the regression parameters, composed of coordinate descent regression combined with the GLASSO algorithm for sparse inverse covariance estimation is proposed. The BIC is shown to be consistent for model selection under these settings. The procedure is studies by simulations and then implemented on an original epigenetic data set from a GWAS study of gene methylation, where the effect of gene specific methylation in the asthma pathway on clinical covariates is investigated.
  4. A nonparametric test for stationarity based on local Fourier analysis. Prabahan Basu, Daniel Rudoy, and Patrick J Wolfe In this paper we propose a nonparametric hypothesis test for stationarity based on local Fourier analysis. We employ a test statistic that measures the variation of time-localized estimates of the power spectral density of an observed random process. For the case of a white Gaussian noise process, we characterize the asymptotic distribution of this statistic under the null hypothesis of stationarity, and use it to directly set test thresholds corresponding to constant false alarm rates. For other cases, we introduce a simple procedure to simulate from the null distribution of interest. After validating the procedure on synthetic examples, we demonstrate one potential use for the test as a method of obtaining a signal-adaptive means of local Fourier analysis and corresponding signal enhancement scheme.
  5. A Latent Class Model with Hidden Markov Dependence for Array CGH Data. Stacia M DeSantis, Andy Houseman, Brent Coull, David N Louis, Gayatry Mohapatra, and Rebecca Betensky Array CGH is a high-throughput technique designed to detect genomic alterations linked to the development and progression of cancer. The technique yields fluorescence ratios that characterize DNA copy number change in tumor versus healthy cells. Classification of tumors based on aCGH profiles is of scientific interest but the analysis of these data is complicated by the large number of highly correlated measures. In this article, we develop a supervised Bayesian latent class approach for classification that relies on a hidden Markov model to account for the dependence in the intensity ratios. Supervision means that classification is guided by a clinical endpoint... more.
  6. sLDA: Sparse Linear Discriminant Analysis. Michael C. Wu, Lingsong Zhang, Zhaoxi Wang, David C. Christiani, and Xihong Lin An R function for testing for differential expression of a gene set/pathway based on the sparse linear discriminant analysis approach... more information.
  7. Estimating equation-based methods for variable selection and estimation. Lee Dicker and Xihong Lin In many applied statistics problems, estimating equation-based methods provide a convenient, powerful, and often necessary alternative to likelihood-based methods. However, variable selection methods for estimating equations - an important topic in high-dimensional data analysis - have received relatively little attention in the literature. In this project, we develop regularized estimating equation methods for simultaneous variable selection and estimation. Theoretical and empirical results establish the validity and practicality of our methods.

Back to top


Copyright by Xihong Lin, 2011