|
Department of Biostatistics Correlated and High-Dimensional Data Seminar 2011 - 2012 |
ABSTRACT: Single variant or single gene analyses generally account for only a small proportion of the phenotypic variation in complex traits. Alternatively, gene set or pathway association analyses are playing an increasingly important role in uncovering genetic architectures of complex traits through the identification of systematic genetic interactions. Two dominant paradigms for gene set analyses are association analyses based on SNP genotypes and those based on gene expression profiles. However, gene-disease association can manifest in many ways such as alterations of gene expression, genotype and copy number, thus an integrative approach combining multiple forms of evidence can more accurately and comprehensively capture pathway associations. We have developed a single statistical framework, Gene Set Association Analysis (GSAA), that simultaneously measures genome-wide patterns of genetic variation and gene expression variation to identify sets of genes enriched for differential expression and/or trait-associated genetic markers. Simulation studies illustrate that joint analyses of genomic data increase the power to detect real associations when compared to gene set methods that use only one genomic data type. The analysis of two human diseases, glioblastoma and Crohn’s disease, detected abnormalities in previously identified disease-associated pathways, such as pathways related to PI3K signaling, DNA damage response, and the activation of NF-ĸB. In addition, GSAA predicted novel pathway associations, for example differential genetic and expression characteristics in genes from the ABC transporter family in glioblastoma and from the HLA system in Crohn’s disease. These demonstrate that GSAA can help uncover biological pathways underlying human diseases and complex traits.
Lu Tian, Ph.D.
Assistant Professor, Health Research & Policy - Biostatistics, Stanford University
"AUC based Biomarker Ensemble with an Application on Gene Scores Predicting Low Bone Mineral Density"
ABSTRACT: Motivation: The area under the receiver operating characteristic (ROC) curve (AUC), long regarded as a "golden" measure for the predictiveness of a continuous score, has propelled the need to develop AUC-based predictors. However, the AUC-based ensemble methods are rather scant, largely due to the fact that the associated objective function is neither continuous nor concave. Indeed, there is no reliable numerical algorithm identifying optimal combination of a set of biomarkers to maximize the AUC, especially when the number of biomarkers is large.
Results: We have proposed a novel AUC-based statistical ensemble methods for combining multiple biomarkers to differentiate a binary response of interest. Specifically, we propose to replace the non-continuous and non-convex AUC objective function by a convex surrogate loss function, whose minimizer can be efficiently identified. With the established framework, the lasso and other regularization techniques enable feature selections. Extensive simulations have demonstrated the superiority of the new methods to the existing methods. The proposal has been applied to a gene expression data set to construct gene expression scores to differentiate elderly women with low bone mineral density (BMD) and those with normal BMD. The AUCs of the resulting scores in the independent test data set has been satisfactory.
Conclusion: Aiming for directly maximizing AUC, the proposed AUC-based ensemble method provides an efficient means of generating a stable combination of multiple biomarkers, which is especially useful under the high dimensional settings.
Research by XG. Zhao, W. Dai, Y. Li and L. Tian.
ABSTRACT: In this work, a topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. In the proposed algorithm, kernel smoothing is applied to the tag counts over the genome, and the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid p-values for candidate peaks are computed via Monte Carlo simulations of smoothed Poisson sequences, whose background Poisson rates are obtained via linear regression from a Control sample and the local GC content. Application of the the proposed method to two different data sets shows that it resolves nearby binding sites that other methods, such as MACS and cisGenome, do not.
ABSTRACT: In this talk, we consider the problem of learning a target function that belongs to the linear span of a large number of reproducing kernel Hilbert spaces. Such a problem arises naturally in many practice situations with the ANOVA, the additive model and multiple kernel learning as the most well known and important examples. A couple of regularization techniques to exploit the sparse nature of the problem will be investigated. The optimality and adaptivity of the these methods will be assessed through oracle type inequalities providing bounds on the excess risk of the resulting prediction rule.
ABSTRACT: A gene set test is a differential expression analysis in which a p-value is assigned to a set of genes as a unit. Gene set tests are valuable for increasing statistical power, organizing and interpreting results, and for relating expression patterns across different experiments.
Breast cancer is a heterogeneous disease with at least 6 major subtypes. We have the microarray data of 4 cell subpopulations in normal human mammary gland, including stem cell enriched cells, luminal progenitor, mature luminal and stromal cells. I will demonstrate how we identify the cell of origin for breast cancer subtypes using this data and publically available data.
A variety of s methods are used, including defining gene signature scores for each sample and some novel gene set testing methods. The rotation gene set test, ROAST, overcomes the limitation of sample size and inter-gene correlation. Other two gene set tests, ROMER and CAMERA are developed to test different statistical hypotheses.
ABSTRACT: We propose an estimation method for the covariance structure of a genetic pathway, while modeling the effect of covariates. We assume a linear mixed model on the pathway as response to covariates. This model corresponds to a simple yet reasonable model on the association between the genes in the pathway. The method penalizes coefficients using an oracle penalty function. We show that using an iterated least squares procedure we are able to recover the model parameters with high accuracy. We propose a powerful variance component score test for the effect of covariates. The method is implemented to estimate the covariance matrix of methylation scores of genes in the NFkB pathway, in inflammation pathway, as a function of CRP, an inflammation marker.
ABSTRACT: Tumor cells frequently harbor an abnormal genome with gains or losses of large chromosome regions or entire chromosomes that affect the expression level of hundreds of genes and correlate with patient prognosis. How do they contribute to and predict tumor initiation, progression and patient outcomes, together with other genomic changes such as mutations and methylation, is among the key questions in cancer genomics studies. Latest technologies generate genome-wide data of multiple types that capture snapshots of cancer genomes. At very high dimension, such data challenge existing and motivate novel statistical and computational methods to analyze them in the context of incompletely known but dynamic biological networks, with aim to answer biological and clinical questions that can benefit patients. I will present our efforts in this big trend of integrating multiple genomic data types, such as the use of TF-miRNA feed-forward loops (FFLs) to explain expression variation and infer TF-miRNA regulatory pathways playing roles in tumor biology. I will also discuss challenging questions down the road.
ABSTRACT: Feature extraction for signal reconstruction from observed noisy samples is a common important problem in statistics and engineering. Emerging advances in genomic sequencing have prompted the development of new computational methods for studying the genomic sources of human diseases.
We present a general statistical approach to the problem of detecting non-zero regions in a long data sequence. Specifically, this method is used to solve the problem of finding non-zero genomic regions with significant copy number alterations (CNAs) between two cell populations, such as normal and cancer cells. Mapping such alterations is of interest as they may be potentially related to higher susceptibility to diseases. The proposed technique employs multi-scale kernel smoothing in conjunction with statistical multiple testing to detect the non-zero regions while controlling their false discovery rate and maximizing their signal to noise ratio (SNR) via matched filtering. The proposed method is applied to synthetic data and a real data set of lung cancer DNA copy numbers.
ABSTRACT: Nonparametric Bayes modeling has widely applications in high dimensional genetic data analysis. In this presentation, I discuss a couple of our recent works using nonparametric Bayes modeling. The first is about the joint analysis of family-based and unrelated data to enlarge the sample size and therefore secure the effective power for studies. Family-based data have the advantage of increasing the chance to detect true risk-variants but are limited due to the difficulty to recruit enough people. Such a problem can become even more serious in emerging next-generation sequencing data due to increased cost. We firstly propose a novel Bayesian model for the unified analysis of family-based and population-based unrelated data. We adopt a matched case-control design within a conditional likelihood framework to account for the ascertainment effect in family data and the population stratification inherent in unrelated data. Nonparametric Bayesian model does not rely any known information on parameters, but automatically adapts data to handle the family/stratum specific parameters. Our model can flexibly incorporate the correlation and variance components parameters into the analysis of any family structure. The studies indicate better estimates than family-based data only, and have much greater efficiency than conditional likelihood model. (This is the joint work with Andrew Allen and Yi-Ju Li at Duke University.)
The second is about Bayesian inferences for genomic data integration for predicting gene-gene/protein-protein interactions. Gene-gene/protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing biological networks. However, several critical difficulties exist in obtaining reliable predictions. For example, false positive rates can be as high as > 80% in predicting PPIs. No computational method was designed to effectively reduce false positives. We propose a novel Bayesian integration method to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction, roles of PPIs in disease susceptibility, and integration of multiple biological sources for better predicting disease-risk variants. (This is the joint work with David Dunson at Duke University.)
ABSTRACT: In this talk I would present two of my papers about network structure estimation from replicated data. In this first paper, we propose a graphical model for ordinal variables, which is a type of categorical data with natural orders. The proposed model assumes that the ordinal variables are generated by discretizing the marginal distributions of a latent multivariate Gaussian distribution and the relationships of these ordinal variables are described by the underlying Gaussian graphical model. To estimate the corresponding concentration matrix of the underlying Gaussian graphical model, we develop an EM-like algorithm. Simulation studies indicate that the developed algorithm works well. We also apply the new model to a movie rating example on identifying dependence structures among movies. In the second paper, we explore effective approaches to estimate the structure of the networks with modular structures, i.e., a network consisting of several dense subgraphs (modules). In this paper, we proposed a modularized Gaussian graphical model, which incorporates the unknown modular structure existing in many real world networks. For this aim, we developed two algorithms that identify the modules and recover the graph structure simultaneously. Numerical results from both synthetic data and real data demonstrate the estimation of the graphs can be improved by utilizing prior modular information.
ABSTRACT: Physiological signals such as ECG consist of mixtures of patterns and phenomena occurring at different times. Traditional signal processing and analysis methods are optimized to handle signals that include a single class of patterns, such as Fourier Representation for pure harmonics or Wavelets Representation for piece-wise smooth functions. In simplified and unreal scenarios, simple operations such as thresholding or filtering in the appropriate space can be very effective for separation of signal and noise. However, using a single representation method usually yields mediocre results on real-life signals. Matching Pursuit and Basis Pursuit use the idea of merging several different representation methods to create a so-called over-complete dictionary. These methods can decompose a signal into relatively few meaningful components by searching for the sparsest possible representation, given the right choice of dictionary.
We show that such tools can provide much better insight into a heart-beat signal's basic components and their relation to different physiological states in general, and pain in particular, than traditional analysis methods based on a single dictionary. We have analyzed the inter-beat time series (known as R-R signals) measured during a controlled pain-related experiment. A simultaneous Wavelet and Fourier analysis was applied using both Orthogonal Matching Pursuit and Basis Pursuit to project all wavelet-like features into the wavelet domain and Fourier-like features into the frequency domain. The analysis clearly reveals pain-related events, and our proposed method outperforms traditional spectral analysis methods with respect to both sensitivity and time delay.
Joint work with Tobias Moeller and Shai Tejman-Yarden (Medical school, UC San Diego), and Michael Saunders (ICME, Stanford University)
ABSTRACT: The study of networks has become a substantial interdisciplinary endeavor that encompasses myriad disciplines in the natural, social, and information sciences. Here we introduce a framework for constructing taxonomies of networks based on their structural similarities. These networks can arise from any of numerous sources: they can be empirical or synthetic, they can arise from multiple realizations of a single process (either empirical or synthetic), they can represent entirely different systems in different disciplines, etc. Because mesoscopic properties of networks are hypothesized to be important for network function, we base our comparisons on summaries of network community structures. Although we use a specific method for uncovering network communities, much of the introduced framework is independent of that choice. After introducing the framework, we apply it to construct a taxonomy for 746 networks and demonstrate that our approach usefully identifies similar networks. We also construct taxonomies within individual categories of networks, and we thereby expose nontrivial structure. For example, we create taxonomies for similarity networks constructed from both political voting data and financial data. We also construct network taxonomies to compare the social structures of 100 Facebook networks and the growth structures produced by different types of fungi.
ABSTRACT: In large-scale multiple testing, correlation between tests can greatly affect inference on the null hypothesis. Specifically, correlation in the data can produce test statistics that appear to have a distribution quite different from their theoretical marginal distributions, even under the global null hypothesis. For a specific example, marginally standard Gaussian data can produce test statistics that appear to have significant shifts, narrowing, and even multiple modes under various forms of correlation. Therefore, we propose tests of the global null hypothesis which are applicable to Gaussian data under arbitrary covariance structure.
| Back to HSPH Biostatistics | Maintained by the
Biostatistics
Webmaster Last Update: May 4, 2012 |