|
Department of Biostatistics Correlated and High-Dimensional Data Seminar 2009 - 2010 |
ABSTRACT: Adjusting for large numbers of covariates ascertained from patients' health care claims data may improve control of confounding in studies of intended and unintended treatment effects, as these variables may collectively be proxies for unobserved factors. We developed a multi-step algorithm to implement high-dimensional proxy adjustment in claims data. Steps include 1) identifying data dimensions, e.g. diagnoses, procedures, and medications, 2) empirically identifying candidate covariates, 3) assess recurrence of codes, 4) prioritizing covariates, 5) selecting covariates for adjustment, 6) estimating the exposure propensity score, and 7) estimating an outcome model. This algorithm was tested in Medicare claims data, including a study on the effect of Cox-2 inhibitors on reduced gastric toxicity compared to non-selective (ns)NSAIDs.In a population of 49,653 new users of coxibs or nsNSAIDs, a crude association with upper GI toxicity of RR = 1.09 (95% CI: 0.91-1.30) was initially observed; adjusting for 15 predefined covariates resulted in a possible gastroprotective effect (RR=0.94; 0.78-1.12). A gastroprotective effect became stronger when adjusting for an additional 500 algorithm-derived covariates (RR=0.88; 0.73-1.06). Results of several other claims data studies showed similarly improved confounding control.
In typical pharmacoepidemiologic studies, the proposed high-dimensional propensity score resulted in improved effect estimates compared to adjustment limited to predefined covariates, when benchmarked against results expected from randomized trials. Beyond the improved confounding adjustment the algorithm has significant operational advantages when pooling multiple healthcare data sources with varying information content as in the planned FDA Sentinel System but it also raises issues of variable selection.
ABSTRACT: Open Life Science Gateway (OLSGW) integrates a group of bioinformatics applications and data collections into a portal so that life scientists can easily access the resources of the TeraGrid to submit their Genome-related analysis. TeraGrid resources include more than 250 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks. OLSGW sets a new standard in providing easy-to-use web 2.0 user-interfaces, consisting of web portals and gadgets. It also allows users with no prior experience of Grid computing to compose Taverna based workflows without the challenges of a sharp learning curve.At present, important bioinformatics applications for biological sequence analysis (BLAST, InterproScan and HMMER), multiple sequence alignment (CLUSTALW and MUSCLE), protein secondary structure prediction (PSIPRED), solvent accessibility prediction (ACCPRO), disorder (DISPRO) and 3D- homology modeling (MODELLER) are available to researchers for large-scale analysis. We have used a curated dataset of annotated B-cell epitopes, and sequence derived features of these epitopes, using OLSGW applications, to train a support vector machine classifier and show that inclusion of evolutionary information improves quality of B cell epitope prediction.
ABSTRACT: Reliably detecting biological signals from high-throughput genomics data is a challenging problem. Gene Expression Omnibus (GEO) contains tens of thousands of microarray samples which represent a rich source of information that can be used to improve omics data analysis. In this talk, I introduce two of our recent efforts to improve tiling array and microarray analyses by incorporating information from GEO.First, I will introduce a new technique TileProbe for tiling array analysis. Individual probes on an Affymetrix tiling array usually behave differently. Modeling and removing these probe effects are critical for detecting signals from the array data. Current data processing techniques either require control samples or use probe sequences to model probe-specific variability, such as with MAT. Although the MAT approach can be applied without control samples, residual probe effects continue to distort the true biological signals. We propose TileProbe, a new technique that builds upon the MAT algorithm by incorporating GEO datasets to remove tiling array probe effects. By using a large number of these readily available arrays, TileProbe robustly models the residual probe effects that MAT model cannot explain. When applied to analyzing ChIP-chip data, TileProbe performs consistently better than MAT across a variety of analytical conditions. This shows that TileProbe resolves the issue of probe-specific effects more completely.
Second, I will introduce a correlation motif approach to jointly analyze multiple related microarray and tiling array datasets. Due to the cost constraints, many high-throughput experiments contain only a few replicates. When the data is noisy, detection of biological signals is challenging. However, if the same biological system has been studied by multiple labs, it is possible to collect multiple related datasets from the GEO and use them to improve statistical inference. Effective information pooling depends on adequate but parsimonious modeling of the unknown correlation structures among datasets. We propose to use a latent mixture model to characterize the correlation among datasets, and couple the model with various microarray and tiling array analysis algorithms. Both simulations and real data analyses show that the proposed method increases sensitivity and specificity compared to the approach that analyzes individual datasets separately.
ABSTRACT: We developed a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
ABSTRACT: See attached PDF
ABSTRACT: None Given
ABSTRACT: None Given
| Back to HSPH Biostatistics | Maintained by the
Biostatistics
Webmaster Last Update: November 2, 2009 |