Department of Biostatistics
Correlated and High-Dimensional Data Seminar

2009 - 2010

Organizers: Dr. Tianxi Cai and Dr. Armin Schwartzman

Schedule: Mondays, 12:30-2:00 p.m.
HSPH2, Room 426 (unless otherwise notified)

Contract All | Expand All
Seminar Description
This working seminar focuses on statistical and quantitative methods for analyzing correlated and high-dimensional data. High dimensional data arise from a wide range of studies in health science research, such as micro-array gene expression studies, proteomics, array CGH studies, genome-wide association studies. We discuss recent developments in statistical and quantitative methodology for analyzing high dimensional data and interesting biomedical applications with high-dimensional data that could motivate new methodological research. The goal of this seminar is to stimulate more research in this challenging and important area and to promote interface of statistics and other quantitative disciplines in biomedical research.


September 21 (FXB G11)

Sebastian Schneeweiss, M.D., Sc.D.
Associate Professor of Medicine and Epidemiology, Harvard Medical School
Vice Chief, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital

"High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data"
ABSTRACT: Adjusting for large numbers of covariates ascertained from patients' health care claims data may improve control of confounding in studies of intended and unintended treatment effects, as these variables may collectively be proxies for unobserved factors. We developed a multi-step algorithm to implement high-dimensional proxy adjustment in claims data. Steps include 1) identifying data dimensions, e.g. diagnoses, procedures, and medications, 2) empirically identifying candidate covariates, 3) assess recurrence of codes, 4) prioritizing covariates, 5) selecting covariates for adjustment, 6) estimating the exposure propensity score, and 7) estimating an outcome model. This algorithm was tested in Medicare claims data, including a study on the effect of Cox-2 inhibitors on reduced gastric toxicity compared to non-selective (ns)NSAIDs.

In a population of 49,653 new users of coxibs or nsNSAIDs, a crude association with upper GI toxicity of RR = 1.09 (95% CI: 0.91-1.30) was initially observed; adjusting for 15 predefined covariates resulted in a possible gastroprotective effect (RR=0.94; 0.78-1.12). A gastroprotective effect became stronger when adjusting for an additional 500 algorithm-derived covariates (RR=0.88; 0.73-1.06). Results of several other claims data studies showed similarly improved confounding control.

In typical pharmacoepidemiologic studies, the proposed high-dimensional propensity score resulted in improved effect estimates compared to adjustment limited to predefined covariates, when benchmarked against results expected from randomized trials. Beyond the improved confounding adjustment the algorithm has significant operational advantages when pooling multiple healthcare data sources with varying information content as in the planned FDA Sentinel System but it also raises issues of variable selection.

October 5

Parantu Shah, Ph.D.
Research Fellow, Department of Biostatistics, Harvard School of Public Health and Dana-Farber Cancer Institute

"Open Life Science Gateway and its Application in Predicting the B-cell Epitopes"
ABSTRACT: Open Life Science Gateway (OLSGW) integrates a group of bioinformatics applications and data collections into a portal so that life scientists can easily access the resources of the TeraGrid to submit their Genome-related analysis. TeraGrid resources include more than 250 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks. OLSGW sets a new standard in providing easy-to-use web 2.0 user-interfaces, consisting of web portals and gadgets. It also allows users with no prior experience of Grid computing to compose Taverna based workflows without the challenges of a sharp learning curve.

At present, important bioinformatics applications for biological sequence analysis (BLAST, InterproScan and HMMER), multiple sequence alignment (CLUSTALW and MUSCLE), protein secondary structure prediction (PSIPRED), solvent accessibility prediction (ACCPRO), disorder (DISPRO) and 3D- homology modeling (MODELLER) are available to researchers for large-scale analysis. We have used a curated dataset of annotated B-cell epitopes, and sequence derived features of these epitopes, using OLSGW applications, to train a support vector machine classifier and show that inclusion of evolutionary information improves quality of B cell epitope prediction.

October 19

Hongkai Ji, Ph.D.
Assistant Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

"Improving High-throughput Data Analysis by Using Gene Expression Omnibus"
ABSTRACT: Reliably detecting biological signals from high-throughput genomics data is a challenging problem. Gene Expression Omnibus (GEO) contains tens of thousands of microarray samples which represent a rich source of information that can be used to improve omics data analysis. In this talk, I introduce two of our recent efforts to improve tiling array and microarray analyses by incorporating information from GEO.

First, I will introduce a new technique TileProbe for tiling array analysis. Individual probes on an Affymetrix tiling array usually behave differently. Modeling and removing these probe effects are critical for detecting signals from the array data. Current data processing techniques either require control samples or use probe sequences to model probe-specific variability, such as with MAT. Although the MAT approach can be applied without control samples, residual probe effects continue to distort the true biological signals. We propose TileProbe, a new technique that builds upon the MAT algorithm by incorporating GEO datasets to remove tiling array probe effects. By using a large number of these readily available arrays, TileProbe robustly models the residual probe effects that MAT model cannot explain. When applied to analyzing ChIP-chip data, TileProbe performs consistently better than MAT across a variety of analytical conditions. This shows that TileProbe resolves the issue of probe-specific effects more completely.

Second, I will introduce a correlation motif approach to jointly analyze multiple related microarray and tiling array datasets. Due to the cost constraints, many high-throughput experiments contain only a few replicates. When the data is noisy, detection of biological signals is challenging. However, if the same biological system has been studied by multiple labs, it is possible to collect multiple related datasets from the GEO and use them to improve statistical inference. Effective information pooling depends on adequate but parsimonious modeling of the unknown correlation structures among datasets. We propose to use a latent mixture model to characterize the correlation among datasets, and couple the model with various microarray and tiling array analysis algorithms. Both simulations and real data analyses show that the proposed method increases sensitivity and specificity compared to the approach that analyzes individual datasets separately.

October 26

Seokho Lee, Ph.D.
Research Fellow, Department of Biostatistics, Harvard School of Public Health

"Sparse Logistic Principal Components Analysis for Binary Data"
ABSTRACT: We developed a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
November 9

Jiashun Jin, Ph.D.
Associate Professor, Department of Statistics, Carnegie Mellon University

"Higher Criticism Thresholding: Optimal Feature Selection when Useful Features are Rare and Weak"
ABSTRACT: See attached PDF
November 23

Charles Lee, Ph.D., FACMG
Director of Cytogenetics, Harvard Cancer Center
Associate Professor, Harvard Medical School
Associate Faculty Member, MIT Broad Institute
Clinical Cytogeneticist, Brigham and Womenıs Hospital

"Talk Title TBA"
ABSTRACT: None Given
December 7

Eric Kolaczyk, Ph.D.
Associate Professor, Departments of Mathematics and Statistics, Boston University

"Talk Title TBA"
ABSTRACT: None Given


Back to HSPH Biostatistics Maintained by the Biostatistics Webmaster
Last Update: November 2, 2009