Department of Biostatistics
2008 - 2009
ABSTRACT: It has been a century since early preliminary reports suggested heredity in some psychiatric disorders such as insanity. Decades have passed since the modes and levels of inheritance were documented for a number of psychiatric and behavioral disorders such as Tourette's Syndrome and nicotine dependence. Despite recent landmark successes that led to discoveries of genetic variants for several complex diseases, the hunting for genes underlying mental disorders remains largely elusive. In addition to political challenges, there are also major clinical and analytical challenges. Mental disorders are difficult to characterize both phenotypically and genetically. Beyond the challenges that are common for complex diseases such as cancer and age-related macular degeneration, there are great intrapersonal variations and uncertainties, particularly over time. The diagnoses of mental disorders generally depend on instruments that include many descriptive questions, and comorbidity is common. I will present some of the joint work conducted by my group in recent years that is motivated by the need arising from studying mental disorders. For example, we have developed methodology and software to analyze ordinal traits and multiple traits commonly encountered in mental health research. The potential of these methods has been demonstrated through simulation as well as several genetic analyses of several mental disorders such as hoarding, nicotine dependence, and alcohol dependence.
ABSTRACT: In 2003, the U.S. FDA, MHRA in the U.K., and European union released public health advisories for a possible causal link between antidepressant treatment and suicide in children and adolescents ages 18 and under. This led the U.S. FDA to issue a black box warning for antidepressant treatment of childhood depression in 2004, which was later extended to include young adults (18-24) in 2006. Following these warnings, rather than observing the anticipated decrease in youth suicide rates, record increases in youth suicide rates were observed in both the U.S. and Europe. In this presentation, we review the data and statistical methodology that led to the public health advisories and black box warning, and the data that led to the record increases in youth suicide rates and discuss their possible relationship. New statistical and experimental design approaches to post-marketing drug safety surveillance are developed, discussed and illustrated.
ABSTRACT: The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcome, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and there has been considerable debate regarding whether and how these data should be used to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to achieving more efficient estimators and tests in the analysis of randomized clinical trials, where "adjustment" for auxiliary covariates is carried out in such a way that concerns over the potential for bias and subjectivity often raised for other covariate adjustment methods may be obviated. Simulations and applications demonstrate the performance of the methods.
This is joint work with Marie Davidian, Min Zhang, and Xiaomin Lu.
ABSTRACT: We address a major discrepancy in matching methods for causal inference in observational data. Since these data are typically plentiful, the goal of matching is to reduce bias and only secondarily to keep variance low. However, most matching methods seem designed for the opposite problem, guaranteeing sample size ex ante but limiting bias by controlling for covariates through reductions in the imbalance between treated and control groups only ex post and only sometimes. (The resulting practical difficulty may explain why most published applications do not check whether imbalance was reduced and so may not even be decreasing bias.) We introduce a new class "Monotonic Imbalance Bounding" (MIB) matching methods that enables one to choose a fixed level of maximum imbalance, or to reduce maximum imbalance for one variable without changing the maximum imbalance for the others. We then discuss a specific MIB method called "Coarsened Exact Matching" (CEM) which, unlike most existing approaches, also explicitly bounds through ex ante user choice both the degree of model dependence and the treatment effect estimation error, eliminates the need for a separate procedure to restrict data to common support, meets the congruence principle, is robust to measurement error, works well with modern methods of imputation for missing data, is computationally efficient even with massive data sets, and is easy to understand and use. This method can improve causal inferences in a wide range of applications, and may be preferred for simplicity of use even when it is possible to design superior methods for particular problems. We also make available open source software which implements all our suggestions. This is joint work with Stefano M. Iacus and Giuseppe Porro; a copy of the paper can be found at http://gking.harvard.edu/files/abs/cem-abs.shtml.
ABSTRACT: Lots of statisticians are now involved in the analysis of genomic, transcriptomic, epigenomic and proteomic data of tumor data. I believe many more are needed, for with the explosion of data from nextgen DNA sequencing, I see the danger of us becoming drowned in data. However, most of these analyses are quite a distance from decisions regarding the treatment of cancer patients. If the widely discussed notion of personalized medicine is to become a reality, these kinds of data have to be used to decide on therapies for patients, for example, in the choice of drugs. What further analyses and other steps are required to achieve this goal? I'll discuss my necessarily limited perspective on these issues, in effect outlining my future hopes and plans and issuing challenges to others to join the game.
ABSTRACT: In this talk, we present a general framework of combining information using confidence distributions (CDs), and illustrate it through an example of incorporating expert opinions with information from clinical trial data. Confidence distributions (CDs), also often viewed as "distribution estimators" of parameters, contain a wealth of information for inferences; much more than point estimators or confidence intervals ("interval estimators"). In this talk, we present a formal definition of CDs, and develop a general framework of combining information based on CDs. This general framework not only unifies both the classical p-value combination and the model based meta-analysis approaches, it also allows us to propose new methodologies. In particular, we develop a Frequentist approach to combine surveys of expert opinions with binomial clinical trial data, and illustrate it using data from a collaborative research with Johnson & Johnson Pharmaceuticals. The results from the Frequentist approach are compared with those from Bayesian approaches. We demonstrate that the Frequentist approach has distinct advantages over the Bayesian approaches, especially in the case when informative prior distribution is skewed.
ABSTRACT: This talk is based on the discussion paper in the latest issue of Statistical Science (jointly with Nicolae and Kong, 2008, 287-331; Reprint available at stat.harvard.edu/faculty_page.php?page=meng.html) with the following abstract:
Many practical studies rely on hypothesis testing procedures applied to datasets with missing information. An important part of the analysis is to determine the impact of the missing data on the performance of the test, and this can be done by properly quantifying the relative (to complete data) amount of available information. The problem is directly motivated by applications to studies, such as linkage analyses and haplotype-based association projects, designed to identify genetic contributions to complex diseases. In the genetic studies the relative information measures are needed for the experimental design, technology comparison, interpretation of the data, and for understanding the behavior of some of the inference tools. The central difficulties in constructing such information measures arise from the multiple, and sometimes conflicting, aims in practice. For large samples, we show that a satisfactory, likelihood-based general solution exists by using appropriate forms of the relative Kullback-Leibler information, and that the proposed measures are computationally inexpensive given the maximized likelihoods with the observed data. Two measures are introduced, under the null and alternative hypothesis respectively. We exemplify the measures on data coming from mapping studies on the inflammatory bowel disease and diabetes. For small-sample problems, which appear rather frequently in practice and sometimes in disguised forms (e.g., measuring individual contributions to a large study), the robust Bayesian approach holds great promise, though the choice of a general-purpose "default prior" is a very challenging problem. We also report several intriguing connections encountered in our investigation, such as the connection with the fundamental identity for the EM algorithm, the connection with the second CR (Chapman-Robbins) lower information bound, the connection with entropy, and connections between likelihood ratios and Bayes factors. We hope that these seemingly unrelated connections, as well as our specific proposals, will stimulate a general discussion and research in this theoretically fascinating and practically needed area.
ABSTRACT: Model selection and classification using high-dimensional features arise frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The connections with the sure independent screening (SIS) and iterative SIS (ISIS) of Fan and Lv (2008) in model selection ill be elucidated and extended. Further extension of the correlation learning results in independence learning for feature selection in general loss functions.
The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.