Department of Biostatistics
ABSTRACT: We propose new, optimal methods for analyzing randomized trials, when it is suspected that treatment effects may differ in two predefined subpopulations. Such subpopulations could be defined by a biomarker or risk factor measured at baseline. The goal is to simultaneously learn which subpopulations benefit from an experimental treatment, while providing strong control of the familywise Type I error rate. We formalize this as a multiple testing problem and show it is computationally infeasible to solve using existing techniques. Our solution involves first transforming the original multiple testing problem into a large, sparse linear program. We then solve this problem using advanced optimization techniques. This general method can solve a variety of multiple testing problems and decision theory problems related to optimal trial design, for which no solution was previously available. Specifically, we construct new multiple testing procedures that satisfy minimax and Bayes optimality criteria. For a given optimality criterion, our approach yields the optimal tradeoff between power to detect an effect in the overall population versus power to detect effects in subpopulations. We give examples where this tradeoff is a favorable one, in that improvements in power to detect subpopulation treatment effects are possible at relatively little cost in additional sample size. We demonstrate our approach in examples motivated by two randomized trials of new treatments for HIV. Below we give an image representing the rejection regions for an optimal procedure that will be discussed in the talk. This is joint work with Han Liu (Princeton) and En-Hsu Yen (University of Texas at Austin), which has been accepted for publication in the Journal of the American Statistical Association (Theory and Methods).
I will focus on an expression genetics experiment in the mouse, with gene expression microarray data on each of six tissues, plus high-density genotype data, in each of 500 mice. I argue that in research with such data, precise statistical inference is not so important as data visualization.
ABSTRACT: The public reporting of performance standards is central to efforts to monitor and improve health care provider quality. One approach is to set performance targets, or statistical benchmarks, that define a high level (e.g., top 10%) of observed provider performance. Widely-used approaches to setting benchmarks often summarize direct estimates of hospital performance, such as the observed proportion of instances when a provider delivers a particular type of care. Benchmarks might be unduly affected by high-variance direct estimates of provider performance. While provider-specific posterior means offer more stable estimates, concerns have been raised about basing benchmarks on such estimates. In this talk, I will discuss how the identification of providers that exceed a performance benchmark is a question that requires fully considering the distribution of performance across all providers, and note that the two aforementioned approaches instead focus on the distribution of summaries of hospital-specific performance. Using publicly available data from the Medicare Hospital Compare website, I will illustrate how widely-used statistical benchmarks of provider performance compare to those obtained by estimating the empirical distribution function of provider performance using hierarchical Bayesian modeling. In this analysis, there was variation across statistical benchmarking approaches with respect to which providers exceeded the top 10% benchmark, but not for the 50% threshold. The results illustrate that benchmarks derived from the histogram of provider performance under hierarchical Bayesian modeling provide a compromise between benchmarks based on direct estimates, which are over-dispersed relative to the true distribution of provider performance and prone to high variance for small providers, and posterior mean provider performance, for which under-dispersion relative to the true provider performance distribution is a concern. Given the rewards and penalties associated with characterizing top performance, the ability of statistical benchmarks to summarize key features of the provider performance distribution should be further examined.
ABSTRACT: Identification of genes underlying variation in human traits provides information that can lead to interventions that modify disease risk. Success in identifying such genes, especially in the context of complex traits, depends on adequacy of our assumed genetic models coupled with use of statistical methods that make effective use of available data. In general, these methods make use of correlation in data resulting from well-understood biological processes. These include correlation between trait values because of inheritance in pedigrees, and correlation in the allelic states at different positions in the genome because of shared population histories. Challenges that arise in analysis include large computational demands, the need to make efficient use of the data, and complications introduced by the enormous variability present in genomic data. I will discuss approaches for integrating information from both population and pedigree data with data from high-throughput sequencing technologies under genetic models that include both rare and common underlying risk variants. These include (1) choice of samples to use and to sequence, (2) imputation of sequence data in pedigrees for efficient data use, and (3) restriction and grouping for reducing the multiple testing problem.
ABSTRACT: Surveys often ask respondents to report non-negative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. The error inherent in this heaping can bias estimation. To avoid bias, we propose a novel reporting distribution arising from a general birth-death process whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of heaping grids and allows for quasi-heaping to values nearly but not equal to heaping multiples. Inference using this stochastic process requires novel, efficient techniques to compute finite-time transition probabilities for arbitrary birth-death processes that we provide through Laplace transforms and a continued fraction representation. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the heaping process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.
ABSTRACT: None Given