Department of Biostatistics
ABSTRACT: We develop a Bayesian nonparametric model for inference for a longitudinal response in the presence of nonignorable missing data. Our general approach is to first specify a working model that flexibly models the missingness and full outcome processes jointly. We specify a Dirichlet process mixture of independent models as a prior on the joint distribution of the working model. This aspect of the model governs the fit of the observed data by modeling the observed data distribution as the marginalization over the missing data in the working model. We then separately specify the conditional distribution of the missing data given the observed data and dropout. This approach allows us to identify the distribution of the missing data using identifying restrictions as a starting point. We propose a framework for introducing sensitivity parameters, allowing us to vary the untestable assumptions about the missing data mechanism smoothly through a space. Informative priors on the space of missing data assumptions can be specified to combine inferences under many different assumptions into a final inference. We demonstrate this by applying the method to simulated data (and comparing with standard methods) and to data from two clinical trials.
This is joint work with Antonio Linero (University of Florida).
ABSTRACT: Investigators commonly gather longitudinal data to assess changes in responses over time and to relate these changes to within-subject changes in predictors. With rare or expensive outcomes such as uncommon diseases and costly radiologic measurements, outcome-dependent, and more generally outcome-related, sampling plans can improve estimation efficiency and reduce cost. Longitudinal follow up of subjects gathered in an initial outcome-dependent sample can then be used to study the trajectories of responses over time and to assess the association of changes in predictors within subjects with change in response. In this talk we develop two likelihood-based approaches for fitting generalized linear mixed models (GLMMs) to longitudinal data from a wide variety of outcome-dependent sampling designs. The first is an extension of the semi-parametric maximum likelihood approach developed in a series of papers by Neuhaus, Scott and Wild and applies quite generally. The second approach is an adaptation of standard conditional likelihood methods and is limited to random intercept models with a canonical link. Data from a study of Attention Deficit Hyperactivity Disorder in children motivates the work and illustrates the findings.
ABSTRACT: For almost 30 years a 1984 paper by Weir and Cockerham served as the standard reference for estimating the "F-statistic" Fst used to characterize genetic structure of populations. These measures are in wide use across many genetic fields: evolutionary, forensic, human, population and quantitative. They have been defined variously as correlations between allelic states or as probabilities of identity by descent, or even as ratios of sample heterozygosities. The moment estimators of Weir and Cockerham used sample size to weight allele frequencies in forming averages over populations, as was appropriate for their simple evolutionary scenario. An extension to allow for different degrees of correlation within different populations and for non-zero correlations between populations was given by Weir and Hill in 2002, although their moment estimators retained sample size weightings. Recently, Bhatia, Patterson, Sankararaman and Price suggested that equal weights would be more appropriate for the Weir and Hill model, especially when sample sizes were quite different among populations. Their suggestion will be amplified here for arbitrary numbers of populations and alleles per locus, and attention will be paid to population-specific Fst values for the detection of signatures of natural selection.
To illustrate this new methodology, the current issue of interpreting matching forensic profiles for the Y chromosome will be discussed. Population genetic arguments, involving Fst, lead to stronger evidence for profiles with more loci even for profiles not previously observed. Joint Fst values, for Y and autosomal loci, allow the effects of Y-matching on autosomal matching, and vice-versa, to be quantified.
ABSTRACT: Sample contamination is a common problem in DNA sequencing studies that results in genotype misclassification and reduction in statistical power for downstream genetic association analysis. I will describe methods to identify DNA sample contamination based on a combination of sequencing reads and array-based genotype data; sequence reads alone; and array-based genotype data alone; and demonstrate that by modeling contamination, we can substantially reduce its impact on variant calling and downstream association studies.
ABSTRACT: Mobile devices are being increasingly used by health researchers to collect symptoms and other information and to provide interventions in real time. These "Just in Time Adaptive Interventions" specify how patient information should be used to determine whether, when and which intervention to provide. We present generalizations of methods from the field of Reinforcement Learning for optimizing just in time adaptive interventions. We discuss how that these methods are related to updated and improved stochastic approximation algorithms used in robotics, online games and online advertising.
ABSTRACT: This will be a two-part talk, summarizing my group's recent work on methods development and large-scale data analysis in the area of human population genomics. The first part of the talk will be concerned with the question of the genome-wide impact of mutations that influence gene regulation. I will describe a recent analysis of complete genome sequences and genome-wide chromatin immunoprecipitation and sequencing data that demonstrates that natural selection has exerted a profound influence on human transcription factor binding sites since our divergence from chimpanzees 4-6 million years ago. Our analysis is based on a new method, called INSIGHT, for characterizing natural selection from collections of short interspersed noncoding elements. We find that binding sites have experienced somewhat weaker selection than protein-coding genes, on average, but that the binding sites of several transcription factors show clear evidence of adaptation. We project that regulatory elements may make larger cumulative contributions than protein-coding genes to both adaptive substitutions and deleterious polymorphisms, which has important implications for human evolution and disease.
In the second part of the talk, I will discuss recent progress on the long-standing problem of inferring an "ancestral recombination graph" (ARG) from sequence data. The ARG provides a complete characterization of the correlation structure of a collection of sequences sampled from a population, and, in principle, fast, high-quality ARG inference could enable many improvements in population genomic analysis. However, the available methods for ARG inference are either extremely computationally intensive, depend on fairly crude approximations, or are limited to very small numbers of samples, and, as a consequence, they are rarely used in applied population genomics. I will present a new method for ARG inference, called ARGweaver, that is efficient enough to be applied on the scale of dozens of complete mammalian genomes. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG, for twenty or more megabase-long sequences generated under realistic parameters for human populations. We have begun to apply our methods to high-coverage individual human genome sequences from Complete Genomics, and I will show that signatures of selective sweeps, background selection, recombination hot spots, and other features are all evident from properties of the inferred ARGs.
ABSTRACT: The use of Bayesian-based designs and analyses in biomedical, environmental, political and many other applications has burgeoned, even though its use entails additional overhead. Consequently, it is evident that statisticians and collaborators are increasingly finding the approach worth the bother. In the context of this escalating incidence, I highlight a subset of the potential advantages of the formalism in study design (\Everyone is a Bayesian in the design phase"), conduct, analysis and reporting. Approaches include designs and analyses with required frequentist properties (Bayes for frequentist) and for fully Bayesian goals (Bayes for Bayes). Examples are drawn from sample size estimation, design and analysis of cluster randomized studies, use of historical controls, frequentist CI coverage, evaluating subgroups, dealing with multiplicity, ranking and other nonstandard goals.
The Bayesian approach is by no means a panacea. Valid development and application places additional obligations on the investigative team, and so it isn't always worth the effort. However, the investment can pay big dividends, the cost/benefit relation is increasingly attractive, and in many situations it is definitely worth the bother.
ABSTRACT: Meta-analysis is widely used to compare and combine the results of multiple independent studies. To account for between-study heterogeneity, it is desirable to use random-effect models, under which the effect sizes of interest are allowed to differ among studies. It is common to estimate the mean effect size by a weighted linear combination of study-specific estimators, the weight for each study being inversely proportional to the sum of the variance of its effect size estimator and the estimated variance component of the random effect distribution. Because the estimator of the variance component involved in the weights is random and correlated with study-specific effect size estimators, the commonly used asymptotic normal approximation to the meta-analysis estimator is grossly inaccurate unless the number of studies is large. When individual participant data are available, one can also estimate the mean effect size by maximizing the joint likelihood. We investigate the asymptotic properties of the meta-analysis estimator and the maximum likelihood estimator when the number of studies is either fixed or divergent as the sizes of studies increase and discover a surprising result that the former estimator is always at least as efficient as the latter. We develop a novel resampling technique that substantially improves the accuracy of statistical inference. We demonstrate the benefits of the proposed inference procedures with simulated and empirical data.