Department of Biostatistics
ABSTRACT: We develop a Bayesian nonparametric model for inference for a longitudinal response in the presence of nonignorable missing data. Our general approach is to first specify a working model that flexibly models the missingness and full outcome processes jointly. We specify a Dirichlet process mixture of independent models as a prior on the joint distribution of the working model. This aspect of the model governs the fit of the observed data by modeling the observed data distribution as the marginalization over the missing data in the working model. We then separately specify the conditional distribution of the missing data given the observed data and dropout. This approach allows us to identify the distribution of the missing data using identifying restrictions as a starting point. We propose a framework for introducing sensitivity parameters, allowing us to vary the untestable assumptions about the missing data mechanism smoothly through a space. Informative priors on the space of missing data assumptions can be specified to combine inferences under many different assumptions into a final inference. We demonstrate this by applying the method to simulated data (and comparing with standard methods) and to data from two clinical trials.
This is joint work with Antonio Linero (University of Florida).
ABSTRACT: Investigators commonly gather longitudinal data to assess changes in responses over time and to relate these changes to within-subject changes in predictors. With rare or expensive outcomes such as uncommon diseases and costly radiologic measurements, outcome-dependent, and more generally outcome-related, sampling plans can improve estimation efficiency and reduce cost. Longitudinal follow up of subjects gathered in an initial outcome-dependent sample can then be used to study the trajectories of responses over time and to assess the association of changes in predictors within subjects with change in response. In this talk we develop two likelihood-based approaches for fitting generalized linear mixed models (GLMMs) to longitudinal data from a wide variety of outcome-dependent sampling designs. The first is an extension of the semi-parametric maximum likelihood approach developed in a series of papers by Neuhaus, Scott and Wild and applies quite generally. The second approach is an adaptation of standard conditional likelihood methods and is limited to random intercept models with a canonical link. Data from a study of Attention Deficit Hyperactivity Disorder in children motivates the work and illustrates the findings.
ABSTRACT: For almost 30 years a 1984 paper by Weir and Cockerham served as the standard reference for estimating the "F-statistic" Fst used to characterize genetic structure of populations. These measures are in wide use across many genetic fields: evolutionary, forensic, human, population and quantitative. They have been defined variously as correlations between allelic states or as probabilities of identity by descent, or even as ratios of sample heterozygosities. The moment estimators of Weir and Cockerham used sample size to weight allele frequencies in forming averages over populations, as was appropriate for their simple evolutionary scenario. An extension to allow for different degrees of correlation within different populations and for non-zero correlations between populations was given by Weir and Hill in 2002, although their moment estimators retained sample size weightings. Recently, Bhatia, Patterson, Sankararaman and Price suggested that equal weights would be more appropriate for the Weir and Hill model, especially when sample sizes were quite different among populations. Their suggestion will be amplified here for arbitrary numbers of populations and alleles per locus, and attention will be paid to population-specific Fst values for the detection of signatures of natural selection.
To illustrate this new methodology, the current issue of interpreting matching forensic profiles for the Y chromosome will be discussed. Population genetic arguments, involving Fst, lead to stronger evidence for profiles with more loci even for profiles not previously observed. Joint Fst values, for Y and autosomal loci, allow the effects of Y-matching on autosomal matching, and vice-versa, to be quantified.
ABSTRACT: Sample contamination is a common problem in DNA sequencing studies that results in genotype misclassification and reduction in statistical power for downstream genetic association analysis. I will describe methods to identify DNA sample contamination based on a combination of sequencing reads and array-based genotype data; sequence reads alone; and array-based genotype data alone; and demonstrate that by modeling contamination, we can substantially reduce its impact on variant calling and downstream association studies.