|
Department of Biostatistics Environmental Statistics Seminar 2011 - 2012 |
ABSTRACT: In this talk I will address methods for Bayesian variable selection for high-dimensional data. I will look at linear and non linear regression model and also extend methods to probit models for classification and to models for survival outcomes. I will also consider clustering settings. I will show examples from genomics, mainly from DNA microarray studies. The analysis of the high-dimensional data generated by such studies often challenges standard statistical methods. Models and inferential algorithms are quite flexible and allow us to incorporate additional information, such as data substructure and/or knowledge on gene functions.
ABSTRACT: In studying the evolution of a process and effects of interventions on it, investigators often collect repeated measures of the process state (longitudinal data) and measure the time to occurrence of important events (survival data). Joint models for such longitudinal and survival data often use latent processes that evolve over time and contribute to both the longitudinal and survival outcomes. These allow substantial flexibility to incorporate association across repeated measurements, among multiple longitudinal outcomes, and between longitudinal and survival outcomes.
The joint modeling framework has been extended to handle many complexities of real data, but less attention has been paid to the properties of these models. We are interested in the "payoff" of joint modeling, that is, whether using two sources of data simultaneously offers better inference on individual- and population-level characteristics, as compared to using them separately. We consider the problem of attributing informational content to the data inputs of joint models by developing analytical and numerical approaches and demonstrating their use.
As a motivating application, we consider a clinical trial for treatment of mesothelioma, a rapidly fatal form of lung cancer. The trial protocol included patient-reported outcome (PRO) collection and follow-up until progression and/or death to determine progression-free survival (PFS). We develop models that extend the joint modeling framework to accommodate several features of the longitudinal PROs, including bounded support, excessive zeros, and multiple PROs measured simultaneously. Our approaches produce clinically relevant treatment effect estimates on several aspects of disease simultaneously and yield insights on individual-level variation in disease processes.
ABSTRACT: The estimation of population/subpopulation average treatment effects has been a subject of extensive research. In a randomized experiment, common practice entails estimating the average treatment effect by calculating the difference between the average outcome in the treatment group and the average outcome in the control group. In cases where additional covariates that affect the outcome exist, estimation of the treatment effect is typically controlled (adjusted) by using a model that combines the treatment effect and a function of the covariates in an additive manner. This type of model relies on the assumption that the response surfaces from the outcome given the covariates are parallel in the control and treatment groups. When this assumption is incorrect, the estimation of the treatment effect may be unreliable. In observational studies, the effect of this assumption is even more substantial, because the distributions of the covariates in the two groups are usually different.
This talk will focus on the proposal of an outcome-free three-stage procedure based on Rubin's framework for causal inference. First, we create subclasses that include observations from each group based on the covariates. Next, we independently estimate the response surface in each group using flexible spline model. Lastly, multiple imputations of the missing potential outcomes are performed. A simulation analysis which resembles real life situations and compares this procedure to other common methods is carried out. In relation to other methods and in many of the experimental conditions examined, our proposed method is the only one that produced a valid statistical procedure while providing a relatively precise point estimate and a relatively short interval estimate.
ABSTRACT: Recently there has been much interest in modeling the joint health effects of multiple pollutants. The objective is to construct statistical models that are both nonlinear and nonadditive in the variables of interest. Radial Basis Functions are a general methodology for representing a smooth function of several variables without linearity or additivity constraints, so they provide the foundation for a possible methodology for handling these problems.
In this talk, I shall review the basic methodology and discuss an application to the joint mortality effects of PM and ozone, using data from the NMMAPS database. I shall also outline some potential applications to health effects of climate change.
ABSTRACT: Model-based estimation of the effect of an exposure (or a treatment) on an outcome is generally sensitive to the choice of which confounding factors are included in the model. We propose a new approach, which we call Bayesian Adjustment for Confounding (BAC), to estimate the effect on the outcome associated with an exposure of interest while accounting for the uncertainty in the confounding adjustment. Our approach is based on specifying two models: 1) the outcome as a function of the exposure and the potential confounders (the outcome model); and 2) the exposure as a function of the potential confounders (the exposure model). We consider Bayesian variable selection on both models and link the two by introducing a dependence parameter (omega) denoting the prior odds of including a predictor in the outcome model, given that the same predictor is in the exposure model. In the absence of dependence (omega = 1), BAC reduces to traditional Bayesian Model Averaging (BMA). In simulation studies we show that BAC with omega >1 estimates the exposure effect with smaller bias than traditional BMA, and improved coverage. We then compare BAC and traditional BMA in a time series data set of hospital admissions, air pollution levels and weather variables in Nassau, NY for the period 1999-2005. Using each approach, we estimate the short-term effects of PM2.5 on emergency admissions for cardiovascular diseases, accounting for confounding. This application illustrates the potentially significant pitfalls of misusing variable selection methods in the context of adjustment uncertainty.
Joint work with Chi Wang and Giovanni Parmigiani.
ABSTRACT: Land use regression (LUR) models provide good estimates of spatially resolved long term exposures, but are poor at capturing short term exposures. Satellite-derived Aerosol Optical Depth (AOD) measurements have the potential to provide spatio-temporally resolved predictions of both long and short term exposures, but previous studies have generally showed relatively low predictive power. Our objective was to extend our previous work on day-specific calibrations of AOD data using ground PM2.5 measurements by incorporating commonly used LUR variables and meteorological variables, thus benefiting from both the spatial resolution from the LUR models and the spatio-temporal resolution from the satellite models. Later we use spatial smoothing to predict PM2.5 concentrations for day/locations with missing AOD measures. We used mixed models with random slopes for day to calibrate AOD data for 2000-2008 across New England with monitored PM2.5 measurements. We then used a generalized additive mixed model with spatial smoothing to estimate PM2.5 in location-day pairs with missing AOD, using regional measured PM2.5, AOD values in neighboring cells, and land use. Finally, local (100m) land use terms were used to model the difference between grid cell prediction and monitored value to capture very local traffic particles. Out-of-sample tenfold cross-validation was used to quantify the accuracy of our predictions. For days with available AOD data we found high out-of-sample R2 (mean out-of-sample R2=0.830, year to year variation 0.725-0.904). For days without AOD values, our model performance was also excellent (mean out-of-sample R2=0.810, year to year variation 0.692-0.887). Importantly, these R2 are for daily, rather than monthly or yearly, values. Our model allows one to assess short-term and long-term human exposures in order to investigate both the acute and chronic effects of ambient particles, respectively.
ABSTRACT: Air pollution is highly dependent on weather, and it follows that climate change could significantly impact air quality. Climate change could also affect wildfire frequency or extent, which in turn could adversely affect air quality downwind. We have investigated the consequences of climate change for PM2.5 following two approaches. First, using principal component analysis and regression, we have identified the main meteorological modes driving PM2.5 variability across the United States. Second, we have developed ecosystem-based models that explain the dependence of area burned in the western United States on meteorological variables. We then use projections from the IPCC Fourth Assessment Report to diagnose future changes in meteorology at mid-century and the subsequent effects on PM2.5. We estimate a likely increase in annual mean PM2.5 in the East by the 2050s, due to a more stagnant atmosphere, with large discrepancies among models. In the West, we find that future wildfires could increase mean summertime concentrations of organic particles by as much as 50%.
ABSTRACT: The move towards more focused control of airborne particulate matter (PM) is hindered by a limited understanding of the toxicity of various components of the PM mixture and of the sources that may contribute injurious particles. Promulgating a more refined US National Ambient Air Quality Standard (NAAQS), incorporating PM chemical components, specific sources, or other characteristics, requires a more complete scientific foundation than is currently available. The U.S. Environmental Protection Agency (EPA) maintains a number of national databases that could be used to address these policy questions directly, however, the incorporation of these databases into analyses of corresponding health effects requires new statistical approaches. We will give an overview of the data and the statistical models used for estimating the health effects of particulate matter sources using national databases and discuss the advantages and limitations of this approach.
ABSTRACT: Generalized linear mixed models are popular tools for modeling population health data for studies of health disparities. Researchers typically take georeferenced health data and population counts and model them via a Poisson log-linear model as a function of fixed covariates and area-level random effects. The random effects may follow a multilevel structure (nested normals) or may be spatially structured (e.g. following an intrinsic conditionally autoregressive (CAR) model). When the interest is social disparities (e.g. racial/ethnic inequalities) in health, researchers typically model race/ethnicity as a fixed covariate, and estimate the overall relative risk for racial/ethnic groups, conditional on area-level random effects that are assumed to be the same for all racial/ethnic groups. Given sociological perspectives on racial/ethnic inequality that posit "race" as a social, rather than essentially biological, category and racial/ethnic inequalities as the result of social relations (e.g. racial discrimination) embedded in social contexts, I argue for the appropriateness of modeling racial/ethnic specific area effects, and consider several model specification: (a) independent racial/ethnic area effects; (b) a multivariate model allowing for correlations between racial/ethnic area effects; and (c) a shared components model that decomposes the risk surface into a shared component and a racial/ethnic specific component. I present results, model comparisons, and maps based on an analysis of premature mortality rates in three Massachusetts cities (Boston, Worcester, and Lawrence) in 2000-2005.
ABSTRACT: We propose a new multi/hyper-spectral image segmentation method designed for remote sensing applications. Specifically, we aim our analysis tools to improve fertilization decision making. The core of the application is segmenting multi/hyper-spectral images of agricultural fields into homogeneous zones as a basis for variable rate application.
As a first step toward accomplishing this goal a new state of the art segmentation method of multispectral images was developed. The proposed methodology is based on a multi-scale geometric transformation called the Beamlet Transform and the Beamlet Decorated Recursive Dyadic Partitioning (BD-RDP). The method is applicable for both mono-spectral and multispectral images where each pixel has its corresponding spectral profile vector. The proposed segmentation method is especially effective when the underlying image consist of relatively large segments with smooth boundaries. In this case it performs exceptionally well even when the Signal to Noise Ratio (SNR) is extremely low. The method is unsupervised and assumes no prior knowledge of the image characteristics or features. Furthermore, it involves a single sensitivity parameter which controls the segmentation granularity. Despite of being relatively complex, the proposed segmentation algorithm has a low computational complexity of which is achieved by implicit computations through the Pseudo-Polar Fourier transform (PPFFT). In order to validate the efficiency of the proposed method we used an improved Fuzzy C-mean algorithm as a benchmark for segmentation of multi-spectral images and show that our new method out-performs it.
The proposed method was applied on a sample from an aerial HS image taken over a potato plot under different nitrogen treatments. . The HS image was acquired on 25/5/2007 using a push-broom AISA system in the range of 400-1000 nm, with 210 bands with spectral resolution of 1.3 nm. The multi-scale segmentation successfully uncovered the spatial structures in the image according to differences in N levels.
This work was done in collaboration with the following contributors:
Yafit Cohen and Victor Alchanati from the Agricultural Research Organization, Volcani Center, Israel
Shaul Cohen and Ziv Mhabari from Ben-Gurion University of the Negev, Israel
ABSTRACT: Dozens of U.S. studies have investigated the association between bans on smoking in public places and hospitalizations for acute cardiovascular events, almost always reporting ban-associated reductions. However, small study sizes and possible publication bias have left the debate unresolved. Here we focus on challenges that arise in national-scale studies of smoking bans and Medicare data, especially assumptions in interrupted time series models standard among this literature and potential pitfalls of Medicare data used in studies of smoking bans, air pollution, and other national public health questions. Our techniques lead to a broader scope, including investigation of a range of cardiovascular and respiratory outcomes, and better fitting models, which show efficacy of smoking bans may be limited.
ABSTRACT: The analysis of longitudinal trajectories usually focuses on evaluation of explanatory factors that are either associated with rates of change, or with overall mean levels of a continuous outcome variable. In this manuscript we introduce valid design and analysis methods that permit outcome dependent sampling of longitudinal data for scenarios where all outcome data currently exist, but a targeted sub-study is being planned in order to collect additional key exposure information on a limited number of subjects. We propose a stratified sampling based on specific summaries of individual longitudinal trajectories, and we detail an ascertainment corrected maximum likelihood approach for estimation using the resulting biased sample of subjects. In addition, we demonstrate that the efficiency of an outcome-based sampling design relative to use of a simple random sample depends highly on the choice of outcome summary statistic used to direct sampling, and we show a natural link between the goals of the longitudinal regression model and corresponding desirable designs. Using data from the Childhood Asthma Management Program, where genetic information required retrospective ascertainment, we study a range of designs that examine lung function proles over four years of follow-up for children classified according to their genotype for the IL 13 cytokine.
| Back to HSPH Biostatistics |
Maintained by the
Biostatistics Webmaster
Last Update: May 2, 2012 |