Project 2

Jump to project examples.

Cancer surveillance plays an essential role in cancer prevention and intervention. This proposal develops new statistical methods that deal with complex data-related issues in cancer surveillance studies. In particular, the specific aims are motivated by problems encountered in surveillance studies that monitor cancer mortality and geographical patterns, and that study disproportionate disease burden on particular populations and important risk factors. We plan to

  1. develop new methods to analyze the cross-relationship matrix of the change trends (e.g., the annual rate changes (ARC)) in mortality or incidence rates on multiple cancer sites for the period of 1969-2004;
  2. propose disease clustering/surveillance methods for outcomes subject to censoring;
  3. propose a new test statistic for spatial clustering detection that incorporates latency distributions that are associated with cancer, and studies where disease clustering patterns differ according to genetic characteristics;
  4. develop and evaluate a spatio-temporal hidden Markov model for disease surveillance based on region-specific counts of disease incidence;
  5. develop efficient algorithms and user-friendly statistical software that implement these methods with the goal of disseminating them to health science researchers.

The proposed methods will be applied to several cancer and environmental health projects that the investigators have been involved in, namely, the SEER cancer mortality data, the SEER prostate cancer incidence data and the Taiwan Leukemia data. The methods will allow practitioners as well as health care policy makers to better understand the change trends of cancer deaths/incidence and the cross-relationship of these trends for the purpose of planning and resource allocation. The methods will also help reveal disproportionate disease burden on at-risk populations and identify important risk factors, including genetic susceptibility. The surveillance methods proposed in this project are linked to the spatio-temporal methods proposed in Project 1, and the regularized regression models proposed in this project are related to the analysis of high-dimensional observational study data, and all projects will generate statistical methods and computational approaches that will inform those developed in the others.

Project examples

  1. Cross correlation matrix of cancer mortality rates in the United States when p>n. Miguel Marino, Yi Li, and Jean-Pierre Gillet (NCI) Research at the National Cancer Institute We develop various tests against complete randomness, including Tracy-Widom test for the largest eigenvalue and Smirnov test for the bulk density. We propose a new sequential-rescaling method to test for the sample eigenvalues that significantly deviate from those under the null hypothesis. Hence, such a test is able to detect significant latent factors when the number of random variables is large. We study and interpret the principal components corresponding to the largest few eigenvalues if signals are detected.
  2. Semiparametric Bayesian Analysis of High-Dimensional Censored Outcome Data: Discovering Spatial Variation and Shared Frailties in Breast Cancer Mortality Rates. Subha Guha, Yi Li, and Steven Melly We discover spatial variation in cancer mortality rates, adjusting for known predictors, such as age and tumor grade and using semiparametric Dirichlet processes (DP) to model individual variability. Since DP models can be prohibitively expensive for the analysis of even a few hundred individuals, we propose a new cost-effective Markov chain Monte Carlo (MCMC) strategy for large datasets that provides inferences from the posterior of interest, rather than an approximation, unlike existing data squashing techniques for DP models.
  3. Spatio-temporal Analysis of Areal Data and Discovery of Neighborhood Relationships in Conditionally Autoregressive Models. Louise Ryan Although computationally convenient, a limitation of most conditionally autoregressive (CAR) model implementations is that they assume ad-hoc neighborhood structures and rely on area-specific random effects that are separable from temporal trends. We propose a Bayesian approach that jointly models the latent neighborhood structure and spatiotemporal variability in the data. The neighborhood relationships are modeled using a continuous-time Bernoulli process with Markovian dependence that is allowed to evolve over time. In the application, we analyze heart disease incidence rates in the Sydney Metropolitan Area of Australia and discover complex temporal trends in the postcode-specific random effects.
  4. Comparing Average Annual Percent Change in Cancer Rates across Overlapping Regions. Dhruv Sharma, Ram C Tiwari, and Yi Li We develop extensions of the average annual percent change (AAPC) measure of cancer incidence and mortality trends of age adjusted rates over a specified time period. This extension specifically address the comparison of AAPC of a subset (such as a state) with the whole (such as the country). Tests are proposed to address the effects of this overlap.
  5. High-Dimensional Coefficient Shrinkage Methods in Spatial Cluster Detection for Cancer Survival Models. Dhruv Sharma and Yi Li We investigate the detection of clusters that are spatially arranged in survival models using high-dimensional coefficient shrinkage methods. We specifically focus on lattice data, such as discretely indexed regions like counties. We propose a high dimensional coefficient shrinkage approach that simultaneously identifies these clusters and their comparable effects by specifying an L1 penalty on the pairwise differences of the county effects.
  6. Comparative effects of drug regimens. Dhruv Sharma, Yi Li, and Deborah Schrag Develop statistical methods in the study of the comparative effects of different drug regimens on the survival and health outcomes of Stage 3B lung cancer patients.
  7. Surveillance Studies. Justin Manjourides and Marcello Pagano We develop a global test for disease clustering with power to identify disturbances from the null population distribution which accounts for the lag time between the date of exposure and the date of diagnosis. Location at diagnosis is often used as a surrogate for the location of exposure, however, the causative exposure could have occurred at a previous address in a case’s residential history. We incorporate models for the incubation distribution of a disease to weight each address in the residential history by the corresponding probability of the exposure occurring at that address. We then introduce a test statistic which uses these incubation-weighted addresses to test for a difference between the spatial distribution of the cases and the spatial distribution of the controls, or the background population. We follow the construction of the M statistic to evaluate the significance of these new distance distributions. Our results show that gains in detection power when residential history is accounted for are of such a degree that it might make the qualitative difference between the presence of spatial clustering being detected or not, thus making a strong argument for the inclusion of residential history in the analysis of such data.

Back to top


Copyright by Xihong Lin, 2011