Department of Biostatistics
Neurostatistics Working Group
2012 - 2013
ABSTRACT: The use of matched designs in case-control studies can result in substantial improvements in efficiency and statistical power. However, it is quite common for studies, particularly those dealing with high dimensional data, to ignore matching when applying variable selection techniques, which can reduce precision and validity. I will present my work in developing a new approach to account for matching in the context of Bayesian variable selection methods, which effectively handle high dimensional data settings.
ABSTRACT: Survival bias is a long-recognized problem in case-control studies, and many varieties of bias can come under this umbrella term. We focus on one of them, termed Neyman's bias or "prevalence-incidence bias." It occurs in case-control studies when exposure affects both disease and disease-induced mortality, and we give a formula for the observed, biased odds ratio under such conditions. We compare our result with previous investigations into this phenomenon and consider models under which this bias may or may not be significant. Finally, we suggest different hypothesis tests to identify when Neyman's bias may be present in case-control studies.
ABSTRACT: Despite recent improvement in mortality rates, stroke remains the leading cause of adult disability in the United States and worldwide. The use of only FDA-approved therapy for acute ischemic stroke (AIS), intravenous thrombolysis with IV tPA is still limited and fraught with complications and insufficient post-stroke recovery rates. Therefore, better drugs and better approaches to patient selection for future clinical trials are crucial in order to alleviate the impending population burden of stroke and cerebrovascular disability in the coming decades. Individual burden of cerebrovascular disease is detected on brain MRI as severity of leukoaraiosis, or white matter hyperintensity (WMH), which is strongly linked in the AIS subjects to cerebral infarct growth, poor functional outcomes, and greater risk of stroke recurrence. Furthermore, genetic contribution may account, at least in part, for variability in individual tissue and clinical outcomes after stroke. We hypothesized that a personalized prediction score that accounts for novel neuroimaging and genetic markers of clinical outcome will reliably predict functional outcomes in AIS subjects. Furthermore, this model can be used in future studies of novel drug targets and clinical trials leading to development of novel interventions that prevent and/or significantly reduce the devastating disability after stroke. We proposed to develop and validate this personalized prediction score using a cohort of consecutive AIS patients with MRI and genome-wide data available at Massachusetts General Hospital and the multi-center Ischemic Stroke Genetics Study and utilizing cutting-edge neuroimaging and genetic tools in order to improve our current clinical prediction models of post-stroke outcomes.
ABSTRACT: In Survival Analysis and other applied fields, the observation of the lifetime or a more general time-to-event is influenced (beyond right censoring) by left truncation. In this framework, it has been shown that information about the distribution of the truncation time can be used to construct less variable survival function estimators. In particular, special attention has been devoted to estimation with left-truncated and right-censored data when the truncation time is uniformly distributed, leading to the so-called length-biased data. After an overview of existing methods to model truncation parametrically, estimation procedures for the corresponding parameters and the resulting semi-parametric survival function are introduced. Asymptotic properties of the proposed estimators are discussed and their performance are investigated through simulations.
ABSTRACT: In this talk, I'll describe a method for assessing variable importance in matched case-control investigations and other highly stratified studies characterized by high dimensional data (p >> n). The proposed methods are motivated by a cardiovascular disease systems biology study involved matched cases and controls. In simulated and real datasets, we show that the proposed algorithm performs better than a conventional univariate method (conditional logistic regression) and a popular multivariable algorithm (Random Forests) that does not take the matching into account.
This is joint work with E. Andres Houseman (Oregon State University), Rebecca A. Betensky (Harvard School of Public Health) and Brent A. Coull (Harvard School of Public Health).
ABSTRACT: For many progressive movement disorders such as Parkinson's, it would be valuable in practice to detect the symptoms of the disease remotely, noninvasively, and objectively. Voice impairment is one of the primary symptoms of Parkinson's. In this talk I'll describe techniques that can be used to detect Parkinson's and quantify the symptoms, using voice recordings alone. I'll give an in-depth discussion of the statistical issues that arise with the use of statistical machine learning methods in this context. I'll also describe early results from the Parkinson's Voice Initiative, a project that has captured a very large sample of voices from healthy controls and Parkinson's subjects from around the world, using the standard telephone network.
ABSTRACT: Modeling clustered bivariate outcomes arises in medical research, for example with the length of stay of patients within the hospital and the leaving status (death or alive). These patients are clustered within the hospital and have two correlated outcomes. The use of the MIXED model method and the Generalized Estimating Equations (GEE) are the most influential recent developments in statistical practice analysis techniques used in analyzing such data. The GEE model estimates are robust to the choice of the correlation structure but greater efficiency will be realized by those correlation models closer to true correlation structure. The mixed model procedure uses the Newton-Raphson algorithm, known to be faster than the EM algorithm. The linear mixed model takes into account all available information and accounts for both serial and cross correlation. The efficiency of the model depends on the correlation structure. Our simulation studies reveal that the independent correlation structure might lead to a biased estimate and the choice of the correlation structure may depend on the cluster size.
ABSTRACT: Although underlying genetic factors and familial disease have been linked to the development of prostate cancer, there is only limited evidence that suggests that prostate cancer is an important component of the most common multi-cancer syndromes. However, as a result of accumulating somatic mutations and/or long-term mutagenic effects of prior treatment, patients who develop multiple malignancies could potentially have prostate cancers with a more aggressive and treatment resistant phenotype. However, it is currently unknown whether men diagnosed with multiple malignancies have prostate cancers that are different from prostate cancers that develop as "only" malignancies in other individuals, specifically in terms of features at diagnosis, treatments employed and clinical outcome. Other malignancies that occur after treatment for a first malignancy have been demonstrated to have worse prognoses, such as lung cancer following treatment for Hodgkin's disease and acute myelogenous leukemia following treatment with chemotherapy or radiotherapy.
To compare features at diagnosis, primary treatment, and clinical outcomes for men diagnosed with prostate cancer in addition to another malignancy with men diagnosed with prostate cancer as an only malignancy. We hypothesize that men who develop prostate cancer in addition to another malignancy may undergo less aggressive treatment. However, even adjusting for this differential treatment, men who develop prostate cancer in addition to another malignancy may develop lethal prostate cancer more rapidly.
ABSTRACT: Identifying interesting relationships between pairs of variables in large data sets is increasingly important. One way of doing so is to search such data sets for pairs of variables that are closely associated. This can be done by calculating some measure of dependence for each pair, ranking the pairs by their scores, and examining the top-scoring pairs. We outline two heuristic properties--generality and equitability--that the statistic we use to measure dependence should have in order for such a strategy to be effective.
We then present a measure of dependence for two-variable relationships, the maximal information coefficient (MIC), that performs well under these criteria. MIC captures a wide range of associations both functional and not (generality), and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function (equitability). Finally, we show that MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships.
ABSTRACT: The development of predictive biomarkers in neuroscience is increasingly enabling better estimation of the likelihood of adverse events such as psychotic episode, dementia, and seizure. One goal of this work is to identify clinically significant risks in psychiatry and neurology, that is, risks that are in some way important enough to be taken into account in the clinical management of patients. Because not all risks are clinically relevant, collaboration between statisticians and clinicians is generally recognized as integral. There has been comparatively little discussion, however, on the possibility of morally significant risks, that is, risks that are in some way important enough that someone gains a moral to do or avoid doing something, for example an obligation to protect others from harm associated with that risk (e.g. avoiding driving to minimize risk that having a seizure while driving will lead to harm to others).
In this seminar I introduce morally significant risk as a potentially fruitful area for collaboration between statisticians and bioethicists.
I begin with the foreseeability-grounded theories of moral responsibility, which posit that an agent can properly be held responsible for a harmful act, even when he exhibited no meaningful control or awareness during the act, as long as the possible harmful effects of entering a state of involuntariness or ignorance were foreseeable to the agent. I argue that development of predictive biomarkers forces us to grapple quantitatively with what we mean by 'foreseeable': How foreseeable do harmful outcomes of my possible future loss of control need to be for me to have such moral obligations? How strong are the obligations? I appeal to the established social obligations surrounding epilepsy and driving as one possible standard from which to approach these questions, but show that such an appeal leads to several surprising conclusions that make clear the need for further work on morally significant risk.
ABSTRACT: Recent studies have identified associations between exposure to ambient air pollution and impaired cognitive function, but cognitive tests such as the Mini Mental State Exam (MMSE) are often difficult to model because of non-standard distributions, ceiling effects and censored data. This presentation will summarize preliminary work in progress on cognitive function in the Framingham Offspring Study and will discuss methodological challenges to study air pollution and associations with measures of cognitive performance and functional measures of brain activity in the Greater Boston area.
ABSTRACT: Truncation is common in clinical trials and observational studies with complex sampling. Ignoring the issue of truncation or incorrectly assuming quasi-independence can lead to bias and incorrect results. Currently available approaches for dependently truncated data are sparse. We present an inverse probability weighting method to estimate the survival function of a failure time subject to left truncation and right censoring. Our method allows adjustment for informative truncation due to factors associated with both time-to-event and truncation. Simulation studies show that the proposed method performs well in finite sample. We illustrate our approach in a real data application.
ABSTRACT: Many biomarkers are obtained with measurement error due to imperfect lab conditions or temporal variability within subjects, and it is therefore critical to develop analytical methods to quantify and adjust for measurement error in the evaluation of diagnostic markers. We develop a parametric bias-correction approach to adjust estimates of sensitivity, specificity, and other diagnostic measures for a single biomarker in the presence of measurement error by using an internal reliability sample. We also propose a parametric approach to compare two or more correlated AUCs when the biomarkers are subject to correlated measurement errors. We evaluate our methods through extensive simulations and illustrate our methods using a biomarker study in Alzheimer's disease.
|Back to HSPH Biostatistics||
Maintained by the
Last Update: May 22, 2013