Department of Biostatistics
Quantitative Issues in Cancer Research Working Seminar
2012 - 2013
ABSTRACT: When there are multiple experimental treatments awaiting testing, multi-arm multi-stage (MAMS) trials provide large gains in efficiency over separate randomised trials of each treatment. They allow a shared control group, dropping of ineffective treatments before the end of the trial and stopping the trial early if sufficient evidence of a treatment being superior to control is found. One type of MAMS design proposed is a generalisation of a group-sequential design to multiple experimental treatments. This class of trial involves specification of futility and stopping boundaries which determine the (random) number of treatments that get through to each stage. I will discuss the choice of these stopping boundaries, and how they can be chosen in an optimal way. A disadvantage of group-sequential MAMS trials is that the sample size used is a random variable, which makes applying for funding and logistical planning difficult. An alternative MAMS design is a drop-the-loser design. In drop-the-loser designs a pre-specified number of experimental treatments are dropped at each stage, and so the sample size used is fixed, which is a big advantage. I shall discuss some recent work on drop-the-loser designs and compare their properties to those of group-sequential MAMS designs.
ABSTRACT: There are thousands of publicly available gene expression profiling datasets. These invaluable resources can be used for meta-analysis, training and external validation of the gene signatures of various disease phenotypes, development of classification and prediction models etc. However, combining such datasets might be difficult due to non-biological systematic variation between the groups of profiled samples, known as batch effects. Batch effects are often observed within the studies, and are almost inevitable between the studies. They impair our ability to obtain reproducible, biologically and medically relevant results, and substantially limit the utility of combining several datatsets for simultaneous analysis.
Several methods have been proposed to mitigate the batch effects. I will discuss existing methods for batch effect correction, and present a method that we recently developed.
ABSTRACT: Active surveillance has been considered a plausible option for men detected with early stage, low grade prostate cancer. Disease grade upgrading has been used as one of the criteria for initiating treatment in men enrolled in active surveillance studies. This criterion has its own limitations given the relatively high rates of disease grade miss-classification on biopsy. However, combining serial biopsy data with available knowledge about biopsy miss-classification rates and serial biomarkers may improve our ability to infer times to grade change. This talk proposes models for grade progression that account for the entire serial collection of biopsy results and that of a biomarker. The proposed models are assessed with a simulation study and applied to active surveillance data. It is shown that the proposed models can substantially improve inferences about timing of disease grade progression while accounting for the uncertainty in the biopsy sensitivity and specificity, the variability in the biomarker growth and the serial correlation in the observations.
This is joint work with Bruce Trock, Ballentine Carter and Ruth Etzioni.
ABSTRACT: Patho-epidemiology is a relatively new discipline, which at the core involves the integration of tumor biomarkers and pathological data into cohorts of cancer cases nested within epidemiological studies. Beyond the fields of molecular epidemiology and pathology, patho-epidemiology incorporates biostatistics, clinical oncology and genetics. The research potential of patho-epidemiology studies includes: the identification of cancer subtypes, the discovery of biomarkers assocated with cancer outcomes or response to lifestyle/therapeutic interventions, as well as the refinement of causal factors in the etiology and progression of cancer. In this seminar, we will present examples of studies in the patho-epidemiology of prostate cancer, discuss study design and analytic strategies, and highlight key findings.
ABSTRACT: Successful reform of the health care delivery system relies on improved information about the effectiveness of therapies in real world practice. Many comparative effectiveness studies rely on the analysis of observational data sources where patients are not randomized to treatment. One strategy to infer effectiveness in non-randomized settings involves ecological analyses that exploit geographic variation in treatment use. In these analyses outcomes in regions with high use of a treatment are compared to those with low use and differences across areas are attributed to the treatment. The key assumption underpinning this method is that patient characteristics are similar across areas, so that geographic residence can mimic randomization to treatment. Few studies have carefully assessed this key assumption, and unobserved differences associated with geographic residence, such as patient preferences, socioeconomic status or health behaviors, may confound estimated treatment effects. Moreover, most studies have relied on cross-sectional differences across areas to infer treatment effects, when longitudinal analyses may be better able to control for unobserved differences across areas. Using data from population-based cohorts of elderly patients diagnosed with colorectal and prostate cancer in the Surveillance Epidemiology, and End Results (SEER)-Medicare database and the Cancer Care Outcomes and Research Surveillance (CanCORS) collaboration, I evaluate the appropriateness of the use of geographic variation to infer effectiveness of cancer therapies. I also emphasize the importance of sensitivity analyses to examine the robustness of estimates to plausible violations of key assumptions underlying all analyses of observational data.
ABSTRACT: In cancer clinical research, discovery of molecular biomarkers has dramatically redefined how we evaluate efficacy of therapeutics. Further, the availability and decreasing costs of high-throughput assays allows for interrogation of the entire genome from biospecimen. As we have overcome the bioinformatic hurdle of "drinking from the firehose", a new series of challenges are emerge when translating genomic discoveries into multiplex assays as biomarkers of disease. In a 2012 report, the Institute of Medicine detailed a test development processes as best practice, that defines a Discovery and Test Validation Stage before moving into Evaluation for Clinical Utility. Under these guidelines, there remains a need for statistical methods that can account for multiplicity in testing signatures during the discovery process and give unbiased inferences from retrospective datasets.
In this talk, I focus on two recent extensions to the statistical methods used in "pathway-analysis". First, we have derived accurate analytic approximations to several forms of the array-permutation based tests, which greatly reduce computational requirements. Second, by adding real and imaginary weights to a summation of score statistics, one can obtain positive and negative squared terms. In this way, gene signatures that are linear predictors, or probabilistic definitions of genesets, can be tested by array-permutation and under our analytical approximations. These extensions allow for pathway analyses to serve as an intermediate step for evaluating signatures before translating predictive models into a new clinical context. Illustrations of their utilization draw from recent retrospective and prospective studies of molecular signatures in breast and ovarian cancer.
ABSTRACT: Important progress has been made in our understanding of cancer thanks to the ever growing amount of data originated by sequencing technologies. The integration of mathematical modeling and statistical analysis with sequencing and clinical data represents a new powerful approach for better understanding the evolutionary dynamics of cancer, and for implementing quantitative approaches to cancer classification and treatment.
ABSTRACT: Competing risks data are inherent to cancer research in which failure can be categorized by its type, and the information on each type of failure is as important as the overall probability of the failure. The first part of my talk will focus on the history of competing risks data, theoretical development, and an overview of competing risks data analysis. The second part will focus on issues and challenges of competing risks data analysis by presenting a survey of over 100 published papers that contains competing risks data analysis.
ABSTRACT: This paper discusses experimental design for the case that (i) we are given a distribution of covariates from a pre-selected random sample, and (ii) we are interested in the average treatment e ffect (ATE) of some binary treatment. The setup considered in this paper is nonparametric, and we take a decision theoretic perspective on experimental design. We consider the problem of jointly choosing a treatment assignment scheme and an estimator to minimize objective functions such as the mean squared error. We show that, in general, the optimal treatment assignment does not use randomization, where optimal is understood in the sense of minimizing Bayes risk or conditional minimax risk. Conditional independence of treatment and potential outcomes given covariates still holds for the deterministic assignments considered. We propose a class of nonparametric priors which are non-informative about the level of treatment e ffects, but do incorporate assumptions of smoothness. Based on these priors, tractable estimators as well as expressions for expected loss as a function of treatment assignment are derived. We suggest an automated procedure for choosing a treatment assignment vector minimizing expected loss. In simulations we find that optimal designs have mean squared errors of up to 20% less than randomized designs, where the gains are larger for smaller samples, more predictive covariates, and higher-dimensional covariates. In an application to the project STAR experiment, we found that mean squared error could have been reduced by about 19% by optimally assigning treatment, relative to the actual treatment assignment.
ABSTRACT: Biomedical scientists seek to define genes and pathways that drive key processes and disease. Comparison of biological activity across experiments provides insight into the role and impact of these key pathways. High impact pathways in turn represent ideal targets for drug intervention but identification of appropriately targeted drugs remains a significant hurdle. We are developing approaches that provide insight into each of these stages of discovery. Modern biology is captured across a multitude of platforms that produce large quantities of sample data. Each is relatively independent and so presents a challenge when deeper understanding of the biology of the system requires insight into the concordance of heterogenous assays. Functional information is extracted by integrating such heterogeneous high-throughput genomic data. This is a non-trivial challenge as it is becoming clear that it is not individual genes, but rather biological pathways signaling through interacting networks of genes that drive development and an organism's response to the environment. So integration must not only address the heterogenous sources of high throughput assays, but also provide a synthesis that reflects the functional interactions of genes.
Recently, there has been an increasing focus on methods that establish the role of biological networks. Biologists seek to use these networks to identify and validate interactions between genes of interest. Researchers are now generating a multitude of 'network instances' that reflect a particular disease model or normal phenotype condition. Whether representing a consensus of experiments, or a particular sample, these instances provide anecdotal references for specific gene interactions.
To provide a consistent and portable representation of a particular normal or diseased phenotype; we must more systematically understand the way in which subsystems interact. We must capture and learn both the pathway instances through these underlying networks and the canonical relationships between these instances. We address these challenges through development and application of a systematic framework for pathway activity, comparison and translation to drug targeting in cancer stem cell systems. The Pathway Fingerprint provides a systems-level description of pathway and network activity that can be universally applied to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. By comparing Pathway Fingerprints, we generate a network of pathway interactions that allows us to determine the relative correspondence of activity of any pathway with any other.
Using comparisons of Pathway Fingerprints between human samples and mouse in vivo experiments, we reconstruct the blood developmental lineage across mammals, establishing that pathways are consistently activated across species in normal development. By comparing patient samples and mouse models, we find prognostic pathways that consistently contribute to survival outcome in Acute Myeloid Leukemia. Having established that pathways are consistently activated across species in cancer stem cell systems, we then show that a network of interacting pathways may be responsible for the success of winners of independent competitions between cancer stem cell clones in a mouse Leukemia system.
We include drug response pathways and so determine compounds that are most likely to correspond with interference with networks of pathways responsible for winning cancer clones. Ultimately this strategy drives provision of a functionally targeted treatment for cancer stem cells.
ABSTRACT: Most ovarian cancers (80%) are detected under standard clinical care in late stage when the disease has spread beyond the ovaries and the prognosis is poor. However when detected in early stage, surgical resection of the ovaries results in excellent prognosis with five year survival exceeding 85%. Early detection through regular testing may identify more ovarian cancers in early stage disease and reduce ovarian cancer mortality. Over more than two decades, a series of trials for early detection have been conducted. Initial trials used a blood test (CA125) followed by ultrasound for women with a positive test. These trials provided longitudinal biomarker data. Statistical modeling of these results demonstrated longitudinal biomarker levels contained information on the presence of undetected ovarian cancer. The screening test for subsequent trials estimated the risk of having ovarian cancer given a woman's biomarker levels and made screening decisions on the level of risk. We describe the history of this sequence of trials and the role statistical modeling has had in designing the intervention.
ABSTRACT: In most of the phase III cancer clinical trials to compare two interventions, the primary efficacy variable is a time-to-event outcome, such as overall survival and progression-free survival. The result is often reported via the Kaplan-Meier plots and the two-sample log-rank test. The properties of "the class K statistics" --- a general class that includes various weighted log-rank tests (e.g., Tarone-Ware family or G-rho-gamma family) –-- was investigated by Gill (1980), which gave a characteristic of the alternative against which tests in the class K are efficient. For example, it is well-known that the log-rank test is optimal under proportional hazards alternatives. If the alternative were known, it would be wise to choose a powerful test in the class K. However, it will not be easy to identify the alternative in practice. Therefore, employing a versatile test that covers various alternatives, as the primary analysis, might be very attractive when no reliable information regarding the alternative is available, and there is a considerable body of methodological research with regard to versatile tests.
In this talk, we will give a brief review of weighted log-rank tests, weighted Kaplan-Meier tests, and some other alternatives. We will then propose a new test based on the weighted differences of two Kaplan-Meier curves. The distribution of the proposed test statistic has a short tail under the null, but it gets quite large if one survival curve is above the other. The proposed procedure does not require a user to choose the weight or any parameters. It automatically chooses the weight adaptively, so that it can effectively capture the difference. Results from numerical studies will show that the proposed test is a useful alternative to the log-rank test.
ABSTRACT: Epigenomic dysregulation is ubiquitous in human cancer and occurs early in carcinogenesis. Epigenetic alterations are therefore promising as potential biomarkers for the early detection of cancer. I will discuss two key requirements that must be established for epigenetic alterations that are to be used for screening purposes: stability and high disease-specificity. 1) Using data from a cohort of men with lethal metastatic prostate cancer we recently showed that somatic DNA methylation alterations are remarkably stable. Despite showing marked inter-individual heterogeneity between patients, most somatic alterations were stably maintained across all metastases within the same individual. The overall extent of maintenance in DNA methylation changes was comparable to that of genetic copy number alterations. Furthermore, regions exhibiting the highest consistency of hypermethylation across metastases within individuals, even if variably methylated across individuals, showed enrichment for cancer-related genes. Given the high frequency of DNA methylation aberrations, these findings suggest that these stable alterations provide a large promising pool of potential biomarkers. 2) Identifying true disease-associated DNA methylation alterations can be challenging due to the confounding effect of differing cell type proportions among samples. We recently demonstrated how differences in cell type composition between diseased and normal tissue can lead false positive findings of disease-associated DNA methylation alterations. I will discuss methods for identifying and correcting cell type heterogeneity bias, thereby improving power to identify true disease markers.
ABSTRACT: Attempts to predict risk using high dimensional genomic data can be made difficult by the large number of features and the potential complexity of the relationship between features and the outcome. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways and networks reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. Pathways could have complex signals, so our approach to model pathway effects should allow for this complexity. The kernel machine framework has been proposed to model pathway effects because it allows for nonlinear relationships within pathways; it has been used to make predictions for various types of outcomes from individual pathways (Scholkopf and Smola, 2002; Liu et al., 2007, 2008; Li and Luan, 2003; Cai et al., 2011; Liu et al., 2010). When multiple pathways are under consideration, we propose a multiple kernel learning approach to select important pathways and efficiently combine information across pathways. We derive our approach for a general survival modeling framework with a convex objective function, and illustrate its application under the Cox proportional hazards and accelerated failure time (AFT) models. Numerical studies demonstrate that this approach performs well in predicting risk. The methods are illustrated with an application to breast cancer data.
|Back to HSPH Biostatistics||
Maintained by the
Last Update: May 13, 2013