|
Department of Biostatistics Colloquium Series 2011-2012 |
ABSTRACT
ABSTRACT: Widely applicable Statistical and Machine Learning Classification problem observes variables (Y,X), Y class 0 or 1, X p-dimensional features. We start with report by Fix and Hodges (1951). We outline a new approach based on our research on fundamental statistical methods that extend from simple data to complex data and unify discrete and continuous random variables. Our approach estimates Pr[Y=1|X] by density estimation of comparison densities d(u), 0‹u‹1; typical d(u)=f(Q(u))/g(Q(u)), where Q(u) quantile function of G(x). We estimate f starting with g by weighted distribution formula f(x)=g(x) d(G(x)). Define comparison probability ComPr[B|A]=Pr[B|A]/Pr[B]. Bayes rule can be stated ComPr[A|B]=ComPr[B|A]. Then ComPr[Y=1|X=x]=ComPr[X=x|Y=1]=d(u), letting x=Q(u), the quantile function of X. Interpret d(u) as probability density of mid-ranks mid-distribution transform W=Fmid(X)=(Rank(X)-.5)/n; Fmid(x)=F(x)-.5p(x), p(x)=Pr[X=x]. We discuss perfect classification Pr[Y=1|X} equals 0 or 1. We discuss many approaches to estimation of d(u); emphasize MaxEnt exponential model with sufficient statistics W_j that are nonparametric statistics, means of score functions S(X) constructed for each variable X from Fmid(X). We compute Wilcoxon type W_j from CORR(Y=1,S(X)). Our modeling approach illustrates the profound chicken and egg puzzle: which comes first, the parameters or the sufficient statistics? It depends on whether parameters are scientific or statistical. We first assume independent features; extension to dependence is via nonparametric estimation of copula density functions. As example we discuss famous Hepatitus data. Our constructed score functions S(X) can also be used to perform logistic regression estimation of Pr[Y=1|X].
ABSTRACT: There are many ways to personalize the diagnosis and treatment of diseases, pharmacogenomics being one of them. Personalization can be based on routinely collected information, molecular signatures, or on repeated trials on the patient whose treatment plan is being devised. However, current emphases in personalized medicine research often ignore characteristics known to impact treatment benefit, in favor of tests that either generate more revenue or are developed with research that is perhaps easier to fund than "low-tech" research. Failure of the research community to fully utilize rich datasets generated by randomized clinical trials only hightens this concern.
Research supporting personalized medicine can be made more rigorous and relevant. For example in acute diseases, multi-period crossover studies can be used to measure individual response to therapy, and these studies can provide an upper bound on the genome by treatment interaction. When patient by treatment interaction is demonstrated, crossover studies can form an ideal basis for pharmacogenomics. However, even with the best within-patient data, group average treatment effects need to be incorporated in order for predictions for individual patients to have high precision.
There are a few ways to do personalized medicine well but a multitude of ways to do it poorly. Biomarker research in particular has not fulfilled its early promises, a major reason being flawed methodology. The flaws include faulty experimental design, bias, overfitting, weak validation, irreproducible research, data processing and analysis practices, and failure to rigorously show that the new markers add information to readily available clinical data. This will be discussed in terms of Platt's concept of "strong inference", seeking alternative explanations of findings, and sensitivity analysis.
This talk is also a call for the biostatistics and clinical epidemiology communities to be more integrally involved in research related to personalized medicine.
ABSTRACT: In this talk I will describe a model for the analysis of infectious disease data collected over time in a set of areal units. The aim of the analysis of such data is often the prediction of future disease counts in areas over the study region. The model combines a Poisson model with B-splines and Gaussian Markov random field prior distributions to carry out spatio-temporal smoothing. The motivating data consist of counts of hand, foot and mouth disease collected in China over a two-year period. In addition to the counts, a small number of cases in each area provided strain-specific information. An extension of the model will be described that reconstructs strain information on all cases, again smoothing over time and space.
This is joint work with Cici Bauer.
|
Click
here for past schedules |
Biostatistics
Webmaster |