Off The Cuff: It’s the data first, hypotheses second

Xihong Lin develops statistical and computational methods to analyze the flood of information now available from the human genome, the environment, and the clinic. Her work has been applied to a wide range of chronic conditions, from lung cancer to neurodevelopment to sleep apnea.

[Fall 2015]

Xihong Lin
Chair, Department of Biostatistics
Henry Pickering Walcott Professor of Biostatistics

Q: What can biostatisticians do today that they couldn’t 10 years ago?

A: Today, there are three terms to describe the massive amount of data that biostatisticians analyze: the genome, the exposome, and the phenome—all of which add up to the “omics” revolution. The genome is our genetic information, gathered from whole-genome sequencing. The exposome refers to all the substances and experiences that we are exposed to. And the phenome is every possible disease outcome, including whatever appears in a person’s electronic medical record. Biostatisticians try to find the needle in those haystacks, teasing out signal from noise, to understand what causes specific diseases, by accounting for random errors.

In the old days, we used a candidate-gene approach to study genetic susceptibility for diseases. This means we looked at one spot of the human genome at a time. A few years ago, we looked at millions of locations across the genome simultaneously—but this only covered about 10 percent of the genome. Now, whole-genome sequencing allows us to study 3 billion base pairs.

In the old days, we measured one exposure at a time—for example, one type of particulate matter in air pollution or heavy metal exposures such as mercury. Now we can simultaneously measure a spectrum of exposures in the environment in a person’s blood or body tissue and through satellite data. Exposures include things like chemical contaminants, nutrition intake, and even social exposures, such as personal interactions and communication networks. And instead of looking at a few endpoints, we study a spectrum of disease-related outcomes using smartphones, electronic medical records, and the Medicare database.

Traditional epidemiological and environmental studies are hypothesis-driven. We could only peer at the puzzle one tiny corner at a time—a gene or vitamin D intake, for example, is associated with lung cancer. Now we can generate new hypotheses directly from the data, matching pieces across this huge puzzle to identify multiple causes of disease, treatment targets, and prevention strategies. It’s a really different way of doing research. And it requires an open mind, curiosity, and creativity.

Madeline Drexler is editor of Harvard Public Heath.