Searching for strong signals in big data’s noise

Zack McCaw
Zack McCaw

Zack McCaw, PhD ’19, is developing statistical tools to advance research into the genetic basis of complex diseases.

May 28, 2019—Inside an unassuming facility on the outskirts of Manchester, England, lies a tantalizing treasure trove of data freely available online to approved researchers worldwide. There, biological samples from more than 500,000 volunteers are stored in a massive freezer, and an array of servers hold terabytes’ worth of the participants’ anonymized genomic and detailed clinical data. Since the UK Biobank opened for business in 2017, thousands of researchers have used the data to launch studies, many focused on finding the genetic variants associated with complex traits and diseases. But it can be a challenge to distinguish between true associations and false discoveries in that vast sea of data. Harvard T.H. Chan School of Public Health student Zachary McCaw is working to make the job easier.

For his dissertation, McCaw, who is graduating this month with a doctoral degree in biostatistics, used UK Biobank data to determine whether a particular type of mathematical operation (called inverse normal transformation) could help researchers more reliably identify locations in the genome associated with lung function. In a second dissertation project, McCaw proposed a way to fill in gaps in data that are difficult to measure by borrowing information from surrogate data that are easy to measure—in this case, gene expression in blood as a stand-in for gene expression in the brain. The goal of these projects was to aid research on genetic variants that could ultimately lead to better diagnostic procedures and targeted treatments.

“I hope that my methods will be adopted by other investigators working on genome-wide association studies,” McCaw said. He hopes to continue working on interesting statistical problems, “and when I am able to propose effective new solutions, to share them with the broader community.”

Solving problems

McCaw first became interested in statistical genetics when he was growing up in North Carolina. His mother, a clinical chemist, performed screening tests on newborns for metabolic disorders, and shared stories about her work. As an undergraduate at the University of North Carolina at Chapel Hill, he worked on research into the genetic signature of respiratory syncytial virus, the leading cause of hospitalization during the first year of life and a contributor to childhood asthma and other respiratory problems.

More recently, he’s become interested in survival data analysis, comparing the health outcomes for patients in clinical trials on different treatments and placebos. It’s gotten him thinking about new areas to pursue in statistical genetics, such as examining the genetic basis for the ways that individuals respond to treatments.

McCaw said that his time at Harvard Chan School has made him a better problem solver, and that it impressed on him the importance of seeing research as an iterative process—identifying problems and the methods to solve them, frequently assessing performance, and making improvements. “I learned that a new method will seldom work as expected the first time through, and you will almost always benefit from multiple rounds of experimentation and revision,” he said.

His dissertation advisor Xihong Lin, professor of biostatistics and statistics, praised his interdisciplinary research skills. “Zack has excelled in developing and applying  novel scalable statistical and computational methods for analyzing massive genome, exposome and phenome data in biobanks,” she said. “It was truly a privilege to have worked with him.”

At the School, McCaw received support from the John F. and Virginia B. Taplin Endowment for Biostatistics. He also received a F31 Individual Research Fellowship from the National Heart Lung and Blood Institute.

After graduation, McCaw will work over the summer at the Broad Institute on the problem of fine-mapping, or determining which of the many genetic variants in a region of the genome is truly responsible for affecting a health outcome. This fall, McCaw will join Google as a data scientist, and one of his areas of focus will be developing causal inferences from longitudinal data. But he sees himself eventually returning to academia, in a role that involves both biomedical research and teaching.

McCaw served as a teaching assistant for classes in the Department of Biostatistics and with the School’s GINGER initiative, a training program in neuropsychiatric genetics for African physicians and researchers. McCaw spent two weeks in South Africa helping teach biostatistics to the program’s fellows. He said that he loves developing problems for courses and helping guide students through the process of solving them.

As he prepares to graduate, McCaw recalled that his experiences at the School weren’t all about crunching data. An enthusiastic athlete, McCaw enjoyed playing on the Biostatistics intramural soccer and volleyball teams and competing in local running races with friends from the program. “I truly enjoyed my time in graduate school,” McCaw said. “The experience was challenging but formative, and absolutely worth the investment of time and effort.”

Amy Roeder

Photo: Sarah Sholes