Researchers wrap their arms around the messy world of big data

November 9, 2018—The promises of big data in health care are seemingly endless. And so are the challenges—poor data quality, byzantine medical codes, and the complexity of human genetics, to name just a few.

A recent event, hosted by the Program in Quantitative Genomics at Harvard T.H. Chan School of Public Health, tackled some of these challenges. The conference “Biobanks: Study Design and Data Analysis,” held on November 1-2, 2018 at the Joseph B. Martin Center, brought together experts from around the world who talked candidly about the obstacles they encounter when working with huge amounts of data from multiple sources. They also showed off the tools and methods they’re developing to use big data to improve prevention, diagnosis, and treatment of a wide range of illnesses.

“Biomedical big data is transforming the study of human biology and disease. For epidemiologists, it’s a big playground. But it’s also a bit of a nightmare,” said Gil McVean, director of the Big Data Institute at the University of Oxford.

With the rise of genetic sequencing technologies, electronic medical records, and digital communications, researchers are able to collect more information on patients than ever before. To store and organize the data, they have started building what are known as biobanks–giant, freezing cold repositories that can safely house biological samples such as blood, urine, and skin cells for long-term analysis and genetic sequencing.

The amounts of raw data that biobanks can generate are staggering. Consider the UK Biobank, located outside of Manchester, England. The program recruited 500,000 men and women between the ages of 40-69 who provide blood, urine, and other biologic samples for storage. Each participant’s electronic medical records are linked to the biobank, allowing researchers to track their encounters with the health system. In addition, subsets of participants respond regularly to questionnaires on lifestyle habits, occupational history, mental health, cognitive function, and other health-related issues. Some wear ECG monitors and accelerometers to capture data on heart health and physical activity. On top of that, more than 30,000 participants agreed to have detailed MRI scans taken of vital organs to provide a visual record changes over time, which could lead to new insights for diagnosing and treating various diseases.

Managing all of these data, let alone making good use of them, is challenging. “This endeavor requires industrial scale processes,” said Catherine Sudlow, chief scientist of the UK Biobank and one of the event’s keynote speakers. She noted that the UK Biobank is open access, meaning that researchers from around the world can use the data for free. “It’s messy, real-world data, and that’s why we’re interested in working with all the people in this room,” she told attendees.

The two-day conference featured more than a dozen speakers, and highlighted the work of junior researchers with a Stellar Abstracts Award ceremony. Several presenters noted the creative ways they’re making use of the UK Biobank, including Tianxi Cai, the John Rock Professor of Population and Translational Data Sciences at Harvard Chan School. Cai works with large data sets from the U.S. Department of Veteran Affairs and said that she can use the UK Biobank to help validate her research on genetic markers for a wide range of diseases, including cardiovascular conditions, aneurysm, and skin conditions.

Cai, who joked during her presentation that she spent the first ten years of her career cleaning data, knows how far the field has come and how much further there is to go. “We want to do better,” she said.

Junior researchers are honored with Stellar Abstract Awards.

Catherine Sudlow, chief scientist of the UK Biobank, speaks to attendees

–Chris Sweeney

Photo: Nilagia McCoy