Scientific Inference and Big Data: Addressing the opportunities, risks, and challenges

On June 8-9, 2016, the Committee on Applied and Theoretical Statistics (CATS) of the National Academies of Sciences, Engineering, and Medicine convened a workshop to discuss how scientific inference should be applied when working with large, complex datasets. The results of that workshop, Refining the Concept of Scientific Inference When Working With Big Data: Proceedings of a Workshop, were published by the National Academies in February. The proceedings of the workshop can be freely downloaded here.

One of the major themes addressed by the workshop was the creation of new opportunities, risks, and challenges related to the analysis of big data, or datasets whose size, complexity, and heterogeneity preclude conventional approaches to storage and analysis. One of the conclusions was that bigger data does not always yield better results, due to a lack of controls, unidentified bias, missing and irregular data, and other confounding factors. The workshop emphasized the importance of statistical inference in the analysis of big data. In one strategy to ameliorate this issue, Biostatistics professor Sebastien Haneuse suggested comparing opportunistic data to the data resulting from a dedicated study to serve as a guidepost in assessing suitability.

Because of the pitfalls associated with the analysis of complex datasets, a greater understanding of uncertainty is crucial, not only for research design but also in statistics education. Participants in the workshop emphasized that not only should students at all levels be taught the basics of computing, database management and data sharing practices, but that greater attention be given to communicating the limitations of basic statistical assumptions, and increased training provided to prevent the blind application of inappropriate models.

Best practices related to the management and analysis of big data will increasingly rely on the early participation and collaboration of statisticians in research teams. Xihong Lin cited Harvard’s new interdisciplinary BD2K training program which engages students in laboratory rotation in computer science, informatics, statistics, and domain sciences. Statisticians should be encouraged to consider themselves not only as analysts but also more importantly as scientists fully engaged in research and discovery. Greater collaboration from the outset of a project can reduce instances of common statistical errors, help encourage the development of appropriate analytic tools and research design, and make research findings more reproducible and reliable.

In summing up the critical opportunities for future research, the issue of scale remains outstanding. Biomedical research has produced data that describe phenomena across orders of magnitude in spatial and temporal scales, and the challenge is to connect these multiscale data. Other areas of focus should include reconciling statistical models and their underlying assumptions with the actual complexity of biological processes. This suggests the need for continued collaboration between statisticians and domain scientists, and the development of new models and computational techniques that are both scalable and adaptable to real-world scenarios.