From contributor Sheila Gaynor:
As a follow-up to last week’s interview, this week we will hear from John Quackenbush, the director of the Big Data to Knowledge (BD2K) Training Grant. He is a professor of Computational Biology and Bioinformatics in the department and serves as the principal investigator on this new grant.
What do you see as the most important issues or challenges in big data within the next couple years?
In 2013, the National Research Council published a report, “Frontiers of Massive Data Analysis” that looked at the challenges of Big Data. Their primary finding was that the key element in meeting Big Data’s challenges is the development of rigorous quantitative and statistical methods. And I think they hit the nail on the head.
The challenge is no longer collecting data, but it is asking good questions and developing robust quantitative methods that will use the data and the models we build to inform our understanding of the systems we are studying. I think our faculty here at SPH and our colleagues across Harvard are extremely well positioned to develop these models, and we are lucky to have fabulous students who we can train (and who can train us) to think creatively about solutions.
But in developing these rigorous quantitative models, I would go farther than the NRC. We have to change how we think about our models. We are always asking, “Is the model right?” I think the better question is “Does the model inform our understanding of the system we are studying?”
The Scottish writer, Andrew Lang, said, “He uses statistics as a drunken man uses lamp posts – for support rather than for illumination.” We constantly have to ask ourselves whether we are asking the right questions and then pick the best methods to answer them. I think that approach is a big part of what defines a scientist.
What skills do you think are most important for students to develop to tackle these challenges?
My PhD was in theoretical particle physics. The title of my thesis was “Two-dimensional quantum field theory and string theory models.” People often ask me whether my physics training has been useful in what I do today. And, while I haven’t solved the Dirac equation in many years, my training has been invaluable. My PhD taught me how to think and solve problems.
The courses we offer students are outstanding in that they give them tools that they can use to solve problems. We teach them statistical methodology, we teach then statistical theory, we are teaching them computational methodology. And those are the foundation for their futures. But the most important part of a PhD is thesis research. We want students to build on the tools they have, to draw from their broader training, to borrow ideas from other disciplines, and to develop intuition about what the right questions are to ask and how to think about solving them. So we need to assure that we teach our students how to think.
If you’ve ever seen me on a PhD thesis committee, I usually only ask a small number of questions, and there is only one that I really care about. ‘What is the next big problem that the field needs solved that you are uniquely positioned to tackle?” Because I want to know the students understand the breadth and depth of their field and that they have thought about the most significant problems.
People sometimes present data-driven approaches, such as machine learning, and statistics as being competing, almost incompatible methodological frameworks. How do you see the difference?
I don’t think there is such a huge gap between statistical and machine learning methods. They often draw on very similar underlying methodologies and the Venn diagram overlap can be substantial. The difference is sometimes in the application. Although there is a huge continuum, I think statisticians tend to rely more on a priori modeling while computer scientists are more willing to let the data drive the model. And both approaches have applications where they excel. Machine learning is exceptional for problems like identifying tumors in CT scans, where there are a multitude of examples in which one can learn patterns and where the rules for finding the pattern matter less than the result. On the other hand, machine learning isn’t so good at predicting new things that aren’t in the training set, whereas having a model that one can fit can provide more testable hypotheses. So I think what we want our students to do is to have a broad set of tools and the knowledge to choose which might be best for solving any particular problem.
How does the BD2K training program integrate with other programs in our department?
I think this is an important addition to our overall program. First, I hope it will continue to build bridges with departments in Cambridge, especially with computer science. The important questions today are often at the interface of multiple disciplines, and so broadening our training is going to be essential. Second, every statistical and quantitative method we teach and use is being stretched by the unprecedented quantities of data that we can now generate and access. This program, I hope, will help continue the improvements in computational training that we’ve been building into our overall graduate program. And third, I think this program will give students more options to explore research areas where their insight, imagination, and training can position them to make important contributions to our collective knowledge. I am proud to be a member of this department and to be able to help maintain and expand the outstanding training our graduates receive.