Big data and public health

See Transcript


Coming up on Harvard Chan: This Week in Health…Big data and public health.

{***Miguel Hernan Soundbite***}
(Every time we go to a doctor, and we receive a diagnosis, we start a new treatment, the information goes into a database. And when you put all of this together, that is a database with millions of data points that can be used for research purposes. In fact, chances are that our own health information is being used right now to gain new insights about health.)

Researchers are now harnessing vast amounts of information to assess what works in medicine.

This new data driven approach holds promise-but there are also potential risks.

And in this week’s episode we discuss that with an expert in the increasingly important field of causal inference.


Hello and welcome to Harvard Chan: This Week in Health. It’s Thursday, January 25, 2018. I’m Amie Montemurro.


And I’m Noah Leavitt.


Noah-Medicine and public health are constantly evolving as new research and technology open the doors to new ways to treat or prevent diseases.


And in other cases-new findings are challenging our preconceived notions of what works best.


And a key challenge for doctors and scientists is exactly that-figuring out what is best for patients and the public’s health at large.

That means asking questions like: When is the best time to start treatment in individuals with HIV? Or is it safe to give anti-depressants to pregnant women?
There are also policy questions that should be answered-such as nutrition recommendations regarding dietary fats.


In an ideal world, those questions would be answered through a randomized control trial-the gold standard of scientific research.

But in many cases that’s not possible-because it may be too expensive, too difficult to enroll the right number of people, or the study itself may be unethical.


And that’s where big data-and the focus of today’s podcast-comes in.


Researchers are now able to harness vast amounts of existing information on patients-such as in Medicare databases-to, in a sense, replicate randomized control trials.


It’s an approach with great promise-but also potential downsides if the research isn’t conducted properly.

And that’s where Miguel Hernan focuses much of his work.


Hernan is the Koloktrones Professor of Biostatistics in the Department of Biostatistics here at the Harvard Chan School.

And he’s a leading expert in the field of causal inference, which includes including comparative effectiveness research to guide policy and clinical decisions.


We spoke with Hernan about how researchers are using big data to answer important questions about health-and the safeguards that need to be in place to avoid misleading results.


I began our conversation by asking Hernan to define big data-which is a term you’ve probably been hearing a lot lately.

Take a listen.

{***Miguel Hernan Interview***}

MIGUEL HERNAN: Big data means different things to different people. And in fact, there’s not even an agreement on how big data need to be to be called big data. But in a health context, we use the term big data to refer to these large databases where our interactions with the health care system are stored.

Every time we go to a doctor, and we receive a diagnosis, we start a new treatment, the information goes into a database. And when you put all of this together, that is a database with millions of data points that can be used for research purposes.

And of course, there are very strict protocols to prevent any leaks of personal information. As an example, there’s a lot of research that is conducted based on the information of Medicare beneficiaries, or information on members of private insurance companies. In fact, chances are that our own health information is being used right now to gain new insights about health. In that sense, we are all part of the research enterprise, which I find very exciting.

NOAH LEAVITT: And so just to follow up on that, it seems from my perspective that this whole field of big data has seemed to grow really, really rapidly over maybe the last decade or so. Has that been the case? Or has big data been in use maybe longer than people realize?

MIGUEL HERNAN: That’s a very good question. People in health started to use big data, big databases probably in the ’70s. At the time, the ones doing that we’re a minority, and now everybody is using these big databases. I think that the term big data that comes to us from other places, from Google, and Facebook, and places like that. But health researchers have been using big data for a long time.

NOAH LEAVITT: So I know much of your research focuses on causal inference. So what does that mean when it comes to using these large databases of information?

MIGUEL HERNAN: Causal inference is a term that has become very fashionable among investigators. What it actually means, what we actually do is to try to learn what works and what doesn’t work to improve health. And we use big data for that.

We ask questions like how much screening colonoscopy lowers the risk of cancer, or what is the best time to start treatment in individuals with HIV, or is it safe to give anti-depressants to pregnant women. In the past, we had very little data to answer these questions.

For each of the questions, we had to recruit participants, collect data, which means that we had relatively small studies. But now, in the last decades, with the availability of these big databases, we can study these issues. We can ask these questions and try to answer them in a more efficient way and at a fraction of the cost.

NOAH LEAVITT: And so you touched on it there that the standard for assessing one of these questions might be to do a randomized controlled trial, recruit participants. But what big data allows you to do is to maybe measure the effectiveness of an intervention, without having to do that. So can you give an example of where that might occur?

MIGUEL HERNAN: Sure, first of all, you mentioned randomized trials, which have historically been the gold standard to learn what works and what doesn’t. And the idea of a randomized trial, as I’m sure many of our listeners know, is that we assign people to two different treatments, and we assign them at random to two different treatments, then we compare the outcomes between the two groups.

And because their treatment assignment happened by chance, any differences between the groups have to be due to the treatment they are receiving. So this is the best possible way of making causal inferences. Now, in the real world, there are many practical difficulties to carry out randomized trials.

Some trials could be so expensive that we cannot even conceive them. Others would not be ethical. Suppose that we want to learn about the risk of birth defects. Well, we cannot conduct a trial in which we intentionally expose pregnant women to various treatments. Other times, we are interested in the long-term effects of treatments, maybe after using them for 10 or more years. And again, a randomized trial would not be practical.

So as much as we love randomized trials, in many cases, we are not going to be able to conduct them. And that is when we use big databases. In those cases, our best chance to learn what works is really the use of these big databases.

And even when we can conduct a randomized trial, when we can actually do it, we will have to wait three, four, or five years until we know the results from the trials. And in the meantime, we still need to make decisions. For those decisions, again, we need some information, which will come from big databases.

NOAH LEAVITT: So in a sense, are you basically taking existing data that’s out there, and then kind of modeling what has already happened in the real world, and drawing conclusions from that?

MIGUEL HERNAN: That is exactly what we do. And that is well that causal inference is. So we take the data that has happened already, and we try to use this data to emulate a randomized trial that we would like to conduct, but we can’t.

NOAH LEAVITT: And so what are some of the benefits of this approach? And then on the flip side, what would some of the risks be?

MIGUEL HERNAN: The benefits are that being formal about causal inference, being formal means trying to be very precise about what is the randomized trial that is our target, what is the randomized trial that we would actually like to emulate, and then go about trying to emulate it. That approach results in fewer mistakes.

If we try to do it in a more casual way, in which, well, we have data, we do a data analysis, we find some associations, and we’re trying to give them a causal interpretation, it’s more likely that we will make mistakes. For example, in naive that analysis, we’ll find that cigarette smoking during pregnancy is associated with lower mortality in babies with low birth weight.

But that doesn’t mean that cigarette smoking during pregnancy lowers the risk of mortality. That is just something that we are guaranteed to find in the data, and a formal causal inference analysis will explain why cigarette smoking really does not lower the risk of mortality in those babies. So being formal with causal inference, we can eliminate some common biases that we sometimes see in data analysis, that are more casual or naive.

NOAH LEAVITT: It seems like, when you were talking about biases there, and the example of smoking during pregnancy, it seems like it’s an example of these seemingly random associations that people can find if they play with the data enough. What are some of the common biases that you do need to be aware of that would separate, as you mentioned, a naive study from a formal causal inference?

MIGUEL HERNAN: Well you just touched on a very important problem of this type of analysis with big data, which is the problem of multiple comparisons. Because you can compare anything that you want. Then just by chance, you are guaranteed to find some associations. And that’s a very serious problem.

One way of fighting that problem is precisely to be formal about the questions. So by pre-specifying the randomized trial that you like to conduct, but you can’t, and then trying to emulate that trial using the big data, then you can actually constrain yourself in terms of the number of analyses that you’re going to do. Because you can do anything. You have to do only the type of analysis that will help you answer that specific question and not the other million questions that could come.

But that is only one of the problems. The other problem when trying to make causal inferences with big data is that we have a lot of data, but that doesn’t mean that we have the data that we need. Of course, we need data on the treatments of interest. We need data on the outcomes of our interest.

If we are trying to estimate the effect of aspirin on stroke, we need data, and good data on aspirin, and good data on stroke. But besides that, we also need very good data on the reasons why people take aspirin. Because people who do take aspirin and people who don’t take aspirin in the real world are different. So we cannot just compare them.

This is not a randomized trial. Therefore, if we just compare people who take aspirin, who probably are people who have a higher risk of heart disease to start with, and people who don’t take aspirin, who have a lower risk of heart disease, then they will have different risk of stroke, and not because of aspirin, but just because they are different types of people.

That is the problem that randomization solves. And that is the problem that we have in this type of study. So we will need very good data on the variables that make the treated and the untreated different.

And that is, I would say, that’s a main limitation of many of these analyses, that in these large health care databases, we may have high quality information sometimes. We can have high quality information on treatments and high quality information on outcomes, but not always high quality information on these prognostic factors that are needed for a valid analysis.

NOAH LEAVITT: On some level, you’re a little bit at the mercy of the data available to you, where if the data on a particular intervention, or the variables you’re talking about, you might not be able to proceed. Is that a challenge that researchers find themselves running into a lot, where they do want to do this type of research, but the data just doesn’t exist yet?

MIGUEL HERNAN: Absolutely, absolutely. And that is one of the first decisions that all researchers have to make. They may want to answer certain important questions. They look at the data that they have, and sometimes you just have to decide that there’s not enough data there, that you can not provide an accurate answer to that. Maybe because you have again, very good data on aspirin, very good data on stroke, but you don’t have very good data on the reasons why people take aspirin. When that happens, you probably have to stop there and not try to use the observational data, the big databases.

On the other hand, there are many other examples in which we do have enough data to give an approximate answer. And we can also explore the data in ways that gives us confidence in our answer. So we can do parallel analyses that show that it is likely that our results are explained by differences between the groups. That is what we sometimes refer to as sensitivity analysis, which are a very important part of the analysis of big databases.

NOAH LEAVITT: And in doing this sensitivity analysis, is that something where like controlling for like confounding variables comes in? Or is a confounding variable more what you were talking about, with like the reasons someone would take aspirin?

MIGUEL HERNAN: There are many different types of sensitivity analysis. Islas A type that we like a lot is something known as negative controls. So the way this works is– let me give you an example.

A few years ago, we conducted a study using a large database of electronic medical records. And we wanted to estimate the effect of statins, which is treatment for cholesterol, on diabetes. So we found that people who initiated statin therapy had a 10% or so increased risk of diabetes compared with people who didn’t.

Now this might be due to many reasons. And one of the reasons is that people who start statins are by definition seeing their doctors more often. So it is possible that statins do not really increase the risk of diabetes. What happens is that you start the statins, you go to a doctor more often, and you are more likely to be diagnosed with diabetes that you had already, and would not have been diagnosed otherwise.

So how can we learn from the data whether that is likely to be the explanation or not? We can use a negative control, meaning we can find another outcome, which is not diabetes, that is not expected to be associated in any way with the statin therapy, but that could also be increased if you go to a doctor often.

For example, gastric ulcer– some people may have some symptoms of ulcer. But they are not diagnosed when the symptoms are mild, unless they go to the doctor for other reasons. So we did the same analysis that we have done for statins and diabetes, but now for statins and an ulcer. And we found that there was absolutely no association between the statins and ulcer. So that gives us some confidence that the association that we have found between the statins and diabetes was not due to visiting the doctor more often.

NOAH LEAVITT: Interesting, so you almost like test something that’s unrelated to kind of validate what you’re doing in the study.


NOAH LEAVITT: That’s really interesting. To continue it with the statin example, if I’m, or someone at home, am reading a story about statins and cholesterol, I guess the first thing would be check with your doctor, but if someone is reading a study about the latest findings on statins, what are some things they should keep in mind when they’re reading news coverage of this type of research? To maybe find out, OK, this is something that is worth paying attention to, worth digging a little deeper into?

MIGUEL HERNAN: There are a few things that you have to pay attention to. One is, of course, whether there is appropriate adjustment for the difference between the differences between treatment users and non-treatment users. Another one is how the treatment group is actually defined.

Because you can define treatment group in such a way that guarantees that treatment is going to be good, but it has nothing to do with the true effect of treatment. And one example is the use of statins in cancer patients.

Imagine that you define the use of statins in cancer patients, and say, well, anyone who has a cancer diagnoses had a start, and then, after the cancer diagnosis starts statins in the next four or five years, will be in the statin user group. And everyone who doesn’t start, will be in the non-user group.

Now imagine that someone dies one year after cancer diagnosis. That person has very little chance of being in the user group, because they have died very early. So that person will be automatically put in the non-user group.

That means that just by the defining users and non-users in that way, we guarantee that non-users will we have a shorter survivor time than users. And that is a type of bias that is sometimes known as immortal time bias, because someone who is a user, because has lived, because has started the statins four years after diagnosis of cancer is by definition immortal for four years.

So that type of classification of users and non-users are as important, or more important, that the proper adjustment for differences between groups, and sometimes is not even enough attention when reading a paper, or by the media when they report on a paper.

NOAH LEAVITT: So it seems like you’re saying, researchers need to be incredibly strict in setting the parameters. What is the process like if you want to conduct one of these studies? What is that process like, in terms of making sure that you are being strict about when you’re setting the follow-up times? What does that process look like in building out one of these studies?

MIGUEL HERNAN: Well, the funny thing is that we’ve always known how to do this right, because we conduct randomized trials, in which some basic principles of study design and analysis are followed. And the problem is that for some reason when we started to analyze these big databases, we forgot about those basic principles of design.

It turns out that some of the best known failures of observational research are just the result of not following the same rules that we would follow for a randomized trial. So once we go back to the big databases and we analyze this data– as I said, making sure that we have defined the randomized trial that we would like to mimic, and now mimic it. If we do that, then we will define our groups correctly, we will define the follow up correctly.

And the only thing that is left, and only is in quotes here, the only little thing that is left is to adjust correctly for the differences between the groups. That’s always going to be the biggest limitation of observational research from big databases, that we don’t know if we have adjusted for all those differences. But all the others, all the other problems, like they immortal time bypass or other types of selection bias, et cetera, those are just self-inflicted injuries that we can very easily eliminate.

NOAH LEAVITT: And so we’ve talked a lot of the examples today are questions of effectiveness or safety. So moving forward, how do you see this growing use of big data, how do you see it affecting patients and the care they receive?

MIGUEL HERNAN: Well, it is already affecting patients and the care that they receive, because for many questions, as I said, we’re not going to be able to conduct randomized trials. So the only information will be coming from observational data, from some time to come. It’s possible that in some cases, there will be randomized trials at the end. But in the meantime, we only use data from large databases.

Again, let me give you an example. A few years ago, there were questions about when is the optimal time to start therapy in patients infected with HIV. There were arguments for and against starting very early in the disease. And there were no randomized trials.

All that we had were observational studies, in which initiation was not– initiation of HIV therapy was not randomized, but you could compare groups that initiated at different times, adjust for the differences between all those groups, try to mimic the target randomized trial as well as possible.

And all of those studies found that early initiation was better than later, than delayed initiation. So the guidelines, the clinical guidelines for the treatment of HIV were changed based on the observational studies. A few years later, a couple of randomized trials were conducted that confirmed what the observational studies had found. But for that period of time, the only thing that we had were observational estimates.

NOAH LEAVITT: Is that, in a way, the ideal scenario, that you conduct the observational research, maybe it influences policy, and then down the line, when you can conduct the randomized control trial, it’s done, it validates what the original study said– is that a situation you think will play out more often in the future?

MIGUEL HERNAN: I think so. I think that this is going to happen more. Of course, this is the ideal situation. It’s possible also that in some cases the randomized trials will not validate what the observational studies found. And in those cases, we will learn something about what is that we did wrong with the observational data. But in the absence of the randomized trials, is either making these decisions based on no information at all or based on the limited information that we can obtain from big data.

NOAH LEAVITT: And so just a last question. I know you run a MOOC through HarvardX’s free online course focused on causal inference. So if people have listened to this podcast, they’re really fascinated, they want to learn more about this, can you just kind of tell me a little bit about that course and what it focuses on, and then, what you hope the course participants will learn?

MIGUEL HERNAN: Well, that is a course that describes the theory of causal graphs in non-technical terms. So causal graphs are a very helpful tool. Because that’s how we express the assumptions that we have, the knowledge that we have, about a causal problem. And based on a few graphical rules you learn in the course, then make decisions about how to best analyze the data. So the course– the title of course is Draw your Assumptions Before your Conclusions. And that is exactly what it is about, how to draw causal graphs that summarize your causal assumptions, so that then you can extract conclusions from the data in the best possible way.


That was our interview with Miguel Hernan on big data and public health.

And as you heard us discuss at the end there, he does offer a free online class through Harvard X. If you’re interested in registering, we’ll have a link on our website,


Hernan has also written a free book on causal inference, and we’ll have a link to that as well.


That’s all for this week’s episode. A reminder that you can always find this podcast on iTunes, Soundcloud, and Stitcher.

January 25, 2018 — Researchers are now harnessing vast amounts of information to assess what works in medicine and public health. In this week’s podcast, we explore why this approach holds promise—but why it also comes with potential risks. You’ll hear from Miguel Hernan, Kolokotrones Professor of Biostatistics and Epidemiology, who is a leading expert in the field of causal inference, which includes comparative effectiveness research to guide policy and clinical decisions. We discussed how researchers are using big data to answer important questions about health—and the safeguards that need to be in place to avoid misleading results.

Learn more

Read Miguel Hernan’s free book on causal inference.

Enroll in the free HarvardX course, Causal Diagrams: Draw Your Assumptions Before Your Conclusions.