John Quackenbush
John Quackenbush PhD; will serve as the PI and the Director of this proposed Statistical and Quantitative Training in Big Data Health Science program. He is Professor of Computational Biology and Bioinformatics in the Department of Biostatistics. He is a world leader in computational and systems biology, bioinformatics, and genomics, and the application of these disciplines to the study of human diseases, including cancer, Alzheimer’s, and COPD. His work has focused on the integration of diverse large-scale genomic and other data to develop new approaches to understanding human diseases. He has a long history or working in application areas involving what, at the time, has been considered “Big Data.” After completing a PhD in theoretical particle physics, he worked for two years as a postdoctoral fellow in experimental high energy physics, participating in particle accelerator experiments that generated terabytes of data every hour and developing methods for analyzing the data.

In 1992, Dr. Quackenbush received a Special Emphasis Research Career Award (a K award) from the National Center for Human Genome Research (the precursor to NHGRI) to work on the Human Genome Project. His career paralleled the genome project, first working on physical mapping, then sequencing strategies, and later gene expression analysis as a component of functional annotation. He has had continuous grant support for 23 years including funding from the DOE, NSF, and various institutes at the NIH. He received the 2013 White House Champion of Change Award for promoting and using open scientific data and publications in health science research. He has published more than 240 scientific papers that have been cited nearly 49,000 times and has an h-index of 76. He has developed freely-available, open source databases, software systems, and tools that support research and his MeV software tool ( has been downloaded more than 214,000 times since his group started tracking statistics in 2008.

Dr. Quackenbush has a longstanding commitment to training. During the past decade, he has mentored thirteen postdoctoral and ten predoctoral students, six as primary thesis advisor. His trainees have gone on to take leading positions in industry, academia, and government in the US and around the world. He has served on more than twenty PhD dissertation committees and as an external examiner for students in Australia, Sweden, and Austria. His is currently supervising one PhD student and four postdoctoral fellows. He is actively involved in mentoring junior faculty through direct interaction with several Assistant Professors, and as mentor on funded and pending K awards. For the past four years, he was co-Director with Dr. Xihong Lin of the Joint Interdisciplinary Training Program in Statistical Genetics and Computational Biology (NIGMS 5 T32 GM074897), the first joint biostatistics/bioinformatics training grant in the country. Dr. Quackenbush stepped down from that role to serve as Director for this proposed BD2K training program.

In 2014 Dr. Quackenbush led the development and approval process for a new MS in Computational Biology and Quantitative Genetics ( and currently serves as the Director of the MS program. The early success of the program and the feedback we have received from current and prospective graduate students has led us to begin the process of creating a PhD program with a similar focus and an emphasis on Big Data.

Associate Directors

Francesca Dominici

Francesca Dominici is Professor of Biostatistics and Senior Associate Dean for Research at Harvard T.H. Chan School of Public Health. Her research focuses on the development of statistical methods for integrating and analyzing large-scale observational data, with the goal of addressing important questions in environmental health science, healthrelated impacts of climate change, and comparative effectiveness research. Dr. Dominici led the two largest air pollution studies conducted to date: the National Morbidity Mortality Air Pollution Study (NMMAPS) and the National Medicare Cohort Air Pollution Study (MCAPS), which included 20 million people. She also led the effort to make the entire NMMAPS database available and to provide open-source software to facilitate the reproduction of study results. Dr. Dominici is the PI, with Dr. Xihong Lin, of the NCI-funded P01 project, “Statistical Informatics for Cancer Research.” Dr. Dominici oversees the management and analysis of several administrative databases, including Medicare files and SEER-Medicare, both linked to air pollution, weather, and socioeconomic data.

Dr. Dominici is strongly committed to training students. From 2004 to 2009 while a professor at Johns Hopkins University, Dr. Dominici mentored four postdoctoral and nine pre-doctoral students, four as primary advisor. From 2010 to 2014, Dr. Dominici has mentored twelve postdoctoral and seven pre-doctoral students Almost all of her trainees have taken academic faculty positions in leading biostatistics departments, including Johns Hopkins, University of Munich, Harvard University and University of Washington. She has served on over thirty dissertation committees with students from biostatistics, statistics, epidemiology, environmental health and global health. She is also actively involved in mentoring junior faculty.

Dr. Dominici’s input and perspective is particularly important as her work represents Big Data public health applications outside of the domain of the medical and genomic research areas. Her participation will help assure that students have appropriate training to deal with such data. In her role as Senior Associate Dean for Research, Dr. Dominici has been tasked with developing a high performance computing strategy that will serve Harvard T. H. Chan School of Public Health into the foreseeable future. As such, her participation will help to assure that the students being trained through the program will have access to computational resources sufficient to enable their research projects.

Rafael Irizarry
Rafael Irizarry is Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Center and Harvard T. H. Chan School of Public Health. For the past sixteen years, Dr. Irizarry has worked as an applied statistician in diverse areas including cancer, neurobiology, physiology and infectious diseases. During the last fourteen years, he has focused on genomics and computational biology problems. He has worked on the analysis and pre-processing of microarray and next-generation sequencing. He has a vast publication record describing his own methods as well as collaborative research. Dr. Irizarry also develops open source software implementing his statistical methodology. His software tools are widely used and he is one of the leaders and founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His participation will help assure the program provides students the appropriate level of statistical rigor while addressing relevant biological problems.

Dr. Irizarry is strongly committed to training students. He has mentored six pre-doctoral students and six post-doctoral students. Almost all of his trainees have taken academic faculty positions in leading biostatistics and computer science departments including Johns Hopkins, Emory, and Brown University. He is actively involved in training, with two pre-doctoral students and five post-doctoral students. Dr. Irizarry is deeply involved in open access education in data science. Since he joined the Harvard T. H. Chan School of Public Health and Dana-Farber Cancer Institute in the fall 2013, he has taught a HarvardX massively open online course on Data Analysis for Genomics that was released online in April 2014 (; it attracted more than 16,000 registrations in two months. In the fall of 2014 and 2015, he co-taught a very popular course on Data Science for Harvard undergraduates with Verena Kaynig-Fittkau in Harvard School of Engineering and Applied Science. In Spring 2016 he will teach the new course Introduction to Health Data Science at Harvard T. H. Chan School of Public Health.

Xihong Lin
Xihong Lin is Chair of the Department of Biostatistics, Professor of Biostatistics, and Coordinating Director of the Harvard T. H. Chan School of Public Health Program on Quantitative Genomics (PQG; Dr. Lin is a leading expert in statistical genetics and genomics and statistical methods for epidemiological data. Her research interests lie in development and application of statistical and computational methods for analysis of highthroughput genetic and genomic data from studies in genetic and epigenetic epidemiology, environmental genomics, and medical genomics. She is a leading expert on genome-wide association studies, whole genome sequencing association studies, gene-environment interactions, and genome-wide DNA methylation studies, pathway and network analysis, and integrative genetics and genomics. Her methodological research is supported by the NCI MERIT and Outstanding Investigator Awards. She is also the contact PI of the NCI P01 grant, with Dr. Dominici, on Statistical Informatics in Cancer Research. Dr. Lin has collaborated extensively with molecular and genetic epidemiologists, environmental health scientists at Harvard T. H. Chan School of Public Health and Harvard Medical School. She is the lead statistician in the Harvard T. H. Chan School of Public Health GWAS lung cancer study and acute lung injury GWAS consortium, and sleep apnea genetic epidemiology consortium. She is the Director of the Biostatistics and Bioinformatics Core of the Harvard T. H. Chan School of Public Health Superfund Research Program, and the lead statistician of several ongoing environmental epigenetic studies.

Dr. Lin has a strong training record. She has trained twenty-one doctoral students as the primary dissertation advisor and eleven postdocs. She is currently supervising six PhD students and three postdoctoral fellows. Almost all of her former PhD students and postdocs have taken academic faculty positions in leading biostatistics departments, such as in the University of Michigan, University of Pennsylvania, Duke University, Fred Hutchinson Cancer Research Center, and MD Anderson Cancer Center. She has served on over 70 PhD dissertation committees, with students ranging from biostatistics, epidemiology, environmental health, to health policy and management. She is currently serving on five dissertation committees as a member.

As mentioned previously, Dr. Lin is the co-Director of the Interdisciplinary Training Program in Statistical Genetics/Genomics and Computational Biology (NIGMS 5 T32 GM074897). She will help assure coordination of training activities between this proposed program and the one which she currently directs. As Chair and coordinating Director of the PQG, she will also help to assure that the resources are made available to support students in this program. Her involvement as an Associate Director is a sign of the commitment that she, as Chair, has made to the overall success of this training program.

David Parkes
David Parkes is George F. Colony Professor of Computer Science at the John A. Paulson School of Engineering and Applied Sciences (SEAS), Harvard University. Dr. Parkes is a world-recognized expert in topics at the interface between computation and economics, specifically as they relate to the application of machine learning and optimization to problems in economics and, in the other direction, to the integration of incentive and fairness considerations into computer science.

His research has focused on the use of statistical machine learning for the design of incentive-aligned mechanisms for resource allocation and social choice, the alignment of incentives in the context of experimental design where the treatments may adapt their behavior to perturb the outcome of an experiment, the use of inferential techniques for experimentation in adaptive systems where the outcomes of interest are long-term but inference must be made based on short-term data, the use of statistical machine learning for modeling rank data and for elicitation and optimization-based approaches to social choice, and the analysis of large-scale data sets from deployed social systems such as Doodle-style calendaring systems and from taxi pick-up and drop-offs in New York City to understand individual and group behavior.

Dr. Parkes has trained more than 30 postdoctoral fellows and graduate students and has mentored a number of junior faculty members since joining the Harvard John A. Paulson School of Engineering and Applied Sciences in 2001. He was a faculty lead on a joint-Ph.D. program (Technology, Information, and Management) between SEAS and Harvard Business School, and is currently the Area Dean for Computer Science with responsibility for all aspects of the program – undergraduate teaching and advising, graduate teaching and advising, faculty mentoring and growth. He is a member of a cross-university committee on data science. His broad-based expertise in computer science, machine learning and economics and the connection with data science, as well as his extensive previous mentoring experience, and his involvement with strategic planning for data science across Harvard make him ideally suited to serve as an Associate Director for this proposed training program. His involvement will help to assure that the program remains at the cutting edge of data science training with access to the resources available across Harvard.