Data Science for Statisticians Short Course – 3/19

Harvard Catalyst Biostatistics Program Short Course 

Data Science for Statisticians: Visualization, Data Wrangling and an Introduction to Machine Learning

March 19, 2019 | 1:00 PM – 5:00 PM
Kresge G2 | Harvard T.H. Chan School of Public Health
Course Registration Here


Rafael Irizarry is a Professor of Applied Statistics at Harvard and the Dana-Farber Cancer Institute. He was recently named Chair of the Department of Biostatistics and Computational Biology at the Dana-Farber Cancer Institute and is a Professor of Biostatistics at Harvard T.H. Chan School of Public Health.


We will assume that you have basic knowledge of R and that you own a laptop with R and RStudio installed. We will not be teaching R basics.

Using case studies from world health and economics, demographic registry data from Puerto Rico, and hand-written digits, we will demonstrate how to use modern statistical packages such as ggplot2 and dplyr to visualize and wrangle data. The data visualization part will include a session on principles. The data wrangling part will be particularly useful to statisticians wanting to cut their (expensive) dependence on SAS. We will then introduce the basics of machine learning and how to use the caret package to make predictions.


Basic knowledge of R. For example, you should know how to define a numeric vector, how to access to elements of data frame, and how to write a function.

A Wi-Fi enabled laptop with R and RStudio installed.
The tidyverse, dslabs and caret package installed. The four expressions below should be TRUE if you run them in R.

as.numeric(version$major)>=3 & as.numeric(version$minor) >=5 packageVersion(“tidyverse”) >= “1.2.1” packageVersion(“dslabs”) >= “0.5.1” packageVersion(“caret”) >= “6.0.80”