Quantitative Image Analysis – Predicting genomic effects in high-throughput microscopy

Intro/motivation

Recently, machine-learning techniques, particularly “deep learning” methods, have garnered significant attention for their potential utility in biomedical applications. Previously, predictive models often relied on expert knowledge and models that incorporated some number of known “features”; a simple example is the prediction of cardiovascular disease using a panel of covariates such as smoking history, HDL, and other relevant variables. However, when trying to gain useful information from biomedical images, such as pathology slides or CT images, such “hand-curated” models are less straightforward. Trained humans are typically able to derive meaningful insight from these images (albeit with varying success or biases), and the challenge is whether it is possible for this process to be robustly automated.

The availability of large, specialized computing resources allowed artificial intelligence researchers to build increasingly complex deep “convolutional neural networks” (CNNs) which are able “automatically” to capture predictive features in image data, once appropriately trained. Previously, these techniques were infeasible due to computational constraints; large CNNs can often have millions of parameters that are dynamically adjusted during the training process. Such methods are perhaps less satisfying since they often do not yield easily interpretable models. For example, significant covariates, such as smoking history, are immdediately interpretable in a more conventional generalized linear model. However, at this point in time, CNNs are the workhorse of many modern artificial intelligence systems and have proven indispensible for analyzing image data.

The Cellular Recursion challenge In this exercise, described here, we used state-of-the-art CNNs and related techniques to classify images derived from high-throughput fluorescent microscopy assays.  In similar work, such methods have been used to identify the “mechanism of action” of small molecule compounds in high-throughput screening. In these images, various siRNA treatments (~1100 in total) are applied to several cell lines. Due to the genetic perturbations introduced, the appearance of the cells is potentially altered; one can envision how such models can be informative for understanding more general genotype-phenotype relationships. Ultimately, the goal is to use our model to predict the siRNA treatment on an unseen/new set of images, as one might do for a novel clinical sample.

The microscopy data is arranged in a hierarchical manner, depicted in Figure 1:

For each of the four cell lines, there are a number of experiments, which may be regarded as biological replicates. Within each experiment, each of the 1108 siRNA are tested on one of four 384-well plates. Finally, the siRNA are tested in “groups” of 277 siRNA. In Figure 1, we see that within experiment “EXP_1”, siRNA group “A” is featured on plate 1. In “EXP_N”, those same 277 are on plate 3. In addition to the 1108 siRNA, other wells are filled with 30 “positive control” siRNA, which are featured on every plate within an experiment. In Figure 2, we show two such experiments. The blue points denote the non-control siRNA (1108), orange points denote the 30 control siRNA, and the grey points denote wells that are not used for technical reasons. The three lines are meant to show how the same siRNA are featured in each experiment, but are located at different locations on different plates.

Diving deeper, consider a single well (which is contained on a particular plate, in a particular experiment, for a certain cell line). In this well (where we are testing a single siRNA), two sites are imaged. These may be regarded as technical replicates, and ideally can provide a reasonable sample of the perturbed cell population within that well.

When it comes to the microscopy images which form the basis of our model, the sites are imaged with six different filters that capture flourescence emissions within a target wavelength band. The flourescence signals are designed to capture particular components of the cell, such as imaging DNA-binding stains, allowing visualization of the cell nuclei. In Figure 3, we show these six images:

Results and outcomes After splitting the data into training and validation sets, we trained our CNN on GPU-enabled machines over several days. Due to the high number of parameters, CNNs are quite prone to overfitting if certain precautions are not taken. That is, given enough training time, the model would be able to generate perfect predictions on the training data. By evaluating the model on unseen validation data, we are able to more accurately judge when a model has completed its training.

For the CNN we implemented, our model was able to correctly predict the siRNA treatment for approximately 90% of the wells on an independent test set. Achieving this level of accuracy was partially dependent on exploiting the structure of the experiment; recall that siRNA occurred in groups of 277. Raw prediction accuracy– predicting the correct siRNA out of all 1108 was closer to 65-75%.