Abstract

Motivation

The identification and discovery of phenotypes from high content screening images is a challenging task. Earlier works use image analysis pipelines to extract biological features, supervised training methods or generate features with neural networks pretrained on non-cellular images. We introduce a novel unsupervised deep learning algorithm to cluster cellular images with similar Mode-of-Action (MOA) together using only the images’ pixel intensity values as input. It corrects for batch effect during training. Importantly, our method does not require the extraction of cell candidates and works from the entire images directly.

Results

The method achieves competitive results on the labeled subset of the BBBC021 dataset with an accuracy of 97.09% for correctly classifying the MOA by nearest neighbors matching. Importantly, we can train our approach on unannotated datasets. Therefore, our method can discover novel MOAs and annotate unlabeled compounds. The ability to train end-to-end on the full resolution images makes our method easy to apply and allows it to further distinguish treatments by their effect on proliferation.

Availability and implementation

Our code is available at https://github.com/Novartis/UMM-Discovery.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

High content screening (HCS) (Buchser et al., 2004; Caicedo et al., 2017; Götte et al., 2010) is a phenotypic screening method leveraging high-throughput microscopy imaging to explore and identify perturbations, here low molecular weight compounds, with phenotype altering effects on cells. Experiments are performed on plates with multiple cell-containing wells. The cells are stained with different fluorescent probes highlighting distinct cellular components. Each well is treated with compounds at various concentrations. The goal is to perform quantitative analysis of the resulting images to uncover what happens in the cells upon treatment. In particular this can yield a better understanding of the mechanism by which cells are affected by various treatments. This is called the Mode-of-Action or sometimes the Mechanism-of-Action (MOA). HCS is a high-throughput method that can generate up to millions of images per experiment, therefore requiring extensive computational analysis.

Several methods have been developed to tackle this problem. For example, Ljosa et al. (2013) as well as Singh et al. (2014) extract cell embeddings with the Cellprofiler software (Carpenter et al., 2006) to predict the MOA based on nearest neighbors’ annotation and Kraus et al. (2016) use a supervised convolutional multiple instance learning for microscopy image classification and segmentation. Ando et al. (2017) follow a similar idea but use a pretrained deep metric network to extract features from individual cells segmented from larger images. The work of Godinez et al. (2017) contains a multi-scale neural network that makes the extraction of cell regions redundant as the neural network picks up local cell features as well as global structures, such as cell density, from the full image. The neural network is trained in a supervised setup on an annotated subset by using known MOAs as labels and applied on the full dataset. In a real-world scenario, we often do not know the MOA, or only have partial understanding of the MOA of treatments, thus limiting the application of the method to well annotated datasets only. This common lack of MoA annotation is another strong limitation of approaches based on a priori knowledge and annotation. In more recent papers, image metadata are used as pseudo-label to train the model in a self-supervised manner (Caicedo et al., 2018; Godinez et al., 2018; Spiegel et al., 2019), with promising results. Such metadata typically comprise compound identifier, dose and treatment duration. One concern is that the method considers similar treatments (e.g. small dose variation of the same compound) as dissimilar and treats them as completely different ones (e.g. different MOA). In addition, batch effects could cause images with identical metadata to look different. All the methods above, besides Ljosa et al. (2013) and Singh et al. (2014), use deep neural networks trained either in a supervised (Godinez et al., 2017; Kraus et al., 2016), or self-supervised setup (Caicedo et al., 2018; Godinez et al., 2018; Spiegel et al., 2019) on pre-annotated datasets or make use of a neural network pre-trained on non-cellular images (Ando et al., 2017; Pawlowski et al., 2016; Tabak et al., 2020). Neural networks trained on non-cellular images or with cellular and non-cellular images are shown to perform well on cellular images (Ando et al., 2017; Pawlowski et al., 2016; Stringer et al., 2021). An unsupervised method trained on cellular images and based on Variational Autoencoders was developed by Lafarge et al. (2019). In parallel, Caron et al. (2019) developed an unsupervised method, applied to non-HCS data, achieving results close to supervised ones by using a standard clustering algorithm to create pseudo-labels during training. We note that all methods cited above except the ones developed by Godinez et al. and Pawlowski et al. require to first extract single cell candidates from the original images, a step that often requires manual tuning to specific assays (Godinez et al., 2018). In addition, focusing on single cell sub images prevents such methods from detecting population-level effects of treatments, such as proliferation effects.

Inspired by Caron et al. (2019), Ando et al. (2017) and Godinez et al. (2018), we developed a fully unsupervised neural network for clustering similar phenotypes together, solely based on the intensity values of the cellular images. Based on the findings of Godinez et al. (2017), we make the extraction of cellular candidates redundant by using an updated multi-scale neural network. The backend is built on top of the deep clustering framework from Caron et al. (2019). Additionally, we show that combining two batch correction methods, Typical Variation Normalization (TVN) (Ando et al., 2017) and Combat (Johnson et al., 2007), during training, significantly improves the results and creates more representative embeddings. We call our method UMM Discovery (Unsupervised Multi-scale Mode-of-action Discovery) and show that we obtain an accuracy similar to existing state-of-the-art (SOTA) methods in the BBBC021 dataset (Caie et al., 2010). To evaluate the proposed method on a real-world scenario, we trained it, unlike other methods, on the entire dataset (without MOA labels) and show that it can be used for novel MOA discovery. We also show that our method is able to capture the proliferation effect of treatments. As validation, we show that UMM Discovery was able to produce novel findings on the 1008 Tales Dataset (Cox et al., 2020).

2 Materials and methods

2.1 Our approach: Unsupervised Multi-scale Mode-of-action Discovery (UMM Discovery)

We developed a fully unsupervised deep learning method to cluster cellular images with similar MOA together. We did so by applying a deep clustering framework developed by Caron et al. (2019), called DeepCluster, on cellular imaging. Following Godinez et al. (2018) findings, we used an updated version of the Deep Neural Network (DNN) architecture, called Multi-Scale-Net.

Together, these two methods create a fully unsupervised algorithm, which we call UMM Discovery, that clusters similar images together. Figure 1 shows an illustration of our combined approach. The algorithm uses no prior knowledge of treatment, beyond their respective identifier and relative concentration. The DNN input consists solely of unannotated images and it outputs one 64-feature vector per image and does not require a pre-processing step to extract single cell candidates.

Overview of the UMM Discovery architecture, combining an adjusted Multi-Scale NN and the DeepCluster backend. Training of the NN is done in two steps. (Step 1) Green arrows: Images are fed through the NN where features are extracted from different scales to create one representing embedding for each image. The images are batch corrected with Combat and TVN and preprocessed by TSNE and L2 normalization. The normalized embeddings are clustered with k-means. The cluster assignments from k-means are the new pseudo-labels and the number of clusters will become the size of the classification layer. (Step 2) Blue arrows: After creation of the pseudo-labels, the training process of the Multi-Scale network starts, where images are forwarded through the NN and the last classification layer to get a prediction for each image. The weights of the Multi-Scale NN are optimized by taking the cross-entropy loss function between the prediction and the pseudo-label and backpropagating the loss through the NN
Fig. 1.

Overview of the UMM Discovery architecture, combining an adjusted Multi-Scale NN and the DeepCluster backend. Training of the NN is done in two steps. (Step 1) Green arrows: Images are fed through the NN where features are extracted from different scales to create one representing embedding for each image. The images are batch corrected with Combat and TVN and preprocessed by TSNE and L2 normalization. The normalized embeddings are clustered with k-means. The cluster assignments from k-means are the new pseudo-labels and the number of clusters will become the size of the classification layer. (Step 2) Blue arrows: After creation of the pseudo-labels, the training process of the Multi-Scale network starts, where images are forwarded through the NN and the last classification layer to get a prediction for each image. The weights of the Multi-Scale NN are optimized by taking the cross-entropy loss function between the prediction and the pseudo-label and backpropagating the loss through the NN

2.1.1 Deep clustering

We built upon the deep clustering framework developed by Caron et al. (2019). The deep clustering framework consists of (i) a neural network for feature extraction, defined as fθ:XRD, which takes as input a cellular image xX and maps it to a feature embedding z of dimension D and (ii) a standard clustering algorithm for generating a pseudo-label y for each x.The training of the parameters θ can be summarized as follows. All images xX are fed through the neural network to create representative embeddings z=fθ(x)RD. A dimensionality reduction algorithm is applied to remove dimensions with low variance. Next, a clustering on the embeddings Z is performed with a standard clustering algorithm such as k-means to assign each embedding z to one of K clusters. The resulting cluster assignment is used as pseudo-label y and a classification layer gw with parameters wis added to the end of the neural network, with K units representing the K clusters, to predict the labels y from the feature embeddings z. We can now train the neural network with the cluster assignments as pseudo-labels by optimizing:
where l is the multinomial logistic loss. This process is repeated for each epoch.

2.1.2 Multi-scale network

Next, we implemented the multi-scale neural network architecture from Godinez et al. (2017) and modified it by adding residual layers, dilation within the convolution layers and a collapse layer. The updated version of the Multi-scale network is shown in Supplementary Figure S1. The method takes pre-processed images as input and performs a parallel multi-scale analysis over a large number of scales, allowing it to detect local features as well as population changes. This makes it possible to capture the phenotype over a large spatial region. Therefore, no extraction of cellular candidates is necessary.

Given a pre-processed image with spatial dimensions w×h, the neural network considers downscaled images with dimensions (w/s×h/s). The neural network contains five multi-scale blocks which reduce the spatial dimension by s=1,2,4,8,16. The multi-scale block consists of three convolutional layers, the first one with a kernel size of 3 and a dilation size of 1, the second one with a kernel size of 3 and a dilation size of 2 and the last one with a kernel size 1. This 1×1 convolutional layer can be seen as a fully connected layer between the feature maps. It is followed by a max pooling operation. Unlike the original multi-scale network from Godinez et al. (2017), the network consists of residual layers. Instead of only considering a downscaled version of the original image, the downscaled image is concatenated with the pooled kernel maps of the last convolution layer and go through an additional 1×1 convolutional layer. After the last multi-scale block, the spatial size is 16×20 pixel with 128 feature maps. A collapse layer is applied, where parallel adaptive average pooling and maximum pooling with the kernel size of the input spatial dimensions. These two layers are afterwards factorized by the trained weight factor wf and summed together as follows:

Next, another 1×1 convolutional layer, as well as a fully connected layer with 64 units, are applied. The fully connected layer is followed by a ReLU operation during training. During the embedding generation phase, this ReLU is removed. Batch normalization followed by a ReLU layer is applied after each convolutional layer throughout the whole neural network. With this setup, the neural network takes the images as input and outputs embeddings. The neural network consists of 106 961 trainable parameters, which is low in comparison to other state-of-the-art neural network architectures, for example ResNet-50 with over 25M parameters.

2.1.3 Clustering algorithms

To generate pseudo-labels, we tested four different clustering algorithms on the embeddings. The first one is the standard clustering framework k-means. The most important parameter for this clustering algorithm is the number of clusters, also called seeds. In the baseline setup the number of clusters is set a priori. However, in most real-world scenarios, we do not know the exact number of clusters, therefore we set this parameter to the number of treatments since identical treatments should in theory cluster together.

The true number of clusters, in this case, the number of MOA is unknown. Therefore, two more dynamic clustering algorithms were tested as well. Namely, the Power Iteration Clustering (PIC) (Lin and Cohen, 2010), an updated version of spectral clustering, and the density-based method HDBScan (McInnes et al., 2017), were tested as clustering algorithms. As noted, the benefit of these two clustering algorithms is that they do not require the number of clusters to be specified. PIC is very fast on large datasets and can provide similar results as spectral clustering. For PIC and HDBScan, we performed hyperparameter optimization to find their best parameters on the annotated subset of BBBC021.

We also implemented an adaptive version of K-means that reduces the number of clusters linearly during training, starting from double the number of treatments and reducing it to the number of compounds. The performance of the different clustering algorithms can be seen in the Supplementary Table S1.

2.1.4 Batch correction

Batch effects arise from uncontrollable variations in biological experiments due to different factors such as slightly different concentrations for same treatments, location of a certain well on the plate, order of the plate within a batch which could result in different rate of evaporation (Tabak et al., 2020). To correct for this nuisance variation, we tested two batch correction methods, typical variation normalization (TVN) (Ando et al., 2017) and Combat (Johnson et al., 2007). Both methods require additional input information. The TVN method needs the embeddings corresponding to the control wells (untreated cells) to bring them together. Combat requires more prior knowledge as it uses the treatment information (e.g. compound identifier, dose, …) as covariates. Treatment information are available in a real-world scenario at the well level, including which wells are used as controls. We apply TVN on the embeddings (Ando et al., 2017) as follows. All embeddings of the untreated cells are whitened by computing the PCA basis so that these embeddings have zero mean and unit variance. The same affine transformation is then applied to all embeddings. Like in Tabak et al. (2020) but unlike Ando et al. (2017), we do not use CORAL (Sun et al., 2017) as we do not have enough embeddings per plate to compute a positive-semidefinite covariance metrics. Instead, for further correcting the batch effect, we are using Combat (Johnson et al., 2007). Combat adjusts the batch effect by using empirical Bayes methods. As covariates, we used treatment metadata (e.g. compound and concentration). Unlike other methods, we applied batch correction during training at each epoch before dimensionality reduction. The effect of the batch correction can be seen in Supplementary Figure S2.

2.1.5 Dimensionality reduction of the embedding

To further improve clustering performance, we tested two dimensionality reduction methods, PCA and t-SNE (Maaten and Hinton, 2008). We implemented an adaptive t-SNE that takes only the number of components that contain at least 95% of the variants and outputs three components.

2.2 Image data and pre-processing

2.2.1 BBBC021 image dataset

We evaluated UMM Discovery on the BBBC021 (Caie et al., 2010) cellular dataset available from the Broad Bioimage Benchmark Collection presented in Ljosa et al. (2013). The dataset consists of high content images from MCF-7 breast cancer cells exposed for 24 h to a variety of chemical compounds (drugs). The cells are treated on 55 96-well plates across 10 batches. Three single-channel images are available representing fluorescence bound DNA, Tubulin and Actin. The total dataset contains 113 small molecules at eight concentrations resulting in 901 treatments representing 13 200 field of view imaged over three channels.

A subset of treatments in BBBC201 are annotated with their MOA. This subset contains 38 compounds dosed at various concentrations resulting in 103 treatments (3848 field of view), representing 12 known molecular MOA. For half of those, the MOA annotation was performed by visual inspection. The annotation of the other half was captured from literature. Ljosa et al. (2013), Singh et al. (2014), Pawlowski et al. (2016), Ando et al. (2017), Godinez et al. (2017), Caicedo et al. (2018), Lafarge et al. (2019) and Tabak et al. (2020) also evaluated their methods on this gold standard subset.

2.2.2 1008 Tales dataset

To further validate our method, we evaluated UMM Discovery on two different subsets of the phenotypic profiling assay created in Cox et al. (2020). The entire assay consists of 3 cell lines and 5 different marker combination resulting in 15 reporter cell line where morphological phenotypes were induced by a library of 1008 reference compounds (approved drugs and well-characterized tool compounds and natural products) at multiple concentrations, totaling 317 430 treated wells across 1200 plates. To capture the phenotypic changes each reporter cell line consists of three organelle or pathway markers.

From this dataset, we created two subsets, a validation subset with only the most distinguishable MOAs according to Cox et al. (2020) for validating the method and a bigger discovery subset for the discovery of novel MOAs.

The discovery subset of the 1008 Tales dataset was generated by focusing on the most active reporter cell line, the A549 lung adenocarcinoma cell line with the markers BFP, ACTB (actin cytoskeleton) and RAB5A (early endosome), named ‘A549-ACTB-RAB5A’. To reduce the dataset size further, the natural products and MOAs with less than three co-annotated compounds were removed and only compounds with one single annotated MoA were kept. This subset consists of 80 plates across 6 batches and contains 761 compounds at multiple concentrations resulting in 3040 treatments (45 257 field of view), representing 110 annotated MOA.

The smaller validation subset was generated from the discovery subset by keeping only the distinguishable MoAs (with an AUC-ROC ≥ 0.85) according to Cox et al. (2020). This remaining subset has a total of 780 treatments (16 583 field of views), derived from 196 compounds at various concentrations, annotated with 22 MOAs.

2.2.3 Preprocessing

We followed the approach undertaken by Ando et al. (2017) for preprocessing images. For each plate and each channel, a flatfield image is computed by taking the 10th percentile of all the intensity values and applying a Gaussian filter with sigma of 50. Each image is then divided by its respective flatfield image. Pixel intensity values lower than 1 are set to 1 before doing a log transformation in order to increase the dynamic range. To remove outliers’ values, any resulting values larger than five are clipped to five. In addition to the preprocessing performed in Ando et al. (2017), the pixel intensities are image-wise and channel-wise normalized by z-score. Unlike Ando et al. (2017) and Tabak et al. (2020), no cell candidates were generated, which eliminates the time-consuming steps to locate and crop cells in the images. To reduce input size, images are tiled into 4 pieces. The advantage of tiling is that we can feed images with the highest resolution in the neural network without down-sampling them and no cell candidates have to be computed. The downside of tiling is that some tiles can hold almost no information. Aggregating embeddings for clustering analysis alleviate this problem.

2.3 Embedding generation

After training the multi-scale net, all tiles are fed through the trained neural network to generate one representing embedding of size 64 for each tile. In case of the BBBC021, there are 4 tiles for each field-of-views and 4 field-of-view per well, resulting in 16 tile embeddings per well. As treatments are applied at a well level, we are mostly interested in well-level embeddings, and therefore we aggregated tile-level embeddings by computing their element-wise median, thus obtaining one embedding of size 64 for each well (Fig. 2). These well embeddings are used for clustering analysis to evaluate the performance of the clustering and to visually discover novel clusters. In this paper, we show that feature vectors can be used to identify MOAs of corresponding treatments.

Embedding generation. Multiple fields-view are taken per well. These field-of-views are split into four tiles during preprocessing. All tiles of a well are forwarded through the neural network to create corresponding embeddings then aggregated to obtain one embedding per well. This well embedding is used for further analysis
Fig. 2.

Embedding generation. Multiple fields-view are taken per well. These field-of-views are split into four tiles during preprocessing. All tiles of a well are forwarded through the neural network to create corresponding embeddings then aggregated to obtain one embedding per well. This well embedding is used for further analysis

2.4 Evaluation strategy

To evaluate the clustering quality, we made the assumption that images corresponding to the same MOA should cluster together, while images from different MOA should belong to different clusters. Suitable validation metrics take cohesion (intra-cluster) and separation (inter-cluster) into account (Palacio-Niño and Berzal, 2019). As each MOA represents one cluster, it is important to evaluate how close the embeddings of the treatments with the same MOA cluster together and separate from clusters with different MOA.

Another evaluation approach relies on using image metadata, by assuming that examples assigned to the same cluster have similar metadata (Palacio-Niño and Berzal, 2019). Because wells with the same treatment should cluster together, we computed the correlation between cluster assignment, MOA ground truth labels and treatment information (e.g. compound and dose).

The annotated dataset is evaluated by first nearest neighbors with Not-Same-Compound (NSC), Not-Same-Compound-and-Batch (NSCB) and Silhouette score (SIL) with the MOA annotations as ground truth labels. For the evaluation of the unannotated dataset, we made use of the adjusted mutual information (AMI), the completeness score (COM), Silhouette score (SIL) and the total clustering score (TCS). All this metrices are well described in the chapter Metrices in the Supplementary Chapters.

2.5 Best epoch determination

To avoid overfitting, we implemented a best epoch criterion. For the labeled dataset, the best epoch is selected by calculating the NSC and NSCB for each epoch and taking the epoch with the highest score. For the unlabeled dataset this strategy cannot be applied. In this case, the TCS is used as best epoch criterion.

3 Results

First, we established a baseline model, which combines a multi-scale CNN (Godinez et al., 2017) with a K-means deep clustering backend (Caron et al., 2019) and applied it on the annotated subset of the publicly available cellular dataset BBBC021 (Caie et al., 2010). To evaluate the annotated clustering results, we used the NSC and SIL scores. After tuning the number of components for PCA and the number of seeds for K-means on the baseline model (Supplementary Fig. S3), we evaluated the influence of different clustering algorithms (Supplementary Table S1).

Then, we incorporated batch effect removal. To assess that there is a batch effect and that it can be reduced with the combination of Combat and TVN, we performed a one-way ANOVA on the PCA components of the untreated well embeddings before and after batch correction with the previous mentioned methods. The Supplementary Table S2 lists the F-value and the P-value of the analysis for the eight PCA components. All P-values before batch correction are significant (P-value < 0.05), and therefore, we conclude that there are significant differences among batches. This is consistent with results from Ando et al. (2017) and Tabak et al. (2020). After batch correction, the F-values are considerably lower and the P-values (P-value > 0.05) are not significant as we fail to reject the null hypothesis and conclude that after batch correction the untreated well embeddings have equal variances. In the Supplementary Figure S4 can be seen that TVN and Combat decreases the batch effect. Supplementary Table S3 shows the clustering results for TVN, Combat, and their combination (TVN followed by Combat and Combat followed by TVN). Every batch correction method significantly improves both NSC and SIL. Combat followed by TVN achieved the best results overall. Furthermore, we found that performing dimension reduction of the embeddings improved clustering performances (Supplementary Table S4).

In Supplementary Figure S5, a t-SNE visualization of the aggregated well embeddings of the BBBC021 subset is presented at various training epochs. Already in the initial untrained network (Supplementary Fig. 5a) a clear separation between the negative control wells and some of the Mode-of-Actions can be seen. As the training progresses (Supplementary Fig. 5a–d), we observe very clear separation of the different mode of actions.

In the following sections, we first show the results of the optimized UMM Discovery on the BBBC021 annotated subset, compare our method with other approaches and present how our method can be applied to unlabeled datasets for MOA discovery, where we applied it to the entire BBBC021 dataset. Furthermore, we show UMM discovery’s ability to discover novel MOA in the 1008 Tales dataset.

3.1 Evaluation on the BBBC021 annotated subset

We first evaluated UMM Discovery on the MOA annotated subset of the BBBC021 (Caie et al., 2010) image dataset.

The best performances were achieved with using K-means for the clustering algorithm, adaptive t-SNE for dimensionality reduction and combining TVN and Combat for batch correction. In agreement with previous results (Caron et al., 2019), we found that allowing for a large number of seeds in K-means is necessary to achieve the best NSC. In our case, we set the number of seeds to the number of treatments. Using the NSC as best epoch criterion, we achieved a NSC score of 97.1%, a NSCB of 84.8% and a silhouette score of 0.52 which indicates a close clustering of the same MOA and a clear separation between the clusters with different MOA. The COM score was 0.94 and showed that well embeddings from the same treatments have the same cluster assignment. Figure 3a shows the NSC confusion matrix of this setup. UMM discovery was able to classify 100 of the 103 treatments’ MOA correctly to the right MOA. The t-SNE visualization of the well embeddings (Fig. 3b) provides a visual confirmation of the clustering quality. The well embeddings of the same MOA are nicely clustering together and are clearly separated from the well embeddings of other MOAs. With the NSCB metric as best epoch criterium, UMM Discovery performed a slightly lower NSC of 95.1%, but improved the NSCB to 89.1%.

Evaluation results of the annotated subset of BBBC021. (a) NSC Confusion matrix of the best performing setup. Almost all treatments are classified correctly by the first nearest neighbor NSC with the corresponding MOA label. Only three treatments were assigned to the wrong MOA. (b) t-SNE visualization of the well embeddings of the annotated subset of BBBC021 without untreated cells of the best performing setup. Each dot corresponds to a well and is colored by MOA
Fig. 3.

Evaluation results of the annotated subset of BBBC021. (a) NSC Confusion matrix of the best performing setup. Almost all treatments are classified correctly by the first nearest neighbor NSC with the corresponding MOA label. Only three treatments were assigned to the wrong MOA. (b) t-SNE visualization of the well embeddings of the annotated subset of BBBC021 without untreated cells of the best performing setup. Each dot corresponds to a well and is colored by MOA

3.1.1 Comparison with supervised training

We compared our method with a supervised training algorithm, where we trained the same adjusted Multi-Scale neural network with ground truth labels, once with the MOA as in Godinez et al. (2017), and once with the treatment metadata (compound + concentration) as in Godinez et al. (2018). After supervised training, we removed the last classification layer and created embeddings from all the tiles. Additionally, we used the same batch correction as in UMM discovery during training. Table 1 recapitulates the performances of these approaches.

Table 1.

Comparing our work with supervised approach trained on the annotated BBBC021 subset with once the MOA as labels and once the treatment information as labels

MethodsNSC NSCB SIL
UMM discovery with NSC criterion97.184.80.52
UMM discovery with NSCB criterion95.189.10.50
Multi-scale with MOA99.097.80.45
Multi-scale with MOA + Combat&TVN98.095.70.67
Multi-scale with treatment61.254.30.14
Multi-scale with treatment + Combat&TVN76.776.10.45
MethodsNSC NSCB SIL
UMM discovery with NSC criterion97.184.80.52
UMM discovery with NSCB criterion95.189.10.50
Multi-scale with MOA99.097.80.45
Multi-scale with MOA + Combat&TVN98.095.70.67
Multi-scale with treatment61.254.30.14
Multi-scale with treatment + Combat&TVN76.776.10.45
Table 1.

Comparing our work with supervised approach trained on the annotated BBBC021 subset with once the MOA as labels and once the treatment information as labels

MethodsNSC NSCB SIL
UMM discovery with NSC criterion97.184.80.52
UMM discovery with NSCB criterion95.189.10.50
Multi-scale with MOA99.097.80.45
Multi-scale with MOA + Combat&TVN98.095.70.67
Multi-scale with treatment61.254.30.14
Multi-scale with treatment + Combat&TVN76.776.10.45
MethodsNSC NSCB SIL
UMM discovery with NSC criterion97.184.80.52
UMM discovery with NSCB criterion95.189.10.50
Multi-scale with MOA99.097.80.45
Multi-scale with MOA + Combat&TVN98.095.70.67
Multi-scale with treatment61.254.30.14
Multi-scale with treatment + Combat&TVN76.776.10.45

UMM discovery achieved results very close to supervised training on the MOA labels and outperformed the supervised approach with treatment labels (Table 1).

3.1.2 Comparison with other methods

We compared UMM Discovery with other state-of-the-art methods on the annotated subset of BBBC201. Table 2 shows that UMM Discovery achieved, together with the weakly supervised methods from Caicedo et al. (2018), the best performances in this comparison. One important distinction with other methods is that UMM Discovery is trained from scratch in a fully unsupervised manner. Ando et al. (2017) and Pawlowski et al. (2016) are using neural networks pretrained on other data to create their embeddings and Caicedo et al. (2018) uses the treatment information as training labels. Another important difference is that UMM Discovery is trained end-to-end on the full resolution field-of-view images which makes the extraction of single-cell candidates unnecessary. Doing so allows UMM Discovery to capture cell proliferation effects, an important addition to MOA prediction (Supplementary Fig. S6).

Table 2.

Comparison of the results of UMM Discovery with other methods on the annotated subset of BBBC201

StrategyMethodCitationSingle-cell extractionNSCNSCB
Classical featuresCell features from an image analysis pipelineLjosa et al. (2013)Yes94%77%
Illumination Correction + Image analysis pipelineSingh et al. (2014)Yes90%85%
Fully supervisedSupervised Multi-scale CNNGodinez et al. (2017)No93%N/A
Weakly supervisedWeakly supervised CNNCaicedo et al. (2018)Yes97%86%
Weakly supervised CNN with mixupCaicedo et al. (2018)Yes95%89%
Transfer learningPretrained Neural NetworkPawlowski et al. (2016)No91%N/A
Pretrained Deep MetricAndo et al. (2017)Yes96%95%
UnsupervisedVariational AutoencoderLafarge et al. (2019)Yes93%82%
UMM Discovery with NSC criterium(this work)No97%85%
UMM Discovery with NSCB criterium(this work)No95%89%
StrategyMethodCitationSingle-cell extractionNSCNSCB
Classical featuresCell features from an image analysis pipelineLjosa et al. (2013)Yes94%77%
Illumination Correction + Image analysis pipelineSingh et al. (2014)Yes90%85%
Fully supervisedSupervised Multi-scale CNNGodinez et al. (2017)No93%N/A
Weakly supervisedWeakly supervised CNNCaicedo et al. (2018)Yes97%86%
Weakly supervised CNN with mixupCaicedo et al. (2018)Yes95%89%
Transfer learningPretrained Neural NetworkPawlowski et al. (2016)No91%N/A
Pretrained Deep MetricAndo et al. (2017)Yes96%95%
UnsupervisedVariational AutoencoderLafarge et al. (2019)Yes93%82%
UMM Discovery with NSC criterium(this work)No97%85%
UMM Discovery with NSCB criterium(this work)No95%89%

Note: Not-Same-Compound (NSC) constrains from matching to the same compound and Not-Same-Compound-and-Batch constrains from matching to the same compound or any compound in the same batch. The highest scores for each metric are highlighted in bold.

Table 2.

Comparison of the results of UMM Discovery with other methods on the annotated subset of BBBC201

StrategyMethodCitationSingle-cell extractionNSCNSCB
Classical featuresCell features from an image analysis pipelineLjosa et al. (2013)Yes94%77%
Illumination Correction + Image analysis pipelineSingh et al. (2014)Yes90%85%
Fully supervisedSupervised Multi-scale CNNGodinez et al. (2017)No93%N/A
Weakly supervisedWeakly supervised CNNCaicedo et al. (2018)Yes97%86%
Weakly supervised CNN with mixupCaicedo et al. (2018)Yes95%89%
Transfer learningPretrained Neural NetworkPawlowski et al. (2016)No91%N/A
Pretrained Deep MetricAndo et al. (2017)Yes96%95%
UnsupervisedVariational AutoencoderLafarge et al. (2019)Yes93%82%
UMM Discovery with NSC criterium(this work)No97%85%
UMM Discovery with NSCB criterium(this work)No95%89%
StrategyMethodCitationSingle-cell extractionNSCNSCB
Classical featuresCell features from an image analysis pipelineLjosa et al. (2013)Yes94%77%
Illumination Correction + Image analysis pipelineSingh et al. (2014)Yes90%85%
Fully supervisedSupervised Multi-scale CNNGodinez et al. (2017)No93%N/A
Weakly supervisedWeakly supervised CNNCaicedo et al. (2018)Yes97%86%
Weakly supervised CNN with mixupCaicedo et al. (2018)Yes95%89%
Transfer learningPretrained Neural NetworkPawlowski et al. (2016)No91%N/A
Pretrained Deep MetricAndo et al. (2017)Yes96%95%
UnsupervisedVariational AutoencoderLafarge et al. (2019)Yes93%82%
UMM Discovery with NSC criterium(this work)No97%85%
UMM Discovery with NSCB criterium(this work)No95%89%

Note: Not-Same-Compound (NSC) constrains from matching to the same compound and Not-Same-Compound-and-Batch constrains from matching to the same compound or any compound in the same batch. The highest scores for each metric are highlighted in bold.

The results show that our method performed very well on the small annotated subset of the BBBC021 dataset. However, the real benefit of an unsupervised approach is obtained when applying it to unlabeled datasets. Next, we evaluated UMM Discovery on the full BBBC201 dataset.

3.2 Unsupervised evaluation of the entire BBBC021 dataset

To evaluate whether UMM Discovery performed well on unlabeled datasets, we trained it on the entire BBBC021 dataset. As ground truth MOA labels are only available for a small subset of the images, we evaluated the clustering performance of our method using the available metadata (i.e. treatment metadata). We used the TCS to select the best epoch to avoid overfitting. The method achieved a silhouette score (SIL) of 0.41, an adjusted mutual information score (AMI) of 0.67 and a completeness score (COM) of 0.94. The high completeness score indicates that almost all embeddings with the same treatment are assigned to the same cluster.

We visualized the clustering of the well embeddings with t-SNE (see Supplementary Fig. S6) and could identify (i) clusters that contained several well embeddings with known MOA, (ii) outlier well embeddings with unexpected clustering behavior based on MOA annotation (iii) as well as completely unannotated clusters (novel clusters). These novel clusters could represent new MOAs that are absent from the annotated subset.

We performed a detailed evaluation (see Supplementary Chapter: Unlabeled Evaluation) and focused on seven different handpicked groups of well embeddings, denoted with boxes (a–g) in Supplementary Figure S6. The compounds and their concentrations within each box are listed in Supplementary Table S5, where we highlighted the concentrations that belong to the MOA annotations of the BBBC021 subset. Supplementary Table S5 additionally contains the compound target from DrugBank (Wishart et al., 2018).

In summary, UMM Discovery was able to cluster unannotated compounds with annotated compounds, thus enabling MOA assignment. Additionally, we demonstrated the ability of the UMM Discovery to identify outliers. Lastly, UMM Discovery is able to identify novel MOA and, importantly, is able to distinguish between MOA effect and proliferation effect.

3.3 Evaluation on 1008 Tales dataset

To further demonstrate the ability to cluster cellular images by phenotypes, we trained and evaluated UMM Discovery on two different subsets of the phenotypic screening assay from Cox et al. (2020). The resulting well embeddings were filtered by active calling to keep only the active well embeddings. For identifying the active compounds, we used the same criteria, high deviation from the control wells and high reproducibility, as from Cox et al. (2020). The method for identifying active treatments is described in the chapter Active Calling in Supplementary Chapters.

In the validation subset, the method achieved a NSC of 35.5% and a NSCB of 36.8% on all treatments and a NSC of 67.6% and NSCB of 69.7% on only the active treatments for all 22 MOA. Some of the MOAs showed diverse phenotypes and were therefore not nicely clustering together, which had an impact on the NSC/NSCB score. Furthermore, we validated the well embeddings by comparing our active treatments with the active treatments in Cox et al. (2020) and had very similar active treatments. The AMI was 0.80 and the accuracy score was 97.3%.

In the discovery subset of the 1008 Tales dataset, we wanted to explore how well UMM Discovery is able to discover novel MOA in comparison to the work from Cox et al. (2020). The T-SNE visualization in Figure 4 shows the clustering of the active well embeddings with high reproducibility. 11 of the 14 MOAs identified by Cox et al. (2020) formed distinct phenotypic clusters, where PI3K inhibitor were clustering with mTOR inhibitors. Similar to Cox et al. (2020), the ABL1 inhibitors and VEGFR family inhibitors showed diverse phenotypes and poor clustering and are not colored in Figure 4. The Microtubule stabilizers were not included in the analysis as they contained in our case only 2 co-annotated treatments.

t-SNE plot of all active well embeddings with high reproducibility of the 1008 Tales subset. Each dot corresponds to a well and is colored by MOA. The first legend holds the MOA that match the findings of Cox et al. (2020) and the second legend contains additional MOAs identified by UMM Discovery. Only MOAs with the most distinct phenotypic are colored. The additional detected MOA clusters are highlighted by manually drawn circles. *in comparison with Cox et al. (2020)
Fig. 4.

t-SNE plot of all active well embeddings with high reproducibility of the 1008 Tales subset. Each dot corresponds to a well and is colored by MOA. The first legend holds the MOA that match the findings of Cox et al. (2020) and the second legend contains additional MOAs identified by UMM Discovery. Only MOAs with the most distinct phenotypic are colored. The additional detected MOA clusters are highlighted by manually drawn circles. *in comparison with Cox et al. (2020)

In addition of finding the MOA clusters detected in Cox et al. (2020), UMM Discovery was able to detect additional MOAs with distinct phenotypic clusters. The most distinctive ones are highlighted in Figure 4, where the Translation inhibitors and MEK inhibitors form the most distinct clusters of the additional MOAs. Interestingly, the DNA synthesis inhibitors show two different phenotypes and the Polo-like kinase inhibitors cluster close to the Kinesin inhibitors.

4 Discussion

In this work, we present an unsupervised approach for the analysis of high-content cellular images based on the combination of two state-of-the-art methods, namely a Multi-Scale neural network architecture (Godinez et al., 2017) combined with a deep clustering framework (Caron et al., 2019). Our approach can accurately and robustly differentiate across a diverse set of phenotypes without the use of prior phenotypic annotation.

Unlike other approaches in the field, we propose a fully unsupervised method which starts from random initialization of the networks’ weights and learns to capture and distinguish the various drug-induced phenotypes from the intensity values of the cellular images alone. Further significant improvements were obtained by introducing batch correction methods which utilize plate and compound annotation to reduce inter-plate effects. Yet at no stage is prior knowledge of MOA required or utilized.

We see benefits from not using prior MOA annotation. For example, including MOA annotation in the training procedure runs the risk of reducing the space of discoverable MOAs to only those included in the training dataset. Indeed, drug concentration are typically either ignored (leading to MOA-based clusters with little or no dose gradient) or treated as individual annotations included in the supervised training (leading to discrete dose dependent clusters with little overlap between iterative doses).

Next, the problem of batch effect was addressed. HCS datasets, like many high-throughput biological datasets, are typically obtained from experiments with potential systematic and distinct noise introduced by technical heterogeneity (different experiment times, handlers, reagent lots, etc.). The BBBC021 dataset is no different and includes well annotated batches, which can be modeled using the differential effect observed across batches for the same treatment. In cases where such annotation would be lacking, considering individual plates as distinct batches would address the problem similarly. In our work, performing such batch correction during training improved the results significantly. The batch correction not only increases the NSC for all clustering methods but also increases the silhouette score significantly, which implies that the overall clustering of the MOA is improved. Although, Combat or TVN alone could improve the NSC in all methods, the combination of both had a significantly higher silhouette score. However, it is important to mention that it is impossible to completely remove the batch effect on the dataset BBBC021 due to lack of sufficient coverage of identical compounds across batches. The application of a different batch correction methods, such as the implementation of the Wasserstein distance to correct for batch effect as in Tabak et al. (2020), could further improve the results. Regarding the dimensionality reduction, we tested PCA, t-SNE and UMAP, with t-SNE performing best. This would indicate the importance of local structure conservation rather than global structure to achieve higher performance. The relative underperformance of UMAP compared to t-SNE came as a surprise in this context, as both methods share some important characteristics, in particular local structure conservation. Better results might have been achieved with an exhaustive search of the UMAP parameter space. The final optimized method is not only able to cluster similar phenotypes together, it also achieves a NSC of 97% on the subset of the BBBC021 which is comparable performance to state-of-the-art methods on the dataset. Furthermore, this result is very close to training in a supervised manner with MOA as ground truth labels.

As pointed out in Ando et al. (2017), the NSC could be overestimated due to nuisance variation. The observed difference between NSC and NSCB, highlights that UMM discovery is still affected by some nuisance variation, despite batch correction. A possible hypothesis for the NSCB drop is the use of linear batch correction methods. Instead of relying on linear batch correcting methods, an extension of UMM Discovery with an adversarial setup as in Qian et al. (2020) or regularization methods like mixup as in Caicedo et al. (2018) could have the potential to correct the nuisance variation even better. Another reason for the drop between NSC and NSCB is the best epoch determination and that we performed parameter optimization for the best NSC instead of best NSCB. By using the NSCB as criterium, UMM discovery achieves a NSC of 95.1% and a NSCB of 89.1%.

Another possible reason is that we used full resolution field-of-view images instead of single-cell candidates, which might induce additional nuisance variation. Training UMM Discovery end-to-end directly from full resolution field-of-view images simplifies the training procedure on large compound assays. In addition, it allows us to capture proliferation and toxicity effects, which can be used for dose response analysis. As the cell count is encoded in the embeddings, this advantage of UMM Discovery might also make it susceptible to nuisance variation. One possibility to reduce this nuisance variation and to improve the NSCB could be to train UMM Discovery with cell candidates cropped from cell segmentations. Instead of using the embeddings, the proliferation effects could then be determined by the number of cell segmentations per well.

Our method was able to produce novel findings in unannotated datasets. As mentioned, one important aspect of UMM Discovery is the ability to train on unannotated datasets, such as the full BBBC021 or the 1008 Tales datasets. Trained on unannotated datasets, we could show that similar phenotypes are clustering together and that our method is not only able to capture the MOA but also the proliferation effect of treatments. Additionally, we show that UMM Discovery can be used for assigning MOA to unknown compound, identifying outliers and the discovery of novel MOA.

During the development of our method, we tried several different setups. Yet still the clustering algorithms and their parameters, the dimensionality reduction methods, the batch correction methods and the hyperparameters of the neural network together represent an immense parameter space. An exhaustive hyperparameter optimization was outside the scope of this paper. The hyperparameter that were used in scope of this work can be found in Supplementary Table S6.

While our approach shows promising results, we recognize that there are some limitations. One of the biggest limitations is the training time, as for each epoch we first have to pass all the images through the neural network to create the pseudo-labels by clustering their embeddings, to start the training. We trained our methods on the NVIDIA Tesla P100 with 16 GB of memory and the training duration of our method on the annotated subset with batch correction and adjusted t-SNE as dimensionality reduction was ∼23 h. With PCA instead of t-SNE the training took ∼18 h. To address this issue, we are consequently exploring the inclusion of early pretrained layers instead of the initial weights. This would ease the evaluation of our method on larger datasets such as the BBBC036, consisting of five channels and 1553 different compounds or the complete 1008 Tales dataset.

Acknowledgements

The authors thank William J. Godinez, Christopher Ball and Mark Bray for fruitful discussions. They also acknowledge the BBBC021 dataset provided by Peter D. Caie via the Broad Bioimage Benchmark Collection as well as the 1008 Tales dataset provided by Michael J. Cox via the Image Data Resource.

Data availability

The BBBC021 data underlying this article are available in the Broad Bioimage Benchmark Collection at https://bbbc.broadinstitute.org/BBBC021. The 1008 Tales data underlying this article are available at the Image Data Resource at https://idr.openmicroscopy.org, and can be accessed with the accession number idr0088.

Funding

This work was supported by the Novartis Institutes for Biomedical Research.

Conflict of Interest: During their involvement related to this reported work, all authors were employees and shareholders of Novartis.

References

Ando
D.M.
 et al.  (
2017
)
Improving phenotypic measurements in high-content imaging screens
.
bioRxiv
,
161422
.

Buchser
W.
 et al.  (
2004
) Assay development guidelines for image-based high content screening, high content analysis and high content imaging. In:
Sittampalam
G.S.
 et al.  (eds.)
Assay Guidance Manual.
 
Eli Lilly & Company and the National Center for Advancing Translational Sciences
,
Bethesda, MD
.

Caicedo
J.C.
 et al.  (
2017
)
Data-analysis strategies for image-based cell profiling
.
Nat. Methods
,
14
,
849
863
.

Caicedo
J.C.
 et al.  (
2018
)
Weakly supervised learning of single-cell feature embeddings
.
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit
.,
2018
,
9309
9318
.

Caie
P.D.
 et al.  (
2010
)
High-content phenotypic profiling of drug response signatures across distinct cancer cells
.
Mol. Cancer Ther
.,
9
,
1913
1926
.

Caron
M.
 et al.  (
2019
)
Deep clustering for unsupervised learning of visual features
. arXiv, 1807.05520 [cs].

Carpenter
A.E.
 et al.  (
2006
)
CellProfiler: image analysis software for identifying and quantifying cell phenotypes
.
Genome Biol
.,
7
,
R100
.

Cox
M.J.
 et al.  (
2020
)
Tales of 1,008 small molecules: phenomic profiling through live-cell imaging in a panel of reporter cell lines
.
Sci. Rep
.,
10
,
13262
.

Godinez
W.J.
 et al.  (
2017
)
A multi-scale convolutional neural network for phenotyping high-content cellular images
.
Bioinformatics
,
33
,
2010
2019
.

Godinez
W.J.
 et al.  (
2018
) Unsupervised phenotypic analysis of cellular images with multi-scale convolutional neural networks. bioRxiv, 361410.

Götte
M.
 et al.  (
2010
)
An imaging assay to analyze primary neurons for cellular neurotoxicity
.
J. Neurosci. Methods
,
192
,
7
16
.

Johnson
W.E.
 et al.  (
2007
)
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
,
8
,
118
127
.

Kraus
O.Z.
 et al.  (
2016
)
Classifying and segmenting microscopy images using convolutional multiple instance learning
.
Bioinformatics
,
32
,
i52
i59
.

Lafarge
M.W.
 et al.  (
2019
) Capturing single-cell phenotypic variation via unsupervised representation learning.
Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning, Proceedings of Machine Learning Research. PMLR
,
315
325
.

Lin
F.
,
Cohen
W.W.
(
2010
)
Power iteration clustering
. In:
Proceedings of the 27th International Conference on Machine Learning (ICML-10)
, pp.
655
662
.

Ljosa
V.
 et al.  (
2013
)
Comparison of methods for image-based profiling of cellular morphological responses to small-molecule treatment
.
J. Biomol. Screen
,
18
,
1321
1329
.

Maaten
L.v.d.
,
Hinton
G.E.
(
2008
)
Visualizing data using t-SNE. J. Mach. Learn. Res., 9, 2579–605
.

McInnes
L.
 et al.  (
2017
)
hdbscan: hierarchical density based clustering
.
JOSS
,
2
,
205
.

Palacio-Niño
J.-O.
,
Berzal
F.
(
2019
)
Evaluation metrics for unsupervised learning algorithms
.
arXiv
, 1905.05667 [cs, stat].

Pawlowski
N.
 et al.  (
2016
)
Automating morphological profiling with generic deep convolutional networks
.
bioRxiv
,
085118
.

Qian
W.W.
 et al.  (
2020
)
Batch equalization with a generative adversarial network
.
Bioinformatics
,
36
,
i875
i883
.

Singh
S.
 et al.  (
2014
)
Pipeline for illumination correction of images for high‐throughput microscopy
.
J. Microscopy
,
256
,
231
236
.

Spiegel
S.
 et al.  (
2019
)
Metadata-guided visual representation learning for biomedical images
.
bioRxiv
,
725754
.

Stringer
C.
 et al.  (
2021
)
Cellpose: a generalist algorithm for cellular segmentation
.
Nat. Methods
,
18
,
100
106
.

Sun
B.
 et al.  (
2017
)
Correlation alignment for unsupervised domain adaptation
. Domain Adaptation in Computer Vision Applications. Advances in Computer Vision and Pattern Recognition. Springer, pp.
153
171
.

Tabak
G.
 et al.  (
2020
)
Correcting nuisance variation using Wasserstein distance
.
PeerJ
,
8
,
e8594
.

Wishart
D.S.
 et al.  (
2018
)
DrugBank 5.0: a major update to the DrugBank database for 2018
.
Nucleic Acids Res
.,
46
,
D1074
D1082
.

Author notes

The authors wish it to be known that, in their opinion, the last two authors should be regarded as Joint Last Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Lenore Cowen
Lenore Cowen
Associate Editor
Search for other works by this author on:

Supplementary data