Abstract

While only recently developed, the ability to profile expression data in single cells (scRNA-Seq) has already led to several important studies and findings. However, this technology has also raised several new computational challenges. These include questions about the best methods for clustering scRNA-Seq data, how to identify unique group of cells in such experiments, and how to determine the state or function of specific cells based on their expression profile. To address these issues we develop and test a method based on neural networks (NN) for the analysis and retrieval of single cell RNA-Seq data. We tested various NN architectures, some of which incorporate prior biological knowledge, and used these to obtain a reduced dimension representation of the single cell expression data. We show that the NN method improves upon prior methods in both, the ability to correctly group cells in experiments not used in the training and the ability to correctly infer cell type or state by querying a database of tens of thousands of single cell profiles. Such database queries (which can be performed using our web server) will enable researchers to better characterize cells when analyzing heterogeneous scRNA-Seq samples.

INTRODUCTION

Single cell RNA-seq (scRNA-seq) which profiles the transcriptome of individual cells (as opposed to ensemble of cells) has already led to several new and interesting findings. These include the level of heterogeneity within a population of cells (1), the identification of new markers for specific types of cells (2) and the temporal stages involved in the progression of various developmental processes (3).

While promising, single cell data have also raised new computational challenges. An important and exciting application of single cell sequencing is the ability to identify and characterize new cell types and cell states (4,5). Recent work has used single cell expression profiles to discover new cells in developing lungs (6), new brain cells (4) and to refine several aspects of cell state transitions in differentiation studies (7,8). A key question that all such studies had to address is how to determine the similarity of the expression profiles of a pair (or larger sets) of cells? Another application for which the ability to compare single cell expression data between cells is critical in retrieval of similar cell types. Consider an experiment in which a population of cells taken from a diseased individual, or from a tumor, is profiled. One question that may be important for such analysis is to identify the specific types of cells that are present in the sample that was profiled, for example to determine which immune cells may have penetrated the diseased tissue (9). While such analysis is often performed using markers, a much more comprehensive solution is to compare the various cell expression profiles to a set of curated single cells with known types.

In the above examples, comparisons or similarity analysis can either be performed using the measured expression values or after performing dimensionality reduction which may help reduce the noise associated with specific values. Indeed, several methods have been used and developed for performing such comparisons. The simplest, though one of the most popular, is based on principal component analysis (PCA). PCA has been used extensively for clustering single cells (1,10,11). Other groups have developed new methods which extend and improve PCA. These include pcaReduce (12), which uses a novel agglomerative clustering method on top of PCA to cluster the cells. SNN-Cliq (13) constructs a shared k-nearest neighbor (KNN) graph over all the cells with the weight of each edge being the difference between k and the highest averaged ranking of the common KNN between two cells. It then tries to find maximal cliques in that graph in order to cluster the cells. ZIFA (14) uses a dimensionality reduction technique that takes into account the dropout characteristics of single cell sequencing data. SINCERA provides a pipeline for the analysis of single cell gene expression data, one of whose tasks is to identify new cell types (15). Clustering is done via hierarchical clustering using centered Pearson correlation as the similarity measure. SIMLR (16) is another open-source tool that performs dimensionality reduction and clustering based on a cell similarity metric.

While PCA and other unsupervised approaches have been successful, they have mostly been used to analyze datasets generated by a specific group. In contrast, for problems including retrieval we would like to obtain a reduced dimension for all cell types and experiments across different labs. In addition, PCA is an unsupervised method and so it is not directly trying to distinguish between specific cell types. Thus it may not be the best method for a discriminative analysis goal including retrieval of cell types based on their expression.

Here, we propose to replace PCA-based dimensionality reduction with a supervised method based on neural networks (NN). These networks are universal function approximators (17) and, while training such networks takes longer than the unsupervised methods discussed above, they are very scalable in terms of training when using GPUs. Past work has used unsupervised versions of NN for the analysis of bulk expression data. For example, Tan et al. (18) used denoising autoencoders (DAEs) to reconstruct bulk gene expression data. Using these networks, they were able to assign genes to clusters and find clusters enriched for genes that are differentially activated between strains of Pseudomonas aeruginosas during cellular response to low oxygen. Gupta et al. (19), use DAEs to analyze time series bulk gene expression data. In contrast, while we also use DAEs for pre-training our network, we fine-tune it with a supervised training pass afterwards to learn a latent representation that disentangles cell types.

Neural network models use multiple layers, often with fewer nodes than the number of input values. The networks are trained by maximizing a target function and in several cases, one of the intermediate layers with small cardinality (compared to the number of inputs) can serve as a reduced dimensionality representation for the input data (20). In our implementation we use gene expression values as the input and formulate a target function whose goal is the identify the correct cell type from the values computed by intermediate layer nodes (Figure 1). We tested various architectures for such networks, including architectures informed by prior biological data (such as protein–protein (PPI) and protein–DNA interactions (PDI)) and compared their performance to prior methods for analyzing single cell data. As we show, the learned networks captured several important biological aspects of the data. We used the values from the intermediate layer in the neural network (NN) for clustering and retrieval of scRNA-seq data. As we show, using the NN values improved performance when compared to using the measured expression data, various unsupervised methods for reducing the dimension of the data, and prior methods for clustering single cell data.

Figure 1.

Network architectures of NN models used in this paper. (A) Single-layered fully connected network. (B) Two-layered fully connected network. (C) Single-layered network with TF and PPI cluster nodes (connected only to their member genes) and fully connected dense nodes. (D) Two-layered network that is similar to the model in (C) but with an additional fully connected layer. Note that bias nodes are also included in each model.

MATERIALS AND METHODS

Datasets used in our analysis

We collected a total of 33 datasets with more than 17 000 single cell expression profiles from published papers and from the Gene Expression Omnibus (21) (GEO). Supplementary Table S1 and the Supporting Website provide full details on these datasets. We used 3 of these datasets which, combined, profiled 16 different types of cells and a total of 402 single cells for initial training, testing and evaluation of the method. We used 31 of the 33 datasets for the retrieval analysis (to avoid using different datasets from the same lab). All datasets are used in the retrieval application which is available from the Supporting Website. We curated all 33 datasets and assigned cell type labels to all single cell expression profiles.

Normalization and imputation

We tested a number of methods for normalizing data obtained from different labs and different platforms. We initially tested a novel normalization method for single-cell RNA sequencing data which is based on pooling across cells (22). However, while results for the clustering analysis using this method were similar to the results presented below (see Supplementary Figures S1 and 2), the method required us to manually set several parameters (such as pool size) which made it hard to use for the larger retrieval analysis. Instead, following prior work we normalized the data by converting all datasets to the the Transcripts Per Million (TPM) format (10,2325).

To combine PPI, PDI and single cell data (see below), we have only used a subset of 9437 genes that were present in the three single cell training datasets and in the PPI and PDI datasets. For the much larger set of profiles used in the retrieval analysis, we used the same set of genes. Since these datasets were generated by different platforms and groups, counts for 2% of the genes were missing (on average) from each dataset. In order to use the NN method for analyzing these datasets, we performed imputation for these missing genes as follows. Missing values for the retrieval analysis were first assigned the median gene expression value for the cell and then imputed with the average expression value for the k-nearest neighbor genes (we used k = 10), where nearest neighbors were computed based on overall correlation (26). Following TPM normalization and imputation, each gene was normalized to the standard normal distribution across samples since this is an essential step for NN training. We choose not to do the log-transformation because we found that it did not help the performance.

To account for the drop-outs in the imputation procedure, we tested a probabilistic model that would randomly assign values for 0 to a specific fraction $z$ of the imputed genes instead of relying on the nearest neighbors for as discussed above. We tested several different values for $z$ including 0 (no drop outs), 0.01, 0.03 and 0.05 using a cross validation strategy. Our results indicate that the best performance is achieved using $z$ = 0 (see Supplementary Table S7) and so this is what we used for the rest of the paper.

Protein–protein and protein–DNA interaction data

We used PPI and PDI data to determine the architecture of some of the NN we tested. We constructed a weighted, partially directed, protein interaction network using several databases including BIOGRID (27), HPRD (28) and also used Post-translational Modification Annotations from the HPRD. PDI were based on data from (29). The data contained 160 000 PPI edges and 60 000 PDI edges between 16 671 genes and 348 TFs. The PPI data were weighted based on the different types of experimental evidence supporting each interaction (30).

A neural network representation of single cell expression data

We evaluated four types of neural network architectures (Figure 1), and trained a total of five models (detailed number of nodes for each model are shown in Table 1). All architectures include an input layer, one or two hidden layers (more hidden layers did not improve performance) and an output layer. The input layer encodes the expression values for the genes that are used in the analysis. The output layer encodes the probability of each cell type. Hidden layers are functions of the input and are used to efficiently encode the inputs such that the output can be the correct cell type given a set of expression values.

The five different types of NN used in the paper

Table 1.
The five different types of NN used in the paper
No.ModelLayer1#nodeLayer2#node#Parameters (million)
1Dense796X7.52 M
2Dense100X0.95 M
3Dense7961007.59 M
4PPI/TF+dense696 + 100 denseX1.01 M
5PPI/TF+dense696 + 100 dense1001.08 M
No.ModelLayer1#nodeLayer2#node#Parameters (million)
1Dense796X7.52 M
2Dense100X0.95 M
3Dense7961007.59 M
4PPI/TF+dense696 + 100 denseX1.01 M
5PPI/TF+dense696 + 100 dense1001.08 M

Note that No.1 and No.2 represent the same architecture (Figure 1A) using different number of hidden layer nodes. The 696 in the PPI/TF models is from 348 TFs + 348 PPI groups. The number of nodes in the Dense models corresponds to the number of nodes in PPI/TF models for comparison. The additional 100 dense in each model are manually selected.

Table 1.
The five different types of NN used in the paper
No.ModelLayer1#nodeLayer2#node#Parameters (million)
1Dense796X7.52 M
2Dense100X0.95 M
3Dense7961007.59 M
4PPI/TF+dense696 + 100 denseX1.01 M
5PPI/TF+dense696 + 100 dense1001.08 M
No.ModelLayer1#nodeLayer2#node#Parameters (million)
1Dense796X7.52 M
2Dense100X0.95 M
3Dense7961007.59 M
4PPI/TF+dense696 + 100 denseX1.01 M
5PPI/TF+dense696 + 100 dense1001.08 M

Note that No.1 and No.2 represent the same architecture (Figure 1A) using different number of hidden layer nodes. The 696 in the PPI/TF models is from 348 TFs + 348 PPI groups. The number of nodes in the Dense models corresponds to the number of nodes in PPI/TF models for comparison. The additional 100 dense in each model are manually selected.

Specifically, we formulate our neural network model as follows. Let the vector |$\boldsymbol{x}^{(i)}$| denote the output of ith hidden layer. We use |$\boldsymbol{x}^{(0)}$| to represent the input of NN. To compute |$\boldsymbol{x}^{(i)}$|⁠, we perform forward propagation: (Equation 1)
\begin{equation*} \boldsymbol{x}^{(i)} = a(\boldsymbol{W}^{(i)}\boldsymbol{x}^{(i-1)}+\boldsymbol{b}^{(i-1)}) \end{equation*}
(1)
where, a is the activation function, |$\boldsymbol{b}$| is an intercept term and |$\boldsymbol{W}$| is the weight matrix of each edges in the neural network. |$\boldsymbol{W}$| and |$\boldsymbol{b}$| are the parameters we need to learn.
We tested a number of possible activation functions including sigmoid, linear, relu and tanh. Our analysis indicates that the hyperbolic tangent activation function (tanh) (Equation 2) leads to the best performance among these, and so we used it in the remainder of this paper. The tanh function is defined as:
\begin{equation*} \tanh (x) = \frac{1-\exp (-2x)}{1+ \exp (-2x)} \end{equation*}
(2)
For the output layer, which performs discrete classification, we use the softmax activation function (Equation 3). Let |$\boldsymbol{x}$| denote the input of the output layer (which is also the output of the last hidden layer), then we have the following:
\begin{eqnarray*} &&{\rm output}(\boldsymbol{x}) \nonumber \\ &&= {\mathrm{softmax}}(\boldsymbol{x}) = {\left[\frac{\exp (x_1)}{\sum _c \exp (x_c)} \cdots \frac{\exp (x_C)}{\sum _c \exp (x_c)}\right]}^T \end{eqnarray*}
(3)
Where, C encodes indices for all the cell types in training set.
Here, the output |$f(\boldsymbol{x}^{(0)})_{c}$| of each node c in the output layer represents the probability:
\begin{equation*} f(\boldsymbol{x}^{(0)})_{c} = p(y=c|\boldsymbol{x}^{(0)}) \end{equation*}
(4)
that the input sample |$\boldsymbol{x}^{(0)}$| belongs to cell type c. The loss function is categorical cross-entropy function (Equation 5):
\begin{equation*} {\rm loss} = - \log f(\boldsymbol{x}^{(0)})_y \end{equation*}
(5)
where, y is the true cell type (label) of the input |$\boldsymbol{x}^{(0)}$|⁠.

Architectures used in our NN method

While all networks used the same input and output layers, they differed in the structure and number of hidden layers. We tested models with 1 or 2 fully connected hidden layers (for the NN with 2 hidden layers, 796 nodes in the first hidden layer and 100 nodes in second hidden layer, Figure 1A and B). While these models have at most 2 hidden layers, the number of parameters that are fitted is very large (the architecture with the most parameters had 7.6 million parameters).

The above architectures do not make use of prior biological knowledge. To incorporate such information, we also tested architectures that were based on PPI and PDI. Unlike the fully connected layers in the architectures described above, when using the biological data we connect the nodes in the first hidden layer to only a small subset of the input nodes. We do this by fixing the weight to 0 for all connections from genes to PPI/TF nodes for which the genes do not belong to that node. This greatly reduces the number of parameters that needs to be learned which can help with issues related to overfitting. Specifically, we have 348 hidden layer nodes based on the PDI data (one for each TF, connected only to genes regulated by that TF) and 348 nodes for the PPI. To divide the PPI graph into 348 subnetworks (where each subnetwork is connected to a node in the hidden layer) we used the ClusterOne (31) algorithm, which produces overlapping clusters so that, similar to the dense nodes, a gene can be connected to more than one PPI node. We also added 100 fully connected nodes to the first hidden layer which account for missing data and TFs in the PPI and PDI—for instance if we do not have a set of targets for a specific TF that should be grouped together or if the PPI network is missing edges leading to missed clusters (PPI nodes). Still, the total number of parameters for this architecture is about 1 million, an order of magnitude less than for the fully dense architectures.

Unsupervised pre-training

The discussion above focused on supervised learning of NN (where the label is the cell type). In addition to supervised learning NN can also use unlabeled data, a practice that has proven useful in other domains (32). A specific type of this NN is termed ‘denoising autoencoder (DAE)’ since the goal is to reconstruct the input layer values using a small number of hidden layer nodes (since the target is the input expression values, no labels are needed to train autoencoders). While unsupervised, autoencoders have been shown to successfully identify input combinations that affect the overall set of values (3335). Given the large number of parameters in a NN the ability to train autoencoders and use the parameters learned as priors for a supervised learning procedure improves the initialization of the model and often leads to better generalization performance (32). We have thus tested the use of 1-layer DAE when testing the method on large datasets (retrieval datasets). We train DAE to reconstruct the original input from corrupted input with the noise sampled from the Normal distribution with zero mean and 0.1 standard deviation. The architecture of DAE is similar to Figure 1A except that the output layer is changed to be the reconstructed input. All layers in the DAE use the tanh activation function and mean square error as the loss function. Here we used 100 and 796 nodes in the hidden layer of the DAE, similar to the numbers used for the supervised models to make sure that the weights of DAE can be used as pre-trained weights of supervised models. We copy the trained weights of DAE to the dense nodes of our models.

Learning model parameters

All models were implemented using Keras ( https://github.com/fchollet/keras) with some modifications to accommodate the sparse layer connections for TF and PPI nodes. All the NNs are first initialized with Glorot initialization (36). The models are trained using stochastic gradient descent with Nesterov accelerated gradient with a learning rate of 0.1, decay 10−6, momentum 0.9. The learning rate for epoch K is the original learning rate multiplied by 1/(1+decay*K). We used 100 epochs (which were manually selected and enough for reaching convergence) to train each model with a mini-batch size of 10, which is the number of samples to fit at each update step. Detailed information about the dimensions of the different architectures is provided in Table 1. It took us 30 s to train the largest supervised NN (7.6 million parameters, 402 cells), 40 min to train the largest unsupervised NN (15 million parameters, 17 000 cells) on a machine with 4 Intel(R) Xeon(R) CPU E5-2620 v3 (2.40GHz each core, 24 cores in total), 4 Nvidia GTX 1080 GPUs and 128 GB RAM. Note that each NN model including the one used for the final outcome of our method (retrieval analysis in Table 4) is only trained using a single GPU.

Biological analysis of learned models

To determine the biological relevance of the parameters learned from the NN we analyzed significant gene groups for each cell type in the PPI/PDI model (Figure 1C). For this we identify the top 10 most highly weighted (hidden layer) nodes for each output layer node (corresponding to the different cell types). Some of the selected nodes are explained by the TF or the PPI they represent. For the other (100 nodes in the hidden layer initially connected all input genes) we perform GO analysis based on the set of input genes that are connected to these nodes with a high (absolute value) weight. We used gprofiler (37) to perform GO analysis because it provides command-line interface for access.

Comparisons to prior clustering and dimensionality reduction methods

To perform dimensionality reduction based on the NN results we extract the values computed by the last hidden layer of each architecture we tested. We next use a simple clustering method (K-means++ clustering (38)) to perform unsupervised grouping of cells using a test set (not used in the NN training) and the results are compared to prior methods suggested for clustering single cell expression data. For such comparisons, we perform experiments in which we left out 2, 4, 6 or 8 random cell types of the 16 types in our analysis set. We next cluster the left out data using the reduced representation obtained from the last hidden layer of the NN and use the adjusted random index (ARI) (39) to compare the clustering results with the true labels. ARI counts the number of agreements and disagreement between two groupings while adjusting for random performance into account. It is defined as follows. Let X = {X1, X2, ..., Xr}, Y = {Y1, Y2, ..., Ys} be two groupings. We can summarize the overlap between X and Y using a table N where |$N_{ij} = |X_i \bigcap Y_j|$| is the number of objects in common. Let ai = ∑jNij, bj = ∑iNij, n be the total number of samples, then we set:
\begin{eqnarray*} \begin{split} {\rm ARI}&=\frac{{\rm Index-ExpectedIndex}}{{\rm MaxIndex-ExpectedIndex}}\\ &= \frac{\sum _{ij}\binom{N_{ij}}{2}-(\sum _i \binom{a_i}{2}\sum _j \binom{b_j}{2})/\binom{n}{2}}{\frac{1}{2} (\sum _i \binom{a_i}{2}+\sum _j \binom{b_j}{2}) - (\sum _i \binom{a_i}{2}\sum _j \binom{b_j}{2})/\binom{n}{2}} \end{split} \end{eqnarray*}
(6)
We repeat the experiment for 20 times with fixed random seeds 0–19, each time with different random left out cell type and obtain an average ARI value. We also test other scoring methods for clustering: adjusted mutual information (40), Fowlkes–Mallows score (41), homogeneity, completeness and v-measure scores (42).

Cell retrieval method

To measure the performance of cell retrieval, the percentage of the desired cell type in the top k nearest neighbors (here, we used k = 100) is calculated for different single cell expression data representations (NN, PCA, measured values). To reduce the ability of the retrieval analysis to rely on artifacts for correctly identifying the cell types (e.g. experiments from the same lab) we only perform this analysis on cell types that were profiled in different datasets out of the 31 used for this analysis. We thus used 9 different cell types for this analysis though the database itself contains over 100 cell types (including subtypes). Cell types that are not selected can thus still be in the retrieved set of k nearest cells which makes the analysis more similar to how we envision the usage of such database. We use the mean of average precision (MAP) to evaluate the retrieval performance. Average precision corresponds to the area under the precision-recall curve for the retrieved cells. The final result for all cell types is a weighted mean so that every dataset (cell type) has equal weight.

Hyper parameter selection

To select hyper parameters, we split the data into a training and test set randomly with 90% (or 80%) of the data used for training and 10% (or 20%) of the data for testing. We repeat this process 10 times and select the best hyper parameter based on the average test performance in these 10 random experiments. Note that since we evaluate the methods in an unsupervised (clustering) setting, this training uses a completely independent dataset from any of the data we use for evaluation. The hyper parameters for the learning rate affects the training time but it does not affect the performance very much. We performed additional analysis to test the influence of different rates on accuracy. We explored different hyper parameters for a dense model with combinations of learning rates [0.01, 0.1], momentum [0.5, 0.9] and decay [0.001, 1e-06]. The results, presented in Supplementary Table S9, are very similar for all these values.

RESULTS

We learned parameters for the different NN architectures discussed in the ‘Materials and Methods’ section and used the resulting models to test the method and to compare it to prior methods for dimensionality reduction of scRNA-Seq data and for clustering and comparing such data.

Testing and comparing the NN method

To test our method and to compare it to prior methods for clustering single cell time series expression data we used 3 single cell expression datasets which, combined, included 16 cell types (See Supplementary Table S8 for complete list). Data were downloaded mostly from GEO and processed as discussed in ‘Materials and Methods’. We identified 9437 genes that had expression values in all training datasets and used them to learn the NN using various architectures. For each architecture, input values were directly connected to the first hidden layer and the output layer predicted the true cell type (label) for each of the datasets. Thus, the goal of the NN was to identify a reduced dimension representation of the expression values (where the number of nodes/values is a function of the specific architecture) that leads to the most accurate assignment of cell type for each dataset. Training and testing accuracy of all NN models reached nearly 100%.

Given the ability of the method to accurately assign training data, we next asked how well the resulting representation can be used on test (unobserved) data. For this we performed a number of different analyses in which we divided the single cell datasets using subset of the cell types to learn the model and the rest to test it. For testing, we compared the ability of a simple clustering method (k-means++ algorithm with k representing the known number of left out cell types) to accurately sub-divide test data according to their cell types. For this we first learn parameters for the NN using the portion of cells used for training. Next, for each of the test cells we run them through the NN and extract the values of the smallest hidden layer (depending on the architecture) and use these vectors in the clustering procedure. Thus, clustering is based on gene expression combinations determined by the NN. We also note that the training and test sets are comprised of different cell types and so this is not a supervised classification task. Instead, the goal is to see if the parameters learned using the labeled data can improve the unsupervised analysis of the other cell types. We also calculate the performance of pre-trained models, and the result are similar to the model without pre-training. We used the clustering results to compare the NN-based analysis to a number of unsupervised clustering and dimensionality reduction methods that have been proposed for single cell analysis. Specifically, we compared our method to PCA, pcaReduce, Independent Component Analysis (ICA), Non-negative matrix factorization (NMF), t-distributed stochastic neighbor embedding (tSNE), SINCERA, SIMLR and SNN-Cliq. For the unsupervised dimensionality reduction methods (PCA, ICA, NMF, tSNE) we tested several possible settings (number of components). Note that we were unable to test more than 10 components for tSNE and ICA since these methods generated run time errors (NaN, Inf values) when using more than 10 components. As for the other methods, PcaReduce generates hierarchical structure of clustering based on PCA reduced dimensions. SINCERA provides a pipeline of analyzing single cell data, including clustering. For clustering, SINCERA uses hierarchical clustering, tight clustering (43) or consensus clustering (44). We only show the result of hierarchical clustering here because it is the default setting of SINCERA and the other clustering methods often generated error messages when applied to our dataset. SNN-Cliq uses shared nearest neighbors (SNN) to define the similarities between cells and then clusters them using graph based methods. SNN-Cliq sometimes generates error messages when the number of cell types (k) is 2, so we left it out in this comparison. We also tried to compare our results to ZIFA. However, ZIFA did not finish after running for two days when trying to cluster 300 cells with 9437 expression values each. To improve its run time we reduced the number of genes to 1356 (selecting only genes that have non-zero values in 90% of samples) but the performance of ZIFA on this data was not good (much lower than the results presented in the comparison table) and so we left it out. Table 2 presents a comparison of the clustering results for all methods we tested and Figure 2 presents a visualization of these results for a few of the methods. As can be seen, in general the variants of the NN we tested outperformed all other methods. One way to explain this result is the fact that unlike other methods our method learns the best way to represent a reduced dimension for single cell data (even though the comparison is based on an unsupervised clustering task and learning is done on a completely different set of experiments and cell types) whereas the other methods are fully unsupervised. While all NN architectures performed better than other methods, the best was the PPITF variant that integrated prior biological knowledge and dense nodes which did slightly better than the Dense 796 node networks. However, as we show below, when testing on a much larger set of data the architectures that incorporate prior biological knowledge do better than the fully dense ones.

Figure 2.

2D visualizations for some of the methods in Table 2. Colors represents cell types as shown below the figures. (A) 2D TSNE of the original data. (B) 2D PCA for SIMLR-transformed data. (C) 2D TSNE of the second layer of the PPI/TF NN. (D) 2D PCA of the second layer of the PPI/TF NN.

Average performance of different scoring metrics for clustering four cell types not used in the NN training

Table 2.
Average performance of different scoring metrics for clustering four cell types not used in the NN training
FeatureHomoCompVmesARIAMIFMAverage
Original0.7850.8750.820.730.7750.8410.805
pca 20.8330.8830.8540.7860.8210.8730.842
tsne 20.1790.1690.1720.1130.1270.3910.192
ica 20.8330.8820.8530.7850.820.8730.841
nmf 20.6590.7350.690.5920.6380.750.677
pca 50.7540.8630.7980.6850.7410.8140.776
tsne 50.0340.0350.034-0.009-0.0060.3320.07
ica 50.6620.8390.7330.6150.6490.7870.714
nmf 50.7070.8480.760.6930.6890.8240.753
pca 100.7590.8730.8050.6950.7470.8230.784
tsne 100.0490.0430.0450.00.0040.2990.073
ica 100.4520.7330.5450.4160.4310.7060.547
nmf 100.7170.8120.7510.6760.6880.8120.743
pca 500.6920.8670.7590.6370.680.7990.739
nmf 500.4810.6250.5320.4270.4570.6810.534
pca 1000.7420.8790.7950.6950.7310.8290.779
nmf 1000.3970.6350.4720.3580.370.6720.484
pcaReduce0.7680.890.8210.7470.7540.8490.805
SIMLR_200.7930.8060.7990.7180.770.820.784
SIMLR_300.8060.820.8110.7470.7790.8380.8
SIMLR_400.8340.8310.830.7470.7970.8350.812
SNN-Cliq0.7510.9050.8020.7160.7260.8430.79
sincera_hc0.8070.9290.8560.7970.7940.8840.845
Dense 1 layer 1000.9050.8970.90.8720.8850.9150.896
Dense 1 layer 7960.8950.8870.890.8560.8740.9040.884
Dense 2 layer 796/1000.8920.8820.8860.8540.8690.9030.881
PPITF 1 layer 696 + 1000.8970.8960.8960.8660.8820.9110.891
PPITF 2 layer 696 + 100/1000.9060.9020.9030.8740.890.9170.899
Dense 1 layer 100 pre-train0.8830.8710.8770.8380.8580.8920.87
Dense 1 layer 796 pre-train0.8870.8720.8790.8430.860.8960.873
Dense 2 layer 796/100 pre-train0.8840.8750.8790.8380.8620.8940.872
PPITF 1 layer 696 + 100 pre-train0.8780.8730.8750.830.8580.8890.867
PPITF 2 layer 696 + 100/100 pre-train0.90.8940.8970.870.8810.9140.893
FeatureHomoCompVmesARIAMIFMAverage
Original0.7850.8750.820.730.7750.8410.805
pca 20.8330.8830.8540.7860.8210.8730.842
tsne 20.1790.1690.1720.1130.1270.3910.192
ica 20.8330.8820.8530.7850.820.8730.841
nmf 20.6590.7350.690.5920.6380.750.677
pca 50.7540.8630.7980.6850.7410.8140.776
tsne 50.0340.0350.034-0.009-0.0060.3320.07
ica 50.6620.8390.7330.6150.6490.7870.714
nmf 50.7070.8480.760.6930.6890.8240.753
pca 100.7590.8730.8050.6950.7470.8230.784
tsne 100.0490.0430.0450.00.0040.2990.073
ica 100.4520.7330.5450.4160.4310.7060.547
nmf 100.7170.8120.7510.6760.6880.8120.743
pca 500.6920.8670.7590.6370.680.7990.739
nmf 500.4810.6250.5320.4270.4570.6810.534
pca 1000.7420.8790.7950.6950.7310.8290.779
nmf 1000.3970.6350.4720.3580.370.6720.484
pcaReduce0.7680.890.8210.7470.7540.8490.805
SIMLR_200.7930.8060.7990.7180.770.820.784
SIMLR_300.8060.820.8110.7470.7790.8380.8
SIMLR_400.8340.8310.830.7470.7970.8350.812
SNN-Cliq0.7510.9050.8020.7160.7260.8430.79
sincera_hc0.8070.9290.8560.7970.7940.8840.845
Dense 1 layer 1000.9050.8970.90.8720.8850.9150.896
Dense 1 layer 7960.8950.8870.890.8560.8740.9040.884
Dense 2 layer 796/1000.8920.8820.8860.8540.8690.9030.881
PPITF 1 layer 696 + 1000.8970.8960.8960.8660.8820.9110.891
PPITF 2 layer 696 + 100/1000.9060.9020.9030.8740.890.9170.899
Dense 1 layer 100 pre-train0.8830.8710.8770.8380.8580.8920.87
Dense 1 layer 796 pre-train0.8870.8720.8790.8430.860.8960.873
Dense 2 layer 796/100 pre-train0.8840.8750.8790.8380.8620.8940.872
PPITF 1 layer 696 + 100 pre-train0.8780.8730.8750.830.8580.8890.867
PPITF 2 layer 696 + 100/100 pre-train0.90.8940.8970.870.8810.9140.893

Results are averaged over 20 clustering experiments (using different random initializations). AMI: adjusted mutual information; ARI: adjusted random index; Comp: completeness; FM: Fowlkes–Mallows; Homo: homogeneity; Vmes: v-measure. Bold represents the highest value for each column.

Table 2.
Average performance of different scoring metrics for clustering four cell types not used in the NN training
FeatureHomoCompVmesARIAMIFMAverage
Original0.7850.8750.820.730.7750.8410.805
pca 20.8330.8830.8540.7860.8210.8730.842
tsne 20.1790.1690.1720.1130.1270.3910.192
ica 20.8330.8820.8530.7850.820.8730.841
nmf 20.6590.7350.690.5920.6380.750.677
pca 50.7540.8630.7980.6850.7410.8140.776
tsne 50.0340.0350.034-0.009-0.0060.3320.07
ica 50.6620.8390.7330.6150.6490.7870.714
nmf 50.7070.8480.760.6930.6890.8240.753
pca 100.7590.8730.8050.6950.7470.8230.784
tsne 100.0490.0430.0450.00.0040.2990.073
ica 100.4520.7330.5450.4160.4310.7060.547
nmf 100.7170.8120.7510.6760.6880.8120.743
pca 500.6920.8670.7590.6370.680.7990.739
nmf 500.4810.6250.5320.4270.4570.6810.534
pca 1000.7420.8790.7950.6950.7310.8290.779
nmf 1000.3970.6350.4720.3580.370.6720.484
pcaReduce0.7680.890.8210.7470.7540.8490.805
SIMLR_200.7930.8060.7990.7180.770.820.784
SIMLR_300.8060.820.8110.7470.7790.8380.8
SIMLR_400.8340.8310.830.7470.7970.8350.812
SNN-Cliq0.7510.9050.8020.7160.7260.8430.79
sincera_hc0.8070.9290.8560.7970.7940.8840.845
Dense 1 layer 1000.9050.8970.90.8720.8850.9150.896
Dense 1 layer 7960.8950.8870.890.8560.8740.9040.884
Dense 2 layer 796/1000.8920.8820.8860.8540.8690.9030.881
PPITF 1 layer 696 + 1000.8970.8960.8960.8660.8820.9110.891
PPITF 2 layer 696 + 100/1000.9060.9020.9030.8740.890.9170.899
Dense 1 layer 100 pre-train0.8830.8710.8770.8380.8580.8920.87
Dense 1 layer 796 pre-train0.8870.8720.8790.8430.860.8960.873
Dense 2 layer 796/100 pre-train0.8840.8750.8790.8380.8620.8940.872
PPITF 1 layer 696 + 100 pre-train0.8780.8730.8750.830.8580.8890.867
PPITF 2 layer 696 + 100/100 pre-train0.90.8940.8970.870.8810.9140.893
FeatureHomoCompVmesARIAMIFMAverage
Original0.7850.8750.820.730.7750.8410.805
pca 20.8330.8830.8540.7860.8210.8730.842
tsne 20.1790.1690.1720.1130.1270.3910.192
ica 20.8330.8820.8530.7850.820.8730.841
nmf 20.6590.7350.690.5920.6380.750.677
pca 50.7540.8630.7980.6850.7410.8140.776
tsne 50.0340.0350.034-0.009-0.0060.3320.07
ica 50.6620.8390.7330.6150.6490.7870.714
nmf 50.7070.8480.760.6930.6890.8240.753
pca 100.7590.8730.8050.6950.7470.8230.784
tsne 100.0490.0430.0450.00.0040.2990.073
ica 100.4520.7330.5450.4160.4310.7060.547
nmf 100.7170.8120.7510.6760.6880.8120.743
pca 500.6920.8670.7590.6370.680.7990.739
nmf 500.4810.6250.5320.4270.4570.6810.534
pca 1000.7420.8790.7950.6950.7310.8290.779
nmf 1000.3970.6350.4720.3580.370.6720.484
pcaReduce0.7680.890.8210.7470.7540.8490.805
SIMLR_200.7930.8060.7990.7180.770.820.784
SIMLR_300.8060.820.8110.7470.7790.8380.8
SIMLR_400.8340.8310.830.7470.7970.8350.812
SNN-Cliq0.7510.9050.8020.7160.7260.8430.79
sincera_hc0.8070.9290.8560.7970.7940.8840.845
Dense 1 layer 1000.9050.8970.90.8720.8850.9150.896
Dense 1 layer 7960.8950.8870.890.8560.8740.9040.884
Dense 2 layer 796/1000.8920.8820.8860.8540.8690.9030.881
PPITF 1 layer 696 + 1000.8970.8960.8960.8660.8820.9110.891
PPITF 2 layer 696 + 100/1000.9060.9020.9030.8740.890.9170.899
Dense 1 layer 100 pre-train0.8830.8710.8770.8380.8580.8920.87
Dense 1 layer 796 pre-train0.8870.8720.8790.8430.860.8960.873
Dense 2 layer 796/100 pre-train0.8840.8750.8790.8380.8620.8940.872
PPITF 1 layer 696 + 100 pre-train0.8780.8730.8750.830.8580.8890.867
PPITF 2 layer 696 + 100/100 pre-train0.90.8940.8970.870.8810.9140.893

Results are averaged over 20 clustering experiments (using different random initializations). AMI: adjusted mutual information; ARI: adjusted random index; Comp: completeness; FM: Fowlkes–Mallows; Homo: homogeneity; Vmes: v-measure. Bold represents the highest value for each column.

Functional analysis of hidden layer nodes

While NN are often treated as ‘black boxes’ following recent studies we attempted to characterize the input and hidden nodes (genes/TFs) that had the highest impact on the ability to accurately classify various cell types (18). Such analysis can provide both, functional interpretation of the parameters that are learned for the method as well as characterization of the different types of genes and their regulators that are most relevant to specific types of cells. To perform such analysis, we analyze the top 10 most highly weighted groups (hidden nodes) for each cell type using the NN in Figure 1C. To analyze the groups we either used the known characteristics of the node (since, as mentioned in ‘Materials and Methods’ some of the nodes represent groups of genes known to be co-regulated by the same TF or to be interacting (PPI)) or GO analysis on densely connected nodes to determine the function of genes associated with that node. Table 3 presents results for a subset of the cell types and nodes. As can be seen, the top scoring nodes were often very relevant for the specific functions that are performed by the cells or the TFs that regulate these functions. For example, the learned NN correctly associated genes related to proliferation and differentiation with ES cells while nodes that scored high for immune response categories were mainly associated with Bone Marrow-derived Dendritic Cells (BDMCs) (4549). See also Supporting Website for tables summarizing more than 1000 significant GO categories for combinations of cell types and nodes based on this analysis.

Examples of highly ranked nodes for some of the cell types used for learning of the NN

Table 3.
Examples of highly ranked nodes for some of the cell types used for learning of the NN
Cell typeTF/PPI/dense nodeCorrected P-valueGO function/reference
Stem cells (ES)Dense 353.25E-07Cell differentiation
Stem cells (ES)Dense 244.44E-15Factor: E2F-3; (45)
Stem cells (ES)Dense 245.78E-12Factor: IRF6; (46)
Stem cells (ES)Dense 242.53E-08System development
BMDCDense 101.56E-05Immune system process
BMDCDense 671.59E-11Positive regulation of immune system process
BMDCDense 362.06E-06Response to cytokine
Fibroblastppi 2233.16E-06Focal adhesion (47)
Fibroblastppi 3012.85E-20Acetyltransferase complex (48)
Zygoteppi 106.85E-07Regulation of cell proliferation
Zygoteppi 2807.36E-09Cell junction
ZygoteTF: foxd3XTF: foxd3 (49)
Cell typeTF/PPI/dense nodeCorrected P-valueGO function/reference
Stem cells (ES)Dense 353.25E-07Cell differentiation
Stem cells (ES)Dense 244.44E-15Factor: E2F-3; (45)
Stem cells (ES)Dense 245.78E-12Factor: IRF6; (46)
Stem cells (ES)Dense 242.53E-08System development
BMDCDense 101.56E-05Immune system process
BMDCDense 671.59E-11Positive regulation of immune system process
BMDCDense 362.06E-06Response to cytokine
Fibroblastppi 2233.16E-06Focal adhesion (47)
Fibroblastppi 3012.85E-20Acetyltransferase complex (48)
Zygoteppi 106.85E-07Regulation of cell proliferation
Zygoteppi 2807.36E-09Cell junction
ZygoteTF: foxd3XTF: foxd3 (49)

Some of the nodes were based on TF–gene interactions and thus represent a specific TF (Foxd3). For these, we rely on the function of the TF to characterize the node in the table. Other nodes are either based on PPI (e.g. PPI223) or on groupings learned by the algorithm (e.g. dense24). For thes, we performed GO analysis on the set of highly ranked genes for these nodes. Several other relevant TFs and functions were found for other cell types as well. See Supporting Website for complete list.

Table 3.
Examples of highly ranked nodes for some of the cell types used for learning of the NN
Cell typeTF/PPI/dense nodeCorrected P-valueGO function/reference
Stem cells (ES)Dense 353.25E-07Cell differentiation
Stem cells (ES)Dense 244.44E-15Factor: E2F-3; (45)
Stem cells (ES)Dense 245.78E-12Factor: IRF6; (46)
Stem cells (ES)Dense 242.53E-08System development
BMDCDense 101.56E-05Immune system process
BMDCDense 671.59E-11Positive regulation of immune system process
BMDCDense 362.06E-06Response to cytokine
Fibroblastppi 2233.16E-06Focal adhesion (47)
Fibroblastppi 3012.85E-20Acetyltransferase complex (48)
Zygoteppi 106.85E-07Regulation of cell proliferation
Zygoteppi 2807.36E-09Cell junction
ZygoteTF: foxd3XTF: foxd3 (49)
Cell typeTF/PPI/dense nodeCorrected P-valueGO function/reference
Stem cells (ES)Dense 353.25E-07Cell differentiation
Stem cells (ES)Dense 244.44E-15Factor: E2F-3; (45)
Stem cells (ES)Dense 245.78E-12Factor: IRF6; (46)
Stem cells (ES)Dense 242.53E-08System development
BMDCDense 101.56E-05Immune system process
BMDCDense 671.59E-11Positive regulation of immune system process
BMDCDense 362.06E-06Response to cytokine
Fibroblastppi 2233.16E-06Focal adhesion (47)
Fibroblastppi 3012.85E-20Acetyltransferase complex (48)
Zygoteppi 106.85E-07Regulation of cell proliferation
Zygoteppi 2807.36E-09Cell junction
ZygoteTF: foxd3XTF: foxd3 (49)

Some of the nodes were based on TF–gene interactions and thus represent a specific TF (Foxd3). For these, we rely on the function of the TF to characterize the node in the table. Other nodes are either based on PPI (e.g. PPI223) or on groupings learned by the algorithm (e.g. dense24). For thes, we performed GO analysis on the set of highly ranked genes for these nodes. Several other relevant TFs and functions were found for other cell types as well. See Supporting Website for complete list.

In addition to the analysis of connection weights for the NN that were learned using labeled data, we also analyzed the values for the nodes obtained by the pre-trained models. Recall that for the pre-training we are using a fully unsupervised approach where the goal is to reconstruct the input data using a small set of highly connected nodes (100 in our case). We hypothesized that some of the values learned for these nodes may actually reflect key genes that either regulate other genes (and so their values are enough to reconstruct the full expression set) or important groups of co-expressed genes. We thus selected, for each of the 100 dense nodes in the pre-trained mode, the set of three most similar genes based on Pearson correlation. The results are presented in Supplementary Table S11. As can be seen, many of these genes are ribosmal genes which are indeed a large and coherently expressed group that is captured by some of the nodes. GO analysis of the selected nodes (Supplementary Table S12 shows part of the results.) indicated that a a significant category for these is ‘nucleic acid binding’ (corrected P-value = 9*10−9) indicating that the model captures some of the TFs that are likely regulating the expression in the different single cell types. See Supporting Results and Website for complete details.

In addition to edge weight-based analysis discussed above, we have also performed an analysis of the significant input genes for each cell type analysis using DeepLift (https://arxiv.org/abs/1704.02685). Instead of focusing directly on the parameters learned by the model for such analysis, DeepLift works by comparing the activity in each internal node between a positive input (correct cell type) and a reference input (a vector of zeros in this case) and then propagates these differences to all nodes in the network including input nodes. Thus, DeepLift can directly assign significance to each of the input nodes (genes). Using DeepLift we selected the top 100 genes for each cell type and performed similar gprofiler analysis on this set. The results agreed in many cases with the edge weight analysis discussed above. For example, for ES cells both analyses correctly identified E2F3 and IRF6 as two of the top TFs (though the edge weight based method had better P-values for these). Similarly for BDMCs both identified immune response as the top function (though here DeepLift had slightly better P-values for one function). See Supplementary Table S10 for a comparison of P-values for DeepLift and our edge weight analysis and Supporting Website for the full DeepLift results.

Retrieval of single cells based on expression

In most single cell studies, including cancer (23), brain studies (5053) and more, several different types of cells are profiled in the same experiment. In most cases at least some of the cells cannot be fully assigned since either they do not contain any of the known markers or they contain several subsets of such markers. Such cells are usually analyzed using clustering to identify several groups within the sampled cells. However, such analysis is unsupervised and so its not always clear what each of the clusters corresponds to. Identifying the composition of cells is important, for example in cancer studies where notable differences between outcomes have been attributed to the amount of immune cells that are present in the tumor. Thus, an important problem in single cell expression analysis is assignment. One way to address this problem is to compare uncharacterized cells to cells that have already been characterized by prior experiments either using follow-up studies or because of their known origin. To enable such analysis we collected 31 single cell expression datasets with more than 17 000 samples and created a database in which we stored both the expression measurements for these cells as well as the assignment for the cell from the paper/experiment in was profiled in (if available). Using such database we can find, for each new cell profiled in an experiment, the most similar cells in the database and based on their annotations, annotate the uncharacterized cells.

A key issue for implementing such strategy is how to identify the most similar cells in the database when given a new query cell. An alternative to using the measured expression values, which is more efficient and may be more robust to noise is to first reduce the dimension of the database data and the query cell and then perform the retrieval search in the reduced dimensional space. Such reduction can be done by several methods as discussed above and can cut the storage requirements and run time by over 80%, depending on the architecture used. Specifically, for the 14 000 queries we performed on the 17 000 cells in the database we reduced the run time from 25 min (when using the observed expression values for 9437 genes) to <5 min when using the 796 features obtained from the NN. The datasets of the query cell types are listed in Supplementary Table S2.

In addition, and most importantly, such reduced dimension greatly improves performance. To test various ways of querying single cell expression data we held out complete datasets for which we had a similar dataset in the database from a different lab/paper. This helps ensure that results are not effected by unique lab methods but are rather related to biological similarities. We identified 31 different datasets that, combined, profiled close to 14K cells. For each of these held out datasets we searched for the most similar cells in our database. As before, we compared the NN results to results obtained when using the measured expression values and PCA with 100 or 796 dimensions. As for the NNs, we tested two NN variants. The first is similar to the one described above for clustering while the second is based on using DAE to initialize the parameters of the network (using unlabeled data) followed by supervised learning as discussed in ‘Materials and Methods’.

To evaluate the different methods, for each query cell we identify the top k most similar cells in the database (where similarity is based on Euclidean distance for all genes or for the reduced dimension representation obtained by each method). We use k = 100 in this paper, though results are similar when using k = 10. We next use the top k matches to compute the MAP for the correct cell type for each query (‘Materials and Methods’).

As can be seen in Table 4, all methods that relied on reduced dimensions did much better than the method that used the measured expression values (20% average improvement for PCA and almost 40% improvement for some of the NN methods when compared to using the measured expression values for all genes). Comparing the reduced dimensionality methods themselves, we observe that NN with more hidden layers (in our case 2) are doing, on average, better than NN with a single hidden layer indicating that non linear relationships may be important for characterizing single cell expression. These multi-hidden layer NN are also performing better than PCA.

Average retrieval performance across the different cell types

Table 4.
Average retrieval performance across the different cell types
ModelsHSC4cellICMSpleen8cellNeuronZygote2cellESCMean
Original0.0810.3610.0580.9870.2790.3720.4680.5560.7050.430
PCA 1000.2990.5080.010.9960.3510.6460.5390.6160.7220.521
PCA 7960.2270.5480.0220.9940.2420.6220.4190.6420.8330.505
DAE 1000.2360.4110.0160.9730.5030.6280.1930.7280.5440.470
DAE 7960.140.4230.0350.9920.3990.6920.3990.7430.4320.473
Dense 1 layer 1000.1020.6620.0380.9530.7390.4850.5220.7170.6040.536
Dense 1 layer 100 pre-train0.230.490.1530.9840.4630.640.4080.7340.5320.515
Dense 1 layer 7960.0820.5990.0960.9880.4720.3890.5630.7320.6830.512
Dense 1 layer 796 pre-train0.1160.5150.0730.9910.4630.7020.4230.7140.4460.494
Dense 2 layer 796/1000.0690.770.0650.9560.8960.5630.2750.6730.5830.539
Dense 2 layer 796/100 pre-train0.1640.6480.0350.9870.7150.6330.4980.7470.5070.548
PPITF 1 layer 696 + 1000.0780.6360.1480.9650.6670.4640.2020.3140.630.456
PPITF 1 layer 696 + 100 pre-train0.1680.5570.0280.9820.550.6470.4470.6650.5690.513
PPITF 2 layer 696 + 100/1000.0680.7710.1820.9560.8490.5610.4150.5530.710.563
PPITF 2 layer 696 + 100/100 pre-train0.3970.6140.1850.9750.7250.6260.4350.6880.5540.578
ModelsHSC4cellICMSpleen8cellNeuronZygote2cellESCMean
Original0.0810.3610.0580.9870.2790.3720.4680.5560.7050.430
PCA 1000.2990.5080.010.9960.3510.6460.5390.6160.7220.521
PCA 7960.2270.5480.0220.9940.2420.6220.4190.6420.8330.505
DAE 1000.2360.4110.0160.9730.5030.6280.1930.7280.5440.470
DAE 7960.140.4230.0350.9920.3990.6920.3990.7430.4320.473
Dense 1 layer 1000.1020.6620.0380.9530.7390.4850.5220.7170.6040.536
Dense 1 layer 100 pre-train0.230.490.1530.9840.4630.640.4080.7340.5320.515
Dense 1 layer 7960.0820.5990.0960.9880.4720.3890.5630.7320.6830.512
Dense 1 layer 796 pre-train0.1160.5150.0730.9910.4630.7020.4230.7140.4460.494
Dense 2 layer 796/1000.0690.770.0650.9560.8960.5630.2750.6730.5830.539
Dense 2 layer 796/100 pre-train0.1640.6480.0350.9870.7150.6330.4980.7470.5070.548
PPITF 1 layer 696 + 1000.0780.6360.1480.9650.6670.4640.2020.3140.630.456
PPITF 1 layer 696 + 100 pre-train0.1680.5570.0280.9820.550.6470.4470.6650.5690.513
PPITF 2 layer 696 + 100/1000.0680.7710.1820.9560.8490.5610.4150.5530.710.563
PPITF 2 layer 696 + 100/100 pre-train0.3970.6140.1850.9750.7250.6260.4350.6880.5540.578

Bold indicates the highest value for each column.

Table 4.
Average retrieval performance across the different cell types
ModelsHSC4cellICMSpleen8cellNeuronZygote2cellESCMean
Original0.0810.3610.0580.9870.2790.3720.4680.5560.7050.430
PCA 1000.2990.5080.010.9960.3510.6460.5390.6160.7220.521
PCA 7960.2270.5480.0220.9940.2420.6220.4190.6420.8330.505
DAE 1000.2360.4110.0160.9730.5030.6280.1930.7280.5440.470
DAE 7960.140.4230.0350.9920.3990.6920.3990.7430.4320.473
Dense 1 layer 1000.1020.6620.0380.9530.7390.4850.5220.7170.6040.536
Dense 1 layer 100 pre-train0.230.490.1530.9840.4630.640.4080.7340.5320.515
Dense 1 layer 7960.0820.5990.0960.9880.4720.3890.5630.7320.6830.512
Dense 1 layer 796 pre-train0.1160.5150.0730.9910.4630.7020.4230.7140.4460.494
Dense 2 layer 796/1000.0690.770.0650.9560.8960.5630.2750.6730.5830.539
Dense 2 layer 796/100 pre-train0.1640.6480.0350.9870.7150.6330.4980.7470.5070.548
PPITF 1 layer 696 + 1000.0780.6360.1480.9650.6670.4640.2020.3140.630.456
PPITF 1 layer 696 + 100 pre-train0.1680.5570.0280.9820.550.6470.4470.6650.5690.513
PPITF 2 layer 696 + 100/1000.0680.7710.1820.9560.8490.5610.4150.5530.710.563
PPITF 2 layer 696 + 100/100 pre-train0.3970.6140.1850.9750.7250.6260.4350.6880.5540.578
ModelsHSC4cellICMSpleen8cellNeuronZygote2cellESCMean
Original0.0810.3610.0580.9870.2790.3720.4680.5560.7050.430
PCA 1000.2990.5080.010.9960.3510.6460.5390.6160.7220.521
PCA 7960.2270.5480.0220.9940.2420.6220.4190.6420.8330.505
DAE 1000.2360.4110.0160.9730.5030.6280.1930.7280.5440.470
DAE 7960.140.4230.0350.9920.3990.6920.3990.7430.4320.473
Dense 1 layer 1000.1020.6620.0380.9530.7390.4850.5220.7170.6040.536
Dense 1 layer 100 pre-train0.230.490.1530.9840.4630.640.4080.7340.5320.515
Dense 1 layer 7960.0820.5990.0960.9880.4720.3890.5630.7320.6830.512
Dense 1 layer 796 pre-train0.1160.5150.0730.9910.4630.7020.4230.7140.4460.494
Dense 2 layer 796/1000.0690.770.0650.9560.8960.5630.2750.6730.5830.539
Dense 2 layer 796/100 pre-train0.1640.6480.0350.9870.7150.6330.4980.7470.5070.548
PPITF 1 layer 696 + 1000.0780.6360.1480.9650.6670.4640.2020.3140.630.456
PPITF 1 layer 696 + 100 pre-train0.1680.5570.0280.9820.550.6470.4470.6650.5690.513
PPITF 2 layer 696 + 100/1000.0680.7710.1820.9560.8490.5610.4150.5530.710.563
PPITF 2 layer 696 + 100/100 pre-train0.3970.6140.1850.9750.7250.6260.4350.6880.5540.578

Bold indicates the highest value for each column.

We also observe that both, the use of prior biological knowledge to define the NN architectures (PPITF networks) and the use of pre-training using DAE improves the overall accuracy of the retrieval. Specifically, the best performing method (achieving an improvement of more than 11% over PCA) is the PPITF 2 layer 696+100/pretrain which combines all these features (2 layers, pre-training and the use of prior knowledge). Other architectures that use prior knowledge are also better than their dense counterparts. In contrast, DAE on their own (fourth and fifth rows) are not as effective as supervised models and so they are probably best for initializing rather than for final model selection.

DISCUSSION AND FUTURE WORK

While single cell analysis holds great promise, it also raises new questions. Given the number of cells that are profiled in each experiment, which can reach thousands (51,54), new methods are required for accurately and efficiently clustering these data while overcoming issues related to the stochastic nature of gene expression even in similar cells, noise and missing values. A related problem is the ability to compare expression profiles from cells so that cell type assignments can be determined not just based on a few marker genes but rather based on the overall expression of all genes in the cells profiled.

In this paper, we developed and tested solutions based on deep neural networks for these problems. The advantage of such networks is that they can learn the importance of different combinations of gene expression levels for defining cell types and such combination are usually more robust than values for individuals genes or markers. We tested several NN architectures, including architectures that are constrained by prior biological knowledge. As we show, the NN achieve very good classification performance on training data and improve upon prior methods when used to cluster datasets from experiments that were not used in the training. We also performed functional analysis of the set of highly weighted nodes for each cell type and showed that even though NN are often described as a ‘black box’ learning method, many of these are functionally related to the cell type they were selected for.

Several methods have relied on clustering to annotate genes in single cell studies (3,16). Here we used NN for this task. Unlike most prior methods that relied on co-expression for clustering, our method is based on the identification of a set of genes that, combined, can be used to discriminate between different cell types. These can either be co-expressed, co-repressed or not expressed at all and they are selected as part of a training procedure rather than as a post-processing step. Such analysis may prove beneficial for better grouping of genes since it is supervised which allows it to overcome groupings that are based on non-cell type aspects (for example, cell cycle phase).

As a final application we used the reduced representation obtained form the NN to query a large database of over 17K single cell expression data in order to determine the cell type of a newly profiled single cell. As we show, using such representation greatly improved the performance of the retrieval analysis while reducing the overall runtime and storage required. The Supporting Website (http://sb.cs.cmu.edu/scnn/) provides an implementation of the retrieval method which can be used by researchers to determine cell types for newly profiled single cells.

Our supervised training strategy is focused on cell type-specific identification. Such target function leads the method to ignore similarities between expression profiles that are based on other commonalities which may be shared across cell types, for example cell cycle phase. Researchers interested in such commonalties can either train a separate method where the target reflects them or use the autoencoder version which is fully unsupervised and so likely to identify additional commonalties beyond cell types.

While the results are encouraging, there are several directions for future work which we would like to explore. These include testing more involved (deeper) architectures, integrating additional types of prior biological knowledge into the model and an automated tool that can download new single cell expression data in order to increase the set used by the retrieval application. A major challenge with the latter direction is the ability to automatically assign cell type from published expression data given the various ways in which people define and encode such information.

AVAILABILITY

Supporting website: http://sb.cs.cmu.edu/scnn/.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institutes of Health (NIH) [1R01HL128172, 1U01HL122626-01]; National Science Foundation [DBI-1356505]. Funding for open access charge: NIH [1R01HL128172].

Conflict of interest statement. None declared.

REFERENCES

1.

Buettner
F.
,
Natarajan
K.N.
,
Casale
F.P.
,
Proserpio
V.
,
Scialdone
A.
,
Theis
F.J.
,
Teichmann
S.A.
,
Marioni
J.C.
,
Stegle
O.
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells
.
Nat. Biotechnol.
2015
;
33
:
155
160
.

2.

Jaitin
D.A.
,
Kenigsberg
E.
,
Keren-Shaul
H.
,
Elefant
N.
,
Paul
F.
,
Zaretsky
I.
,
Mildner
A.
,
Cohen
N.
,
Jung
S.
,
Tanay
A.
et al. 
Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types
.
Science
.
2014
;
343
:
776
779
.

3.

Trapnell
C.
,
Cacchiarelli
D.
,
Grimsby
J.
,
Pokharel
P.
,
Li
S.
,
Morse
M.
,
Lennon
N.J.
,
Livak
K.J.
,
Mikkelsen
T.S.
,
Rinn
J.L.
The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells
.
Nat. Biotechnol.
2014
;
32
:
381
386
.

4.

Darmanis
S.
,
Sloan
S.A.
,
Zhang
Y.
,
Enge
M.
,
Caneda
C.
,
Shuer
L.M.
,
Gephart
M.G.H.
,
Barres
B.A.
,
Quake
S.R.
A survey of human brain transcriptome diversity at the single cell level
.
Proc. Natl. Acad. Sci. U.S.A.
2015
;
112
:
7285
7290
.

5.

Poulin
J.-F.
,
Tasic
B.
,
Hjerling-Leffler
J.
,
Trimarchi
J.M.
,
Awatramani
R.
Disentangling neural cell diversity using single-cell transcriptomics
.
Nat. Neurosci.
2016
;
19
:
1131
1141
.

6.

Treutlein
B.
,
Brownfield
D.G.
,
Wu
A.R.
,
Neff
N.F.
,
Mantalas
G.L.
,
Espinoza
F.H.
,
Desai
T.J.
,
Krasnow
M.A.
,
Quake
S.R.
Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq
.
Nature
.
2014
;
509
:
371
375
.

7.

Kolodziejczyk
A.A.
,
Kim
J.K.
,
Tsang
J.C.
,
Ilicic
T.
,
Henriksson
J.
,
Natarajan
K.N.
,
Tuck
A.C.
,
Gao
X.
,
Bühler
M.
,
Liu
P.
et al. 
Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation
.
Cell Stem Cell
.
2015
;
17
:
471
485
.

8.

Hough
S.R.
,
Thornton
M.
,
Mason
E.
,
Mar
J.C.
,
Wells
C.A.
,
Pera
M.F.
Single-cell gene expression profiles define self-renewing, pluripotent, and lineage primed states of human pluripotent stem cells
.
Stem Cell Rep.
2014
;
2
:
881
895
.

9.

Hackl
H.
,
Charoentong
P.
,
Finotello
F.
,
Trajanoski
Z.
Computational genomics tools for dissecting tumour-immune cell interactions
.
Nat. Rev. Genet.
2016
;
17
:
441
458
.

10.

Shalek
A.K.
,
Satija
R.
,
Shuga
J.
,
Trombetta
J.J.
,
Gennert
D.
,
Lu
D.
,
Chen
P.
,
Gertner
R.S.
,
Gaublomme
J.T.
,
Yosef
N.
et al. 
Single cell RNA Seq reveals dynamic paracrine control of cellular variation
.
Nature
.
2014
;
510
:
363
369
.

11.

Usoskin
D.
,
Furlan
A.
,
Islam
S.
,
Abdo
H.
,
Lönnerberg
P.
,
Lou
D.
,
Hjerling-Leffler
J.
,
Haeggström
J.
,
Kharchenko
O.
,
Kharchenko
P.V.
et al. 
Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing
.
Nat. Neurosci.
2015
;
18
:
145
153
.

12.

Yau
C.
et al. 
pcaReduce: hierarchical clustering of single cell transcriptional profiles
.
BMC Bioinformatics
.
2016
;
17
:
140
.

13.

Xu
C.
,
Su
Z.
Identification of cell types from single-cell transcriptomes using a novel clustering method
.
Bioinformatics
.
2015
;
31
:
1974
1980
.

14.

Pierson
E.
,
Yau
C.
ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis
.
Genome Biol.
2015
;
16
:
241
.

15.

Guo
M.
,
Wang
H.
,
Potter
S.S.
,
Whitsett
J.A.
,
Xu
Y.
SINCERA: a pipeline for single-cell RNA-seq profiling analysis
.
PLoS Comput. Biol.
2015
;
11
:
e1004575
.

16.

Wang
B.
,
Zhu
J.
,
Pierson
E.
,
Ramazzotti
D.
,
Batzoglou
S.
Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning
.
Nat. Methods
.
2017
;
14
:
414
416
.

17.

Hornik
K.
,
Stinchcombe
M.
,
White
H.
Multilayer feedforward networks are universal approximators
.
Neural Netw.
1989
;
2
:
359
366
.

18.

Tan
J.
,
Hammond
J.H.
,
Hogan
D.A.
,
Greene
C.S.
ADAGE-based integration of publicly available pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions
.
mSystems
.
2016
;
1
,
doi:10.1128/mSystems.00025-15
.

19.

Gupta
A.
,
Wang
H.
,
Ganapathiraju
M.
Learning structure in gene expression data using deep architectures, with an application to gene clustering
.
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on IEEE
.
2015
;
Washington DC
:
IEEE Conference
1328
1335
.

20.

Chopra
S.
,
Hadsell
R.
,
LeCun
Y.
Learning a similarity metric discriminatively, with application to face verification
.
Computer Vision and Pattern Recognition, 2005 (CVPR 2005). IEEE Computer Society Conference on IEEE Vol.1
.
2005
;
1
:
San Diego
:
IEEE Conference
539
546
.

21.

Barrett
T.
,
Wilhite
S.E.
,
Ledoux
P.
,
Evangelista
C.
,
Kim
I.F.
,
Tomashevsky
M.
,
Marshall
K.A.
,
Phillippy
K.H.
,
Sherman
P.M.
,
Holko
M.
et al. 
NCBI GEO: archive for functional genomics data sets–update
.
Nucleic Acids Res.
2013
;
41
:
D991
D995
.

22.

Lun
A.T.
,
Bach
K.
,
Marioni
J.C.
Pooling across cells to normalize single-cell RNA sequencing data with many zero counts
.
Genome Biol.
2016
;
17
:
75
.

23.

Kimmerling
R.J.
,
Szeto
G.L.
,
Li
J.W.
,
Genshaft
A.S.
,
Kazer
S.W.
,
Payer
K.R.
,
de Riba Borrajo
J.
,
Blainey
P.C.
,
Irvine
D.J.
,
Shalek
A.K.
et al. 
A microfluidic platform enabling single-cell RNA-seq of multigenerational lineages
.
Nat. Commun.
2016
;
7
:
10220
.

24.

Shin
J.
,
Berg
D.A.
,
Zhu
Y.
,
Shin
J.Y.
,
Song
J.
,
Bonaguidi
M.A.
,
Enikolopov
G.
,
Nauen
D.W.
,
Christian
K.M.
,
Ming
G.-l.
et al. 
Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis
.
Cell Stem Cell
.
2015
;
17
:
360
372
.

25.

Shalek
A.K.
,
Satija
R.
,
Adiconis
X.
,
Gertner
R.S.
,
Gaublomme
J.T.
,
Raychowdhury
R.
,
Schwartz
S.
,
Yosef
N.
,
Malboeuf
C.
,
Lu
D.
et al. 
Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells
.
Nature
.
2013
;
498
:
236
240
.

26.

Troyanskaya
O.
,
Cantor
M.
,
Sherlock
G.
,
Brown
P.
,
Hastie
T.
,
Tibshirani
R.
,
Botstein
D.
,
Altman
R.B.
Missing value estimation methods for DNA microarrays
.
Bioinformatics
.
2001
;
17
:
520
525
.

27.

Stark
C.
,
Breitkreutz
B.-J.
,
Reguly
T.
,
Boucher
L.
,
Breitkreutz
A.
,
Tyers
M.
BioGRID: a general repository for interaction datasets
.
Nucleic Acids Res.
2006
;
34
(
Suppl. 1
):
D535
D539
.

28.

Prasad
T.K.
,
Goel
R.
,
Kandasamy
K.
,
Keerthikumar
S.
,
Kumar
S.
,
Mathivanan
S.
,
Telikicherla
D.
,
Raju
R.
,
Shafreen
B.
,
Venugopal
A.
et al. 
Human protein reference database–2009 update
.
Nucleic Acids Res.
2009
;
37
(
Suppl. 1
):
D767
D772
.

29.

Schulz
M.H.
,
Devanny
W.E.
,
Gitter
A.
,
Zhong
S.
,
Ernst
J.
,
Bar-Joseph
Z.
DREM 2.0: improved reconstruction of dynamic regulatory networks from time-series expression data
.
BMC Syst. Biol.
2012
;
6
:
104
.

30.

Gitter
A.
,
Bar-Joseph
Z.
Identifying proteins controlling key disease signaling pathways
.
Bioinformatics
.
2013
;
29
:
i227
i236
.

31.

Nepusz
T.
,
Yu
H.
,
Paccanaro
A.
Detecting overlapping protein complexes in protein-protein interaction networks
.
Nat. Methods
.
2012
;
9
:
471
472
.

32.

Erhan
D.
,
Bengio
Y.
,
Courville
A.
,
Manzagol
P.-A.
,
Vincent
P.
,
Bengio
S.
Why does unsupervised pre-training help deep learning?
.
J. Mach. Learn. Res.
2010
;
11
:
625
660
.

33.

Hinton
G.E.
,
Salakhutdinov
R.R.
Reducing the dimensionality of data with neural networks
.
Science
.
2006
;
313
:
504
507
.

34.

Chen
Y.
,
Li
Y.
,
Narayan
R.
,
Subramanian
A.
,
Xie
X.
Gene expression inference with deep learning
.
Bioinformatics
.
2016
;
32
:
1832
1839
.

35.

Krizhevsky
A.
,
Hinton
G.E.
Using very deep autoencoders for content-based image retrieval
.
19th European Symposium on Artificial Neural Networks
.
2011
;
Bruges
.

36.

Glorot
X.
,
Bengio
Y.
Understanding the difficulty of training deep feedforward neural networks
.
Aistats
.
2010
;
9
:
249
256
.

37.

Reimand
J.
,
Arak
T.
,
Adler
P.
,
Kolberg
L.
,
Reisberg
S.
,
Peterson
H.
,
Vilo
J.
g: Profiler–a web server for functional interpretation of gene lists (2016 update)
.
Nucleic Acids Res.
2016
;
44
:
W83
W89
.

38.

Arthur
D.
,
Vassilvitskii
S.
k-means++: The advantages of careful seeding
.
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
.
2007
;
Society for Industrial and Applied Mathematics
1027
1035
.

39.

Hubert
L.
,
Arabie
P.
Comparing partitions
.
Journal of classification
.
1985
;
2
:
193
218
.

40.

Vinh
N.X.
,
Epps
J.
,
Bailey
J.
Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance
.
Journal of Machine Learning Research
.
2010
;
11
:
2837
2854
.

41.

Fowlkes
E.B.
,
Mallows
C.L.
A method for comparing two hierarchical clusterings
.
Journal of the American statistical association
.
1983
;
78
:
553
569
.

42.

Rosenberg
A.
,
Hirschberg
J.
V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure
.
EMNLP-CoNLL
.
2007
;
7
:
410
420
.

43.

Tseng
G.C.
,
Wong
W.H.
Tight clustering: A resampling-based approach for identifying stable and tight patterns in data
.
Biometrics
.
2005
;
61
:
10
16
.

44.

Monti
S.
,
Tamayo
P.
,
Mesirov
J.
,
Golub
T.
Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data
.
Machine learning
.
2003
;
52
:
91
118
.

45.

Julian
L.M.
,
Blais
A.
Transcriptional control of stem cell fate by E2Fs and pocket proteins
.
Frontiers in genetics
.
2015
;
6
:
161
.

46.

Bailey
C.M.
,
Hendrix
M.J.
IRF6 in development and disease: a mediator of quiescence and differentiation
.
Cell Cycle
.
2008
;
7
:
1925
1930
.

47.

Rustad
K.C.
,
Wong
V.W.
,
Gurtner
G.C.
The role of focal adhesion complexes in fibroblast mechanotransduction during scar formation
.
Differentiation
.
2013
;
86
:
87
91
.

48.

Ghosh
A.K.
,
Varga
J.
The transcriptional coactivator and acetyltransferase p300 in fibroblast biology and fibrosis
.
Journal of cellular physiology
.
2007
;
213
:
663
671
.

49.

Eckardt
S.
,
McLaughlin
K.J.
,
Willenbring
H.
Mouse chimeras as a system to investigate development, cell and tissue function, disease mechanisms and organ regeneration
.
Cell Cycle
.
2011
;
10
:
2091
2099
.

50.

Hu
Y.
,
Huang
K.
,
An
Q.
,
Du
G.
,
Hu
G.
,
Xue
J.
,
Zhu
X.
,
Wang
C.-Y.
,
Xue
Z.
,
Fan
G.
Simultaneous profiling of transcriptome and DNA methylome from a single cell
.
Genome biology
.
2016
;
17
:
1
.

51.

Zeisel
A.
,
Muñoz-Manchado
A.B.
,
Codeluppi
S.
,
Lönnerberg
P.
,
La Manno
G.
,
Juréus
A.
,
Marques
S.
,
Munguba
H.
,
He
L.
,
Betsholtz
C.
et al. 
Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq
.
Science
.
2015
;
347
:
1138
1142
.

52.

Li
C.-L.
,
Li
K.-C.
,
Wu
D.
,
Chen
Y.
,
Luo
H.
,
Zhao
J.-R.
,
Wang
S.-S.
,
Sun
M.-M.
,
Lu
Y.-J.
,
Zhong
Y.-Q.
et al. 
Somatosensory neuron types identified by high-coverage single-cell RNA-sequencing and functional heterogeneity
.
Cell research
.
2016
;
26
:
83
102
.

53.

Kim
C.C.
,
Nakamura
M.C.
,
Hsieh
C.L.
Brain trauma elicits non-canonical macrophage activation states
.
Journal of neuroinflammation
.
2016
;
13
:
1
.

54.

Klein
A.M.
,
Mazutis
L.
,
Akartuna
I.
,
Tallapragada
N.
,
Veres
A.
,
Li
V.
,
Peshkin
L.
,
Weitz
D.A.
,
Kirschner
M.W.
Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells
.
Cell
.
2015
;
161
:
1187
1201
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.