The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data

Abstract Motivation Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models, such as variational autoencoders, which use a variational approximation of the likelihood for inference. Results We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori estimation. The DGD handles complex parameterized latent distributions naturally unlike variational autoencoders, which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell datasets. Here, the DGD learns low-dimensional, meaningful, and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable variational autoencoder. Availability and implementation scDGD is available as a python package at https://github.com/Center-for-Health-Data-Science/scDGD. The remaining code is made available here: https://github.com/Center-for-Health-Data-Science/dgd.


Introduction
High-throughput methods in biology and medicine produce vast amounts of high-dimensional noisy data that we seek to extract knowledge from.The first step is often to obtain a lowdimensional representation of the data through clustering, PCA analysis or other similar techniques.A good example is gene expression analysis, where we need to compare counts for each of the 10-20 thousand genes between samples or cells.Here the first analysis is often to visualize the data in two dimensions with UMAP [1] or t-SNE [2].Although these tools are indispensable, they do not model the uncertainty of the count data or reveal any of the underlying structure in the data.Therefore, it has become popular to make use of generative models that give low-dimensional representations mapping to a probability distribution over the data.Some of the most common are variational autoencoders (VAEs) [3], generative adversarial networks (GANs) [4], normalizing flows [5] and energy-based models (EBMs) [6].Even though these approaches have their shortcomings [7], they have shown great success in a multitude of applications such as single-cell RNA-sequencing (scRNA-seq) [8,9], image and surface generation [10,11], and natural language processing [10,12].In addition, diffusion models [13] and Transformers [14] are among the best performing and most popular approaches in generative modeling of image and text data.However, these methodologies do not necessarily include probabilistic latent spaces.
The VAE is the most common type of generative model applied to biological data.In general, VAEs consist of an encoder that maps data to a lower-dimensional probabilistic representation in latent space and a decoder that maps back to the data.The model is trained using maximum likelihood estimation (MLE), which seeks to find a set of parameters that maximize the likelihood of the data.However, the likelihood calculation is intractable due to an integral over the latent space.The solution in the VAE is to approximate the likelihood and maximize a variational lower bound (the "ELBO").This approximation is not always accurate [15] and favours larger latent dimensionalities than necessary [16].
The decoder alone specifies the generative model and the encoder of the VAE is just a convenient tool for finding representations, but it adds a whole new set of modelling choices to be made, which influence the inference gap and expressiveness [15].As a result, encoder-less representation learning has been suggested many times in different ways.There are simple, deterministic approaches that maximise the likelihood over representations of the training data and the decoder parameters at the same time [17][18][19].Here and in our work, the representations are treated like the other model parameters and since gradients can be derived exactly, the model can be trained using standard gradient descent.There are also other encoder-less generative models, such as Gaussian Process Latent Variable Models (GP-LVMs) [20] or PCAbased models [21][22][23].In their vanilla form, they have issues with scaling and limited model complexity and have not quite gained traction in the application to biological data compared to VAEs.However, scalable versions of GP-LVMs have been presented [24][25][26] and PCA-based methods have been applied for the joint modelling of multiple single-cell modalities [27].Scalability in these GP-LVM applications is achieved by applying sparse Gaussian processes and amortised inference [24][25][26].As a result, they are still limited to the same problems of variational inference [15] and sensitivity of latent initialisation [20].The standard VAE and many other models use a fixed Gaussian distribution over latent space.This may result in underexpressive representations, and it has been shown that learning the parameters of a Gaussian mixture leads to improved generalization and clustering [28,29].However, this makes the variational inference even more complex, as it has to be combined with score matching [30], or other approaches.Another alternative approach worth mentioning is the VQ-VAE [31] which learns discrete representations with more powerful priors than a standard Gaussian, but still requires an encoder.
In this work, we present a probabilistic formulation of a generative decoder using maximum a posteriori (MAP) estimation of all parameters.MAP estimation is the Bayesian analog of MLE and seeks to find the set of parameters that maximize the posterior, which is the probability of the representations and model parameters given the observed data.There is no intractable step in this estimation, so both model parameters and representations can be estimated directly.The model we are introducing consists of a decoder (or generator), such as a feed-forward neural network, and a distribution over representations with learnable parameters, which are optional.Estimating the representations of the training samples as well as the parameters is straight-forward and can be done by gradient descent as in a non-probabilistic formulation [32].One of the main advantages of this approach over variational inference is the ease and flexibility with which distributions over representations can be learned.The advantage over GP-LVMs is that the approach is scalable.Our emphasis here is on a model with a parameterized distribution over representations.This is motivated by the intuition that representations are meaningful and are likely to group samples with different properties.We think of this approach as a marriage of generative models and manifold learning, such as UMAP.The beauty of the approach lies in its simplicity and utility.The decoder parameters, latent representations and latent distribution are estimated in the same way.The simplicity makes it easy to extend the model to more complex distributions and losses.Unlike in Gaussian Mixture VAEs [33,34], no additional encoder for the Gaussian components is needed.It also makes it much simpler to understand for non-specialists.This is especially important, as we see great potential for this approach in fields like biology and medicine.
In the following sections we present the theory of the model and experiments on two very different types of data.We first demonstrate the general functionality of the proposed model on the Fashion-MNIST [35] benchmark dataset.This is a typical choice in deep learning and an important step, because it enables us to investigate and understand the model's generative capabilities and potential caveats.We further explore the complexity of the distribution over representations and compare latent space and generative capabilities to models using variational inference.We also showcase the extension of the approach to supervized learning in which the mixtures of the latent distribution represent specific data classes.After establishing the model's functionality, we show an application to single-cell gene expression count modelling.We chose a data set of peripheral blood mononuclear cells (PBMC) [36], for which we compare our approach in terms of data reconstruction and clustering performance to scVI [37] and scVAE [38].The latter is a variational autoencoder also using an adaptive mixture of Gaussians rather than a single fixed Gaussian to model the latent representation.We show that our model learns a representation and Gaussian mixture model which cluster well according to cell type with additional, previously unobserved sub-clustering.Finally we compare the two models on 11 different single-cell gene expression data sets using default settings of scVI and our model.

The model
Consider a model with observed variable x and continuous latent variable z, which we also refer to as a representation.Usually, the x-space (or sample space) is of a higher dimension than the z-space, so the model gives a low-dimensional representation of data.A neural network, the decoder, with parameters θ takes z as input and outputs f θ (z), which are the parameters for the distribution of x.These could be the means and standard deviations of independent normal distributions.Thus, the neural network defines a conditional distribution in sample space, . We assume a distribution over representation space, P (z | φ), with parameters φ.The parameters may be fixed as in a standard VAE with a fixed Gaussian distribution over representations.It can also have adjustable parameters, such as a mixture of Gaussians with trainable means, covariances and mixture coefficients, but essentially any (differentiable) parameterized distribution can be used.When introducing priors over parameters, the joint probability of everything becomes P (x, z, φ, θ) = P (x, z | φ, θ)P (φ)P (θ) = P (x | z, θ)P (z | φ)P (θ)P (φ). ( Again, P (x | z, θ) represents the decoder neural network, P (z | φ) the distribution over representations, P (θ) the prior over the decoder parameters (neural network weights), and P (φ) the prior over the parameters in the representation space distribution.We use maximum a posteriori (MAP) estimation to find optimal parameters and representations.For a data set X = {x 1 , x 2 , . . ., x N } with N samples we thus want to find representations Z = {z 1 , z 2 , . . ., z N } and parameters φ, θ that maximize P (Z, φ, θ | X): ( If we assume independence among training samples, the log of this joint probability can be obtained from (2), with i indexing the samples log P (X, Z, φ, θ) This is the quantity we want to maximise with respect to both the representations (z) for all samples and the model parameters (φ, θ).The training process can be imagined like this: A representation z for each training sample is initialized.This can be a random sample from a chosen distribution or a zero-valued vector.Representations are then passed through the decoder to give P (x | z, θ) and losses are computed from this and P (z | φ).All parameters θ, φ, and z are then updated via back-propagation.The implementation is efficient and straightforward and compatible with standard modules and loss functions in deep learning frameworks, such as PyTorch.Once the model is estimated, prediction on new data points is done by first finding an optimal representation by maximizing P (x | z, θ)P (z | φ) as above while keeping all other model parameters fixed, i.e., do gradient descent in z alone.
In this work we use a parameterized Gaussian mixture for the representations (P (z | φ)) with diagonal covariance matrix and a mollified Uniform ("softball prior") as prior over component means.Hereafter, we will refer to this Gaussian Mixture Model as GMM.We use a Dirichlet prior as a prior on the mixture weights and a Gaussian prior on the log-variances of the components.We use a weight decay in the training of the decoder, which corresponds to a Gaussian prior on the weights.

Demonstration on image data
It is customary to test generative models on image benchmarks, so we begin our experimental analysis by applying our model to the Fashion-MNIST data.It allows us to understand and evaluate model behaviour, latent space and most importantly sampling quality much better than more complex, biological data.The decoder used for the Fashion-MNIST models is based on the DCGAN [39] generator architecture with modifications using common methods in image generation and is described in detail in the Methods section and Supplementary figure 3. We limited the design space [40] to a latent dimension of 20, as interpolation can only occur in low-dimensional representations [41].The remaining parameters and how they were chosen are described in the Methods section.The model was trained for 500 epochs.
In figure 2b) we see that a GMM with 20 Gaussian components distributes well over the complex latent space.The resulting distribution over latent space (the parameterized GMM) models the latent representation much better than for GMMs with less components which fail to pick up on more subtle structures (Supplementary figure 1).This supports the hypothesis stated in many works that overly simple distributions such as standard Gaussians may not be sufficient for describing the probabilistic representations of many data.Another advantage of the GMM is the increase in control over sampling.Figure 2a) shows generated samples from individual components of the GMM.Having more components leads to each of them covering a much smaller area of the latent space, and thus offering less varied but more specific samples.In this case of a 20-dimensional latent space with 20 mixture components, each component mean can be attributed to a (sub-)class of Fashion-MNIST.A relatively high number of components can thus enable more deliberate, more qualitative sampling and a better model of the latent space.While the latent representations in figure 2b) show clear separation for super-classes such as shoes (Ankle boot, Sneaker, Sandal) and some distinct classes like of trousers and bags, other classes are much more mixed.We evaluate the clustering with the adjusted Rand index (ARI) [42].This metric achieves values between 0 and 1 with 1 representing a perfect clustering and is corrected for chance.The ARI of this clustering based on predicted labels derived from the GMM's component probabilities is only 0.295.As an extension of the DGD, it is also possible to perform supervised training, where each GMM component is assigned a data class a priori.In the optimization, each component is only committed to covering the latent space of its assigned classes' representations.When trained with these assigned 10 components, we arrive at the distinct latent clusters seen in figure 2c) with an ARI of 0.995 (which is merely a sanity check, since the ARI is expected to be high).While the quality of random samples from this model is reduced slightly (see Supplementary figure 2) in comparison to those from an equivalent unsupervised model, we achieve perfect control over the sample identity.
We next investigated how the MAP approach compares to variational inference (VI) and amortization.For this purpose, we limit the prior of the DGD to a fixed standard Gaussian.We found that once the encoder of a VAE is removed (using the variational auto-Decoder (VAD) [43]), reconstruction performance improves significantly and is equal to that of the DGD, as seen in Figure 2d) and Tables 1 and S1.Supplementary table 1 also shows that latent space normality is improved by removing the encoder.This supports [18,32,44] in their results stating that the encoder can be a hindrance to achieving good representations.Given the standard Gaussian priors, the structure of the latent spaces is not practically changed between the three models (Supplementary figure 4), even though ARIs vary a lot (Supplementary table 1).
Our demonstration of the DGD on image data comes in very handy when next evaluating the generative modeling capabilities of the different approaches.The benefit of image data is that it comes naturally to us to evaluate whether generated data is sensible or not from a qualitative standpoint.In addition, there are many quantitative measures available.We make use of the Fréchet Inception Distance (FID) [45], a popular metric of the similarity between image distributions.It is, however, a biased metric depending on the number of samples and the generator itself.A qualitative analysis of the samples from all models (see figure 2e)) ranks the DGD samples the highest in terms of realisticness and detail.However, the FID score of the VAE samples are slightly better (Table 1) despite the generated images being very generic and blurry.
The FID score of the DGD improves drastically with learning a more complex parameterized distribution, which lends itself naturally to our approach.When using 20 Gaussian components, The FID decreases to 27.25 ± 0.11 (from three computations using different random seeds, see the methods section), positioning itself between the probabilistic autoencoder (28.0) [46] and PeerGAN (21.73) [47].

Application to single-cell data
In order to assess the model performance on scRNA data, we chose to apply it to the peripheral blood mononuclear cells (PBMC) data from [36].The single cell expression counts are modeled by a Negative Binomial distribution.We achieved best clustering results with a latent dimension of 20 and 3 fully connected hidden layers.The model, which we will refer to as scDGD, Table 1 Comparison of performance metrics for DGD, VAD and VAE with latent dimension 20 and one fixed standard Gaussian as latent prior.Model names are indicated on the left.Performance metrics computed are indicated as the remaining columns.The BCE loss is the average binary cross-entropy loss of the reconstructed images, indicated with its standard error.The FID score is the Fréchet Inception Distance, a metric of the distance between generated and original image distributions.Both metrics are computed with respect to the test images.Reported means and errors stem from repeats with three different random seeds.We additionally report run times based on the same model architectures as an average over runs for three different latent dimensions (20,50,100) and 1000 epochs per run along with the standard error of the mean.The same hardware was used for all runs.was trained for 700 epochs after which it achieved its overall best performance in terms of reconstruction and clustering.The number of GMM components was set to 9, which represents the number of cell types in the data.Details about our implementation and how we arrived at these hyperparameters can be found in the Methods section and Supplements.

Model
The resulting latent space is depicted as a UMAP projection in figure 3. From a visual perspective, the latent space and GMM components cluster well with the cell types.For clustering purposes, a cell is assigned to the component that gives the largest probability to its representation.The adjusted Rand index (ARI) of this clustering on the train set is 0.618.In the work of [38], scVAE was reported to have a higher clustering performance than scVI of around 0.66 compared to 0.53 for scVI.However, scVI clearly outperformed the Gaussian Mixture VAE in terms of reconstructions.Since the scVAE tool could not be retrained, we as well compare our model to scVI and include the Leiden algorithm [48] as a baseline for clustering performance.The clustering performance of scVI is evaluated based on Kmeans clustering of the latent space with 9 components.
As a baseline for clustering performance, we apply the Leiden [48] algorithm commonly used in the field of single-cell data.At 9 components, this achieves an ARI of 0.51.A default scVI model trained on our train set resulted in a clustering performance of 0.566, which is derived from Kmeans clustering with 9 components.The reconstruction performance of scVI is on par with our model trained with 9 components in terms of goodness of fit, but underperforms in reconstruction accuracy (see Table 2).The reconstruction performance is evaluated based on two metrics.Firstly, we compute the negative log likelihood (NLL) of the true held out test counts being drawn from the gene-specific Negative Binomial distributions.These distributions are parameterized by the models' predicted means and learned dispersion parameters.Secondly, we also compute the root mean squared errors (RMSE) as a standard metric for comparability.We include both metrics for a more comprehensive understanding of the reconstruction performance.While the NLL measures the goodness of fit of the probabilistic modeling of counts and leaves more flexibility for genes that are highly variable, the RMSE emphasises the reconstruction accuracy and highlights sensitivities to outliers in the modeling.The clustering of scVI derived from the Kmeans algorithm with 9 components, however, is still lower than that of scDGD (see Table 2).This shows that with scDGD, there is no need for a compromise with respect to built-in, high-performance clustering and count modeling.
We also trained a model with 18 Gaussian components, since we have learned in our explorations on image data that increasing the complexity of the distribution over representation can be beneficial to modeling the substructures in the representation if the problem is complex enough.Increasing the number of components resulted in equal reconstruction performance as seen in Table 2 and highly improved clustering performance with an ARI of 0.684.As we can see in figure 3c), several cell types have one or more components specifically assigned to them.Among these are CD4+ naïve and memory T cells, NK, CD34+ and B cells.The cytotoxic T cell sub-types cannot easily be differentiated by component assignment, but are distinct from all other cell types in the data.The same goes for all CD4+ T cells except memory cells.Since t-SNE distorts the latent space, we also show a graph of the component means in Figure 3d) based on the Euclidean distances between them (Supplementary figure 5).To go one step further in the analysis of our model, we next aimed to get an understanding of the structuring of the learned latent space.In both models with 9 and 18 components, we see that the CD34-positive cells form sub-clusters that are modelled by different Gaussian components.We therefore investigated the representations for CD34-positive cells learned by the 18-component scDGD in terms of some more fine-grained differentiation markers.We were able to determine a sub-cluster best modeled by component 17 (see 4a)) which is primarily occupied by cells expressing markers for the lymphoid lineage of hematopoetic stem cells (HSCs).This cluster further shows a distinction between bone marrow common lymphoid progenitors (CLPs) and cord blood CLPs, indicated by markers CD10 and CD7, respectively.Another sub-cluster, represented by components 8 and 16, is made up of several cells expressing CD18, which is a marker for myeloid cells regulating neutrophil production and a general marker for granulocytes.There is also a strong presence of CD14-positive cells in these clusters, suggesting Table 2 Comparison of performance metrics for scDGD and scVI.Model names are indicated on the left, with c describing the number of Gaussians in the GMM (1 indicates a standard Gaussian).In the row for Leiden clustering, n/a denotes that the metric is not applicable for the given method.Performance metrics computed are indicated as the remaining columns.The NLL refers to the negative log-likelihood of the negative binomial distribution which models the counts in all methods.The best values for each metric are highlighted in bold.an aggregation of monocytes and neutrophils.The biggest sub-cluster, modeled by component 14, suggests the presence of different sub-populations of the myeloid lineage of HSCs.We find gradual levels of CD41 to CD62L, which are markers for megakaryoblasts and myeloblasts, respectively.These are also correlated with component 7. We additionally found a concentrated site showing expression of different dendritic cell markers (CD1c, CD141, CD303), best represented by components 8 and 16.

Model
Lastly, we investigated the merits of scDGD in terms of user experience and applied both default scDGD and default scVI to 10 data sets that had no part in the development of scDGD.Details about these data sets from mouse and human and their preprocessing can be found in the Methods section.Figure 5 shows that scDGD outperforms scVI in both data reconstruction (accuracy measured by RMSE) and latent clustering (in 9 out of 10 and 7 out of 10 cases, respectively).We also trained scVI and scDGD on the extremely large mouse brain data set with roughly 1.3 million cells [].On this large data set, scVI outperforms scDGD in terms of RMSE, but achieves again lower clustering performance and requires more computational resources (see Supplementary table 2).

Discussion
In this work, we have presented the DGD, a Bayesian formulation of a deep generative neural network using MAP estimation for model parameters and representations.This is a fully tractable and simple approach, for which no encoder is needed.This gives the DGD the advantage of having fewer parameters and fewer modelling choices, such as encoder architecture.We have shown that the DGD achieves good reconstruction, clustering and generative performance on much smaller latent spaces than some of the other generative approaches such as VAEs.In this sense it is comparable to flow-based models [49], but with less architectural restrictions.Additionally, we see clear advantages of this approach due to the simplicity of incorporating more complex distributions for modelling the latent representation.Analogous to our results on Fashion-MNIST, others have reported much better clustering and generative performances with a mixture of Gaussians compared to classic VAEs with standard Gaussians as priors over latent space [38].
In our application to single-cell expression counts, we showed that we can achieve reconstruction and clustering performances superior to those of comparable models, but with much smaller and better structured latent spaces, fewer parameters and more detailed distributions.Our choice for comparison are scVI [37] and scVAE [38], two VAE-based methods modelling the representation of single-cell expression data.Overall, we compare scVI and our model on a total of 12 data sets, ranging from 500 to over one million cells.In these experiments, scDGD largely outperforms scVI on both reconstruction and clustering performance.Investigating one of the learned representations closer, we find that the structure shows meaningful relationships and sub-clustering by Gaussian components, which can be linked to different sub-populations in some cell types.
A caveat of this model is the inference time of new data points.The lack of an encoder means that the representations of new data points need to be found by back-propagation (with fixed model parameters).This process is more time-consuming and computationally expensive than passing new data points through an encoder as in the VAE-based model.However, in our experiments this only took 0.93 seconds to 36.24 minutes for 59 to 130613 test cells, respectively.In the case where inference is done repeatedly, one could also train an encoder afterwards as described in [50], which still results in a more expressive representation than the traditional AE setup.
Altogether, we find that the DGD is a simple and efficient approach to learning lowdimensional representations.Unlike some other generative models, it is fully tractable and bears no architectural restrictions.Although the model can be supervised as we showed on Fashion-MNIST, our scRNA model is completely unsupervised and is suitable for data without existing labels, while achieving a meaningful clustering of the representation on its own.We also observed that a more complex distribution relating to more GMM components can be beneficial for discovering more substructures in the data than provided by class labels.This makes the DGD an ideal candidate for modelling a multitude of biological and medical data.For these types of data, interpretability of the latent space is crucial.Several downstream applications such as perturbations and pseudo-time require a low-dimensional and continuous representation.We plan to expand the model to a variety of biological tasks and to include semi-supervised learning and other features that are of interest to generative models for biological and medical data.We further aim to apply this approach to even more extensive single-cell expression data sets and multi-omics data that allow a more thorough investigation of the latent space and can provide meaningful tools for data integration and downstream analysis.

Data sets
In this work we made use of in total 13 publicly available data sets, out of which 12 represent single cell transcriptomics sets.The data set referred to as PBMC was used to develop the application of the DGD to single cell expression data.The remaining single cell data sets were used to evaluate the performance of scDGD with default parameters in comparison to the popular and successful scVI model with default parameters.

Fashion-MNIST
The data set used in our proof of concept is the natural image Fashion-MNIST [35] set.We used the data set's implemented train-test split.

PBMC
The data set used to develop the single-cell application of our model is a single-cell gene expression count data set of peripheral blood mononuclear cells (PBMC) presented by [36].The data is provided by 10x Genomics under "Single Cell 3' Paper: Zheng et al. 2017 (v1 Chemistry)" and consists of data from the following 9 cell types: CD4+/CD45RA+/CD25-naïve T cells, CD4+ helper T cells, CD4+/CD25+ regulatory T cells, CD4+/CD45RO+ memory T cells, CD8+/CD45RA+ naïve cytotoxic T cells, CD8+ cytotoxic T cells, CD56+ natural killer cells, CD34+ cells, and CD19+ B cells.As [38] we used the filtered gene-cell matrices.This data is extremely sparse with 98% of the data being zero [38].Of the 32738 genes covered by this assay, only 21812 are expressed in the whole data set at least once.The 92043 samples are randomly split into train, validation and test set using percentages 81-9-10%.The validation set is used to finding hyperparameters, and the test set is used for final evaluation.

Single cell evaluation sets
The following data sets were taken from 10x Genomics or chosen because of their use in the scVI paper [37].We used provided raw counts of cells that passed quality control filtering and did not apply any feature selection, modelling all transcripts available.Data sets were annotated using the CellTypist [51,52] python package.The resulting cell type labels were used in the DGD for automatic selection of the number of Gaussian components and in the clustering performance evaluation as an approximated ground truth.All data sets were split into 80 % training, 10 % validation and 10 % test data.The raw counts with cell type and data split assignment as observables were exported as Anndata [53] objects which were used to train and evaluate both scDGD and scVI.

PBMC (500)
The smallest of the data sets used for the elaborate analysis of our model's performance and applicability is another data set of PBMCs containing 587 samples and 36601 features from 10x Genomics [54].For cell type annotation, we used the 'Immune All Low' model as reference and majority voting in CellTypist.This resulted in the presence of 11 distinct cell types.

BMMC (2k)
The next data set contains 1985 bone marrow mononuclear cells (BMMCs) with 32738 features from 10x [55].Cell type annotations were approximated using CellTpyist with the 'Immune All Low' reference model and majority voting.This resulted in 15 distinct cell types.

Cortex (3k)
A data set of 3005 cells from mouse cortex and hippocampus with 19972 features was published in [56].Cell type annotations were approximated using CellTpyist with the 'Developing Mouse Brain' reference model and majority voting.This resulted in 14 distinct cell types.

Jejunum (5k)
The 10x data set of human jejunum [57] contained 4392 cells with 36601 features.Cell type annotations were approximated using CellTpyist with the 'Cells Intestinal Tract' reference model and majority voting.This resulted in 24 distinct cell types.

Mouse brain (5k)
This data from 10x is comprised of 7377 cells from the adult mouse brain [58] with 32285 features.Cell type annotations were approximated using CellTpyist with the 'Developing Mouse Brain' reference model and majority voting.This resulted in 7 distinct cell types.

Whole blood (8k)
From the blood of healthy females, a total of 8000 PBMCs, Neutrophils and Granulocytes [59] were extracted by 10x, containing 36601 features.Cell type annotations were approximated using CellTpyist with the 'Immune All Low' reference model and majority voting.This resulted in 10 distinct cell types.

Heart (10k)
The 7713 cells with 31053 features from 10x in this data were extracted from an E18 mouse [60].Cell type annotations were approximated using CellTpyist with the 'Immune All Low' reference model.This resulted in 3 distinct cell types.

PBMC (10k)
This data from 10x contains 11984 PBMCs from healthy female donors between the ages of 25 and 30, with 36601 transcripts covered [61].Cell type annotations were approximated using CellTpyist with the 'Immune All Low' reference model and majority voting.This resulted in 19 distinct cell types.

PBMC (20k)
Another data set from female donors aged 25-30 provided by 10x was used, containing a total of 23837 cells with 36601 features.Cell type annotations were approximated using CellTpyist with the 'Immune All Low' reference model and majority voting.This resulted in again 19 distinct cell types.

Hemato (25k)
After excluding the basal-bm1 library due to poor quality (as presented in [37] by recommendation of the authors), the data set of haematopoietic progenitor cells from mice [62] was comprised of 25050 cells and 28205 features.Cell type annotations were approximated using CellTpyist with the 'Cells Intestinal Tract' reference model and majority voting if cell types represented less than 100 cells.This resulted in 18 distinct cell types.

Mouse brain (1M)
The extremely large data set of 1306127 brain cells from two E18 mice was provided by 10x with samples from cortex, hippocampus and the subventricular zone [63].It features 27998 transcripts.Cell type annotations were approximated using CellTpyist with the 'Developing Mouse Brain' reference model and majority voting.This resulted in 27 distinct cell types.

The DGD
The DGD consists of three components.Representation Z is learned directly as a set of trainable parameters.For each of the N samples of X, there exists an m-dimensional vector z that receives gradients.The values are initialized with zero for training but can be initialized with any other tensor of shape (N, m).The decoder is characterized by the input dimension m and output dimension n, which represents the number of data features.The decoder can be any type of Neural Network that fulfils this constraint.The choice of reconstruction loss is equally flexible but should ideally represent P (x | z).The distribution over latent space, which makes this a generative model, is here presented by a GMM.The mixture model consists of K mixture components each with an m-dimensional mean vector µ µ µ, and diagonal covariance Σ and a mixture coefficient c for each component.The mixture coefficients are transformed into component weights w through softmax activation.Since the diagonal covariance matrix can only take positive values, the according parameter is learned as the negative log-covariance.
The priors over the GMM parameters φ are as follows.The prior on the mixture weights is an m-dimensional Dirichlet distribution with uniform parameters, α.The prior on mixture means is a mollified uniform which we call the "softball" distribution with hyperparameters scale as the radius of the m-ball and sharpness determining the slope of the boundary.Details on initialization and formulation of the log-probability can be found in Supplementary section "The softball prior".For the negative log-variance, we use a Gaussian prior N (−2 log(σ), 1) with the same mean and standard deviation for all dimensions.Mixture coefficients and negative log-variances are initialized with default values 1 (so that components weights are uniformly K −1 ) and −2 log(σ) with σ = 0.2 × scale × K −1 , respectively.The default values for α, scale, and sharpness are all 1.Component means are initialized by sampling from the softball prior.The distributional component of the loss is given as ) as the density of the kth multivariate Gaussian component for representation z i .The training procedure is described in Algorithm 1.Since representations are updated once per epoch and decoder and GMM parameters every batch pass, we use different instances of the optimizer for the different parameter sets.We have also observed that the GMM requires much larger steps than neural network parameters, so we choose to have an optimizer per parameter set with individually selected learning rates.

Algorithm 1 Training
Initialize parameters for representations z i , decoder and GMM for epoch in n epochs: for x i , i in training data: Optimizer step for model and GMM Optimizer step for Representation The inference of new data points is straight-forward as well.For each new data point, one or more new representations are initialized.The type of initialization can be chosen.It can be from zero as done for the training data or from component means.For the latter initialization technique, the best starting point for each new sample is derived from the minimum reconstruction loss from all K component means.This presents our default setting and essentially assigns the optimal component to each new sample.From there on, representations are typically optimized for a very short time (10 epochs) with a batch size of 32.Inference of new representations is of course done with frozen decoder and GMM parameters.

Fashion-MNIST DGD
For the image-generating DGD, we choose a network architecture based on convolutions and tricks from other image generation models and image segmentation techniques [64][65][66].The decoders start with two fully connected hidden layers fed with the latent representations.The first hidden layer has 100 units, the second capacity × 3 × 3 units.The capacity represents the minimum number of channels of the convolutional layers except for the gray-scale output channel and is set to 64.The output of the fully connected hidden layers is reshaped into (batch, 3, 3, capacity) and fed into a series of NN blocks.These blocks consist of a transposed convolutional layer, swish activation and a Squeeze and Excitation layer as applied by [66].Input-and output channels as well as kernel size, strike and padding depend on the position of the block in the scheme and can be seen in Supplementary Figure 3.There are four main blocks and two skip connection blocks.The outputs of combined blocks go through the activation function after summation.The series of these NN blocks is followed by a PixelCNN with 5 layers and mask size 5 [65] and a last simple transposed convolutional layer reducing the number of channels to 1.The output is scaled using sigmoid activation.
Latent dimensionality and convolutional capacity vary depending on the experiment.For the models with 10 and 20 Gaussian components, we use a latent dimension of 20.The standard deviation of the GMM components are calculated as scale components .This ensures that the components are sufficiently separated and alleviates the burden of another hyperparameter to optimize.The scale and hardness of the softball mean prior are 3 and 5, respectively, and the dirichlet alpha is set to 2 since we are dealing with balanced classes.For the models involving a standard Gaussian (including VAD and VAE), we test latent dimensionalities of 20, 50, and 100.Since the prior is not learned, softball and dirichlet prior are not relevant.The supplementaries contain results of a model with latent dimension 2, capacity 32 and varying number of Gaussian components (1, 10, and 20).
For training, we use the binary cross entropy (BCE) loss and Adam optimization with betas 0.5 and 0.7.For decoder, representation and GMM we choose learning rates 1e − 3, 1e − 2 and 1e − 1, respectively.This relationship has evolved as a rule of thumb for the DGD.Learning rates are best chosen with respect to a desired decoder learning rate (which is to be optimized for each task independently).From there, the representation learning rate should be ten times the decoder learning rate, and the GMM learning rate should be between 10 and 20 times that of the decoder.
Hyperparameters such as the number of hidden dimensions, capacity, dropout and the softball prior scale were found through optimization with respect to the validation reconstruction loss.The summary is available as a parallel coordinate plot from weights and biases runs in supplementary figure 1.

VAD and VAE
The VAD and VAE decoder implementations tested here are architecturally identical to the DGDs with a standard Gaussian.The VAE encoder was for simplicity chosen to mirror the decoder, with normal convolutional layers instead of the transposed convolutions.The capacity for all models is 32 and latent dimensions tested are 20, 50, and 100.Instead of the negative log density of the GMM, the loss for the distribution over latent space is given as the Kullback-Leibler divergence as typical for VI [3].

Supervised learning
In the unsupervised model, the log likelihood log P (z i | φ) of an individual representation z i being drawn from the parameterized distribution P (φ) is given by the log sum of the probability densities of z i per component times the corresponding component probabilities, which are given by the softmax of the weights.In the supervised setting, we can ignore all components except the one that has been assigned to the sample's class.We thus only have to calculate the probability density of z i for the given component multiplied with the component probability.The losses for all other components will be zero and they will thus not receive gradients.

Single-cell DGD (scDGD)
The architecture of scDGD is much simpler than that of the Fashion-MNIST DGD.The decoder consists of the input layer with units defined by the latent dimensionality, and 3 hidden layers of each 100 units, connected to the output layer of (in the case of the PBMC data set) 32728 units, representing all transcripts in the data.The number of output units is given by the number of genes with non-zero expression counts in the data.The latent dimensionality was empirically found best to be 20 in terms of validation reconstruction loss and clustering accuracy of the cell types.The depth of the network was also empirically determined.The parallel coordinate summary of the hyperparameter optimization can be found in supplementary figure 6.
The normalized expression counts are modelled with a Negative Binomial distribution.Normalization is achieved by dividing the true expression count with the samples largest count.We can therefore use Sigmoid activation on the output layer.The reconstruction loss is given as the probability density of the re-scaled expression count and a gene-specific, learned dispersion parameter.Since this parameter is positive definite, we learn its logarithmic counterpart.For training, we use Adam optimization with betas 0.5 and 0.7.For decoder, representation and GMM we choose learning rates 1e − 3, 1e − 2 and 1e − 2, respectively.Two models are trained with 9 and 18 Gaussian components, respectively for the PBMC data set.
The component means of the GMM are distributed according to the mollified Uniform, for which we set scale and sharpness to 1 each.This applies to both 9-component and 18component model.The standard deviation of the components is initialized with a mean of 0.02 for the 9-component model and 0.01 for the 18-component model, which roughly corresponds to the formula we have introduced in the previous section, but is modified to a fifth of that in order to improve the component separation.Dirichlet alphas are set to 2 and 1 for models with 9 and 18 components, respectively.This is motivated by a dirichlet alpha of 1 allowing for nonuniform component weights, which is desired in this case of imbalanced cell type classes.In the case of nine Gaussian components, a Dirichlet alpha of 1 resulted in even more components representing T cells, so we decided to distribute the components more uniformly by increasing alpha.
All hyperparameters described above, except for the number of GMM components and of course the output dimensionality, present the empirically determined default parameters of scDGD.We applied this default model to 11 more data sets, in which the output dimensionality was set as the number of transcript found in the data, and the number of Gaussian components was determined by the number of unique cell types identified by CellTypist.Models for all data sets except PBMC (20k) were trained for 800 epochs.PBMC (20k) was trained for 1000 epochs.In all scDGD models, the learning rate of the decoder was reduced to 1e − 4 after 500 epochs.

scVI
An scVI model was trained on our PBMC train set.The model is implemented in Python version 3.8 with the scvi-tools [37] package version 0.17.4.as described in [8] with one hidden layer of 128 hidden units in both encoder and decoder.We chose a latent dimension of 20 for comparability with our model.The dropout rate is set to 0.1.The model was trained for 1000 epochs.For the analysis and comparison to scDGD, the latent space is clustered with Kmeans clustering (k=9) and then evaluated by the ARI.The lower bound is computed on our held out test set.Latent representations and ELBO were computed using scVI's implemented functions.Negative log likelihoods and RMSEs were computed based on the models returned mean and dispersion parameters for the test set.
For the remaining data sets, the scVI models were again initialized with default parameters and trained for up to 400 epochs, which represents the upper bound of the default number of epochs (for large data sets, the automatically calculated number of epochs can be very low).

Clustering metric
We use the adjusted Rand index (ARI) [42] for evaluating and comparing clustering performance of our models.The value of the metric ranges from 0 to 1, representing random and identical clustering, respectively.This metric is relatively robust to different clustering approaches and diverging numbers of clusters, which makes it a good metric for comparing our method to others.
When analyzing models without GMM priors, the latent spaces are clustered using k-means [67] clustering implemented in Python's scikit-learn package.

Reconstruction metrics
For Fashion-MNIST, we use the binary cross-entropy (BCE) loss in order to evaluate the reconstruction of image pixels.
In scDGDs, we calculate the negative log likelihood as the summed log-density of parameterized Negative Binomials over all genes.For each gene, the log density is calculated based on the re-scaled model output and a gene-specific, learned, dispersion factor.

Image generation evaluation metric
For a quantitative evaluation of generated images, we employ the Fréchet Inception Distance (FID) [45].The FID gives the squared Wasserstein distance between two multivariate Gaussian distributions.We used the PyTorch implementation [68] based on the original publication [45].We compute the FID score three times for a given model, with random seeds 0, 21 and 87243.In each run, we randomly generate 10000 samples and compute the FID score with respect to the Fashion-MNIST test set.

Software, statistics and visualization methods
Python 3.8 is used as the programming basis for all methods.Neural Networks are implemented in Pytorch 1. 10 [69] and training progress was monitored and logged using [70].Dimensionality reduction is either done with our described model or common methods such as PCA or UMAP [1].If not stated otherwise, default parameters of 15 neighbors and a minimum distance of 0.1 are used in UMAP fits.Graph visualizations are achieved using the NetworkX package [71].Requirements can be found in the corresponding repository here.

Hardware
Models were trained on NVIDIA TITAN Xp (12GB), except for scVI on the mousebrain (1M) data set, which was trained on NVIDIA TITAN RTX (24GB).

Funding
AK is supported by grants from the Novo Nordisk Foundation NNF20OC0062606, NNF20OC0059939, and NNF20OC0063268.

Acknowledgments
We acknowledge all the great discussions on representation learning and generative models with Wouter Boomsma, Jes Frellsen, Søren Hauberg, Aasa Faragen, Ole Winther and other members of Center for Basic Machine Learning Research in Life Science (MLLS), as well as support from our Center of Health Data Science on discussions about the methods and singlecell data (especially Iñigo Prada Luengo) and for reviewing our manuscript (Jennifer Anne Bartell, Iñigo Prada Luengo and Thilde Terkelsen).We especially thank Sarah Teichmann, her lab and Emma Dann for the insight we have gained into single-cell sequencing data which was invaluable in the reviewing process.

Fig. 1
Fig. 1 The model.a) Graphical model and b) schematic of the deep generative decoder.The DGD consists of a decoder of any desired architecture with parameters θ mapping the latent representation Z to the data space X.The representation is modelled by a probability distribution with parameters φ.N is the number of samples.

Fig. 2
Fig. 2 Fashion-MNIST latent representations and samples.The dimension of all latent spaces included in this figure is 20.The number of components of the GMM are denoted as "c".a) Generated images of samples drawn from each component of the DGD with 20 Gaussian components.The number of the component drawn from is depicted on the x axis.b)-c) Latent representations visualized as UMAP dimensions 1 and 2 colored by data classes.The UMAP was computed with 50 neighbors and a minimum distance of 0.7.b) Latent representation from the model with 20 GMM components.Numbers in the plot show the positions of the corresponding component means, from which samples were drawn in a).c) Latent representation from the supervised DGD with 10 components.d) Test image reconstruction from DGD, VAD and VAE.The top row shows the first 20 original test images.The remaining 3 rows show test image reconstructions.Names of the corresponding models are depicted on the left.The indication of one component refers to a standard Gaussian.e) Randomly sampled images from DGD, VAD and VAE.Corresponding models are indicated by plot titles.

Fig. 3
Fig. 3 Latent spaces of scDGD.The latent spaces are shown colored by cell type (left) and type of latent point (right).Visualizations are achieved using UMAP with a spread of 5 and a minimum distance of 1. 'Representation' refers to learned representation of training data, 'samples' correspond to random samples drawn from the GMM and 'gmm means' represents the component means.The IDs of the component means are shown in black text on their corresponding coordinates.a) Latent space of scDGD with 9 Gaussian components.This model was trained for 700 epochs.b) Latent space of scDGD with 18 components, trained for 600 epochs.c)-e) Spatial relationships between GMM components of the 18-component scDGD.c) Heatmap of the GMM's component assignment to samples given as the percentage of each class with highest probability of a component.d) Graph visualization of the GMM component means with edge lengths correlated with Euclidean distances from a) and edge widths negatively correlated with Euclidean distances.Only the 95th percentile of distances were accepted as graph edges, resulting in a threshold of 0.14.Edge colors are grey or show components belonging to the same cell type (same colors as in 3b), except blue, referring to all CD4 cells minus memory T cells).

Fig. 4
Fig. 4 UMAP projection of CD34-positive cell latent representations learned by scDGD with 18 Gaussian components coloured by expression levels of different genes.a) Heatmap of Spearman correlation coefficients between GMM component probabilities and marker expression counts for all CD34positive samples.b) Representations (green) and component means (white numbers) in UMAP projection of all CD34-positive cells on black background.c) UMAP representation projection colored by expression counts for gene markers indicated by the plot titles.

Fig. 5
Fig. 5 Performance comparison of scVI and scDGD on independent data sets.Error bars indicate the standard error of the mean from three models trained with different random seeds.Asterisks indicate whether differences are significant based on an alpha value of 0.05.a) Reconstruction performance measured as the negative log likelihood (NLL) of the negative binomial distribution and the root mean squared errors (RMSE).b) Clustering performance measured by the adjusted Rand index (ARI).

Fig. 1
Fig. 1 Fashion-MNIST hyperparameter search.The parallel coordinate plot from the corresponding wandb [70] project shows combinations of hyperparameters (all coordinates of the plot except the last) and the resulting models' reconstruction performance on the validation set.Each model is represented by a line colored by the validation reconstruction loss (BCE).A total of 132 different models were tested.

Fig. 3
Fig. 3 Unsupervised and supervised learning of representation and GMM for Fashion-MNIST.DGDs with a 20-dimensional latent space are trained with 10 Gaussian components in a a) unsupervised and b) supervised manner.For each model, 6 reconstructed samples are shown (row-wise) for each of the 10 components, indicated by the component ID below.

Fig. 2
Fig. 2 Latent spaces for varying numbers of Gaussian components and latent dimensionalities trained on Fashion-MNIST.a) DGDs with a 2-dimensional latent space are trained with 1, 10 and 20 Gaussian components (GC).Latent points are colored by their type.This refers to whether they are learned representations, samples drawn from the GMM or component means.b) UMAP projections of 20-dimensional latent spaces with 1, 10 and 20 Gaussian components.The top row is colored by sample class, the bottom by latent point type as in a).

Fig. 6
Fig. 6 Single-cell DGD hyperparameter search.The parallel coordinate plot from the corresponding wandb [70] project shows combinations of hyperparameters (all coordinates of the plot except the last) and the resulting models' reconstruction performance on the validation set, as well as the clustering accuracy.Each model is represented by a line colored by the clustering accuracy with respect to the cell types of the data.A total of 67 different models were tested.

Fig. 7
Fig. 7 Euclidean distances between GMM component means of the 18-component scDGD.Heatmap of the Euclidean distances between GMM component means.