Deep learning in spatially resolved transcriptomics: a comprehensive technical view

Abstract Spatially resolved transcriptomics (SRT) is a pioneering method for simultaneously studying morphological contexts and gene expression at single-cell precision. Data emerging from SRT are multifaceted, presenting researchers with intricate gene expression matrices, precise spatial details and comprehensive histology visuals. Such rich and intricate datasets, unfortunately, render many conventional methods like traditional machine learning and statistical models ineffective. The unique challenges posed by the specialized nature of SRT data have led the scientific community to explore more sophisticated analytical avenues. Recent trends indicate an increasing reliance on deep learning algorithms, especially in areas such as spatial clustering, identification of spatially variable genes and data alignment tasks. In this manuscript, we provide a rigorous critique of these advanced deep learning methodologies, probing into their merits, limitations and avenues for further refinement. Our in-depth analysis underscores that while the recent innovations in deep learning tailored for SRT have been promising, there remains a substantial potential for enhancement. A crucial area that demands attention is the development of models that can incorporate intricate biological nuances, such as phylogeny-aware processing or in-depth analysis of minuscule histology image segments. Furthermore, addressing challenges like the elimination of batch effects, perfecting data normalization techniques and countering the overdispersion and zero inflation patterns seen in gene expression is pivotal. To support the broader scientific community in their SRT endeavors, we have meticulously assembled a comprehensive directory of readily accessible SRT databases, hoping to serve as a foundation for future research initiatives.


Introduction
In multicellular organisms, the tissues contain a group of diverse cells, with each cell conducting a specific function and constantly proliferating itself through the process of division [1].A cell's fate and behavior are associated with communicating to the surrounding environment.Awareness of the cell's position and how it spatially organizes within a tissue is critical for understanding not only the function of the tissue, but also general concepts underlying diseases.For example, it has been shown that the cell-cell interactions between stromal and infiltrating immune cells play a major role in many physiological and pathological processes such as autoimmunity and cancer [2].Single-cell RNA sequencing (scRNA-seq) has become a powerful approach in the genomics area, capturing the activity of thousands of genes in a biological sample at an unprecedented resolution.Currently, scRNA-seq has provided systematic benchmarks to dissect heterogeneous cell populations across various disciplines such as cancer, immunology, developmental biology, etc. [3].However, the scRNA-seq method requires tissue dissociation, leading to loss of cell position within the tissue, which is key to understanding the functionality of complex tissues.Spatially resolved transcriptomics (SRT), selected as method-of-the-year in 2020 [4], has enabled researchers to capture the expression of genes along with corresponding spatial information across tissues [5].SRT data are generated based on the different experimental protocols.Generally, SRT technologies can be broadly divided into two leading groups [6] (I) image-based methods with high spatial resolution and overall low sensitivity in gene detection and (II) sequencing-based methods with limited spatial resolution but high-throughput mRNA-capturing.In the first group, in-situ hybridization (ISH) methods enable the quantification of gene expression at a sub-cellular resolution and the visualization of RNA molecules directly in their original environment.These methods include spatially resolved transcript amplicon readout mapping (STARmap) [7] and singlemolecule fluorescent ISH (smFISH) [8].smFISH was further developed into the sequential hybridizations (seqFISH) [9], multiplexed error-robust FISH (MER-FISH) [10], and ouroboros smFISH (osmFISH) [8] techniques, which each measure more mRNA species with higher resolution, respectively.Although current image-based methods can provide a higher gene detection sensitivity than sequencing-based methods, their resolution is inversely associated with the number of genes imaged, and they are limited to a specific number of preselected genes [11].Also, these methods are typically limited to hundreds of preselected genes.The sequencing-based methods depend on the prior spatial barcoding to perform an in situ capturing of transcripts [12] followed by an in situ sequencing, such as spatial transcriptomics (ST)/10x-Visium, Slide-Seq, and high-definition spatial transcriptomics (HDST).This category empowers unbiased profiling of the complete transcriptome [12].Therefore, it can capture thousands of genes at specific locations, denoted as spots at lower cellular resolution than image-based techniques.In the sequencing-based methods, hematoxylin and eosin (H&E) stained histology images, provide necessary information about cellular morphology and heterogeneity of tissues in parallel [13].Figure 1 summarizes various SRT methods related to the image-based and sequencing-based approaches.Existing histology images and spatial information generated using SRT methods and gene expression data have added a new dimension to omics research, generating massive and diverse datasets [14].Same as other biological data, SRT data are highly dimensional and implicitly noisy [15], which increases a demand for statistic and machine learning (ML) methods to deal with these challenges in SRT.Statistic approaches primarily focus on inference, meaning assuming a probability model on the input data to formalize understanding of a hypothesis about the system's behavior.In contrast, ML methods have a long-standing focus on prediction, in which the learning algorithms extract highly robust and rich features from data.Subsequently, some methods belong squarely to one domain or are common in both domains.However, the relationship between model complexity and the number of features (data-wide) and possible associations among them are linear, in which statistical inferences become less precise [15].Since the SRT data can be categorized as wide data in which the number of input variable are more than the number of observations, ML approaches can be more robust and efficient than statistical methods.As a robust ML approach, deep learning (DL) has proved its efficiency in many biological tasks (both supervised and unsupervised), particularly in the various steps of scRNA-seq data analysis such as normalization [16], dimensionality reduction [17,18], clustering [19], cell-type identification [21], data integration [22,23,24] and perturbation modeling [25,26,27].However, SRT datasets are more challenging than scRNA-seq data due to their multi-modality and diversity.Consequently, conventional (ML) methods in the SRT data analysis are mainly similar to the statistical inference domain, in which there is a demand for pre-existing knowledge about the data to estimate unknown parameters in the model.Consequently, the DL method does not need to know the data-generation process to model data and is more potent in extracting complex and high-dimensional features.DL models are more versatile for integrating histology images, gene expression matrices, and spatial information.Indeed, DL paradigms have facilitated the handling of such complicated datasets and related downstream analyses.This paper undertakes a comprehensive review of recently developed ML and DL methods for analyzing SRT data (both imagingbased and sequencing-based techniques), focusing on DL models to investigate how DL approaches jointly use histology image, gene expression, and spatial coordinates, and how the spatial information can provide unprecedented insight into the molecular organization in heterogeneous cellular contexts.Some efforts have been made to review the computational challenges in the SRT domain.Hu et al. [28], focused on the statistical and ML methods to analyze SRT data.This work focused on the capacity of histology images which can be applied to both imaging-based and sequencing-based techniques.Zeng et al. [13], provided a summary of the statistical and ML methods in the SRT domain with more focus on the sequencingbased methods.Despite the valuable information in these review papers, they did not provide detailed information and discussion of the application of DL models in SRT analysis.Although the review paper by Heydari and Sindi [29] reviewed the application of DL in SRT analysis, this work mainly focused on sequencing-based approaches.In this technical review, we identified papers published on applying ML methods focusing on Dl models for analyzing both imagingand sequencing-based SRT data up to June 2022.We summarize the concepts and common tasks in SRT data analysis into six categories and discuss the DL models and their associated findings in detail.Our study also offers complete information on current SRT datasets and evaluation metrics.Also, we provide the technical detail of current DL models and the associated results in the supporting material.We envisage that this paper will serve as a comprehensive reference for further applications of DL in SRT data analysis and can facilitate the development of innovative methods in the future.

Overview of common deep learning models for SRT data analysis
According to the existing tasks in SRT exploration, the reviewed papers used different supervised and unsupervised learning methods in their works.For example, gene prediction, and cell segmentation are supervised, whereas clustering, imputation, and dimension reduction are unsupervised learning tasks.For a better understanding of the reviewed methods in this study, we first describe the DL models that have been used to analyze SRT data as well as their general mathematical formula together with their training strategies.These models include deep neural networks (DNNs), autoencoders (AEs), variational autoencoders (VAEs), convolutional neural networks (CNNs), and graph convolution networks (GCNs) [30].We focus on different attributes of each method along with their uniqueness and novelty.Figure 2 illustrates a brief visualization of all surveyed DL models in this paper along with available pre-processing approaches regarding gene expression matrices, spatial information, and histology images.The DL model and the pre-processing steps vary depending on the input data and the research objectives.

Deep Neural Networks
The DNNs, called feed-forward neural networks, are the quintessential DL model.These networks consist of multiple hidden layers (fully connected layers), each of which including several neurons.The number of layers determines the depth of the model.Given the input X, the DNN model approximates the nonlinear transformation φ for a specific goal (e.g., classification).The neurons of each layer (l), take the output of neurons in the previous layer (l − 1) and feed them to an activation function , where w n is the weight of neuron i, and b is the bias.The weights and biases in each layer are the parameters that are updated using a predefined optimization algorithm that minimizes a loss function.

Autoencoders
The AE networks are deep generative models, mostly considered as a dimensionality reduction method.The AEs have an encoder part E and a decoder part D, with specific neural network architectures in each part, respectively.The encoder part takes the input data X and generates the latent variable Z by learning the network weights θ as Z = E θ (X).The latent representation Z carries out the most important information of the input, which can be leveraged in clustering and other downstream analyses.The decoder part receives Z as the input and reconstructs the input data X by updating the decoder network weights φ during the training as X = D φ (Z).All network parameters (θ and φ) are updated by minimizing the mean square error (MSE) loss function between the input data X and reconstructed input X.The goal of AE models is to introduce a paired encoder-decoder that keep the maximum information when encoding and then achieve the minimum of reconstruction error in the decoding.

Variational Autoencoders
The AEs encode an input as a single point without any regularity in the latent space, leading to a lack of generative properties in the decoder part.Variational autoencoders (VAEs) are deep generative models which encode the input as a distribution over the latent space.They consist of two parts: inference and generative model.The inference model, also called the recognition model or encoder takes the input data X and learns the latent features Z through the multilayer deep neural networks.In the inference process, there are two assumptions: (1) the latent variable Z is sampled from a prior distribution p(Z) = N (Z; 0, I), and (2) the input data X follows the assumed conditional distribution (e.g., Gaussian distribution with mean µ and variance σ) q θ (Z | X)−N (Z; µ, σ), where θ denoted the encoder network's weight.The decoder part is a Bayesian network, which accepts the variable Z and calculates the posterior probability P φ (X, Z), where φ is the decoder network's weights.Therefore, the main aim of the VAE is to find the optimal value for the distribution parameters by updating the θ and φ, to maximize the evidence lower bound (ELBO) as follows: ) where, log(p φ (x)) is the marginal likelihood and D KL is the Kullback-Leibler (KL) divergence between the approximates and true (posterior) distribution.Eq.1 is optimized by the stochastic gradient descent (SGD) approach.The VAE can be applied to graphs, giving rise to the variational graph autoencoder (VGAE).Besides this general formulation (Eq.1), each paper Fig. 1: Schematic overview of two SRT approaches.A) Image-based methods.These methods contain two categories.i. fluorescent in situ hybridization (FISH) approach: In this category, the probes are labelled with a set of fluorophores that are then individually hybridized to predefined RNA targets to visualize gene expression in fixed tissue.This approach has been further developed to smFISH by utilizing multiple shorter probes, which provide quantitative measurements of transcripts.In 2014, a further development of smFISH involved using sequential hybridizations (seqFISH).To avoid the extensive time of seqFISH, a multiplexed error-robust FISH (MERFISH) was proposed in which N rounds of fluorescence readouts can encode each mRNA by a binary code, and the targeted transcript can be distinguished by decoding.ii.In situ sequencing methods (ISS): In this approach, RNA sequencing is performed directly on the RNA content while it remains in its tissue context, mainly using padlock probes to target the genes.For instance, STARmap is an ISS method; it uses six cycles of barcoded padlock probes and adds a second primer to target the site next to the padlock probe.B) Sequencing-based methods.This category provides an unbiased analysis of the complete transcriptome, capturing transcripts in situ and performing sequencing.First, the targeted tissue is placed on top of the microscopic slide, including a barcoded array that captures the spatial information related to each probe.Tiny needles inside each probe contain a spatial barcode and RT primers.After removing the tissue, cDNA-mRNA complexes are extracted for library preparation and next-generation sequencing (NGS) readout.Experimentally, the measured gene expressions are captured in spots or beads, complemented by a high-resolution histology image obtained by microscopy of stained tissue sections for the same tissue section.The dimensions of probes (spots/beads) vary, corresponding to the different technologies.They can be 100 µm (ST) or 55 µm (10X Visium) in diameter or use an ordered bead array onto which two µm-sized beads(HDST) or use ten µm-sized barcoded beads (Slide-seq).
proposed a specific data distribution and learning strategy.

Convolutional Neural Networks
CNNs are well-known supervised methods that have proved their efficiency in many areas, especially in image processing.Briefly, a CNN has multiple layers, such as convolutional, pooling, normalization, dropout, and fully connected.A multidimensional array of data is usually fed as the input to perform feature extraction.In the convolution layer, the convolution operation is applied to the input matrix X using the kernel W , i.e. S = X * W , where S is often referred to as the feature map.An essential attribute of the convolutional layer is the commutative property, which arises because the kernel is flipped relative to the input to learn the most important features (e.g., edges in the image) regarding their locations.Then, the output is passed to the non-linear activation function.Typically, the output of the activation function is run through the pooling layer to reduce the dimensions of the input layer.The pooling operation such as max-pooling or average-pooling replaces the previous layers' output at a particular location with a maximum or mean of a rectangular neighborhood.Mainly, the pooling layer helps to make the representation approximately invariant to small input translations meaning small changes in the input do not result in large changes in the output.CNNs are used more than other DL models on SRT data for gene classification, gene prediction, and learning embedded features because of existing histology images.The details of the CNN models used in the reviewed papers are discussed later.

Graph Convolutional Networks
The idea of GCNs is the same as the CNNs, but the GCNs take the graphs (irregular data) as the input, and the kernels fit on each node and their adjacent Since the reviewed papers propose different architectures of DL models and various loss functions, we will explain their topologies in the next section 3.

Survey of deep learning models for SRT analysis
In this section, we review 21 DL methods that have been used in the analysis of SRT data.To have a better discussion on these methods, we divided the available SRT data analysis tasks into six sub-categories; i) identifying spatial domain, ii) identifying spatially variable genes, iii) imputing missing genes, iv) enhancing gene expression resolution, v) cell-cell interactions, and vi) cell-type decomposition.A brief landscape of each sub-category and the corresponding DL are shown in Figure 3.To facilitate cross-references of the information, we have tabulated the utilized metrics and the dataset used in each method in Supplementary Tables 1 and 2 in supporting information, respectively, and the summary of reviewed papers in Table 1.

Identifying Spatial Domain
The spatial domain and reconstructing tissue architectures, i.e., refer to identifying spatially spots with coherent gene expression and histology, which are considered as a critical step in spatial transcriptomics analyses.Although there are many SRT platforms such as slide-seq [32], which produces both tissue images and gene expression data, most SRT approaches have characterized cell types using clustering methods that use only gene expression features (i.e.Seurat [33]).Many works utilized traditional ML approaches to incorporate tissue heterogeneity and spatial information for clustering the spatial domains.
[34] developed an approach using a hidden Markov random field (HMRF) model and considered each spatial domain of cells as a set of nodes in an undirected graph and clustered each node by utilizing gene expression data.BayesSpace [35] used a Bayesian model with a Markov random field (MRF) which assumes that the gene expression data follows a multivariate normal distribution in which its latent features can be modeled by a spatial prior.Since the spatial prior does not use the exact location of spots, BayesSpace forces neighboring spots to join the same cluster.Besides the high computational cost of BayesSpace and HMRF for high-resolution SRT data, MRF-based methods have the smoothness parameter, which is highly significant for determining the proximity of the neighboring spots, which is considered as a fixed value by BayesSpace.SC-MEB [36] is another ML approach that is computationally efficient and adaptively learns the smoothness parameter.This method used the HMRF technique based on Empirical Bayes and utilize expectation-maximization (EM) to predict the label of each spot or cell.The abovementioned ML approaches have an assumption on the input data, leading to knowing the prior information about the data-generating process.Thus, as we barely control the experimental design, the DL models can alternatively use because these models have minimal assumptions about the data-gathering systems.Subsequently, due to the high dimensionality of SRT data, specifically in sequencing-based methods, DL models can be effectively used because current ML methods are suitable for low-dimensional data.Thus, using DL methods for analyzing SRT data has significantly increased the functionality of finding spatial clustering.Several DL methods have been developed to account for contributing spatial data and histology images in spatial domain identification.This includes SpaCell [38], stLearn [42], SpaGCN [45], SEDR [47], STAGATE [49], RESEPT [51], ECNN [52], JSTA [56], and conST [58].Figure 4 illustrates a summary of the DL models used to identify spatial domains in SRT data.The detail about some DL models as well as their results are provided in supporting material. 3.1.1SpaCell.
Tan et al. [38] developed a DL model called SpaCell, which uses gene expression and tissue images.SpaCell is the first method that combines images and gene expression data in cell-type clustering.SpaCell performed pre-processing methods on the image and the count matrix separately.For histology images, it divided the entire image into small tiles, each of which is resized to 299 × 299 pixels, including one spot, then further normalization such as random rotation and Z-transform were performed.Gene counts were also mapped to each spot.For cell type clustering, the ResNet50 [39] (a CNN) with pre-trained weight on the ImageNet [40] database was fitted on each spot to find a latent embedding vector, representing informative features in the image tile.SpaCell then imported these features, and the corresponding gene counts vector into the two AE networks and merged the obtained layers to find a latent embedding layer.K-means clustering was then applied to the embedding layer.In the disease-stage classification model, ResNet50 was fitted on the same images as the clustering step.A DNN model with two fully connected layers was then used to take the pixel features and gene count matrix as input.The output from the last layer is four probabilities, each corresponding to the four disease stages (see Supporting material for more information about the results).However, SpaCell has two disadvantages: learning embedding without using spatial information and a pre-training model with a non-histology image which can be non-informative.

stLearn.
By inspiring SpaCell, Pham et al. [42] proposed stLearn, to leverage the integration of three data types optimally in SRT, including gene expression measurements, spatial distance, and tissue morphological information.The most critical aspect of stLearn is normalizing gene expression matrix via histology image.It means that stLearn tries to put the information from data heterogeneity into a gene expression matrix by normalizing it through the concept of the morphological distance of neighboring spots, in which the latent morphological features can be extracted by a CNN model (ResNet50 pre-trained on the Ima-geNet) and HE images as input.For each spot, stLearn identifies neighboring spots whose euclidean distance between their spatial coordinates is less than the predefined threshold.In other words, stLearn assumes that the neighboring spots (which are determined .

Gene expression Spatial coordinates
Identifying by arbitrary parameter) have similar gene expression and morphological information.This method applied global and local unsupervised clustering in two stages on the normalized gene expression data using the SMEclust function.In the former, the authors applied the PCA or uniform manifold approximation and projection (UMAP) [43] methods on the normalized data followed by the KNN graph, which was constructed based on the Euclidean distance.Then, they applied k-means clustering to the graph.Regarding the latter, this work found neighbor spots with minimum distance = 100 for each spot and identified location-based core spots with more than min samples .If a global cluster has more than one location-based cluster obtained by the DBSCAN algorithm [44] (a clustering algorithm), it will be split into sub-clusters (see Supporting material for more information about the method and results).Like SpaCell, stLearn used an unrelated dataset for pre-train the model and also used fixed radius for identifying neighboring spots.

SpaGCN.
Hu et al. [45] expressed that despite the efforts of stLearn in cell-type clustering using three separate input data (gene expression, spatial information, and histology image), stLearn could not link the spatial SpaGCN constructs an undirected graph in which each node's feature is the gene expression associated with each spot.The edge's value is obtained via spatial coordinates and histological features of each spot.Unlike stLearn, SpaGCN does not limit the neighboring spot to the predefined radius and consider all spot simultaneously by weighting them in gene expression aggregation.Given the gene spatial coordinates (x, y), SpaGCN added a new dimension z by using the pixel coordinate and the variation of their RGB channels without using unsupervised feature learning approaches.Thus, the graph can be constructed by the Euclidean distance between the two spots, which have three dimensions.SpaGCN reduced the dimension of the gene expression matrix to 50 via PCA and utilized a GCN to link the weight of edges and gene expression for node clustering.Next, SpaGCN applied an unsupervised clustering model [46], in which each cluster represents a spatial domain containing spots with a highly correlated gene expression and histology (see Supporting material for more information about the method and results).Since SpaGCN used RGB channels to add an extra dimension, this approach may cause inaccuracy results due to the noisy nature of these images.

SEDR.
Fu et al. [47] expressed that the SpaGCN is an oversimplified way of combining histology images with spatial information, and it needs more evidence to prove its rationality.Hence, they presented a novel method called SEDR that learns a low-dimensional latent representation of gene expression by the AE model (with two fully connected layers) and then embedded spatial data with the VGAE model (parameterized by a two-layer GCN).For each spot, the ten nearest spots consider as the neighboring spots by SEDR.The obtained embeddings from AE and VGAE were concatenated into the final latent representation; then, an unsupervised clustering method was added to obtain spatial clusters (see Supporting material for more information about the method and results).While SEDR may have justified the exclusion of histology images in their analysis, it's worth noting that both SpaCell and stLearn have demonstrated the valuable insights that can be gained from incorporating histology images, particularly in relation to tissue heterogeneity.

STAGATE.
Despite the recent use of both data in SEDR, Dong et al. [49] argued that using a predefined similarity measurement between neighboring spots needs to be corrected.They proposed a graph attention autoencoder framework, STAGATE, to precisely detect spatial domains in the SRT data by incorporating spatial information and gene expression profiles.In the preprocessing step, the authors removed the area outside the tissue and used log-transformed raw gene expression as the input for STAGATE.The novelty of STAGATE mainly refers to constructing a spatial neighbor network by utilizing two approaches adaptively.The first is the standard adjacency matrix with spatial data and the predefined parameter as radius, and the second is obtained through GAT and the preclustered gene expression matrix.These two modules can adaptively be selected as the input of the graph attention layer.STAGATE sets the encoder into two neural network layers, where the first layer is adopted to the attention layer.Then, it performs mclust [50] and Louvain clustering algorithms for the labeled and unlabeled data on the learned features, respectively (see Supporting material for more information about the method and results).Despite the promising result by STAGATE, this method used a predefined radius parameter to identify neighboring spots.

RESEPT.
Although SpaGCN [45] and stLearn [42] provided helpful information on spatial domain identification, Chang et al. [51] argued that these methods have not revealed the intrinsic tissue architecture because of limited use of spatial information.Alternatively, they developed RESEPT [51], a DL approach for reconstructing, visualizing, and segmenting an RGB image from spatial transcriptomics.RESEPT consists of two main steps: (i) converting gene expression or RNA velocity data from spatial transcriptomics sequencing into an RGB image.This process preserves the topological relationship between spots and reconstructs the RGB image by combining spatial expression and spatial coordinates.(ii) Segmenting the image from the previous step to identify spatial domain boundaries.
To achieve this, RESEPT uses a graph autoencoder, including a GCN in the encoder part, to map the input graph into a three-dimensional latent space that represents the RGB channels.The input graph is built from six neighboring spots that are adjacent in Euclidean space, using spatial information.This stage includes a backbone network (ResNet101), an encoder module (Atrous Convolutional Layers), and a decoder module.The backbone network provides rich visual features and is pre-trained using the Cityscapes dataset.
The encoder module extracts multi-scale contextual information from the backbone network, while the decoder module recovers segment boundaries and spatial domains.The weights in RESEPT are optimized by minimizing the cross-entropy loss between the segmented image and the ground truth (see Supporting material for more information about the results).However, this method is limited by the fixed number of neighbors in the adjacency matrix, causing the obtained graph to be unable to learn more instructive features and biased to the model's parameters.

ECNN.
Chelebian et al. [52] have investigated recent attempts in the DL area to process histology images and found a need for comprehensive methods to extract features from histology images and transcriptomics signatures.They also expressed that models need massive related data for the training process; however, current methods such as stLearn pre-trained their model on unrelated data such as ImageNet, negatively affecting the model's accuracy.Thus, they developed a modified version of the ensemble CNN [53] previously trained on prostate needle biopsies to identify sub-regions with meaningful genetic properties in prostate histology images.As the authors in [52] did not assign a name to the model developed in their work, we will subsequently refer to it as ECNN throughout this paper.ECNN used an ensemble model consisting of 30 Inception V3 [54]  The input images are the patches with 598×598 pixels.ECNN used the UMAP feature reduction algorithm on each latent feature vector of the models (2048 × 30 dimension) to reduce it to 30 × 10 descriptors per patch.This approach performed unsupervised clustering using a Gaussian Mixture Model on the ten latent features to prove that the extracted features produce relevant biological clusters to the manual annotation.Also, this work reduced the feature dimensions from ten to three, denoted as three RGB channels, to visualize the captured heterogeneity via a color map on top of the original section.A relative mean intensity (RMI) matrix was created using the measured gene expression factor signatures in each region to show that the obtained clusters were genetically relevant.The obtained clusters appropriately represented different gene expression factor signatures (see Supporting material for more information about the results).Although this method used a related dataset for the pre-training model, this method is blind to spatial information in SRT data.
3.1.8JSTA.However, these methods have no prior information about the type of cell from which RNA molecules are captured.Therefore, there is a need for a comprehensive segmentation algorithm to assign the genes to the related cells.Previously, watershed-based algorithms [?] and the newer ML approaches were developed to segment images into cells.However, the need for experts to label the segmentations and the low quality of the data for the training step have affected these methods.Littman et al. accounted for these limitations and proposed JSTA [56], a joint computational DL-based expectation maximization (EM) approach to enhance the segmentation of the RNA hybridization images, specifically at cellular boundaries.JSTA receives two inputs, i.e., the gene expression level of cells and pixels, described by two matrices E c and E p .First, it uses the watershed algorithm on E p to obtain the initial segmentation.Then, a DNN with three layers is applied on E c to assign each gene to the related cell type with a higher likelihood.JSTA trains another DNN on the E p to obtain the cell type probability of each pixel.The two training pipelines are the Estep in the EM algorithm, which estimates the cell type distribution through pixels and gene expression.
The trained pixel classifier is applied to the border pixels (determined based on the specific criteria) to reclassify those to the cell type with a higher probability.
Once the JSTA re-assigned pixels in the cell borders, it updates the image segmentation and cell classifier based on the new assignments.The improvement is the M-step, which is the optimization step.The entire process is iterated until its convergence (the stable point at the end of the learning process).In both classification models, the cross-entropy loss function is used, which minimizes the error between the predicted cell type and ground truth (see Supporting material for more information about the results).However, this method lacks generalizability since it can be applied only to RNA hybridization-based methods.
3.1.9conST.5 shows the overall view of distinguishing SVGs with DL models, reviewed in this paper.

SpaGCN.
Hu et al. [45] proposed SpaGCN to detect SVGs in each obtained cluster (refer to the previous section).
As the second pipeline, SpaGCN identifies SVGs which are highly expressed in each spatial domain.
Considering the spatial domains obtained by clustering, SpaGCN first finds the neighboring domains (clusters) of a targeted domain based on a defined criterion (e.g., distance) and selected SVGs based on the Wilcoxon rank-sum test.Additionally, the authors noted that some genes may be expressed in multiple but scattered domains and called those metagenes, which are still helpful for understanding the spatial variation of gene expression.Briefly, SpaGCN first identifies genes with a weaker expression level than SVGs in the target domain by reducing threshold values in the previous pipeline.SpaGCN then randomly selects the genes with a corresponding mean value of gene expression as the base genes and identifies the genes with higher expression values in the target and not-target domains as positive and negative genes, respectively.SpaGCN next adds the positive genes to and subtracts the negative genes from the base genes to detect the metagenes uniquely expressed in the target domain.The entire process would iterate for all genes in the target domain.Moreover, SpaGCN provides the sub-cluster option to the obtained spatial domain by using the information from neighboring spots to characterize heterogeneity across the spatial domain.The performance of SpaGCN in SVG detection is measured by Moran's I [63] statistic metric and SpaGCN achieved more coherent SVGs with better biological interpretability than the SPARK [64] and SpatialDE (statistical methods).SpaGCN identifies SVGs as a downstream task, and the deep model is essentially trained to clustering rather than design for SVG detection.Thus, the obtained marker genes do not associate with tissue heterogeneity.

conST.
Zong et al. [58] detected spatial marker genes on the obtained clusters (refer to the previous section) and considered it as a downstream task to show the accuracy of the clustering results.conST uses a similar approach like SpaGCN into the latent embedding obtained from the main algorithm.(refer to the conST method in the spatial domain identification section).
The authors compared conST with obtained marker genes by SpaGCN in terms of Moran's I on the spa-tialLIBD [65] dataset, in which conST achieved better performance, particularly at the boundary of white matter layer.

STAGATE.
STAGATE is another deep model which identifies SVG in the spatial domain, however, it has not been trained for this task.Similar to SpaGCN, obtained SVGs do not correlate with morphology.STAGATE applied the Wilcoxon test implemented in the SCANPY package to identify spatially variable genes for each spatial domain.STAGATE could also detect SVGs in the Slide-seqV2 dataset from mouse olfactory bulb tissue, and it can detect more genes in the small tissue structures than the SPARK-X algorithm.

CoSTA.
CoSTA is a cluster-based method that avoids using hierarchical clustering and treats pixels as independent features.It uses an expression matrix constructed for gene expression analysis and employs PCA as a pre-processing step, which takes into account pixellevel correlations instead of preserving the spatial relationships between neighboring cells.[67].Xu et al. [68] claimed that most spatial patterns become lost by these approaches.They proposed CoSTA that uses an unsupervised convolutional neural network learning approach to learn spatial relationships between genes by using more information about the positions of neighboring pixels in spatial transcriptomic images.
In the pre-processing step, 100 pixels are binned into one pixel and resized into the 48×48 image size.After binning, CoSTA normalizes gene matrices as described in [62] and scales them between 0 and 1 through dividing the gene matrices by the maximum value of the 48×48 matrix.CoSTA consists of two steps: clustering and neural network training.In the first step, CoSTA passes the normalized images through the CNN network (ConvNet), which consists of three convolution boxes (each box contains Convolution, batch normalization and max-pooling layer).To cluster features, the method applies the L2-normalization and UMAP (an unsupervised dimension reduction method) to the feature vector, respectively.The purpose of clustering is to generate labels for training ConvNet.Once the ).However, this method needs high parameter tuning and only was assessed on high-resolution SRT data.
3.2.5 ST-Net.The UMAP visualization then showed that the latent feature vectors can distinguish between tumor and non-tumor spots, which has potential applications in clustering and cell-type composition (see Supporting material for more information about the results).Despite its robustness and generalizability, this study does not fully exploit the spatial information available in the data.

SPADE.
The combination of image and spatial gene expression data have provided complementary information about morphological patterns in tissue.Bae et al. [72] used morphological heterogeneity to detect SVGs in which each gene was considered a dependent variable in the model.They proposed SPADE, a convolutional neural network model, to extract features from image patches around each spot and combine them with gene expression data to obtain spatial marker genes.
In contrast with ST-NET that utilizes tissue images to predict marker genes, SPADE exploits the relationship between morphology and gene features.In the first step, SPADE extracts features from patches by utilizing VGG-16, in which the weights were pretrained on the ImageNET dataset.The last layer of VGG is a 512-dimensional vector, which the dimension is reduced by the PCA algorithm.The number of principal components (PCs) varies regarding the input dataset.After normalizing genes in each spot, the authors used Limma [73] for discovering SVGs, and a linear regression model to fit normalized genes to PCs of latent image features.The linear model's goal is to rank the associated genes base on the PC's value regarding regression coefficient (RC) or corrected Pvalue.Ultimately, the genes are selected as the spatial maker genes that present a false discovery rate (FDR) less than 0.05 in PCs which explains more than 2% of the variance in 512-dimensional image features.SPADE identifies marker genes positively associated with the endoplasmic reticulum, synapse organization, and cell adhesion molecule binding in human breast cancer, olfactory bulb, and prostate cancer dataset, respectively.Nonetheless, The obtained marker genes can be significantly affected by the spot's density and the distance between spots.

HisToGene.
ST-Net has shown that tumour-related genes are highly correlated with the histology images, however this method did not use spatial location information in their CNN-based model.Pang et al. [74] declared that despite the CNN performance in the image processing tasks, it suffers from intrinsic bias regarding the SRT patch position.Thus, this drawback reduces the CNN model's effectiveness on SRT data.Consequently, they used an autoencoder model with an attention-based mechanism [75] called HisToGene [74] to predict gene expression values by embedding spatial location and histology images.The pre-processing step comprised of the following steps: removing genes with low expression that appeared in fewer than 1000 spots, normalizing the UMI count of each gene by dividing it by the total UMI count across all genes in the spot and multiplying by 1,000,000, and transforming to a natural log scale.The HisToGene method extracted patches from the histology images and transformed the images into a new matrix.The authors acknowledged that, like in natural language processing (NLP) where sentences of varying length exist, the number of spots within a tissue also varies, making it unsuitable to split them into a fixed number of patches.Thus, they modified the encoder part, encoding the new image matrix and spatial coordinate through a single layer encoder.The final embedding matrix was obtained by summing up the encoding matrixes, considered an input for the multi-head attention layers (see [75] for more details about attention models).The multi-head attention module consists of eight multi-head attention layers and 16 attention heads, automatically learning the gene expression from sequencing patches (see Supporting material for more information about the results).
The authors compared the HisToGene with ST-Net, which consistently outperformed ST-Net in correlation, but was assessed only on high-resolution SRT data.

CNNTL.
Abed-Esfahani et al. [76] noted that as each image in the ISH method represents a specific gene, and there are only a limited number of images per gene, alternative methods to classification-based approaches may be more suitable.Therefore, they proposed a CNNbased method using contrastive loss to re-identify gene expression and embed gene expression patterns from the human brain.As they did not choose a name for the proposed model, we named the proposed model CNNTL (CNN with triplet loss) to better refer to the model in this study.In the pre-processing step, CNNTL imported the U-Net model with ResNet34 as a backbone to separate grey and white matter from the background in images.The segmented image was tiled into patches to contain at least 90% of the foreground.In the training step, CNNTL leveraged the triplet loss.The loss function ensures that the learned embedded from an input patch (positive image) is closer to another patch from the same class (anchor image) compared to the patch that belongs to another class (negative image).The learned embedded of three input images were obtained from three ResNet50 models (with shared weights), pre-trained on the ImageNet [40], followed by two fully connected layers with the dimensions of 1024 and 128, respectively.The CNNTL approach was tested on the Cortex dataset obtained from 42 donors.The metric is rank-1 accuracy at the level of images, which is the proportion of images for which the Euclidean distance of their embeddings computes the closest image of the same gene.The CNNTL achieved rank-1 accuracy of 38.3%, which performed better than single ResNet or random models (see Supporting material for more information about the results).Despite the novelty of CNNTl, The method is limited to a small fraction of genes in brain layers and pre-trained on a dataset unrelated to SRT data.

DeepSpaCE,
Monjo et al. [77] focused on the in situ capturing technology in SRT due to this method's significant effect in oncology.They developed a convolutional neural network model named DeepSpaCE to predict gene expression.DeepSpaCE is a VGG16 network that takes spot images as an input and predicts the expression of 24 genes, including breast cancer-marker genes and breast cancer-related micro-environment marker genes.The DeepSpaCE was tested on a dataset from human breast cancer and obtained 0.588 correlation coefficients between the measured and predicted values (see Supporting material for more information about the results).However, the model has limitations in that it can only predict a limited number of genes.

Imputing Missing Genes
Single-cell RNA sequencing (scRNA-seq) is a sophisticated technique that provides an unrivaled expression profile of a considerable number of genes across tissues at the resolution of an individual cell [78].However, a drawback of these methods is the requisite sample distinction, which destroys any spatial context, which can be crucial to understanding cellular attributes [79].In contrast, SRT data can capture cell location but is limited to the resolution of SRT technology.Recent papers have proposed computational approaches to integrate scRNA-seq data and spatial transcriptomics to predict unmeasured genes and impute gene expression in spatial data.Many machine learning methods employ joint dimension reduction and joint embedding projection to integrate the scRNA-seq and spatial transcriptomics data, followed by using the K-Nearest Neighbor (K-NN) algorithm to predict the missing (unmeasured) genes in spatial transcriptomics data.LIGER [89], and SpaGE [88] are two ML methods that utilized joint dimension reduction approaches, including NMF and PCA, respectively, then learn joint embedding by linear models.Lopez et al. [80] demonstrated that a portion of genes can be found in both scRNA-seq and spatial transcriptomics data, and therefore, domain adaptation methods can be a solution.They proposed gimVI, a joint non-linear model based on deep generative models.Shengquan et al. [11] critiqued recent approaches that only utilized genes shared between both datasets and employed unsuitable evaluation metrics, such as the Spearman correlation coefficient.This metric can be misleading in indicating the actual performance of a method, as even though the Spearman correlation coefficients may be low, visual exploration may reveal improved patterns for known genes [80]; thus, they developed an AE model called stPlus [11] to predict gene expression using the learned embedding and k-NN algorithm.stPlus applied Louvain clustering on the predicted genes and compared it with current ML methods based on the four clustering metrics AMI, ARI, Homo, and NMI (see Supplementary Table 1).The proposed method showed better results than SpaGE, Seurat, Liger, and gimVI in all four metrics.In the following, we focus on the DL models and investigate the three DL models in detail.Figure 6 represents the process of gene imputation along with cell type decomposition (refer to the next section) in SRT data.

gimVI.
gimVI specifies the binary variable s n for each cell, denoted as whether the scRNA-seq or the SRT experiment captured the cell, and models the gene expression matrix with either NB or zero-inflated NB (ZINB).The generative model is a VAE, which gets the input cells and s n .The output from the encoder part is a latent vector z n , describing the cell type n. gimVI then measures the probability of each gene g in an individual cell from the decoder part.Finally, they impute missing genes by implementing the K-NN algorithm on obtained latent space.Lopez et al. evaluated the gimVI on the two paired datasets of scRNAseq/SRT.They calculated the spearman correlation to assess the performance of the gene imputation process in 20% of the genes in the SRT dataset.The results showed that gimVI works much more efficiently in imputation than the Liger and Seurat.However, the reported results can vary by the number of K and evaluate the model on the small fraction of genes.

Tangram.
[90] developed the Tangram model to map spatial information into the scRNA-seq data and align the histological data to the anatomical position via a DL framework.Technically, Tangram is a DL tool for aligning sc/snRNA-seq data to spatial data by utilizing nonconvex optimization.The input of Tangram is sc/ snRNA-seq data and SRT data from the same region or tissue type, and the output is a matrix that contains the probability of assigning each cell in sc/snRNA-seq data to the voxel of SRT data.
Tangram first randomly mapp sc/snRNA-seq data to the space.Then their alignment is updated through an objective function.In the nonconvex optimization process, Tangram aims to compare cell-density distributions of sc/snRNA-seq and SRT data using KL divergence, whereas gene expression is assessed by cosine similarity.Although Tangram was specifically developed for the reconstruction of spatial maps, the imputation task was effectively performed in the intermediate process step.Therefore, the original paper has no quantitative comparison between Tangram and other imputation methods.[91] recently prepared a comparative performance evaluation for imputation methods, and Tangram has placed thirdbest method after stPlus and gimVI.However, it has been shown that Tangram has the highest running time compared to the other imputation methods.

Cell-type Decomposition
In the spatial transcriptomics method, transcripts are captured at spatial locations, called spots [99], and often consist of a mixture of low-resolution cells (such as sequencing-based and ST/Visium technologies).The number of cells is different due to the tissue heterogeneity or SRT technology [100].Therefore, it is important to identify the cell composition in SRT data at the spot level (Figure 6).Recently, various computational methods have been developed for this purpose, grouped into three main categories: 1) Inference-based methods, 2) Multivariate analysis and linear algebra-based methods such as SPOTlight [101] and SpatialDWLS [102], and 3) Deep learning-based methods.Inference-based methods, including Stereoscope [103], RCTD [104], cell2location [105], DestVI [106], and STdeconvolve [107], utilize likelihood-based approaches and explicitly or parametrically assume the data distribution in the input data.These methods belong to both machine learning and statistical approaches, with limitations discussed in Section 1.Meanwhile, deep-based methods such as GIST, DSTG, and Tangram, estimate cell-type proportions using deep learning models.Some deep learning methods, such as VAE, are also based on probabilities but are still considered deep-based methods in this paper.However, these deep learning methods have limitations in real-world applications.

DSTG.
The DSTG method, proposed by Song et al. [108], is a graph convolution network that uses a semi-supervised Fig. 6: Imputing missing genes and cell-type decomposition with deep learning models on a synthetic tissue.In the sequencing-based approaches, the transcriptomes are captured on the spots, which cannot measure all genes inside a tissue.On the other hand, single-cell sequencing can perform this task with tissue dissociation, leading to spatial information loss.Thus, using the SRT and scRNA-seq data from the same tissue, the deep learning models can measure the missing genes and the proportion of cells in each spot.
approach to decompose cell mixtures in SRT (spatial transcriptomics) data.It creates pseudo-ST data from scRNAseq data, projects the data into a 20dimensional space, and builds a linked graph.The linked graph, represented as an adjacency matrix A, and the data matrix X are fed into a GCN network consisting of three convolution layers.The output of the network is the predicted proportions of different cell types in the pseudo and real SRT data, which are learned by minimizing cross-entropy loss with the ground truth.The evaluation results show that DSTG outperforms the SPOTlight method on both synthetic and real SRT datasets.The evaluation process shows that DSTG obtains better results than SPOTlight on synthetic and real SRT datasets.However, utilizing the Euclidean distance to show the similarity between the pseudo-ST and real-ST is not a fair comparison.

Tangram.
In Tangram, the authors demonstrated that inferencebased deconvolution methods can be limited by the lack of use of spatial information, resulting in inaccurate detection of cell types defined by sparse markers.
Tangram performs deconvolution on ST/Visium technology, considered a low-resolution SRT method.Tangram first calculates the number of cells by performing initial segmentation, and then passes the segmentation results to the Tangram model (see the previous section) to calculate the cell fraction per spot.The results on three visium datasets showed that Tangram was able to find consistent mapped cell-type ratios and those from the snRNA-seq data.However, Tangram required pre-knowledge about the cell numbers to perform segmentation before deconvolution, which may not be easily obtained in higher-density tissues, such as tumors [90].

GIST.
In the study by Zubair et al. [109], a joint model was presented to improve cell-type decomposition by integrating gene expression data from spatial transcriptomics (SRT) and image-derived data from the same tissue.The main objective of the model, named Guiding Image-based ST (GIST), was to leverage deep learning (DL) on images as preliminary information in a Bayesian probabilistic model for cell type identification.GIST utilized a DL approach to estimate the abundance of cell type A (e.g.immune cells) at a given spot by using a convolutional neural network (CNN) model.The JPEG format of the images was first converted to an encoded tiled TIFF format and then fed to a pre-trained VGG16 model on the TCGA dataset.This generated the probability of the patches of 50x50 microns.The spot-level probability was then obtained by a weighted sum of overlapping patches across the spot.GIST used the estimated cell-type proportion from DL as an informative prior distribution and mapped it onto the first round of model fitting distribution in the GIST base-model, which was trained with SRT data without prior information.The results showed that the GIST model improved the identification of immune cells in pathologist-annotated regions compared to the GIST base model using expression data only.The performance of the GIST model was demonstrated in breast cancer pathology, and it may be generalizable to immunofluorescence.However, a comparison to current deep-based methods, such as Tangram and DSTG, is needed.

Enhancement of Gene Expression Resolution
Sequencing-based spatially resolved transcriptomics data often have limited resolution at the single-cell level.To improve gene expression resolution in SRT data, various deep learning methods have been proposed to borrow information from neighboring areas to fill the gaps between spots and enhance gene expression resolution.The overall process of this approach, including cell-cell interactions, is depicted in Figure 7. High-resolution information on cell morphology is available in some popular SRT technologies, such as Visium and SLIDE-seq [32], as histology images from H&E-stained tissue sections.However, statistical methods like RCTD that estimate cell-type-specific gene expression for each spot based on the probability of cells obtained in the deconvolution process are unreliable, as the results depend on the accuracy of the deconvolution step.BayesSpace resolves this issue by dividing each spot into multiple equal-size sub-spots and inferring gene expression while keeping the total expression of the original spot constant.However, different splitting methods may produce different results, making it challenging to determine the optimal solution.Due to the capability of DL methods to integrate multiple data, the following DL methods have utilized histology images in enhancing the gene expression resolution, in which none of the above methods take advantage of this.

XFuse.
Bergenstrahle et al. [93] developed XFuse to infer high-resolution spatial gene expression from the histology image data by integrating low-resolution gene expression from in situ sequencing and high-resolution histology images.XFuse assumes that the conditional distribution of gene expression data and histology images (I) follow negative binomial (NB) and gaussian distribution, respectively.Next, it maps the parameters of the mentioned distributions from the latent tissue state (Z) via a convolutional generator network G.Then, XFuse uses variational inference to estimate the posterior of the latent variable (N (Z | X, I)), where X is the observed expression data at a specific location.It updates the variational and network parameters by minimizing the Kullback-Leibler divergence, which measures the differences between two distributions.Simultaneously, XFuse amortises the inference through a convolutional recognition network on histology images to map them to the latent tissue state.The authors assessed the performance of the XFuse model on mouse olfactory bulb and human breast cancer datasets.The results revealed that XFuse was able to uncover distinct patterns in both datasets, outperforming a method that used nonmissing neighbors' information to fill in missing data.XFuse was also found to have a lower median rootmean-square error (RMSE) and to accurately predict unseen samples.When compared to in situ hybridization data, XFuse showed better prediction of gene expression patterns in the tissue.However, the model has a limitation in that it can only detect genes whose spatial patterns are similar to the histology images.

HisToGene*.
The resulting super-resolution prediction pipeline in HisToGene (see identifying SVGs section) was named HisToGene*.The HisToGene* study presents a novel approach for super-resolution gene expression prediction by averaging the predicted gene expression from dense histology image patches.The authors applied the trained HisToGene model to images and estimated gene expression at the spot-level resolution.They then treated spots as sentences in NLP and created sub-patches covering four patches each to predict gene expression at a higher resolution than the original spot.The experiments from the HisToGene study were repeated in the HisToGene* study.The results showed that the HisToGene* predicted spot-level gene expression had higher correlations with the observed spot-level gene expression in 19 sections compared to HisToGene.In the remaining six sections, HisTo-Gene showed higher correlations than HisToGene*.The study also observed a direct link between thyroid hormones and the risk of breast cancer in the His-ToGene* top enriched pathways [94], indicating that the predicted gene sets by HisToGene* contain more biologically meaningful information.

DeepSpaCE.
DeepSpaCE is a method that utilizes super-resolution of spatial gene expression and imputation of tissue sections to predict gene expression.It uses a trained model to estimate unmeasured genes in images with insufficient gene expression.The method leverages semi-supervised learning (SSL) to improve its performance.The DeepSpaCE method was tested on a human breast cancer dataset that consisted of three tissue sections (A, B, C) and related consecutive sections (D1-D3).The model was trained on sections C and D2 for the super-resolution step and sections D1 and D3 were used as a training set for the section imputation step, with section D2 as the test set.In this case, the model acted as a "teacher" model in SSL.Using sections A, B, and C as unlabeled data, the Pearson's correlation coefficients (PCC) between actual and predicted expression were increased.The SSL approach was also applied to cat and dog images from ImageNet and the Cancer Genome Atlas (TCGA) dataset, but no improvement was observed in the results.

Cell-Cell Interactions
Cell-cell interaction refers to the communication between cells through the binding of a ligand to its complementary receptor, a process known as ligandreceptor interaction (LRI).LRIs are crucial for extracellular communication and have been studied as a way to understand cell-cell interactions [112].However, most current computational methods in this area either focus on intracellular interactions or are limited to investigation of small-scale experiments.Spatial transcriptomics provides gene expression profiles in spatial coordinates within individual cells, making it a potentially valuable tool for predicting LRIs (see Figure 7).The Giotto [48] toolbox is a comprehensive framework for analyzing spatial transcriptomics data, including a module for cell-cell interactions.Giotto and other statistical methods identify interactions within a cellular niche by constructing a model for the expression of markers.MISTy [57] is a scalable machine learning framework that can identify a range of cell-cell interactions in spatial transcriptomics data by generating pairwise distances.However, it is computationally extensive.The unique feature of MISTy is identifying CCI within specific regions of interest that facilitate understanding of the marker interactions, which have not been considered by the deep learning based methods.Additionally, the performance of non-DL models is affected by the growth of SRT data from different species and tissue in size and resolution.At this point, the DL methods can provide a better solution to identify cell interactions in the large SRT data.
stLearn presents a method to analyze cell type diversity and RLIs separately and then combine them into an interaction measured through a whole tissue section.The algorithm has two steps: i) quantifying cell type diversity by dividing the tissue into windows and counting the cell types of interest, and ii) finding RLIs by calculating the co-expression of ligand and receptor pairs in the central spot and its neighbors using the CellPhoneDB algorithm [113].Given the expression threshold, the co-expressing L-R pairs for the central spot are calculated by Eq(see supporting material).A CCI matrix is then generates to show the significant L-R pairs for each spot, and clustering performs to identify tissue regions based on the most similar L-R co-expression values.Finally, incorporating both cell density and CCI measures, stLearn can identify tissue regions with a high likelihood of cellcell interaction.The performance of the CCI algorithm was tested on a breast cancer dataset and was able to detect high interaction between tumour and immune cells.

GCNG.
Yuan and Bar-Joseph [114] reported that the recent models, such as Giotto, have mainly concentrated on unsupervised correlation-based analysis in detecting extracellular interactions, leading to failure in predicting interactions in a complex pattern.To address this, they proposed a graph convolutional network for gene expression (GCNG) model, which predicts extracellular interactions from Spatial Transcriptomics data.
GCNG is a five-layer graph convolutional network consisting of two CNN layers, a flattening layer, and a sigmoid activation function layer to calculate the probability of ligand-receptor interactions within cells.The network takes two inputs, including the spatial location of cells (neighborhood graph) and gene expression pairs in each cell.First, it constructs an adjacency matrix A R×R from the total cell number R in which the element is 1 if the Euclidean distance of two cells in the spatial location is smaller than the predefined distance threshold and 0. Second, the input matrix X R×2 is built based on paired candidate genes in each cell.The two matrices are multiplied by each other in the first layer and then mapped to the embedding vector, leading to the investigation of more interactions between cells without a direct link.Finally, the output predicts the probability of interactions between two paired genes.The model tested on the various normalization types of matrix A, and the GCNG reached median AUROC/AUPRC of 0.99/1.0 and 0.99/1.0for seqFISH+ and MERFISH, respectively.The model outperformed the recent intercellular communication models such as Giotto or the single-cell Pearson correlation between ligand and receptors method.However, GCNG selects predefined distance criteria for selecting the neighbor cells, which may cause biases in the obtained results.

conST.
conST leverages the advantage of clustering, SVG detection, and trajectory inference for identification of target receptors on breast cancer cells and analysis of their microenvironment in IDC regions.To do this, conST first obtains latent features from the breast cancer dataset and clusters them into 20 clusters.Then, it detects three clusters containing the prominent lesion areas and applies trajectory inference to obtain pseudotime ordering.The SVG detection algorithm (see conST in the SVG detection section) is applied to detect marker genes responsible for the tumour microenvironment.Finally, cross-cluster CCI analysis is performed using TraSig [116] and withincluster analysis is done by label transfer from Seurat to detect active ligand-receptor pairs.The results demonstrated that conST can successfully detect IDC, DCIS, and edge tumour cell regions in cross-cluster analysis and active L-R pairs in within IDC regions Within-cluster analysis.

Discussion and Future Look
We outlined the advantages and disadvantages of available DL algorithms for analyzing imaging-and sequencing-based SRT data (Table 1).Additionally, we provided a comprehensive technical overview to understand the performance of each method (see Supplementary Table 1).We further contrasted non-DL and DL methods to emphasize how DL methods can improve the analysis process of SRT data.As many downstream tasks rely on the individual components derived from the entire workflow, the downstream analyses will be negatively affected if a component does not work optimally.For example, the identification of SVGs is reliant on the clustering algorithm.
The downstream analysis will be impacted if the clustering algorithm has neglected a biologically relevant feature.Also, phylogeny-aware clustering is increasingly used for biological data sets by incorporating the phylogeny of the organisms into their effect size on creating the resemblance matrix.Therefore, we suggest using phylogeny-aware clustering techniques by incorporating "pathway information" into SRT data [117].One way to do this is by incorporating KEGGlevel pathways or other reference assignments into the effect size of a gene.If two genes produce proteins that function within the same pathway, then both of them changing is less impactful to the overall picture than two genes from completely different pathways.Also the reviewed techniques could not leverage the full advantages of rich information in the SRT data.Therefore, there is still a need for more robust DL methods to jointly use spatial data, scRNA-seq, and high-resolution histology image data.Most of the reviewed techniques in this study were developed based on histology images.As the CNN networks have mainly obtained promising results in image processing, these networks were more substantially used in the SRT data analysis than the other deep models.However, histology images' unique characteristics and complex structures (e.g., irregularity and large scale) pose challenges in DL algorithms, particularly when integrating with spatial data.We believe there is a substantial need to prove that the extracted features by deep models are biologically meaningful.For example, ECNN [52] used the CNN intermediate layers to plot the features, or ST-NET [71] also used features from the latent vector to plot the 2D UMAP.Additionally, due to the large histology images, all methods in this review tiled the input images into small patches.Effective analysis of the histology images requires new methods to treat the extracted patches as united data, i.e., by treating them as a word used in the NLP techniques [74].Although HisToGene [74] can link the patches by developing an attention-based model, other techniques could not consider the relation between patches through their deep algorithms.This problem is most important in CNN-based models, making them more sensitive to the influence of the batch effect [74].Thus, it will be desirable to consider patches as time-series problems and develop DL methods on sequential data such as recurrent neural network (RNN), long short-term memory (LSTM) [118], and transformers.Another essential issue in SRT data processing is the batch effect, which becomes more apparent due to the abundance of spatial transcriptomic datasets.
Although several DL methods have been developed for batch effect correction in scRNA-seq [46,120], the SRT domain still suffers from generating a deep model to address the batch effect challenges.Importantly, this problem is even more complex in SRT data because of spatial dependency and histology images association, in which batch effects can affect both gene expression and histology images.SEDR [47] was the first deep model for batch effect correction, which uses the SEDR-derived embedding and the Harmony algorithm [121] for batch effect removal.STAGATE [49] was another attempt to extract the 3D expression domain and reduce the batch effect between consecutive tissue sections.However, neither of the two mentioned algorithms accounts for histology images.
Since the batch effect may also occur on associate histology images, methods to jointly evaluate gene expression and histology images are required to alleviate the batch effect across the tissue sections as well as between them, simultaneously.Histopathology has provided a comprehensive perspective across various medicine domains such as disease staging and cancer development in tissues.
As such, it is named the gold standard in diagnosing almost all types of cancers [122].Interestingly, Spatially resolved transcriptomics has enabled the analysis of both imaging and molecular features.SpaCell was the first DL model for cancer stage classification using both image and gene expression.CNNTL [76] was another deep model that classified the Schizophrenia patient into control and non-control by using image-based SRT data.With the development of SRT technology and reducing the cost of SRT data-generating, it would be cutting-edge technology to diagnose diseases routinely with SRT data.Also, this would be desirable to record biological variables such as sex, race, and age while capturing gene expression in parallel.We envisage that accounting for such variations across individuals in SRT data and existing histology images will revolutionize the future of disease identification.
In analyzing SRT data, the pre-processing step can dramatically affect the results.The captured location undergoes sequencing for generating SRT data to generate the count matrix known as the gene expression matrix.However, barely acknowledged explicitly, the obtained count data from sequencing machines has the compositional nature for which the abundance of each gene can be described as proportions or probabilities to other genes within that sample [123].Subsequently, the gene expression matrix in SRT data as a compositional data exists in a non-Euclidean space, but rather in a sub-space known as the simplex [124].In the simplex space, a proposed alternative is the Aitchison distance.Yet, using transformation methods such as log-ratio transformation, the compositional data is mapped into real space [124], making the Euclidean distance meaningful.Thus, applying conventional analyses, including dimensional reduction and statistical methods, without appropriate transformation and normalization strategies can lead to misinterpretation of the data (refer to [123] for more information about proving compositionally in sequencing data).Although the reviewed techniques in this study mainly leverage the Euclidean distance for the spatial coordinates, which is completely meaningful, using the PCA or clustering algorithms (i.e., K-means) on the untransformed data is against the compositional data hypothesis.Among the reviewed techniques, 13 methods have considered log-transformation on gene expression matrix, and the remaining performed only the normalization method on the gene expression matrix.For example, in the stLearn approach, the authors proposed the SMEClust normalization, which performs PCA and UMAP on the normalized genes without any transformation.Nonetheless, we would suggest that the future works should take account of compositionality in gene expression matrix and investigate the other transformation and normalization approaches.Regardless of the technology that provides the SRT data, the gene expression matrix is in sparse form, consisting of an excessive amount of zero values, displaying appreciable overdispersion, which makes it challenging to find an appropriate model for the count data.Since many statistical and DL models (i.e., VAE models) directly model the count data, understanding the overdispersion and zero inflation patterns of gene expression is essential.Also, the fitted model can give an overall view of whether this sparsity is caused by the platform (which there is a demand for an imputation method) or whether it is caused by the gene expression heterogeneity across tissue locations (which need the model to account for overdispersion and zero inflation) [125].Zhao et al. [125] showed that the preferable models across most SRT technologies are the Poisson or the negative binomial (NB) models with direct modeling of overdispersion without an additional zero inflation term.Additionally, they expressed that the excessive zero count in SRT data potentially reflects biological variation, which imputation methods can highly generate noises by adding non-zero values, negatively impacting the analysis.Therefore, we strongly suggest further assessment of the current existing imputation methods such as gimVI (also the ZINB model is used to model the count data) and Tangram to consider the limitations mentioned above.In addition, All methods mentioned, are referencebased, meaning they rely on an external scRNA-seq dataset from the same tissue to estimate cell proportions.Chen et al. conducted a comprehensive analysis of computational methods for cell-type deconvolution using internal inference (using the single-cell resolution SRT dataset) and external inference (using the scRNA-seq dataset from the same tissue as reference) while also investigating the impact of genesubset selection.The results showed that Tangram and DSTG performed best with perfectly matched internal references and that the performance of deconvolution can be affected by the selection of genes.Most methods performed better with top cell-type marker genes compared to highly variable gene (HVG) subsets in the case of external reference.In analyzing single-cell resolution transcriptomics datasets, dimension reduction techniques such as PCA are a crucial step, followed by clustering algorithms, which consider different loss functions.The reviewed techniques in this study treated dimension reduction methods as error-free techniques for obtaining low-dimensional features.It would be valuable for future studies to develop a new method that combines dimension reduction and clustering with a unified loss function, and to evaluate the performance of dimension reduction approaches.[127].Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is a technology designed to identify genome-wide profiling of chromatin accessibility [128].Regarding the emergence of single-cell biology, single-cell ATAC sequencing (scATAC-seq) has provided chromatin accessibility at single-cell resolution.The current epigenomic profiling methods lack spatial resolution; several methods [129,130] have been developed to perform spatially resolved chromatin accessibility profiling (spatial ATAC) in animal and human tissue sections.However, obtaining spatial information for these technologies mainly need a higher resolution.Recently, VAEs have been utilized to learn joint latent space for gene imputation tasks such as gimVI and understand the phenotypic interplay between gene expression and TCR sequence [131].Moreover, the integrating and providing a unified view of multi-omics has received considerable attention for researchers to jointly profile the transcriptional and chromatin land-scape of single-cells, such as [132,133,134].Thus, developing the DL models for integrating scATAC-seq and spatial ATACseq to learn the latent embedding jointly would be innovative [135].

Conclusion
In conclusion, several innovative deep learning (DL) approaches have been developed to address singlecell resolution transcriptomics (SRT) data analysis challenges.DL algorithms are well-suited to these challenges because they can uncover complex patterns and efficiently analyze large and multi-modal data.As SRT data grows and diversifies, software and DL approaches that can effectively tackle and interrogate multimodal SRT data will be in high demand.In this paper, we thoroughly reviewed all DL methods and the challenges they address in analyzing spatially resolved transcriptomics data.We categorized these methods into six main categories based on their main task and downstream analysis, including identifying the spatial domain, identifying spatially variable genes, imputing missing genes, improving gene expression resolution, analyzing cell-cell interactions, and performing cell-type decomposition.We hope this review serves as a comprehensive reference for guiding the usage of DL methods in SRT data analysis and encourages scholars with complementary expertise to collaborate and develop new methods that integrate gene expression, spatial information, single-cell data, and digital pathology to drive innovation in these spatial technologies.

Fig. 3 :
Fig.3: Spatially resolved transcriptomics and its six sub-categories with corresponding applications.Spatially resolved transcriptomics provides gene expression profiling with spatial information in tissues.Experimentally, the measured gene expressions are captured in spots, complemented by a high-resolution histology image for the same tissue section.The resolution of spots is different (from cellular, containing multiple cells, to subcellular resolution, containing genes) due to the spatially resolved transcriptomic techniques.Deep learning approaches have been leveraged in spatially resolved transcriptomics data analysis to address six domains, including 1) Identifying spatial domain, 2) Identifying spatially variable genes (SVG), 3) Imputing missing genes, 4) Enhancement of gene expression resolution (GER), 5) Cell-cell interactions, 6) Cell-type decomposition.The small colored circle beside each model represents which SRT data are used in the models(blue: histology image, red: gene expression, green: spatial information).

Fig. 4 :
Fig. 4: Identifying spatial domains with deep learning algorithms on a synthetic tissue.Mainly, the reviewed papers leveraged deep models to learn latent embedding and then pass them into the unsupervised clustering algorithm; or the papers leveraged deep learning models to segment the spatial domains.

Fig. 5 :
Fig.5: Identifying spatially variable genes with deep learning methods on a synthetic tissue, including four layers (i.e.spatial domains).A) Some of the reviewed papers considered this as a primary task and trained a deep model to predict the value of marker genes.B) Other approaches applied ML or statistical methods to identify SVGs in each spatial domain, obtained by clustering algorithms.

Fig. 7 :
Fig. 7: Enhancing gene expression resolution and cell-cell interactions in SRT data.A) Since the distances between spots are different based on the utilized sequencing-based approaches, borrowing information from neighboring spots makes it possible to enhance the gene expression resolution in empty areas between spots.B) The spatial location of each spot facilitates the understanding of finding ligand-receptor interactions of each cell in SRT data.
[59]r performing pre-processing related to the spatial information and gene expression (see the pre-processing step in Figure2), conST extracts morphological features by a masked AE (MAE) model[59]pre-trained on the ImageNet dataset.conST uses the deep AE to obtain a latent embedding from gene expression.Meanwhile, the spatial information is passed into the VGAE to encode the position of spots into the node attribute.RayleighSelection was tested on the FISH dataset, and genes with high consistency to the spatial expression pattern were obtained.Nevertheless, spectral-based methods are computationally expensive and must scale up for large datasets.Since the H&E-stained histology images are cheaper and more accessible than spatial transcriptomics data, it makes them desirable to jointly leverage with SRT data.Fluorescence in-situ hybridization and in-situ sequencing techniques complement spatial transcriptomics, which captures more genes with low spatial resolution along with tissue images.Specifically, (ISH) captures high-resolution images of spatial gene expression at cellular resolution.Due to the complex nature of these images and the need for the full exploitation, deep learning based methods have performed better than spectral technique-based methods.Concerning these limitations and the influence of image histology, DL methods have been proposed for SVG detection both in cluster-based domain, such as CoSTA and deep-based domain, including ST-NET, SPADE, His-ToGene, CNNTL, and DeepSpaCE.Figure In the main training stage, conST used contrastive learning to learn more robust embeddings.The spots in SRT data are considered the graph's vertices, in which each node contains multiple attributes, including the gene expression matrix and the morphological feature.late each gene's combinatorial Laplacian score (CLS).This method constructs the Laplacian matrix by degree and adjacency matrix (refer to GCN section).Accordingly, the genes with the lowest CLS are more spatially variable.

Table 1 :
Deep Learning Algorithms. .Spatially resolved transcriptomics (SRT) is a new technology providing the position of captured expression across the tissue at single-cell level resolution.2. Twenty-one deep learning-based methods are systemically reviewed in this paper and categorized into six main groups based on the tasks and downstream analyses.3. A brief discussion of the current machine learning approaches are presented for each category to assess the advantages of deep learning models proposed for that category in comparison to the traditional machine learning models.4. A unified description of the model and result corresponding to each deep learning models is presented, and the mathematical model is also discussed in the supplementary section. 5. Lastly, a comprehensive summary of the deep learning algorithm, evaluation metrics, and datasets by each approach is tabulated.HAR designed the study; RZN wrote the manuscript; HAR, RZN, RE, CM, AB, AA, NL, and ML edited the manuscript; RZN generated all figures and tables.HAR provided his advice in generating figures and tables.All authors have read and approved of the final version of the paper. 1