EAGS: efficient and adaptive Gaussian smoothing applied to high-resolved spatial transcriptomics

Abstract Background The emergence of high-resolved spatial transcriptomics (ST) has facilitated the research of novel methods to investigate biological development, organism growth, and other complex biological processes. However, high-resolved and whole transcriptomics ST datasets require customized imputation methods to improve the signal-to-noise ratio and the data quality. Findings We propose an efficient and adaptive Gaussian smoothing (EAGS) imputation method for high-resolved ST. The adaptive 2-factor smoothing of EAGS creates patterns based on the spatial and expression information of the cells, creates adaptive weights for the smoothing of cells in the same pattern, and then utilizes the weights to restore the gene expression profiles. We assessed the performance and efficiency of EAGS using simulated and high-resolved ST datasets of mouse brain and olfactory bulb. Conclusions Compared with other competitive methods, EAGS shows higher clustering accuracy, better biological interpretations, and significantly reduced computational consumption.

Secondly, it is worth pointing out that the reviewers' comments and suggestions have really and constructively helped us improve the quality and presentation of our manuscript much further.In light of their inspiring comments and suggestions, we have duly and carefully revised the manuscript, with the main changes highlighted in red color in the revised manuscript.
Thirdly, with many thanks to the reviewers, we would like to address their comments as below.

Authors Response to Comments of Editor
Comment: In addition, please register any new software application in the bio.tools and SciCrunch.orgdatabases to receive RRID (Research Resource Identification Initiative ID) and biotoolsID identifiers, and include these in your manuscript (in the availability section).Response: Many thanks are given to the editor for this comment.We have registered the software application on bio.tools and SciCrunch.orgrespectively, and added the RRID and biotoolsID to the "Availability of Source Code and Requirements" section of the revised manuscript on Page 24, Lines 480-481.

Authors Response to Comments of Reviewer #1
Major comments: Comment 1: More comprehensive and systematic benchmarking on large-scale synthetic dataset.In current manuscript, the comparisons were mainly performed on datasets where cell type annotations in previous researches were utilized.While the comparison results were appealing, to further demonstrate the strength of EAGS, it would be very insightful to incorporate more systematic benchmarking on synthetic data where "ground truth counts" are ready and more controllable.Recently, tools have been proposed to generate large-scale spatiomics data, e.g.scDesign3 (Nature Biotech 2023).In addition, the authors could also consider benchmarking with more spatial imputation methods such as SpaGE (NAR 2020) and classical scRNA-seq methods for example kNN-smoothing.Response 1: Many thanks are given to the reviewer for the comment.We have used ScDesign3 to generate a simulated dataset of the spatial transcriptomics mouse olfactory bulb.In addition to the existing comparison algorithms in the previous manuscript, we have also added the spatial imputation method SPCS [22] and classical scRNA-seq method kNN-smoothing [41] for comparison in the experiments on simulated and high-resolved datasets.Since SpaGE required singlecell transcriptome dataset as a reference and the datasets we used lacked such a reference, we have used the competitive method, SPCS, as an alternative.The details can be seen on Page 1, Lines 23 Comment 2: From Fig2-5, legend font can be increased.It's blurry and not visible properly.Response 2: Thank the reviewer for such a comment.We have enlarged the font of the legend in Figures 2-5 in this revision.In addition, we have provided vector figures in PDF format to clearly show the details.
Comment 3: Why only ISH data is analyzed when lots of ISS datasets are also available.Would EAGS perform equally well on ISS data like from 10X Xenium?Response 3: Many thanks are given to the reviewer for raising the comment.We have used the ISS datasets based on in situ capture as the test set in our experiments, such as spatial transcriptomics datasets of mouse brain and olfactory bulb.Since these ISS datasets lacked ground truth, and Allen's ISH dataset is a published standard dataset, such we have used Allen's ISH dataset as a reference for evaluation of different imputation methods.EAGS is an imputation method for high-resolved spatial transcriptomics data.Given that 10X Xenium has high resolution, EAGS has the potential to be applied on this ISS data.We will also explore improvements and more applications of EAGS in next work.
Finally, we (the authors) would like to express thanks again sincerely to the editor and anonymous reviewers for their time and efforts spent in handing the manuscript, as well as providing us many constructive comments for improving further the presentation and quality of this manuscript.

Introduction
Recent advances in barcode-based spatial transcriptomics (ST) technology include 10X Visium [1], Slide-Seq [2,3], and high-definition spatial transcriptomics [4].These advances made it feasible to provide expression profile information of entire genes, which is extremely important for comprehending biological functions and interaction networks [5,6].High-resolved ST is an essential technical support for analyzing complex biological problems, as the function of complex biological tissues is closely related to the location of the transcriptional expression events within the tissue.Barcode-based high-resolved ST technology captures fewer genes at a single sequencing site (spot) than low-resolution ST technologies, such as 10X Visium [1], leading to high sparsity of the complete gene expression profile.In certain cell cycle phases, some cells do not express a set of genes whose expression thus appears to be null.In addition, amplification bias, cell cycle, library creation, and poor RNA capture rates cause some genes to be expressed but not captured by DNA nanoballs; such genes are called "dropout" [10].Such biases adversely affect downstream analyses, such as clustering, cellular interaction analyses, and pseudo-temporal reconstructions [11,12], when the raw data is directly processed.
Various imputation methods have been proposed to solve the "dropout" in gene expression for scRNA-seq datasets [13].These imputation methods can be broadly classified into 3 categories according to their principles.The first category smooths or diffuses the levels of gene expression in cells with comparable expression patterns to correct (typically) all values (zero and non-zero).
MAGIC imputes the missing data on scRNA-seq datasets based on the Markov chains of adjacent domains and recovers gene expression of the characterized cells by data diffusion [14]; DrImpute finds similar cells by consensus clustering and pools their gene expression values to estimate the loss [15].The second category models the gene expression profile with an existing probabilistic statistical model to simulate the distribution of genes.SAVER assumes that each gene in each cell follows a Poisson-Gamma distribution (a negative binomial distribution) and estimates prior parameters to recover the expression of the missing genes using Poisson LASSO regression methods [16].Scimpute constructs a mixed Gamma-Normal distribution based on the gene expression profile and uses a non-negative least squares regression model, sc-transform (R package), to perform the imputation [17].The third category uses deep learning methods to capture the potential spatial representation of cells and reconstruct the expression matrix.DCA is an auto-encoder that predicts the parameters of the selected distribution to generate estimates [18].These methods offer practical recommendations for single-cell imputation; however, these methods do not account for spatial information in ST datasets, and the methods based on specialized statistical models can't be applied to the high sparsity of high-resolved ST datasets.
In recent years, ST-based imputation methods have been presented.Sprod first projects gene expression onto a potential space, connects nearest neighbor cells to construct patterns, and then learns the denoising matrix using a shared minimization of the graph's Laplacian smoothing term and reconstruction errors [19].For ST data without pathology images, Sprod provides cluster-based pseudo-images, but it does not accurately reflect the actual cell clustering situation.STAGATE introduces a graph attention auto-encoder to construct a spatial neighbor network based on sequencing spots, and then, it introduces a distribution of the spatial neighbor network in the middle layer of the self-encoder to learn the correlation of neighboring sequencing spots and subsequently obtains the recovered gene expression profile by decoder [20].However, the labels processed based on a specific clustering method are not completely consistent with the reality of the biological organization.It has been noted that the self-attention layer of the network does not consider the interaction between spot pairs and the information about the graphical structure of the spots [21].
To address these problems, we propose an efficient and adaptive Gaussian smoothing (EAGS) method, which is applied to high-resolved ST data.EAGS be derived from the fact that the spatial location of cells in biological tissues has a close relationship with their microenvironment, and the gene expression levels of cells within the same microenvironment are similar [17,22].EAGS constructs different patterns based on cell expression profiles and cell location information to generate a similarity matrix.The similarity matrix then assesses cellular similarity within expression profiles to recover true biosignatures.By refining the information from proximal cells using adaptive smoothing weights and generating new gene expression profiles, the "dropout" is reduced.
The resulting dataset provides RNA abundances more accurately than the original gene expression profile and preserves more of the true biological signal.EAGS enables the usage of high-sparsity ST datasets since it is independent of prior statistical models of the expression preconditioning the gene expression profiles.More crucially, EAGS could be used for large-scale ST datasets without requiring a lot of operating memory since it does not call for the computation of parameters for a pre-defined model, skipping most of the iterative process.We applied EAGS to the simulated and high-resolved ST datasets of mouse brain and olfactory bulb, and compared it with widely used imputation methods to evaluate its efficacy in terms of fewer "zeros" in the gene expression profiles, improved cell annotation, and spatial organization replication.

The workflow of EAGS
In EAGS, the original expression matrix with the single-cell resolution was first used to generate patterns based on expression and spatial information.Then, the tight relationship between cells was established using two distinct patterns.Finally, the smoothing weights calculated from the patterns were used to define the level of smoothing for each cell, and were applied to recalculate the gene expression.

Datasets
There were two methods to generate gene expression profiles from Stereo-seq in situ captured data.One was to acquire the spatial location information of various cells by conducting cell identification and segmentation on the optical stained image, and then match the cell in the image to the sequencing spots with spatial coordinates [7,23].The other one was to take consecutive X×X bins as units (considered as cells), where each bin (binX) contains the total gene expression of X×X spots [7].We used the mouse brain [24] and olfactory bulb datasets at single-cell resolution [23], which were generated by the first method and included 61,857 and 33,272 cells respectively.The In Situ Hybridization (ISH) images of the signature genes from the mouse brain were obtained to help compare the impacts of smoothing [25,26].We also used another mouse olfactory bulb dataset generated by the second method, which contained 812 units of Bin140 [27].
The above two categories of gene expression profile with spatial information were preprocessed with the Scanpy toolbox (V1.9.1; RRID:SCR_018139) to remove low-quality signals that might be blended into the gene expression data [28,29].For the first category, firstly, we filtered genes based on expression in at least 10 cells: those genes were kept.Next, cell outliers were filtered using gene expression: cells expressing at least 300 MID counts were kept.The 2% highest MID counts in all cells were subtracted from the overall number of MIDs across all cells in the gene expression profile.Finally, the coordinates of the spatial position information of the cells and the log-transformed and normalized gene expression profiles were employed as input to EAGS.For the second category, we filtered genes based on expression in at least 10 cells: those genes were kept.
Next, cell outliers were filtered using gene expression: cells expressing at least 300 MID counts were kept.

Pattern construction
Since "similar cells" in organisms with comparable molecular microenvironments express their genes similarly, the regions with identical expression patterns may originate from the same cell type or from the same biological tissue location [17,22].Using "similar cells" to supplement the information of a particular spot is feasible.Based on spatial location data and gene expression profiles, we constructed two patterns to divide the cells on an ST slice's gene expression profile into several clusters.A comprehensive description of these two pattern styles is given below: Balltree is a binary tree data structure that performs well on high-dimensional datasets, especially for Fast Nearest-neighbor Search on high-dimensional datasets [30,31]

ENM
). Different definitions are given depending on whether j Cell can be attributed to the gene expression pattern of i Cell : where = 0, it does not.

Algorithm 1. Builds the tree structure of Balltree
Balltree is built using a divide-and-conquer method.Initially, Balltree has only one (root) node and all data points are assigned to it.At each step, the partition corresponding to each node is split into two sub-partitions.For a partition i p , the splitting procedure is as follows: Step 1: Find the centroid of the node points in

LDIM
. Reducing an n-dimensional matrix to a two-dimensional plane, the centroid of the node is centroid 1.
Step 2: Select the farthest point from centroid 1 in i p as the first (left) child pivot L i p .
Step 3: Select the farthest point from L i p as the second (right) child pivot R i p .
Step 4: Assign each data point i p to the partition whose pivot is closer.
Step 5: Assign the new sub-partitions as children of i v in Balltree, i.e., R i v and L i v .

Algorithm 2. Using Balltree to find the nearest Neighbor of each cell
Input: Balltree structure nbrs , nearest neighbor num k , test point t,

ENM
The difference between ST and scRNA-seq datasets is that ST dataset provides the spatial coordinate position of each sequencing site (spot).After StereoCell processing, ST data are spots with a single-cell resolution where every spot corresponds to a single physical cell with spatial coordinates [23].Cells in adjacent regions of histological sections are more likely to come from the identical microenvironment and belong to similar or identical cell types than cells from other areas.
Therefore, we offer the spatial neighborhood pattern as a reference and classify the cluster of cells that are physically adjacent to a specific cell as its "spatial neighborhoods":

NCM
) for an ST dataset containing m cells as follows: where

NCM
is the dot product obtained by multiplying the corresponding elements of the   where G represents the number of vectors of nonzero

NCM
. c and t represent the integer and fractional parts of the calculation result, respectively.The Distance Distribution Threshold ( DDT ) is defined as follows: where the calculated c and t obtain the th p percentile DDT along the specified axis of the nonzero

NCM
. The calculation of the adaptive weights is based on the where new

GS
is the degree of smoothing information and is an adaptive weight determined by the degree of similarity between the cells in the pattern's framework, and GS() is used to calculate the adaptive weights.The precise smoothing weight contribution between cells is calculated as follows: where gs is a hyperparameter that characterizes the overall smoothness of the reference gene expression profile, which represents the overall smoothness of the entire chip.For a 1  DDT in Eq. ( 6) refers to the similarity distance between cells generated based on the Gene Expression Pattern and the Spatial Neighbor Pattern in the entire gene expression distribution matrix, that as a global benchmark reference for information distribution and can be characterized as the distribution of the gene expression matrix from the overall level. calculated in Eq. ( 8) refers to the standardized parameters of the Gaussian model.The value of  calculated by DDT can make the smoothed gene expression matrix more consistent with the preset distribution, such new GS can be measured with the help of some existing expression quantities, genes are complemented without changing the overall expression profile.

Smooth
The raw gene expression profile can be processed after where GS E represents the level of gene expression after adaptive weight smoothing, ( ) represents all cells in the region where cell x is smoothed, Step 1：Calculating the K-nearest-neighbor cell Euclidean distance distribution.
Step 2：Smooth threshold takes the percentile value x of the distance distribution and requires a value from 0.2 to 1.
Step 3：Using Eq. ( 6) to back-calculate the magnitude of  at this time; preset gs = 0.95.
Step 4：The Gaussian weights at other distances are calculated by substituting  values into Eq.( 7).
Step 5：Re-weighted summation based on the newly calculated Gaussian weights and the original expressions.

If relying entirely on the cells in the ( )
A P i as smoothing factors without using the origin gene expression of the smoothed i Cell , Eq. ( 10) can be further streamlined as: (10) where GS

E is completely calculated from the expression level of cells in ( )
A P i , regardless of the gene expression of i Cell .

Evaluation method
We used the imputation error by calculating the L2 norm of the difference between the smoothed matrix and ground truth (L2-error) [38].We used Calinski-Harabasz Index (CHI) and the Davies-Bouldin Index (DBI) to evaluate the significance of the differences in intra-class and extraclass similarity of the clustering results.We used Moran's I and Geary's C to calculate the correlation of cellular marker genes in the gene expression space of the data before and after smoothing [33].

Imputation error by calculating the L2 norm
L2-error is used to compute the difference between two matrix vectors by calculating the Euclidean distance between each corresponding element of the two matrices separately.A lower L2-error represents a higher degree of similarity between the two matrices, indicating that the method performs better.It is defined as follows: L2-error ( ) ( ) where , i j Y represents the reference gene expression matrix, and , i j X represents the smoothed gene expression matrix.L2-error is mainly used to compare the difference between the reference expression matrix with "ground truth counts" and the smoothed expression matrix.

Calinski-Harabasz Index
The CHI computes the sum of squares of the distances between points in the class and the class center to determine how closely a class is related [34].The higher the CHI, the higher the similarity between cells of the same type in the cell population, indicating that this method performs better.It is defined as: where h is the number of training samples, q is the number of categories, q B is the betweencategory covariance matrix, q W is the within-category data covariance matrix, and tr() is the trace calculation function.

Davies-Bouldin Index
The DBI finds the maximum by calculating the quotient of the sum of the average intra-class distances of any two classes within the sample set and the distance between the centers of the two clusters [35].The lower the DBI, the higher the similarity between cells of the same type in the cell population, indicating that this method performs better.It is defined as: where n is the number of categories, i c is the center of the ith category, i  is the average distance from all points of the ith category to the center,   is the distance between the center points i c and j c , and max() is the maximum function.

Moran's I
Moran's I is a global autocorrelation statistic for certain metrics on a graph.It is commonly used in spatial data analysis to evaluate autocorrelation on two-dimensional grids [36].The higher the Moran's I, the stronger the spatial autocorrelation of the cell population, indicating that the method performs better.It is defined as: where N is the number of spatial units indexed by i and j , x is the variable of interest, x is the mean of x , ij w are the elements of a matrix of spatial weights with zeros on the diagonal, and W is the sum of all ij w .

Geary's C
Geary's C is a measure of spatial autocorrelation that attempts to determine if observations of the same variable are spatially autocorrelated globally (rather than at the neighborhood level) [37].
The lower the Geary's C, the stronger the spatial autocorrelation of the cell population, indicating that the method performs better.It is defined as: where ij w is the -i th row of the spatial weight matrix with zeros on the diagonal, and 0 S is the sum of all the weights.

Overview of EAGS
We collect the datasets of mouse brain and olfactory bulb as inputs to EAGS [23,24].The acquisition process of these data is: stereo-seq [7] is used to capture the ST data of the mouse brain and mouse olfactory bulb in situ and record the position information of the sequencing spot, just like the data generation process in "Datasets" subsection, and then StereoCell [23] is used to generate ST data at single-cell resolution with spatial information.After obtaining the ST dataset at single-cell resolution, the entire gene expression profile is normalized and smoothed [39], as shown in Fig. 1A.
EAGS constructs two styles of patterns based on the input gene expression information and spatial information, respectively.These two patterns are used to identify similar cells within the pattern, as shown in Fig. 1B.Next, EAGS adaptively generates smoothing weights based on the difference between similar cells and their genes' expression, then utilizes these weights as a reference to complement the expression of similar cells.

EAGS performs better smoothing by adaptive weighting
We use mouse brain dataset to evaluate EAGS with adaptive weight.The results are compared to the outputs of EAGS with fixed weights.As the mouse brain dataset's adaptive weight value is 19,001, the fixed value weights are set to 25,000 and 15,000.We use Spatial-ID to annotate cell types in order to assess the potential of EAGS to improve the cell annotation power and restore the true levels of gene expression [24].Fig. 2 shows all the results of the subsequent analysis with the adaptive and the fixed weights.The cell annotation results of EAGS using an adaptive weight compares to a fixed weight generated a cell-type spatial map with clearer tissue outlines and more annotated cell-type subtypes (Fig. 2A).
Based on our cell annotation results, the CHI of the EAGS smoothing results with adaptive and fixed weights are calculated.Next, Geary's C and Moran's I of the common cell types in the annotation results are calculated (Figs.2B and C).The results based on the adaptive weight cell annotation show a significant improvement in spatial autocorrelation compared to the others.Also, within the same type of cell annotation, the level of intra-class autocorrelation is higher.

EAGS smooths gene expression with better performance on simulated ST dataset
We collected Bin140 specification mouse olfactory bulb ST dataset [27], the top 2000 highly variable genes are selected as the reference input of ScDesign3 to construct a simulation space group with "ground truth counts" [40].To simulate the "dropout" phenomenon during the sequencing process, we randomly drop the simulated ST dataset expression to varying degrees and add different proportions of noise.EAGS, MAGIC [14], kNN-smoothing [41], SPCS [22] and STAGATE [20] are used to impute the processed ST dataset, then L2-error with the "ground truth counts" matrix and DBI are calculated respectively.The results are shown in Table 1.From Table 1, in the simulated datasets with 30%-and 50%-dropout, L2-error and DBI values obtained by EAGS are always the lowest, regardless of the proportion of noise.When the proportion of dropout is 70%, DBI obtained by EAGS is suboptimal with 10%-noise (only higher than that of SPCS), and the results obtained by EAGS are the best on the other cases.In general, EAGS performs better on different simulated ST datasets and shows obvious advantages in improving intra-cell similarity and consistency with the "ground truth counts" compared with other methods.

EAGS smooths gene expressions for better characterizing the spatial expression patterns of mouse brain
We perform cell annotation on mouse brain data before and after EAGS smoothing using Spatial-ID [24].The annotation results are shown in Fig. 3A.The mouse brain cell annotation based on data smoothed by EAGS return a clearer tissue structure, and more cell types can be annotated.
To further assess the improvement provided by EAGS in cell annotation, we also perform cell annotation with Tangram [42], a technique for merging spatial data types with single cell/single nucleus RNA sequencing data and for cell type annotation.As shown in Fig. 3B, the CHI and DBI are calculated for the spatial autocorrelation of cell types with the gene expression profiles after Tangram and Spatial-ID cell annotation.These results show that EAGS smoothing provides significantly better results in cell-type annotations.
Fig. 3C shows the results of cell annotation using Spatial-ID and the spatial map of Allen Mouse Brain Atlas of corresponding cell types [25,26].TEGLU24, TEGLU7 and MEINH2 are important cell types in the Hippocampus, Cortex and Midbrain dorsal respectively, and DGGRC2 is the important cell type in the Midbrain ventral and Dentate gyrus.These cell types are more consistent with the Allen's spatial expression map of cell types after EAGS smoothing.To verify the smoothing effect, Moran's I and Geary's C are calculated for cells with different cell number ratios using the raw or the EAGS smoothed dataset (Fig. 3D).To determine whether the correlation between the above cell types and their marker genes improved after smoothing, the ratios of the number of annotated cell types to their corresponding non-zero marker gene expressions are computed.The ability of EAGS to restore true biological signals is shown in Fig. 3E.Our results show that EAGS contributes to enhancing the cellular features of the mouse brain as well as the spatial autocorrelation and intraclass similarity of the gene expression patterns.

EAGS improves spatial patterns and downstream analyses of gene expression data
EAGS is compared with the imputation methods, MAGIC [14], STAGATE [20] and kNNsmoothing [41], on ST mouse brain dataset (SPCS can't be executed successfully because the sparsity of this dataset is high to cause out of memory, thus the results SPCS are not obtained).The cell type space map of different imputation methods using Spatial-ID as reference are shown in Fig.
4A (left).EAGS return more cell types and more prominent outlines than other methods.The results of MAGIC are very unbalanced in terms of the number of cell types, with a large number of cell annotations that did not match the true values [25,26] 4B shows the CHI derived from data processed using one of the three methods.After cell annotation, CHI [34] calculated by the cell annotation label shows higher spatial autocorrelation than other three methods.EAGS obtain higher Moran's I and Geary's C than other methods (Fig. 4C).Additionally, the spatial maps of a few marker genes based on their expression are generated (Fig. 4D).The gene expression profiles smoothed by EAGS agree with the Allen's ISH image better than the other methods.

Discussions
EAGS defines patterns based on expression and spatial information.Specifically, it selects similar elements from the intersection between cells of two patterns, ensuring a reliable source of information is borrowed between smoothed cells and similar cells.The main source of smoothing information for EAGS is the smoothing weights adaptively generated based on gene expression profiles.EAGS considers the overall expression level to generate weights, avoids the appearance of a single edge value, and effectively ensures the reliability of information borrowed between cells.
This allows to recover authentic cellular signals with improved intracellular similarity and spatial autocorrelation.For example, the expression of the Cartpt gene in Fig. 4D is scattered in the original data heatmap, with more noise appearing, and the matching degree with Allen's ISH image is low.
EAGS smoothing consider the reliability of adjacent information.After EAGS smoothing, a lot of noise is eliminated, and more Cartpt genes are expressed in the correct cells, which is more degree with Allen's ISH image and the aggregation of Cartpt expression is significantly improved.It should be noted that the EAGS model is based on the premise that "neighboring" cells in the spatial microenvironment of biological tissues are more similar, which is applicable to most developmental tissue systems.However, for complex microenvironments with high biological heterogeneity (such as tumor microenvironment, etc.), this assumption will be challenged.EAGS may result in many false positive signals.When it is necessary to perform EAGS on complex tumor microenvironment samples, when calculating the adaptive Gaussian smoothing weight, the sample may need to be partitioned according to different situations, and gaussian weight is calculated for different areas.

Conclusions
We propose EAGS, a method for smoothing high-resolved ST datasets that performs two-factor smoothing and adaptive weighting on raw gene expression profiles.EAGS significantly improves computing efficiency, reduces "dropout" in ST data, recovers the expression of true biological signals, and restores the spatial patterns of tissues.In the future, we will explore the false positive signals produced by EAGS imputation strategies, as well as downstream analyses of datasets after imputation.
However, cell localization and identification are limited by technical factors, such as the chip capture area, the sequencing depth, and the resolution.Spatially enhanced resolution transcriptome sequencing (Stereo-seq)[7] is a new ST technology based on DNA nanoballs.Stereo-seq provides the highest resolution (500 nm) among all currently available ST technologies.Such breakthrough in resolution allows researchers to perform genome-wide analyses of gene expression at the capture site (spot) with a single-cell or even sub-cellular resolution.Wang et al.[8] applied Stereo-seq to the 3D reconstruction of the ST of Drosophila embryos and larvae, providing a spatial-and temporal-resolved transcriptomic map of the whole organism across the developmental stages for Drosophila research.Liu et al.[9] reconstructed the developmental trajectory of zebrafish embryos during their development by analyzing Stereo-seq and scRNA-seq datasets from different time points.
. The complete gene expression profile is separated into many different subspaces by Balltree.Then, the Euclidean distances between cells are calculated separately.Assuming the pre-normalized gene expression profile still contains m cells, the unsupervised nearest neighbor network toolkit (scikit-learn) is used to extract the n-dimensional principal component data and creates the low-dimensional information matrix ( gene expression profile, as shown in Algorithm 1 [32].Then, the neighboring cell matrix is constructed based on the K-Nearest Neighbors network as in Algorithm 2, forming the Expression Neighbor Matrix ( ( , ) m m distribution of ST dataset is a two-dimensional plane space, the Euclidean distance can serve as a useful measure of spatial location between cells in a low-dimensional environment.Therefore, the spatial distance Matrix ( by computing the Euclidean distance.Furthermore, since ST chips of the Stereo-seq platform vary in size, EAGS finetunes the weight value for different chip sizes while calculating Euclidean distances.Adaptive weight calculationCells can be used as smoothing factors for i Cell , and must satisfy both the gene expression pattern and the spatial neighbor pattern belonging to i Cell .A cell acting as the smoothing factor is more similar in gene expression to the smoothed cell than to other cells in the overall expression profile.EAGS defines the nearest neighbor contribution matrix ( parameter for smoothing weights, and nonzero NCM is a G-dimensional row vector, where the condition G M M   is satisfied.The th p percentile of the nonzero NCM along the specified axis is calculated by the following method:

xEAlgorithm 3 .
represents the original gene expression of the smoothed cell.The whole process can be represented by Algorithm 3. Calculate weights and perform smoothing Input: Expression Neighbor Matrix ENM , Spatial Distance Matrix

Figure 1 :
Figure 1: (A) Data generation process for the input of EAGS.(B) The EAGS method calculates the nearest neighbor information based on the gene expression pattern and spatial information.Then, EAGS adaptively generates smoothing weights and outputs the smoothed results.

Figure 2 :
Figure 2: Results of EAGS with adaptive and fixed weights.(A) Spatial cell type map for cell annotation with Spatial-ID using different weights for smoothing results.(B) The smoothing results with different weights are annotated with Spatial-ID cells.The Calinski-Harabasz Index is calculated using cell labels.(C) After the cell annotation using Spatial-ID with different weights, Geary's C and Moran's I are calculated from annotation results.

Figure 3 :
Figure 3: Comparisons between the analysis results obtained from data before and after EAGS smoothing.(A) Spatial cell type maps of the mouse brain using Spatial-ID cell annotation of raw and EAGS smoothed data.(B) Davies-Bouldin and Calinski-Harabasz Indexes calculated using Spatial-ID and Tangram annotation results obtained from raw and EAGS smoothed data.(C) Comparison of the spatial map and Allen Mouse Brain Atlas obtained from raw and EAGS smoothed dataset.(D) Comparison of Moran's I and Geary's C cell annotation types obtained from raw and EAGS smoothed dataset.(E) Heatmap of non-zero ratio between the number of cell types and their marker genes obtained from raw and EAGS smoothed dataset.
. The annotations of the Midbrain dorsal, the Midbrain ventral, and the Dentate gyrus are mixed using MAGIC.The results of STAGATE show fewer cell types.Also, STAGATE don't result in well-organized cell type distributions in the Hippocampus and Cortex.The cell type boundaries of cell annotation after kNN-smoothing processing are blurred, and different types of cells are mixed.In order to avoid the impact of data sparsity on the interpretability of the results, the input data of the cell annotation is the 50thdimensional principal component of different imputation results; the Uniform Manifold Approximation and Projection (UMAP) of the annotated results is shown in Fig. 4A (middle).The cell type space maps, consisting of cell types that are highly represented and annotated by the three methods, are shown in Fig. 4A (right).Fig.

Figure 4 :
Figure 4: Comparison of different imputation methods.(A) left: Spatial maps of cell types using Spatial-ID cell annotations and three different imputation methods.middle: UMAP dimensionality reduction using Spatial-ID cell annotation and different imputation methods.right: Individual cell type spatial maps after cell annotation and different imputation methods.(B) Calinski-Harabasz Index calculated using cell labels after Spatial-ID cell

Figure 5 :
Figure 5: EAGS application to mouse olfactory bulb data.(A) Cell-annotated spatial map of data before and after EAGS smoothing.(B) Cell-annotated Umap of dataset before and after EAGS smoothing (C) Davies-Bouldin and Calinski-Harabasz Indexes of mouse olfactory bulb data.(D) How the annotation results of the main cell types of the mouse olfactory bulb differ between data without and with EAGS smoothing.We also show the heatmap of the marker genes of different cell types, and Moran's I and Geary's C indexes of the corresponding types.Cells annotated before and after smoothing (grey), cells annotated by EAGS alone (purple), and cells annotated by pre-treatment data alone (orange) are displayed on the left side; the expression heatmap of marker genes corresponding to different cell types are shown on the middle; the Moran's I and Geary's C indices are shown on the right side.

Furthermore, EAGS improves
the quality of raw data as it recovers the original biological signals by smoothing cell expression information.The dimensional space is adjusted to ensure the hidden correlation between cells.As it does not depend on a specific statistical model, EAGS does not adjust from the low-dimensional space of the expression profile, thus ensuring the hidden correlation between cells.More importantly, EAGS does not require pre-defined expression models, numerous iterations to obtain the model parameters, or multiple training sessions on the deep learning model framework of the GPU platform.Consequently, EAGS significantly reduces computational costs and offers a significant execution advantage over other methods.Finally, because of the general applicability of smoothing, EAGS is suitable for different ST data.

Figure 1
Figure 1 Click here to access/download;Figure;Fig.1.pdf

Figure 2
Figure 2 Click here to access/download;Figure;Fig.2.pdf

Figure 3
Figure 3 Click here to access/download;Figure;Fig.3.pdf

Figure 4
Figure 4 Click here to access/download;Figure;Fig.4.pdf

Figure 5
Figure 5 Click here to access/download;Figure;Fig.5.pdf Current node n 1  cm ST chip of the Stereo-seq platform, gs is set to 0.95. is the smooth weight that varies around the gs , and characterizes the overall contribution level of cells in both the i Cell .

Table 1 .
Results on simulated ST dataset with different proportions of dropout and noise