vissE.cloud: a webserver to visualise higher order molecular phenotypes from enrichment analysis

Abstract Gene-set analysis (GSA) dominates the functional interpretation of omics data and downstream hypothesis generation. Despite its ability to summarise thousands of measurements into semantically interpretable components, GSA often results in hundreds of significantly enriched gene-sets. However, summarisation and effective visualisation of GSA results to facilitate hypothesis generation is still lacking. While some webservers provide gene-set visualization tools, there is still a need for tools that can effectively summarize and guide exploration of GSA results. To enable versatility, webservers accept gene lists as input, however, none provide end-to-end solutions for emerging data types such as single-cell and spatial omics. Here, we present vissE.Cloud, a webserver for end-to-end gene-set analysis, offering gene-set summarisation and highly interactive visualisation. vissE.Cloud uses algorithms from our earlier R package vissE to summarise GSA results by identifying biological themes. We maintain versatility by allowing analysis of gene lists, as well as, analysis of raw single-cell and spatial omics data, including CosMx and Xenium data, making vissE.Cloud the first webserver to provide end-to-end gene-set analysis of sub-cellular localised spatial data. Structuring the results hierarchically allows swift interactive investigations of results at the gene, gene-set, and clusters level. vissE.Cloud is freely available at https://www.vissE.Cloud.


INTRODUCTION
High throughput molecular technologies, such as RNAseq, single-cell RNA-seq, spatial transcriptomics and proteomics, hav e unlocked ne w av enues of modelling and understanding the complexity of biological systems. Howe v er, this empowerment is dependent on a ppropriate anal ysis and interpretation of large high-dimensional data. Many statistical and computational approaches have been de v eloped to address the challenge of genes / proteins prioritisation based on some statistics (1)(2)(3); howe v er, these lists are not easily interpretable biologically. Gene-set analysis (GSA) approaches have been developed to address the problem of biological interpretation ( 4 ). These approaches use existing W594 Nucleic Acids Research, 2023, Vol. 51, Web Server issue functional knowledgebases such as Gene Ontology ( 5 , 6 ), Kyoto Encyclopedia of Genes and Genomes (KEGG) ( 7 ) and Reactome pathways ( 8 ) that are commonly r epr esented as sets of genes to infer biological functions by assessing the enrichment of prioritised genes.
Though GSA can summarise thousands of measurements into semantically interpretable components, it still presents experimental researchers with two major challenges. First, GSA often results in hundreds of significantly enriched gene-sets, primarily due to redundancy within and betw een knowledge bases (9)(10)(11). Since not all of these hypotheses can be tested, experimental r esear chers ar e faced with the decision to prioritise a subset to further followup, howe v er tools that facilitate navigating through different hypotheses still lacking. Second, the application of GSA to new technologies such as single-cell and spatial omics r equir es specialized anal ytical a pproaches that can account for the unique features of these data (12)(13)(14)(15). Deployment of such ra pidl y evolving a pproaches in a scalable manner to w e b-based applications remains challenging because of the computational and software engineering r equir ements.
Existing popular GSA w e b servers attempted to partially address the issues of result summarisation and broad applica bility (Ta b le 1 ) (16)(17)(18)(19)(20)(21)(22)(23)(24)(25). For e xample, WebGestalt ( 18 ), g:profiler ( 22 ) and Enrichr ( 19 ) provide gene-set visualisation tools, howe v er, tools to summarise and e xplore the results in a guided manner are still lacking. Additionally, w e b servers provide a limited set of methods, often only over-r epr esentation analysis (ORA) and gene-set enrichment analysis (GSEA) ( 26 ). To enab le v ersatility of analyses, they r equir e gene lists or ranked gene lists as input.
Howe v er, they do not provide end-to-end solutions for the emerging data types such as single-cell and spatial omics.
We hav e pre viously addressed the issue of result summarisation of GSA results in our R / Bioconductor package, vissE ( 27 ), where we identify gene-set clusters with common biological themes. To further empower biologists / scientists with limited coding experience and to enhance result interpretation with interacti v e visualisations, we present vissE.Cloud. vissE.Cloud provides end-to-end gene-set analysis and offers within-browser gene-set summarisation and highly interacti v e visualisations. Built upon a job queuing ar chitectur e, an R-based analysis cor e and a Single-Page App frontend, vissE.cloud offers a robust and easily scalable solution for running computationally intensi v e workflows while providing a streamlined deployment of ra pidl y evolving methodologies for newer omics technologies. vissE.Cloud supports both ORA and GSEA methods unlike many existing w e bservers. We maintain versatility by allowing analysis with lists of genes, but in addition, we also support the analysis of single-cell and spatial omics data from the raw data, including pre-processing, factor analysis and factor interpretation. Our easily e xtensib le design allows for the workflow to be deployed to the latest subcellular spatial molecular technologies such as CosMx ( 28 ) and Xenium ( 29 ), making vissE.Cloud the first w e bserver to enable end-to-end gene-set analysis of sub-cellular localised data. Hierarchical structuring of GSA results, coupled with highly interacti v e visualisations allow biologists to conduct swift interacti v e inv estiga tions of results a t mul-tiple le v els / scales , including the gene , gene-set and clusters le v el yet allowing seamless linkage across all three le v els. Till this end, this frame wor k enab les a holistic interpretation of biological systems that is intuiti v e, easily accessib le, and interacti v e f or an y biologist / scientist to use. vissE.Cloud is freely available at https://www.vissE.Cloud .

Ov ervie w of vissE.cloud workflow
The vissE.Clould workflow consists of three main steps ( Figure 1 ): (i) input data processing, (ii) identification of enriched gene-sets and (iii) identification of biological themes / clusters. We describe each of the steps briefly below and refer to the full methodology provided on the help pages via the w e bsite.

Input data processing
To maintain versatility, vissE.Cloud accepts a wide range of inputs allowing the integration of the workflow with differential analysis of bulk transcriptomics , proteomics , singlecell and spatial transcriptomics data.
For bulk transcriptomics, users can choose from two input options: (i) a list of genes of interest, such as those found to be significant in a differential analysis and (ii) genes with their associa ted sta tistics, commonl y lo g-fold change or P -value. vissE.Cloud supports se v en different gene ID types, including UniProt, that are then mapped to their corresponding gene-sets. To facilitate proteomics analysis, vissE.Cloud can handle pr otein gr oups commonly produced from proteomics search tools such as MaxQuant ( 30 ) and DIA-NN ( 31 ).
For single-cell and spatial transcriptomics, vissE.Cloud accepts raw files as input and provides an end-to-end factor analysis and interpr etation workflow. Pr e-processing of raw data follows the Orchestrating Single-Cell Analysis (OSCA) workflow ( 12 ), where poor quality cells and lowly variable genes are removed, and data normalised for compositional biases ( 32 ), followed by feature extraction of highly variable genes using the scran R package ( 33 ). For the panel-based sub-cellular spatial molecular datasets from the CosMx and Xenium technologies, preprocessing follows the pipeline described in ( 34 ). Cell-le v el quality control is performed using spike-in probes that ar e pr esent in the standard panel. Finall y, factor anal ysis is conducted on log-transformed pre-processed data using principal components analysis (PCA) or non-negati v e ma trix factorisa tion (NMF) with methods implemented in the scater R package ( 35 ). Users are able for fully contr ol the pre-pr ocessing parameters fr om the vissE.Cloud interface.

Identification of enriched gene-sets
vissE.Cloud compiles gene-sets from Molecular Signatures Database (MSigDB v7.5) ( 26 , 36 ), which includes 31 508 gene-sets split into nine categories and 23 subcategories. This comprehensi v e compendium of biological knowledge organised into gene-sets is suitable for a wide range of applications, including functional enrichment, regulome analysis, and cell type annotation. Users can select a subset of Nucleic Acids Research, 2023, Vol. 51, Web Server issue W595 Table 1. Comparison of functionality in gene-set analysis w e b servers. Filled boxes indicate the w e b server caters for the specific function, while blank boxes indicate lack thereof. Coloured bars label different aspects of the analysis, such as support of different input types, available gene-set databases and r epr esentation and visualisation of gene-set analysis results O R A G S E A S C S p a t ia l P r o t e o m ic s P a t h w a y s G O E m p ir ic a l S u m m a r iz a t io n G e n e -le v e l Input Gene-sets Output (sub-)collections depending on the biological hypothesis of interest.
The methodology for identifying enriched gene-sets depends on the user input data. ORA implemented in clus-terProfiler ( 37 ) is used to identify enriched gene-sets from a list of genes. Alternati v el y, w here a gene-associated statistic is provided, genes are ranked and GSEA is performed using the fgsea R package ( 38 ). In both cases, users are able to set P -value threshold, or filter gene-set by their size. In factor analysis, gene loadings for each factor are used as gene weights that are subsequently used to score gene-sets using the singscore R package ( 39 , 40 ).

Identification of biological themes / clusters
The core analysis of biological theme identification is performed using algorithms de v eloped in the vissE R / Bioconductor package ( 27 ). Gi v en the results of a GSA analysis, vissE first generates a gene-set network by computing gene-set similarity using the Adjusted Rand Index (ARI) or other user-defined similarity measur es. P airwise similarity r epr esents the number of genes that a pair of gene-sets share or have in common. Gene-set clusters, also r eferr ed to as 'biological themes', are then identified using a random-walks based graph clustering algorithm ( 41 ). These gene-set clusters are ranked based on a combination of the gene-set cluster size and the average of the gene-set statistic for the gene-sets in each cluster. Specifically, a product of ranks statistic ( 42 ) is computed using these two metrics such that gene-set clusters having many gene-sets and highly significant gene-sets are prioritised ( 27 ).
A semantic meaning is generated for each cluster using natural language processing by performing term frequency analysis on the gene-set names, treating each gene-set name as a document. The term-frequency inverse-document frequency (TF-IDF) is computed for all words within the geneset cluster. These r esults ar e then pr esented as word clouds with the TF-IDF score determining the size of the word. To link gene-set clusters to their member genes, gene-le v el statistics are projected on a pr otein-pr otein interaction network ( 43 ) and are used to generate gene-statistic scatter plots. Details of these methods are described in Bhuva et al. ( 27 ).

Web server design and ar chitectur e
The overall ar chitectur e for vissE.Cloud is shown in Figure 2 . The client-side is implemented with highly responsi v e ReactJS. When users submit analysis jobs, they are assigned a human-readable job ID that can be easily shared with collaborators or bookmarked. A python-Flask backend passes all job parameters to a Redis-based job queue wher e they ar e dispatched to 'w ork er' processes that performs the analysis in isolated R environments. Finished job r esults ar e then formatted as JSON and passed back to the client-side for rendering and visualisation. To achie v e separation of concerns and smooth deployment, each of the server components is containerised and the full setup is deployed using docker-compose container orchestration. This modularised structure where modules can be deployed on se v eral compute-instances ensur es futur e scalability for computationally intensi v e analyses of single-cell and spatial transcriptomics datasets. Currently, the end-to-end analysis of such data takes between fiv e to thirty minutes, providing a quick turnaround time.

Inter active r esults visualisation
vissE.Cloud presents analysis results using three main views that the user can access from the side panel (Figure 3 A-C): (i) GSA ov ervie w and summary statistics panel, which includes detailed information for the number of mapped genes and tested gene-sets (Figure 3 A), (ii) global gene-set networ k vie w of the graph along with the identified clusters, associated word cloud and gene / protein statistic plot (Figure 3 B) and (iii) cluster gallery view, where identified clusters are represented semantically using word clouds (Figure  3 C). An additional detailed cluster view (Figure 3 D) is presented where users can select specific themes / clusters to explore depending on the hypotheses of interest. Results are hierar chically structur ed to allow users to traverse across gene, gene-set and clusters le v els seamlessly. We describe each of these views below.

GSA ov ervie w and summary statistics
In this view, important GSA summary sta tistics associa ted with genes, gene-sets and clusters are displayed to allow users to validate and identify any issues that may occur in the GSA step. For example, vissE.Cloud shows the number of genes that were mapped to gene-sets in the database used, and that subsequently contributed to the GSA results. This information can re v eal mismatching gene identifiers as a useful quality control step. At the gene-set le v el, distribution of their sizes and categories can uncover potential anal ysis bias. Additionall y, these results can also inform the feasibility of theme identification: if the number of significant gene-sets is very small (in the 10s), any further result summarisation may not be necessary and may not re v eal much more that a classic GSA analysis. Collecti v ely, these summary statistics can guide users to understand the overall known functional information content of their data and the status of the GSA analysis performed.

Global gene-set network view
Here, users can investigate GSA results holistically by visualising the relationships between all significant gene-sets as a gene-set netw ork, potentially unco vering consistent yet previously unknown patterns across the experiment. Dif ferent gene-set sta tistics and annota tions such as (sub-)category, false discovery rate (FDR), gene-set size, enrichment score and node degree (a statistic to represent the connectivity of a gene-set) can be used to annotate the colour and size of nodes interacti v ely. Users can change the colour scheme, choosing from 47 different palettes to use for visualisation. To enhance the user experience, site-wide preferred palettes can be specified for categorical, sequential, and di v erging data types. Furthermore, this vie w offers agile network-based navigation between clusters. Interacting with the network by hover over gene-sets (nodes) highlights the gene-set cluster, shows the corresponding word cloud, and displays gene-le v el statistics and gene-gene interactions inferred from known protein-protein interaction networks ( 43 ).

Cluster gallery view
Since gene-set clusters are composed of numerous genesets, vissE.Cloud semantically describe them as word cloud to facilitate user explor ation. Inter acti v ely, the 'Wor dCloud Gallery' panel renders word clouds progressi v ely as a continuous feed, allowing optimised visualisation of e v en hundreds of clusters. The biological terms presented by the word clouds augment the user's domain knowledge of the biological systems of inter est, r elating the identified terms to the biology. As such, the relati v ely loose semantic definitions afforded by word clouds can be combined with expert domain knowledge to come up with a more complete semantic interpretation. The 'WordCloud Gallery' panel provides users entry into more detailed cluster exploration views provided by vissE.Cloud.

Cluster details view
This view allows thorough investigation of the composition and semantics of a gene-set cluster. Focusing on a selected cluster, vissE.Cloud displays four panels encompassing information across the thr ee hierar chical le v els. These panels are: (i) cluster-le v el wor d cloud description, (ii) gene-set Nucleic Acids Research, 2023, Vol. 51, Web Server issue W597  le v el similarity network, (iii) gene-level statistics scatter plot and (iv) pr otein-pr otein interaction (PPI) network for the genes in the cluster. Collecti v ely, presenting this multiple hierarchical information as a single inter-linked view greatly enhances the interpretation of results in the context of the biological system being studied.

Factors summary view (single-cell and spatial transcriptomics)
The results of the factor analysis of single-cell and spatial transcriptomics data performed using PCA are displayed in the 'Factors Summary' view. This view presents the top factors as word cloud panels, showing four genes-set clusters of each factor. This word cloud panelled view is coupled with the corresponding dimension reduction plots visualising the selected factor of interest. In the case of singlecell transcriptomics data, uniform manifold approximation and projection (UMAP) ( 44 ), t-stochastic neighbour embedding (t-SNE) ( 45 ) and PCA projections can be visualised. In each of these factor visualisa tions, da ta points (cells / loci) can be coloured with user-defined palettes by one of the following da ta annota tions: (i) the top 5 principal components (PCs), (ii) the first two dimensions of the UMAP, (iii) the first two dimensions of the t-SNE or (iv) quality control statistics such as the library size, percent mitochondrial transcripts, and the total number of genes detected. For Visium data, an additional 'tissue' dimension allows visualisation of the data in the context of the tissue location of each spot in the spatial data. If histology images (e.g. hematoxylin and eosin (H&E) stained images) are available for the Visium data, they can be uploaded using this view and used as underlays for the tissue plot. Upon selecting a factor, users can access all of the views mentioned abov e to e xplore and characterise the biological functions r epr esented by the factor.

DISCUSSION
To generate biological hypotheses from 'omics data, should r esear chers focus on a handful of important genes or on overall trends and hallmarks? This dilemma of determining the appropriate scale and scope of bioinformatics analysis has been a challenging issue pre v enting researchers from fully unleashing the potential of high-dimensional and high-resolution molecular data. On the one hand, gene function is dependent on molecular contexts and interactions that are often omitted if hypotheses are solely developed from focusing on a few important genes. On the other hand, it is also important to select a distinct subset of genes that r epr esent observed tr ends to generate more specific testable hypothesis (e.g. using perturbation models).
The solution presented in vissE.Cloud enables researchers to toggle between very broad trends represented in biological themes or gene-set clusters, and narrower contexts at the gene level. Empowered by highly interactive visualisations, r esear chers can generate hypotheses using both top-down and bottom-up approaches. In the top-down approach, users start with a biological theme of interest and investigate changes in their member gene-sets and genes. Alternati v ely, users can start with genes of interest and explore how their expression contribute to observed overall trends.
By having an analysis core based on the R framework ensures vissE.Cloud remains versatile and abreast with ne wly de v eloped methods. Supporting different input types including proteomics, single-cell, and spa tial da ta, vissE.Cloud offers no-code access to methods that would otherwise r equir e high technical expertise. The softwar e engineering challenges of performing R-based long-running anal yses w hile maintaining interactivity were addressed by coupling the backend server with a job queueing service and resolving the architectural complexity with container or chestration. This modular ar chitectural design not only enables rolling out new methods and algorithms for emerging data types but can also be adapted to deploy a wide range of R-based bioinformatics pipelines beyond the scope of vissE.Cloud.
This no-code access and intuiti v e interacti v e interface via the cloud will enable a widespread uptake of vissE.Cloud by r esear chers, e v en those users with limited bioinformatics experience. The ability to quickly deploy new methods and algorithms will enable high quality r esear ch with quality GSA results and visualisations.

DA T A A V AILABILITY
The analysis core for the w e b serv er including e xample data is available as a R package on GitHub at https:// github.com/ahmohamed/vissEServerRpkg and on Zenodo at https://doi.org/10.5281/zenodo.7841244 . The dockercompose setup for the full ar chitectur e is available upon request.