SubcellulaRVis: a web-based tool to simplify and visualise subcellular compartment enrichment

Abstract Cells contain intracellular compartments, including membrane-bound organelles and the nucleus, and are surrounded by a plasma membrane. Proteins are localised to one or more of these cellular compartments; the correct localisation of proteins is crucial for their correct processing and function. Moreover, proteins and the cellular processes they partake in are regulated by relocalisation in response to various cellular stimuli. High-throughput ‘omics experiments result in a list of proteins or genes of interest; one way in which their functional role can be understood is through the knowledge of their subcellular localisation, as deduced through statistical enrichment for Gene Ontology Cellular Component (GOCC) annotations or similar. We have designed a bioinformatics tool, named SubcellulaRVis, that compellingly visualises the results of GOCC enrichment for quick interpretation of the localisation of a group of proteins (rather than single proteins). We demonstrate that SubcellulaRVis precisely describes the subcellular localisation of gene lists whose locations have been previously ascertained. SubcellulaRVis can be accessed via the web (http://phenome.manchester.ac.uk/subcellular/) or as a stand-alone app (https://github.com/JoWatson2011/subcellularvis). SubcellulaRVis will be useful for experimental biologists with limited bioinformatics expertise who want to analyse data related to protein (re)localisation and location-specific modules within the intracellular protein network.


INTRODUCTION
The localisation of proteins within cells is critical for their processing, function and regulation. Cellular homeostasis and response to environmental signals are also dependent on sequestering or dynamic re-localisation of proteins in specific intracellular compartments (1). Regardless of whether a protein is relocated in the cell or remains in a single compartment for the duration of its life, spatial regulation is critical for protein function. For example, upon translation, proteins bound for secretion will be trafficked through multiple compartments that compose the endomembrane system; contained within these compartments are 'quality control' proteins that mediate correct folding or degradation for malformed proteins (2). Cellular processes also occur in discrete, non-membrane bound locations, such as degradation within the proteasome, or tran-scriptional regulation in stress granules (3,4). It is also increasingly appreciated that proteins which are dynamically recruited to different subcellular locations form functionally active modules that may be location specific, such as those that form and disperse at the lysosomal surface in response to growth factor or nutrient sensing (5). Moreover, proteins can perform different functions dependent on their locations -referred to as moonlighting proteins (6,7). The change in protein localisation introduces them to different protein partners that may differentially regulate the function of the protein, as exemplified by STAT3, which has independent roles in transcriptional regulation and oxidative phosphorylation (8). Proteins can also be pathologically mislocated, as demonstrated by mislocalisation of liver-specific peroxisomal enzyme alanine:glyoxylate aminotransferase (AGT) to the mitochondria in patients diagnosed with primary hyperoxaluria type 1 (9). Co-regulated proteins or genes are commonly identified using high-throughput 'omics experiments, such as those based on transcriptomics, proteomics and phosphoproteomics. The regulated proteins or genes identified in these experiments are often assessed based on enrichment for annotations to particular biological characteristics or participation in biological pathways. Annotations based on subcellular localisation are stored in the Gene Ontology aspect Cellular Component (GOCC) or the Jensen COMPART-MENTS database (10,11). Enrichment for these annotations can provide initial indications of location-specific roles for a protein or gene list. Moreover, when analysing spatially resolved data, annotations describing subcellular localisation also provide an important data quality control. This is exemplified in the analysis of spatial proteomics data that utilise proximity dependent biotinylation, in which bait proteins (known, characterised proteins tagged with a biotin ligase or peroxidase) covalently modify neighbouring proteins (termed 'prey') through the addition of biotin (12,13). Isolating the biotinylated proteins generates a spatial interactome of the bait protein. Recently, BioIDbased proximity-dependent biotinylation was used to generate cell-wide maps of proteins localised to particular organelles (14). As demonstrated in this work, calculating enrichment for GOCC terms is a useful method for understanding whether the bait has been correctly tagged, in which case one would expect enrichment for the subcellular compartment the bait is commonly found in.
Several web-based tools, such as EnrichR (15), can be used to perform enrichment analyses on protein or gene lists based on the GOCC or compartments databases. These analyses are accessible to biologists with minimal bioinformatics skills and are useful for understanding the subcellular compartments overrepresented in a list of proteins or genes of interest. However, a significant barrier in the interpretation of the results is that annotations are often highly specific, meaning it can be difficult to extract general trends from the potentially long list of returned annotations. For example, the enrichment for GOCC terms,calculated using EnrichR, on a list of GPI-anchored proteins (extracted from UniProt, Supplementary Table S1), results in a list of terms that are difficult to immediately generalise (Table 1), despite useful visualisation tools included in the EnrichR web app. In contrast, certain cellular compartments are poorly characterised, for example the individual compartments of the endosomal system (16), and therefore analyses of proteins localised in such compartments may be noisy and incomplete.
We have implemented a Shiny web app, named Subcel-lulaRVis, that provides graphical visualisation of subcellular compartments enrichment from a list of multiple genes or proteins. By performing the enrichment test using condensed annotations, so that there is a single term for each major intracellular compartment in the results, the analyses can be concisely visualised and interpreted. SubcellulaRVis represents an alternative to the visualisation provided on the web interfaces of UniProt and COMPARTMENTS which only visualise the localisation of a single protein (11,17). A major novelty of our approach is the interactive visualisation on a schematic of the eukaryotic cell of GOCC enrichment (Figure 1), allowing for user-friendly interpretation, exploration, and presentation of the data. The visualisation can be exported as a static image or as a table of the data behind the visualisation. SubcellulaRVis provides a solution for non-bioinformaticians to investigate the subcellular localisation of the proteins within a dataset of interest and standardises the visualisation of these results. The app is available at: http://phenome.manchester.ac.uk/subcellular/.

MATERIALS AND METHODS
SubcellulaRVis aims to simplify the interpretation of GOCC enrichment analyses through visualisation of a standardised schematic of the cell. We selected subcellular compartments that are non-transient and common to eukaryotes, though it is a non-exhaustive list.
We first defined each cellular compartment visualised on the schematic (as shown in Figure 2) in respect to the GOCC terms that describe it. We did this by utilising the hierarchical structure of GO. The GO terms are hierarchically organised, with GO terms towards the 'top' of the hierarchy being more general descriptors (e.g. GO:0005886, plasma membrane.) whilst terms towards the 'bottom' of the hierarchy are more specific (e.g. GO:0005901, caveola) (18). Each term in the GO hierarchy has parent terms (those that are higher in the hierarchy) and may also have child terms (those that are lower in the hierarchy). We identified the highest level GOCC term that would best describe each of the compartments visualised on the schematic (Table 2). Then, using the GO.db package in Bioconductor (version 3.14) (19), the child terms of the high-level GOCC terms were extracted and associated together. We call these grouped, compartment-specific terms the SubcellulaRVis compartment and use the high-level parent term to describe them, as shown in Table 1, hereafter. We extracted all the genes annotated to the terms in each SubcellulaRVis compartment using the species related AnnotationData libraries in Bioconductor (version 3.14), to create gene sets against which to calculate enrichment ( Table 2). The annotation lists were generated for multiple species: H. sapiens, M. musculus, D. melanogaster, S. cerevisiae, R. rattus and X. laevis.
Enrichment is calculated on the user supplied list of HUGO Gene Nomenclature Committee (HGNC) symbols or UniProt IDs ( Figure 1). Firstly, duplicate and non-valid entries are removed from the list. Then, the genes are an-    Table S1).
notated to the GOCC terms based on the annotation lists generated for each species, and the annotations are then grouped based on the SubcellulaRVis compartment. A standard enrichment test is performed using the hypergeometric probability function, phyper, in R (version 4.1.0), to calculate the enrichment for each SubcellulaRVis compartment in the user-supplied gene list, and corrected using the false discovery rate (FDR) using the p.adjust function in R with the 'method' argument set to 'fdr'. Though it has limitations (20,21), the hypergeometric test was selected as values associated to the genes or proteins (e.g. expression or abundance values) do not need to be supplied in order to perform the test. To improve the accuracy of this calculation, the user can also input a background population of expressed genes or proteins in their sample, if known (as discussed in (22)).
In the absence of the user-supplied background population list, the calculation is performed based on all the genes in the reference genomes (from the Bioconductor Annotation-Data databases, as described above).
The SubcellulaRVis tool has been implemented as a Shiny (version 1.7.1) web app (http://phenome.manchester.ac.uk/ subcellular/) and as a standalone app with R package (https: //github.com/JoWatson2011/subcellularvis). The R package will be submitted to Bioconductor upon acceptance of the article. The visualisation within the app has been implemented using ggplot2 (version 1.7.1) and Plotly (version 4.10.0), allowing interactive exploration of the results when viewed on a web browser. All dependencies are described in the DESCRIPTION file of the R package and will be installed along with the package, as described in the README file.

Features of SubcellulaRVis
The key feature of the SubcellulaRVis app is the visualisation of GOCC enrichment results on a graphical repre-Nucleic Acids Research, 2022, Vol. 50, Web Server issue W721  sentation of a eukaryotic cell (as seen in Figure 2), allowing for rapid and simple interpretation of the predominant localisation of proteins of interest. By providing visualisation for a group of proteins, we provide an alternative to similar tools that visualise the localisation of single proteins (11,17). The app provides an interactive view of protein localisation, and a static image of the plot can be exported; the characteristics of both the image and the plot (such as text size or colour scale) can be chosen by the user. This is demonstrated by the test data, available in the app, which is a list of 139 proteins which are annotated as GPIanchored in UniProt (Supplementary Table S1) (17). Figure  2 visualises the enrichment of these proteins for the membrane bound organelles of the secretory pathway, plasma membrane and for the extracellular space (Table 3). Cellular compartments in white are not enriched for in the gene list.
Using SubcellulaRVis, the user can view the results of the enrichment analysis graphically, as discussed above, or in tabular form. A standard enrichment including all the GOCC terms (rather than the summarised terms) can be calculated and can be viewed in a separate tab on the app; the table also specifies the SubcellulaRVis compartment each of these terms is assigned to, allowing the user full understanding of the way their data has been analysed and visualised. To evaluate genes or proteins that have been annotated to multiple categories, the user can use the results found in the 'overlap' tab, which contains an UpSet plot and the results in tabular form. Finally, we have also implemented a feature to allow users to compare multiple gene lists. We anticipate this being useful to users who have multiple experimental conditions or want to quickly perform and visualise multiple comparisons.

Validation of GO cellular compartment summarisation
To assess the precision of the SubcellulaRVis compartment assignments, we validated the tool using various data sets describing subcellular localisation of proteins.
The first validation data set came from The Human Protein Atlas (23). The subcellular section of the database is considered a gold-standard resource for the assignment of protein localisation. Each entry in the database is a protein whose spatial distribution has been investigated with immunofluorescence, confocal microscopy and proteomics in up to three different cell lines (23). The flat file of subcellular location data was downloaded from the website (https://www.proteinatlas.org/about/download), filtered for proteins that were associated to only one subcellular compartment and protein lists were created associated to each subcellular compartment. Analysis of proteins with limited spatial distribution allowed for simpler validation, as we could perform a one-to-one comparison between the true    Table S1).
localisation and the highest enriched SubcellulaRVis compartment. We found that this was indeed the case for many of the protein lists (Table 4, Supplementary Table S2). Encouragingly, sub-compartments were resolved to their parent compartments; for example, proteins associated to actin filaments, intermediate filaments and microtubules were all associated to the cytoskeleton. Likewise, cell junctions and focal adhesion sites were associated to the plasma membrane. Moreover, the parent compartments were correctly associated, for example the endoplasmic reticulum and mitochondria. Only the peroxisome proteins were not correctly associated by the most enriched compartment however the second most enriched compartment was the peroxisome (Supplementary Table S2); this is due to overlap between the genes annotated to the peroxisome and the cytoplasm. We then validated SubcellulaRVis with two experimental datasets. The first, from Go et al. (14) (Supplementary Table S8 of original publication), used proximity dependent biotinylation to label proteins in the vicinity of different organellar markers which were tagged with the biotin ligase enzyme BioID. In total, 4424 proteins were identified and associated to different subcellular compartments. We calculated the enrichment using SubcellulaRVis of the proteins associated to the different organelles in the datasets, treating them as organelle-specific protein lists. The most significantly enriched SubcellulaRVis compartment for each of the twenty experimentally determined, organelle-associated protein lists were directly or closely matched (Table 5, Supplementary Table S3). For example, proteins experimentally associated to the cell junction, nucleolus and microtubules were enriched for the plasma membrane, nucleus, and cytoskeleton compartments, respectively by the Sub-cellulaRVis tool. This case study provides a useful example of how analysis based on standardised, generic cellular compartments names, such as the SubcellulaRVis com-partments, could be useful for comparison between different spatial studies.
To assess the precision of the SubcellulaRVis tool for the analysis of a non-human experimental data set we extracted the data from Nightingale et al. (24) (from Supplementary Table S3 of original publication) which used the hyperplexed Localisation of Organelle Proteins by Isotope Tagging (hyperLOPIT) method (25) to reconstruct spatial profiles of organelle proteomes in Saccharomyces cerevisiae. SubcellulaRVis found the highest enrichment for the compartment most closely describing the experimentally associated cellular compartment for the cytosol, nucleus, Golgi apparatus and plasma membrane (Table 6) and the second highest enriched compartment for the mitochondrion and endoplasmic reticulum (Supplementary Table S4). The latter is explained by the significant overlap in genes annotated to the cytoplasm and associated to the mitochondria (183/186) and endoplasmic reticulum (165/170) by the hy-perLOPIT experiment. We concluded that SubcellulaRVis accurately replicates the protein localisation determined in S. cerevisiae. Moreover, the SubcellulaRVis visualisation (as demonstrated in Figure 1) would be an impactful way to display the enrichment results for each of the organelle-specific lists.

DISCUSSION
We have introduced an app named SubcellulaRVis, which provides easy visualisation of the enriched subcellular locations from a user-supplied list of HGNC symbols or UniProt IDs. The visualisation is more biologically insightful than a bar chart and will aid in efficient interpretation of the characteristics of gene lists and standardisation of subcellular localisation analyses. Moreover, Subcel-lulaRVis complements existing visual tools such as COM-PARTMENTS and UniProt which use similar schemat-W724 Nucleic Acids Research, 2022, Vol. 50, Web Server issue ics to visualise the known locations of individual protein species. We anticipate our software will be particularly useful for cellular and molecular biologists with limited bioinformatics expertise wanting to perform precise and quick enrichment analysis and immediate visualisation of gene lists. SubcellulaRVis will be beneficial in analysing data from spatial proteomics, as we have demonstrated here, which is increasingly used to resolve organelle proteomes and location-specific protein interaction networks. In this context, our tool will provide impactful visualisation of how protein localisation varies upon different extracellular perturbations or for providing confirmation of correct tagging of proteins for spatial proteomics experiments. Moreover, the method is useful for analysing high-throughput data from all 'omics contexts. For example, RNA transcripts can be localised in order to be translated at the location in which the protein will be active (26). This information is captured in the database RNAlocate but, as the authors note, RNA localisation is not well-captured in the Gene Ontology resource (on which our tool is based) (27). Therefore, SubcellulaRVis is well suited for analysing protein lists or protein products from genes or transcript lists.
There are some features of SubcellulaRVis that will be developed further based on increasing and more resolved information on the subcellular localisation of proteins. We have included the option to calculate enrichment within human protein or gene lists based on predicted subcellular localisations from the Human Protein Atlas rather than GOCC annotations. At the time of writing, the subcellular map of the Human Protein Atlas has coverage of 65% of human protein-coding genes. This makes it an invaluable resource for researchers and a good fit for Subcellu-laRVis. SubcellulaRVis also provides an alternative view for the visualisation of the endosomal system ( Figure 3). The endosomal system is notoriously hard to characterise (16), therefore our tool will be a valuable resource for distinguishing gene lists based on enrichment for different endosomal compartments. The endosomal view of SubcellulaRVis has recently been effectively utilised for visualisation of spatially resolved phosphoproteomics data describing intracellular signalling modules following dynamic cell surface receptor internalisation (28). In the future, SubcellulaRVis can be further expanded to provide more detailed views of subcompartments of, for example, the nucleus or endomembrane systems.
In summary, we have described an easy-to-use and powerful tool for the visualisation and simplification of GOCC enrichment analyses on gene or protein lists.