OpenXGR: a web-server update for genomic summary data interpretation

Abstract How to effectively convert genomic summary data into downstream knowledge discovery represents a major challenge in human genomics research. To address this challenge, we have developed efficient and effective approaches and tools. Extending our previously established software tools, we here introduce OpenXGR (http://www.openxgr.com), a newly designed web server that offers almost real-time enrichment and subnetwork analyses for a user-input list of genes, SNPs or genomic regions. It achieves so through leveraging ontologies, networks, and functional genomic datasets (such as promoter capture Hi-C, e/pQTL and enhancer-gene maps for linking SNPs or genomic regions to candidate genes). Six analysers are provided, each doing specific interpretations tailored to genomic summary data at various levels. Three enrichment analysers are designed to identify ontology terms enriched for input genes, as well as genes linked from input SNPs or genomic regions. Three subnetwork analysers allow users to identify gene subnetworks from input gene-, SNP- or genomic region-level summary data. With a step-by-step user manual, OpenXGR provides a user-friendly and all-in-one platform for interpreting summary data on the human genome, enabling more integrated and effective knowledge discovery.


INTRODUCTION
Human genomics r esear ch produces complex raw genomic da ta tha t can be simplified into summary-le v el da ta tha t capture essential information ready for sharing and mining. Without loss of generality, we define genomic summary data as a list of genes, SNPs or genomic regions, along with their summary statistics about the significance le v el (e.g. P -values) ( 1 ). Gene-le v el summary data are often generated from differential expression studies ( 2 ), SNP-le v el summary data from genome-wide association studies ( 3 ), and genomic region-le v el summary data from epigenomic studies ( 4 , 5 ). This simplification of data allows for more straightforward analyses, but how to effecti v ely conv ert genomic summary data into downstream knowledge discovery remains one of the major challenges in human genomics r esear ch.
To address the challenges described abov e, we hav e developed e X ploring G enomic R elations or XGR ( 1 ) by demonstrating how ontologies enhance genomic summary da ta interpreta tion and how to enable insights a t the gene subnetwor k le v el. A dozen ontologies hav e been created to annotate genes regarding functions ( 6 ), phenotypes ( 7 , 8 ), diseases ( 9 , 10 ) and other attributes. By integrating a reference gene network that consolidates interaction knowledge ( 11 ) with genomic summary data, a subset of the gene network can be identified to best explain the data, thereby gaining insights at the gene subnetwork level. Interpreting non-coding SNPs or genomic regions, howe v er, r equir es additional use of functional genomic datasets, due to the inherent difficulty in linking them to candidate genes. This difficulty can be resolved by le v eraging information fr om pr omoter capture Hi-C (PCHi-C) datasets that capture physical interactions with gene promoters ( 12 ), quantitati v e trait loci (QTL) datasets that capture genetic regulation with gene expression (eQTL) ( 13 , 14 ) or protein abundance (pQTL) ( 15 ), and datasets about enhancer-gene maps that are constructed using the activity-by-contact (ABC) model ( 16 , 17 ).
Extensi v ely e xtending our XGR software since its first release ( 1 ) and incorporating verified approaches and tools (18)(19)(20)(21)(22)(23)(24)(25), in this study, we introduce a newly designed w e b server 'OpenXGR' (Figure 1 ), which is available at http: //www.openxgr.com . Ov erall, the serv er is designed to be scalable, ef ficient and ef fective, enabling almost r eal-time enrichment and subnetwork analyses for user-input lists of thr ee differ ent entities: genes, SNPs and genomic r egions. It is not only limited to the gene-or SNP-centric data types but is also capable of interpreting genomic regions. This generality of capacity for interpreting different entities on the fly is not available elsewhere, thus complementing other popular w e b servers such as DAVID ( 26 ), Enrichr ( 27 ) and GREAT ( 28 ) that are the most relevant to OpenXGR, and also competiti v e to standalone tools such as DEPICT ( 29 ), MAGMA ( 30 ) and jActi v eModule ( 31 ). OpenXGR achie v es this capacity by le v eraging increasingly available ontologies , networks , and functional genomic datasets (i.e. PCHi-C, e / pQTL and ABC). Along with a user manual with step-by-step instructions, it offers a user-friendly and all-in-one way to interpret genomic summary data for more integrated and effecti v e knowledge discovery.
In the remaining sections, we will provide an ov ervie w of the OpenXGR w e b server implementation, its six analysers, and the underlying knowledgebase. We will then delve into each analyser that may be of interest to users, with utilities illustr ated using pr actical examples from real-wor ld scenarios, including ageing-related genes ( 32 ), gene-le v el summary data for early human organogenesis ( 33 ), SNP-le v el summary data for chronic inflammatory diseases ( 34 ), and genomic region-le v el summary da ta for inna te immune activation and tolerance ( 4 ). Finally, we will conclude with the discussion on the limitations of OpenXGR and the directions for future de v elopments.

Implementation of the OpenXGR web server
The OpenXGR w e b server (Figure 1 ) was newly implemented using the Perl real-time w e b frame wor k 'Mojolicious' ( https://mojolicious.org ) and the widely-used 'Bootstr ap' ( https://getbootstr ap.com ) to create a mobile-first and responsi v e design that ensur es fast and r esponsi v e per-formance across all major w e b browsers and mobile devices. All backend computations can be completed within three minutes on the server side to ensure timely deli v ery of outputs to users. All outputs displayed on the results page are generated using the R package 'bookdown' ( https: //bookdo wn.or g ), providing users with a self-contained dynamic HTML file for download and exploration. Additionally, a user manual with step-by-step instructions is made available where needed to facilitate ease of use and provide guidance for users.

Two types of analysers supported by OpenXGR
The OpenXGR w e b server offers a range of analysers for conducting enrichment and subnetwork analyses leveraging ontologies and netw orks. Presently, tw o types of analysers are supported: one for enrichment analysis designed to identify ontology enrichments, and the other for subnetwork analysis designed to identify gene subnetworks.
Enrichment analysis comprises three analysers that identify enriched ontology terms. These analysers take as input a list of genes, SNPs or genomic regions. One-sided Fisher's exact test is used to calculate Z -scores, odds ratio with its 95% confidence interval (CI), and false discovery rate (FDR) for measuring the significance of enrichments. The following are the three analysers supported by OpenXGR: (i) Enric hment anal yser f or genes (EA G) , w hich uses genecentric ontology annotations to perform enrichment analysis. (ii) Enric hment anal yser f or SNPs (EAS) , w hich identifies genes linked from input SNPs (alongside the significance information) and conducts ontology enrichment analysis for the linked genes. Linking SNPs to genes is enabled by genomic proximity or using functional genomic datasets about PCHi-C and e / pQTL. (iii) Enric hment anal yser f or g enomic r egions (EAR) , which is similar to EAS that first identifies genes linked from input genomic regions using functional genomic datasets about PCHi-C and enhancer-gene maps and then conducts ontology enrichment analysis based on the linked genes.
Subnetwork analysis is performed using three analysers that identify gene subnetworks from input gene-, SNP-or genomic region-le v el summary data. All subnetwor k analysers r equir e the input of the information about the significance le v el (e.g. P -values). The subnetwork identification is done via a heuristic solver for the prize-collecting Steiner tree problem, demonstrated to be competitive to other stateof-the-art algorithms ( 1 , 24 ). The significance ( P -value) of the identified gene subnetwork can be estimated using a degr ee-pr eserving node permutation test to count how often it would be expected by chance. The following are the three subnetwork analysers supported by OpenXGR: (i) Subnetwor k anal yser f or genes (SA G) , w hich takes as input gene-le v el summary data to identify a subset of the gene network in a manner that the resulting subnetwork contains a desired number of highly scored and interconnected genes.  The server at http://www.openxgr.com offers six analysers that are designed to interpret various genomic summary data related to genes (G), SNPs (S) and genomic regions (R). By le v era ging b uilt-in knowledgebase on ontologies, networks and functional genomics, these analysers allow almost real-time enrichment and subnetwor k analyses, enab ling identification of ontology enrichments and gene subnetworks. A user manual is made available to provide step-by-step instructions on the use.
(ii) Subnetwor k anal yser f or SNPs (SAS) , w hich identifies a gene subnetwork from input SNP-le v el summary data. It first uses genomic pro ximity, e / pQ TL or PCHi-C to link SNPs to genes, and then uses information on the linked genes to identify the gene subnetwork. (iii) Subnetwor k anal yser f or g enomic r egions (SAR) , which is similar to SAS that first identifies genes linked from input genomic regions using PCHi-C datasets or enhancer-gene maps, followed by subnetwork analysis based on the linked genes.

Leveraging knowledgebase on ontologies, networks and functional genomic datasets
Enrichment analysis in OpenXGR is supported by a variety of ontologies that span a wide range of knowledge contexts, ranging from functions and pathways to regulators, from diseases and phenotypes to drugs, and fr om pr otein domains and disorders to hallmarks and evolution. Ontologies currently supported are: (a) functions : Gene Ontology (GO) ( ( 48 ) and Phylostr atigr aphy ( 49 ). Subnetwork analysis in OpenXGR leverages the knowledge of functional or pathway interaction networks. Functional interaction networks are sourced from the STRING database ( 11 ) (version 11.5), with only the 'experiments' and 'databases' source codes considered. Functional interactions are classified as having the highest confidence ( ≥0.9), high confidence ( ≥0.7), and medium confidence ( ≥0.4). Pathway interaction networks are sourced from the KEGG database ( 35 ) (105.0 release), with all individual pathways being merged into a gene network.

Capabilities of enrichment analysers in identifying enriched ontology terms from input genes, SNPs or genomic regions
Enric hment Anal yser (Genes)-EA G conducts enric hment anal ysis f or g enes lever agingontologies . EAG is designed to le v erage gene-centric ontology annotations to identify enriched ontology terms from input genes. The tool comprises two major steps, which are outlined in the user-request interface (Figure 2 A). The interface takes a list of genes as input, such as ∼300 ageing-related genes as an illustrati v e e xample ( 32 ). Available ontologies are organised by category (Table 1 ). Additional parameters can be specified to control the enrichment analysis and results. The interface features a toggle button to show / hide information on the use, including details on input, output and other useful information, as well as a key icon that provides an example input / output showcase. In the results page (Figure 2 B), the ' Input Gene Information ' tab lists the input genes and hyperlinks to their corresponding GeneCards pages for additional information and displays the server-side runtime. The ' Output: Enriched Terms ' tab features an interacti v e tab le that displays enriched ontology terms, along with their significance information such as Z-scores, FDR, odds ratio and its 95% CI. It also shows member genes that overlap with the input genes. The r esults ar e also illustrated in the ' Output: Dotplot ' tab, displaying the top fiv e terms with their respecti v e Z-scores and FDR (Figure 2 C), and in the ' Output: Forest Plot ' tab, listing the top enriched terms ordered by odds ratio (Figure  2 D). As expected, the most enrichments are ageing-related, such as FoxO signaling, longevity regula ting pa thways, and ageing. It is worth noting that all enrichment results are embedded into a self-contained dynamic HTML file that can be downloaded and e xplored interacti v ely in a new browser window. We highly recommend users download this file for subsequent exploration.
Enric hmentAnal yser (SNPs) -EAS links SNPs to candidate genes for enrichment analysis. EAS achie v es this by linking input SNPs to candidate genes in three steps. The user-r equest interface r equir es two pieces of information as input: SNPs and their significance info (p-values). For example, the interface presents an illustrative example of ∼210 SNPs and their reported p-values for chronic inflammatory diseases ( 34 ). By default, this analyser considers input SNPs with a P -value threshold of < 5 × 10 −8 , and additional SNPs in linkage disequilibrium ( R 2 ≥0.8) according to the European population, though other populations are also supported ( 56 ). Input and additional SNPs are then linked to genes based on genomic proximity, PCHi-C or e / pQTL (see Table 1 ). The linked genes are scored based on p-values, threshold and R 2 for SNPs, the distance window for genomic proximity, the strength of gene promoters physically interacting with SNP-harbouring genomic regions for PCHi-C datasets, and the significance le v el defining e / pQTL, as previously described ( 1 , 23 ). Enriched ontology terms are identified based on enrichment analysis of the linked genes. In addition to a dot plot and a forest plot, the output also includes two tabular displays under the ' Output: Linked Genes ' tab. One lists the linked genes and their scores, which range from 1 to 10. The other is an evidence table showing which SNPs are used to define the linked genes based on which datasets.
Enric hmentAnal yser (Genomic Regions) -EAR links genomic regions to candidate genes f or enric hmentanal ysis. EAR works similarly to EAS , but instead of input SNPs, it links input genomic regions to candidate genes and performs enrichment analysis on them. Users specify the genomic coordinates of the input regions, including the chromosome, start, and end positions. The genome build for the input regions is also required, with hg19 used internally as a default and automatically converted if a different build is provided. For example, an input of ∼380 differentially expressed enhancer RNAs (non-coding regions) involved in innate immune activation and tolerance is used as an illustration ( 4 ). The linked genes are identified and scored      based on genomic proximity, PCHi-C, or enhancer-gene maps (see Table 1 ). The output includes tabular displays of the linked genes and graphical plots of enriched ontology terms. The linked gene table under the ' Output: Linked Genes ' tab displa ys inf ormation on genes linked from input genomic regions, including scores that quantify the degree to which genes are likely modulated by input genomic regions. The e vidence tab le displa ys inf ormation on which regions are linked to genes based on which evidence. Taken together, EAR can handle various genomic regions, such as differentially expressed regions, differentially methylated DNA regions, transcription factor binding sites, and epigenetic marks from epigenomic experiments. It assists in the interpretation by identifying ontology enrichments and candidate genes associated with input genomic regions.

Capabilities of subnetwork analysers in identifying gene subnetworks from input gene-, SNP-or genomic region-level summary data
Subnetwor kAnal yser (Genes) -SA G perf orms subnetwor k anal ysis f or gene-level summar y data lever aging netw orks . SAG is an analyser designed to exploit knowledge of protein interactions or pathway-deri v ed gene interactions to identify gene subnetworks from input gene-level summary data ( Figure 3 ). A typical example would be a list of differentially expressed genes with their corresponding significance informa tion. An illustra ti v e e xample provided in the user-input interface is the list of stage-transiti v e differential genes between Carnegie stages 9 and 10 during early human organogenesis ( 33 ). Functional interaction networks are sourced from the STRING database ( 11 ), and by default, the highconfidence interactions ar e used, corr esponding to ∼14 800 genes and ∼203 900 inter actions. Pathway inter action networ ks are deri v ed by merging pathways from the KEGG database ( 35 ), collecti v ely forming a gene network with ∼6000 genes and ∼59 000 interactions. SAG aims to iden- Users can specify the desired number of nodes / genes in the resulting subnetwork, and the output is returned via a well-estab lished iterati v e sear ch procedur e ( 1 , 24 ). In summary, SAG takes a list of genes along with their significance information, such as differential genes showcased her e, and r eturns a tabular display of the subnetwork genes and a netw ork-lik e visualisation of the subnetwork (with nodes / genes colour-coded by input gene significance information). Subnetwor kAnal yser (SNPs) -SAS links SNPs to candidate genes f or subnetwor k anal ysis. SAS is designed to perform subnetwork analysis using input SNP-level summary data, with the goal of linking SNPs to genes for subsequent analysis. The first three steps in user-request interface are identical to those in EAS . Instead of specifying which ontology to use, users must indicate which gene network to use and provide specifications to control the desired number of the resulting subnetwork genes. Using the same example as in the previous section for EAS , under the ' Output: Gene Subnetwork ' tab in the subnetwork results page, the output subnetwork is visualised, with genes / nodes colourcoded by linked gene scor es. Inter estingly, most of subnetwork genes are involved in C-type lectin receptor signaling ( CARD9 , CYLD , IL10 , IL12B , IL2 , IRF1 , NFKB1 and W394 Nucleic Acids Research, 2023, Vol. 51, Web Server issue RHOA ), JAK-STAT signaling ( IL10 , IL12B , IL19 , IL2 , IL23R , JAK2 , PDGFB , PTPN2 and SOCS1 ), and TNF signaling ( CCL2 , FOS , IRF1 , NFKB1 , NOD2 and TN-FRSF1A ). These findings are consistent with the importance of these pathways in inflammation and inflammatory diseases (57)(58)(59). In summary, SAS is a valuable online tool that links SNPs to genes, enabling the identification of subnetworks that are critical to the understanding of the genetic basis of complex diseases. The resulting gene subnetwork is returned in a tabular display and a network-like visualisa tion, which facilita tes further analysis of candida te genes, particularly enrichment analysis of subnetwork genes to identify enriched pathways.
Subnetwor kAnal yser (Genomic Regions) -SAR links genomic regions to candidate genes f or subnetwor k anal ysis. Similar to SAS , this analyser is specially designed for subnetwork analysis using input summary data but at the genomic region le v el. The first three steps in user-request interface are identical to those in EAR . Users need to indicate gene networks to use and specify the desired number of the resulting subnetwork genes. Using real-world summary data on non-coding enhancer RNAs differentially expressed upon innate immune activation and tolerance ( 4 ), the output subnetwork is retuned under the tab ' Output: Gene Subnetwork ' in the results page. Enrichment analysis of the resulting subnetwork genes via EAG identifies JAK-STAT signaling ( CDKN1A , CISH , IL10 , IL10RA , IL19 , IL20 , IL7 , IL7R and JAK2 ) as the most significant pathway (FDR = 2.0 × 10 −6 ; odds ratio = 16.0; 95% CI = [6.07, 39.7]), highlighting its crucial role in mediating innate immune activation and tolerance ( 57 ).

DISCUSSION
We hav e de v eloped OpenXGR to meet the growing demand for efficient and effecti v e interpretation of the e v erincreasing volume of summary-le v el data in genomics. Designed as a versatile and user-friendly w e b server, it can interpret a wide range of genomic summary data related to three different entities (namely, genes, SNPs and genomic regions). This represents a significant de v elopment in human genomics r esear ch, as it has the potential to facilitate a more comprehensi v e understanding of genomic summary data and more effecti v e do wnstream kno wledge discovery.
One of the unique features of OpenXGR is its ability to identify gene subnetworks from input summary da ta a t the gene, SNP and genomic region le v els, providing valuab le insights into the functional relationships between genes (or linked genes) and aiding in r esear chers to identify potential pathways or networks that best explain specific biological processes or diseases. Another significant advancement offered by OpenXGR is its use of functional genomic datasets, such as PCHi-C, e / pQTL and enhancer-gene maps, to link non-coding SNPs or genomic regions to candidate genes. This enables interpretation of input SNPs and genomic regions, regardless of their location in the genome, which is often difficult or lacking in existing tools (for interpreting non-coding entities).
Howe v er, we recognise that there are limitations to OpenXGR regarding the availability of functional genomic datasets, w hich currentl y primaril y support blood-and brain-related contexts. Thus, our first aim for future de v elopments is to expand the supporting functional genomic datasets to include a di v erse range of cell types, states and tissues. Additionally, enrichment and subnetwork analyses ar e curr ently limited to the human genome, so our second aim is to support model organisms, for example, the extension to the mouse genome already on the agenda. This will expand the capacity of OpenXGR in interpreting genomic summary data beyond human. Looking further ahead, we are excited about the opportunity of employing large language models ( 60 ) to support genomic summary data interpreta tion, either in genera ting ontology annota tions and gene networks or in providing outputs in a conversational way similar to ChatGPT. Other future efforts will focus on improving the selection panel of available options (e.g. cell type-specific information on PCHi-C, eQTLs and enhancergene maps), supporting enrichment and subnetwork analyses for protein structural domains taken from the dcGO resource ( 25 , 61 ), increasing the user base, and committing to the OpenXGR w e b server update twice a year. In the long run, OpenXGR will function as an interacti v e, user-friendly and all-in-one platform that accelerates genomic summary da ta interpreta tion by le v eraging ontologies, networ ks, and functional genomics as well.

DA T A A V AILABILITY
The OpenXGR w e b server is easily accessible at http: //www.openxgr.com , where the user manual is also available that provides step-by-step instructions on how to get started ( http://www.openxgr.com/OpenXGRbooklet/ index.html ). The source code is made available on GitHub at https://github.com/hfang-bristol/OpenXGR-site and Figshare at https://doi.org/10.6084/m9.figshare. 22679284.v1 . For added convenience, OpenXGR can also be accessed through the mirror site at http://www.genomicsummary.com/OpenXGR , along with the user manual at http://www.genomicsummary.com/ OpenXGRbooklet/index.html .