MyGeneset.info: an interactive and programmatic platform for community-curated and user-created collections of genes

Abstract Gene definitions and identifiers can be painful to manage–more so when trying to include gene function annotations as this can be highly context-dependent. Creating groups of genes or gene sets can help provide such context, but it compounds the issue as each gene within the gene set can map to multiple identifiers and have annotations derived from multiple sources. We developed MyGeneset.info to provide an API for integrated annotations for gene sets suitable for use in analytical pipelines or web servers. Leveraging our previous work with MyGene.info (a server that provides gene-centric annotations and identifiers), MyGeneset.info addresses the challenge of managing gene sets from multiple resources. With our API, users readily have read-only access to gene sets imported from commonly-used resources such as Wikipathways, CTD, Reactome, SMPDB, MSigDB, GO, and DO. In addition to supporting the access and reuse of approximately 180k gene sets from humans, common model organisms (mice, yeast, etc.), and less-common ones (e.g. black cottonwood tree), MyGeneset.info supports user-created gene sets, providing an important means for making gene sets more FAIR. User-created gene sets can serve as a way to store and manage collections for analysis or easy dissemination through a consistent API.


INTRODUCTION
Different types of gene-centric information can be accessed thr ough numer ous r esour ces, many of which have their own identifiers for accessing information on each gene or gene product.For example, the identifiers for 'N-glycanase 1' include 55768 (NCBI Gene) ( 1 ), 610661 (OMIM) ( 2 ), Q96IV0 (UniProt) ( 3 ), etc. Further, each r esour ce may include different annotations describing this gene or gene product such as information on its molecular function (Gene Ontology) ( 4 ), chromosomal location (UCSC Genome Browser) ( 5 ), clinical significance (ClinGen) ( 6 ), and more (7)(8)(9)(10)(11).Ensuring up-to-da te informa tion on a specific gene can be time-consuming as users would need to either continuously download and merge data from different resources or ensure up-to-date mappings of resource-specific identifiers in their maintenance pipelines.

Nucleic Acids Research, 2023, Vol. 51, Web Server issue W351
We hav e pre viously r eported the cr eation of My-Gene.info, an up-to-date RESTful gene annotation as a service API (A pplication Pro gramming Interface), w hich helps simplify the maintenance of pipelines that pull and / or analyze data from gene-specific r esour ces ( 12 , 13 ).Howe v er, the functional effects of a gene are often context-dependent, and context can be difficult to capture, let alone make Findab le, Accessib le, Interoperab le, and Reusab le (FAIR) ( 14 ).A pre vious serv er, Tribe, allowed the capturing and sharing of collections of genes (gene sets) ( 15 ).Howe v er, it supported a subset of identifiers and did not plug into a widely used ecosystem of APIs.To create a FAIR gene-functionfocused w e b server, w e use the BioThings SDK ( 16 ) to create MyGeneset.info,which supports collaborati v e, conte xtstoring, user-friendly access to collections of genes both via w e b and application programming interfaces.
MyGeneset.infocan link both software and w e b-based analytical pipelines.A user might run an analysis on a BioThings-supported w e bserver ( 16 , 17 ) to generate a set of genes and then access that set through an R analysis.For example, a user could annotate a set of genes that ar e fr equentl y m utated in a rare cancer.They could then query MyChem.info( https://mychem.info ) ( 16 ) to identify drugs that target those genes.In parallel, they could analyze RNA-seq data using an R script and then rapidly examine the e xpression le v els of genes within the gene set.De v elopers of curated r esour ces can use the MyGeneset.infoinfrastructure to distribute collections within a well-used and highly interoperable ecosystem.

Public gene set data sources
MyGeneset.infoaggregates m ultiple publicl y available gene sets and provides one-stop access to these built-in, curated gene sets.The current list of data sources, as of March 2023, include CTD (Comparati v e Toxicogenomics Database) ( 18 ), DO (Disease Ontology) ( 19 ), GO (Gene Ontology) ( 4 ), MSigDB (Molecular Signatures Database) ( 20 ), Reactome ( 7 ), SMPDB (Small Molecule Pathway Database) ( 21 , 22 ) and WikiPathways ( 23 ).There are 188650 (as of February 2023) curated gene sets in total collected from se v en data sources.Similar to other biomedical w e b service APIs we have built (e.g. the MyGene.infoAPI), the BioThings SDK ( 16 ) package was used to build a data source 'hub' to monitor, download and aggregate underlying data (Figure 1 ).All curated gene sets and their associa ted gene informa tion are upda ted in MyGeneset.infoon a weekly to monthly basis based on their respecti v e sources.During each data update, individual gene annotations from each associated gene set are normalized using the MyGene.infoAPI ( 12 , 13 ), with commonly used identifiers included.In the case of r etir ed gene identifiers used in the original data source, it will be either replaced when a gene entity is replaced by a new entity or removed when a gene entity has been r etir ed permanently.

User submitted gene sets and annotations
MyGeneset.info also allows authenticated users, via their existing Github ( https://github.com ) or ORCID ( 24 ) ac-count, to submit their own gene sets, e.g. from their own research studies or relevant literature.Users may set their gene sets to be either private or public, so that they can be shared when they ar e r eady.The MyGeneset.infow e b application provides a user interface to help users to annotate and standar dize the indi vidual gene annotations within a user gene set using the MyGene.infoAPI.This is the same process by which curated public gene sets are annotated and updated so that users do not need to handle the identifier conversion and maintenance of up-to-date gene annotations.

Programmatic access via a web-service API
MyGeneset.infopro vides tw o primary ways to interact with the service to read and write data: a w e b-service API and its w e b application.The API is intended for users who requir e or pr efer programmatic access to the service.The API can be used from a command-line interface or any programming language capable of standard HTTP requests.API access is useful f or perf orming large-scale, batch, or complex analyses in an automated way.Developers may also utilize the API to build w e b applications that import gene sets and metadata automatically.The same API is powering MyGeneset.info'sown w e b application.The MyGeneset.infoAPI is built upon the BioThings SDK package ( 16 ), which allows us to quickly build a set of w e b API endpoints with rich query features and scalable query performance.The BioThings SDK uses Tornado w e b frame wor k ( http:// tornadow e b.org ) to create API query endpoints and process the queries passed to an underlying Elasticsearch ( https: //www.elastic.co/elasticsearch/ ) cluster, wher e all gene set data are stored and indexed.

Web application
The MyGeneset.infow e b application is intended for users who prefer w e b-based access to the service.It provides all the same features as the API, but with a user-friendly graphical interface.It allows users to quickly and conveniently search, view and manage gene sets from any connected device.The w e b application is built as a standard single-page application (SPA) in Vue 3 ( https://vuejs.org ) and Type-Script ( https://www.typescriptlang.org).Its implementation follows modern w e b de v elopment best practices, such as responsi v e design (i.e.works on desktop and mobile devices) and accessibility, as well as modern Vue-specific best practices, such as using Composition API and script setup syntax.

Collecting and sharing publicly available annotated gene sets
MyGeneset.infocurrently provides read-access for about 180k gene sets, improving the findability and reusability of gene sets from humans, common and uncommon model organisms.As seen in Table 1 , differ ent r esour ces have grouped genes into sets based on which chemicals they interact with (CTD), their involvement in specific diseases (DO), their co-localization (GO), their contribution or exclusion in biological processes or molecular functions (GO), their membership in pathways (Reactome , Wikipathways , Figur e 1. Publicl y-accessible, curated gene sets are structured as resource plugins in MyGeneset.info,which was built using the BioThings SDK (dotted box).Each plugin retrie v es gene annotations from MyGene.info before uploading and merging the data.Users can submit queries via clients or RESTful queries to the API, or via the Vue-based browser interface.Users can build and submit gene sets via the browser interface (dashed arrows).  2 A), or species from which the gene set was deri v ed (Figure 2 B).Although all publicly available gene sets are included by default, the user can also limit their search by type of gene sets to include user-generated gene sets, anonymously submitted gene sets, or select from curated gene sets by source (Figure 2 C).The resulting gene sets can be sorted based on the author, source, or number of genes in the gene set (Figure 2 D).Users can further filter for gene sets by number of genes in the gene set (Figure 2 E) or by source by selecting 'Curated' (Figure 2 C) and then unchecking sources to be removed from the results (Figure 2 F).Hovering over a gene within a gene set will display additional information on the gene including links out to other r esour ces.

Collecting and sharing user-annotated gene sets
MyGeneset.info'suser -friendly, browser -based gene set builder improves the accessibility of the w e b server to researchers with less programming experience while ensuring interoperability of the underlying data.The builder page guides users to generate important metadata for making their gene set findable (Supplemental Figure 1A), and researchers must search for genes to add enforcing mapping of the genes included in the user-generated gene set (Supplemental Figure 1B).The interface allows users to search a batch of up to 100 genes at once for inclusion in the gene set.If a user chose to log in before submitting their gene set, the user would be able to edit their gene set later on.Otherwise, a user may submit gene sets without logging in (anonymousl y), or simpl y download their gene set locally (if they wish to keep the gene set private) (Supplemental Figure 1C).

Value proposition of MyGeneset.info for targeted user types
MyGeneset.info was first designed to address the needs of three types of users: biomedical researchers, bioinformatics (core) analysts, and bioinformatics resource de v elopers (Table 2 ).MyGeneset.info'suser-friendly, w e b browser is meant to improve the findability , accessibility , and re- usability of public gene sets; while the gene set builder is meant to improve the interoperability of user-generated gene sets and provide a GUI-based approach for sharing and maintaining user-generated gene sets.For example, a biomedical r esear cher may be inter ested in key genes suspected to be important to a biological process but lack the technical expertise to analyze these genes in the context of high-throughput genomic data.To ensure that their collaborator or bioinformatics core analyzes the correct genes, the r esear cher can le v erage the MyGeneset.infow e bsite to search for genes (and filter by an organism of interest), add those genes into a collection, and share that collection either by submitting it to MyGeneset.infoor downloading the gene set file in a format acceptable by the collaborator or bioinformatics core.By building the gene set via MyGe-neset.info, the gene set will include metadata describing the gene set, an identifier for the gene set itself, and identifiers for each gene in the gene set.Because the gene set builder le v erages the MyGene.infoAPI, users can build gene sets from any species with genes annotated by NCBI Gene and Ensembl.
In contrast, a bioinformatics (core) analyst may be tasked with summarizing information for collections of genes from differ ent r esear chers stud ying dif ferent biological processes in different animal models.Mapping and keeping track of each gene in each collection can be time-consuming, especially if different biomedical researchers provide these gene sets with different types of identifiers (or no identifiers altogether), and in different formats.By having the biomedical r esear chers use MyGeneset.info,the bioinformatics (core) W354 Nucleic Acids Research, 2023, Vol.51, Web Server issue analyst will be able to receive the gene sets in a consistent, easy-to-read, machine-friendly format either via a file sent from the biomedical r esear cher or by le v eraging the API.For example, if a bioinformatics analyst needs Gene Ontology annotations for all genes in a gene set, they could easily access the gene set via MyGeneset.info'sRESTful API using simple python code, pull the identifiers for each gene in the gene set, and then query MyGene.info f or additional gene-specific annotations (like GO annotations) (Supplemental Figure 2).Analysts concerned about version changes from the weekly updates of MyGeneset.infocan cache the results locally using the BioThings Python client (Supplemental Figure 3).Similar to a bioinformatics (core) analyst, a bioinformatics r esour ce de v eloper can offload the gene search and identifier mapping of their r esour ce to MyGene.info.With MyGeneset.info,the r esour ce de v eloper can enab le users to query for collections of genes, and allow users to create, append, and maintain these collections of genes.Further, if the r esour ce is implemented with OAuth ( 25 ), the bioinformatics r esour ce de v eloper could largely offload the implementation of storing and maintaining gene sets and users.

DISCUSSION AND CONCLUSION
Biomedical r esear chers ar e constantly studying the functionality of genes and sets of genes, which r equir e context to interpret; however, sharing context has traditionally been done via free text descriptions (i.e.publications).Gi v en the current volume of scientific literature and its rate of growth, it is difficult to make gene function annotations Findab le, Accessib le, Interoperab le, and Reusab le.To do so, some r esour ces have captur ed contextual information in the f orm of pathwa ys or as collections of genes.Traditional resour ces that captur e this context (like Wikipathways , CTD , DO and others) improve the FAIRness of gene function an-notation by making their annotations available and / or allowing community curation of gene function annotations.Sacrificing some FAIRness for quality, community curation of gene function annotations by different resources have been limited in scope / topic and hampered by the curation process.For example, Wikipathways enables anyone to submit pathway-based gene sets, which may be useful for r esear chers stud ying specific pa thways, but less useful for r esear chers studying polygenic diseases where a common pathway has not yet been elucidated.Other r esour ces may curate based on other features but have a lengthier or more restricti v e curation process.MyGeneset.infoempowers users to share the context of their r esear ch on gene function as gene sets, improving the FAIRness of functional gene annotation.
In contrast to community curation, r esour ces have done a phenomenal job making curated gene sets available; howe v er, improv ements in FAIRness can still be made.For example Disease Ontology does not directly provide gene sets on its site; r ather, DO cur ates diseases and xrefs to other resources for diseases.The disease-based gene sets generated based on Disease Ontology r equir e traversing from Disease Ontology to OMIM which can be challenging without programmatic expertise.Furthermor e, r esour ces that do provide gene sets may use different identifiers for the genes in the gene sets.For example, SMPDB uses UniProt IDs, while Wikipathways uses NCBI Gene IDs.By providing a unified and harmonized r esour ce for gene sets, My-Geneset.info makes it easier to Find gene sets (only have to search one resource as opposed to se v eral), Accessib le (availability of a single user-friendly interface for interacting with gene sets), Interoperable (the genes in the gene set are mapped and linked to multiple commonly used identifiers and allow exporting to commonly used identifiers).By improving the FAIRness of gene sets, MyGeneset.infoaims to increase their Reusability; thereby making gene sets more FAIR.
Nucleic Acids Research, 2023, Vol.51, Web Server W355 MyGeneset.info is integrated with the BioThings API ecosystem ( https://biothings.io ), which has served > 45 million requests in the past 30 days and allows for programmatic traversal from gene sets to genes to variants in those genes or drugs targeting those genes.For example, a user can now easily retrie v e a DO-sourced gene set from My-Geneset.info, use the identifiers for the genes in the gene set to pull Gene Ontology annotations from MyGene.info, and perform an overr epr esentation analysis using their preferred libraries to investigate processes involved in disease pathology.The user can also use MyGene.info to map the genes to PharmGKB identifiers to find chemicals which might perturb each gene allowing the user to search for compounds that might treat the disease.These compounds could be searched in MyChem.infofor additional properties which could affect their appeal as a potential treatment candida te.W hile not designed for e xtensi v e analysis on its own, MyGeneset.info'sintegration in the BioThings API ecosystem (including the BioThings Python client) makes it suitable for use in downstream analytical workflows or other w e b servers.Using the client, users can quickly fetch all gene sets available from MyGeneset.infofor downstream processing.
MyGeneset.info is fully functional, has documentation, and both the API and front end are fully open source under a permissi v e Apache 2.0 license.Currently, the gene set creation interface supports the search / upload of up to 100 genes (via the multi-line input mode) in the gene addition process.While this may be sufficient for many biomedical r esear chers, incr easing the batch size and including a means of uploading a list of > 100 genes will improve the value of the r esour ce for biomedical r esear chers and bioinformaticians alike.In the future, we hope to improve the gene set creation process by integrating similarity scores with existing gene sets in the gene set creation process, to minimize redundancy in user-submitted gene sets.Furthermore, we hope to incorpora te fea tures for discussing user-submitted gene sets-enhancing collaboration within teams of users.

Figure 2 .
Figure 2. The search and browse interface of MyGeneset.infoallows users to search for gene sets by k eyw ord ( A ) or species ( B ). Users can also search by type ( C ) including, gene sets generated by the User, publicly accessible Curated gene sets filterable by source, Anonymously submitted gene sets, and All gene sets.The resulting gene set table can be sorted by author, source, and number of genes in the gene set ( D ).Gene sets can be filtered by the number of genes in the set ( E ) or by the source ( F ) if 'Curated' is selected in (C).

Table 1 .
24)lt-in gene sets from publicly available r esour ces (as of 2023.03.24) https://ctdbase.org/ (see site) chemical interactions 22690 Disease Ontology (DO) https://disease-ontology.SMPDB), and more.Users can search for gene sets based on k eyw or ds describing the gene or conte xt (Figure

Table 2 .
Feature list value proposition of each feature for different types of usersValue of the feature if the user is a: