Fission stories: using PomBase to understand Schizosaccharomyces pombe biology

Abstract PomBase (www.pombase.org), the model organism database (MOD) for the fission yeast Schizosaccharomyces pombe, supports research within and beyond the S. pombe community by integrating and presenting genetic, molecular, and cell biological knowledge into intuitive displays and comprehensive data collections. With new content, novel query capabilities, and biologist-friendly data summaries and visualization, PomBase also drives innovation in the MOD community.


Introduction
Over the past decade, PomBase (www.pombase.org), the authoritative model organism database (MOD) for the fission yeast Schizosaccharomyces pombe, has supported the fission yeast research community by integrating and presenting all types of genetic, molecular, cell biological, and systems-level knowledge relevant to S. pombe. In addition to about 200 laboratories dedicated to fission yeast research, PomBase serves a growing number of users who focus on other organisms but rely on data generated in S. pombe for inferences about orthologous genes and conserved eukaryotic cell biology. As fission yeast research has grown in complexity and breadth of relevance, PomBase has kept pace, supporting emerging experimental and data-handling techniques, enabling researchers to ask novel questions, and driving innovation in the MOD community.
PomBase's primary aims remain to standardize, integrate, and display fission yeast research, to disseminate datasets and new knowledge to the wider scientific community, and to highlight the added value these efforts bring.
The core of PomBase consists of a growing body of comprehensive, reliable knowledge derived by manual curation of the fission yeast literature. Manual curation covers a wide range of data types, including molecular functions, biological processes, cellular locations, macromolecular complexes, phenotypes, alleles and genotypes, protein modifications, physical interactions, genetic interactions, DNA and protein sequence features, and orthologs in human and budding yeast (Saccharomyces cerevisiae). Annotations generated by computational methods supplement manually curated data for function, process, and location data represented using the Gene Ontology (GO; The Gene Ontology Consortium 2000Consortium , 2021. Table 1 summarizes annotation types and totals. Literature curation makes extensive use of ontologies, including GO, the Sequence Ontology (Eilbeck et al. 2005), the Protein Ontology (Natale et al. 2017), the Mondo Disease Ontology (developed by the Monarch Initiative; Mungall et al. 2017;Shefchek et al. 2020), and the protein modification ontology PSI-MOD (Montecchi-Palazzi et al. 2008). We have contributed many new classes, corrections, and other revisions to all of these ontologies. For example, since 2018 PomBase curators have raised over 500 issues in the GO Consortium (GOC)'s GitHub tracker for ontology structure and content (https://github.com/geneontology/go-ontol ogy/issues) and over 80 issues on the Mondo tracker (https:// github.com/monarch-initiative/mondo/issues). We have also pioneered annotation quality control methods that are being adopted throughout the GOC. Notably, we developed a method to use observed coannotation patterns to identify annotation outliers and to build rules that allow automated outlier detection ). Using this system, we have corrected thousands of annotations, and we collaborate with the GOC to continue rule development and to deploy a pipeline for error detection and reporting.
PomBase curators develop and maintain the Fission Yeast Phenotype Ontology (FYPO), a logically robust vocabulary that is designed for fission yeast but is also the leading candidate for further development into an ontology of cell-level phenotypes for all eukaryotes (Harris et al. 2013). FYPO currently comprises over 7500 terms, used in almost 97,000 annotations. FYPO development now uses the Ontology Development Kit (ODK; Matentzoglu 2021) for releases, which provides a release pipeline that seamlessly incorporates ontology reasoning, continuous integration checks, and generation of production files in Web Ontology Language (OWL) and Open Biomedical Ontologies (OBO) formats.
Our broad and deep ontology-based curation standardizes data from large-and small-scale publications, making a wide range of published data compatible with FAIR (Findable, Accessible, Interoperable, and Reusable; Wilkinson et al. 2016) data sharing principles. Furthermore, our field-leading community curation project actively engages researchers in building the PomBase collection of FAIR-shared biological knowledge. We have recently described the insights gained from the project, including our experience with approaches that maximize participation, and the unanticipated added value that arises from cocuration by publication authors and professional curators . To date, we have assigned over 1800 publications to authors for curation and have received over 975 submissions in response (54% response rate). Our online curation tool, Canto (Rutherford et al. 2014), has also been deployed for several other communities, including PHI-base (Urban et al. 2020), FlyBase (Larkin et al. 2021), and the new Schizosaccharomyces japonicus MOD JaponicusDB (see Rutherford et al. 2022).
Alongside its established, ongoing data stewardship activities, PomBase has introduced new features that enable biologists to integrate diverse molecular data into human-friendly summaries of biology. Below, we provide an overview of the new and updated features and describe how biologists can use new and existing data and tools to place their results into broader contexts. Taken together, PomBase activities give rise to a long-term collaboration with users to curate the knowledge gained from fission yeast research into an integrated overview of conserved cell biology to ensure that the resulting knowledge can be used to its full potential.

Querying PomBase
PomBase provides robust, intuitive search interfaces to enable biologists to easily retrieve and combine data of diverse types from multiple sources.

Simple search
A simple search, available in the header of every PomBase page, offers quick access to several commonly used PomBase pages.
The search finds gene pages matching fission yeast names, synonyms, systematic IDs, UniProtKB accessions (The UniProt Consortium 2020), or gene product descriptions. Human gene symbols (Tweedie et al. 2021) or S. cerevisiae gene names, ORF names, or IDs (Cherry et al. 2012;Engel et al. 2021) can also be used to find curated orthologs. Publication pages and ontology term pages can be retrieved using relevant IDs (PubMed or ontology IDs). Text searches generate autocomplete suggestions from matching gene product descriptions, ontology term names, and publication titles, author names, and dates.

Advanced search
Since the reimplementation of the website in 2017, PomBase has provided an advanced search facility (https://www.pombase.org/ query) that supports querying for gene sets based on a wide variety of criteria, including ontology annotations, gene product attributes (e.g., protein length, mass, domains, or modifications), genomic location, conservation, etc. Complex queries can be constructed by combining single queries in the query history. The history, with links to result sets, is stored locally in the user's web browser, and out-of-date results are highlighted and refreshed upon following result links. We have previously described links between ontology terms in query results, ontology term pages, and gene lists, which support data integration throughout the PomBase website .
Recent enhancements include improved phenotype querying, enhanced options for display and download of search results, and new tools for saving and sharing queries.
The search now provides an expanded set of query parameter options for phenotypes: experimental conditions can be used as search criteria for phenotype annotations, using the same condition descriptors as shown on PomBase web pages and in Canto. For example, a query can retrieve genes that show abnormal chromosome segregation mutant phenotypes specifically at high or low temperatures. Genes can also be selected based on the phenotypes of haploid or diploid, or single-or multilocus, genotypes. For single-locus haploids, the expression level can be specified.
The display of search results is now highly customizable. Each query in the history has a link to a page of results that shows the count, query details, and a list of matching genes. By default, systematic IDs, names, and gene product descriptions are shown. The display can be customized by choosing additional columns (including a selection of gene expression data), and the list can be sorted on any column.
Similarly, selected data can be downloaded for genes in the list, via a popup that offers three sets of options: • A "Tab delimited" option offers the same data as the results web page, for inclusion in a downloaded text file; • A "Sequence" tab retrieves amino acid or nucleotide sequences in FASTA format, with checkboxes to select which items are included in the headers. When "Nucleotide" is selected, flanking sequence options similar to those on the gene page are available; • A "GO annotation" tab downloads a file in GAF2.1 format (http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/) that includes direct annotations (but not those inferred by transitivity) for the selected branch(es) of GO.
To facilitate query reuse and sharing, entries in the PomBase advanced search query history now show brief, user-editable query descriptions, and a toggle to show or hide additional details. All result pages from the advanced search now have a unique permanent URL that can be bookmarked and shared with colleagues. A set of "Commonly used queries" uses this system to provide convenient shortcuts to frequently sought data, such as all genes with disease associations (see below) or all proteincoding genes of unknown biological role.

ID mapper
On a related note, we have developed an identifier mapper that retrieves S. pombe genes for a selection of different input ID types.
Users can now find S. pombe genes using UniProtKB accessions, and retrieve manually curated orthologs for S. cerevisiae using standard gene names or ORF names, and for human using standard gene names or HUGO Gene Nomenclature Committee (HGNC; Tweedie et al. 2021) identifiers. S. pombe genes found via the ID mapper can be sent to the advanced search with a single click.

From data to biological narratives
Much of PomBase's recent work helps further our goal of enabling researchers to take data found in PomBase-from their own results combined with those of others-and build narratives that elucidate a biological topic with broad applicability.

Intuitive annotation displays
In the PomBase reimplementation, we introduced display filtering on gene, publication, and ontology term pages for phenotype annotation tables, to narrow the list by broad phenotypic category, and (in the detailed view) by evidence. We have now added filtering for more annotation types and expanded the set of filtering options. Display filtering is now available for all three branches of GO, and for qualitative or quantitative gene expression annotations. For GO and gene expression annotations, detailed views now offer filters for observations made during specific cell cycle phases, which use annotation extensions (Huntley et al. 2014) specifying the relevant phase. All of the above annotation types, as well as genetic and physical interactions, can also be filtered to distinguish between high-vs lowthroughput experiments. The display of protein features on gene pages has been updated to improve legibility and provides an interactive graphical view in which mousing over any feature shows additional details in a tooltip and highlights the full entry in the accompanying table.
For ease of use, we have improved the appearance of PomBase pages on small screens such as tablets and smartphones.

Comprehensive knowledge representation
To ensure that coverage of S. pombe research is current and complete, we organize annotation reviews by biological process. We have recently comprehensively reviewed and updated annotations in three broad areas of biology: tRNA metabolism, mitochondrial biology, and transmembrane transport. Because these topics are not intensively studied in fission yeast, the updates primarily affect GO annotations inferred from orthologs. We have also reviewed a fourth topic, chromatin silencing, to bring annotations into alignment with substantial, ongoing revisions to chromatin-mediated transcriptional regulatory functions and processes in GO.
The S. pombe GO annotation corpus also now includes a set of annotations from the GOC's Phylogenetic Annotation and INference Tool (PAINT) pipeline (Gaudet et al. 2011), which propagates GO annotation across all species based on protein family membership. Before incorporating PAINT annotations into PomBase, we reviewed the predictions made for S. pombe and reported errors to the GOC for over 500 protein families, improving the accuracy of the PAINT annotation corpus for all species. Like all GO annotations imported into PomBase from external sources, PAINT annotations are filtered for redundancy with experimentally supported annotations .
Qualitative gene expression annotations now use an expanded set of descriptors, accommodating ribosomal density data as well as more precisely defined RNA and protein level terms.
To ensure coverage of high-throughput experiments, we have continued to build the collection of data tracks, and associated well-curated metadata, in the PomBase JBrowse (Buels et al. 2016) instance; the browser now hosts 330 tracks from 28 publications. To improve browser data visibility, gene pages now display JBrowse images that have clickable features and a link to open the fully functional browser in a new tab or window. Data tracks from datasets hosted in the PomBase genome browser can also now be browsed, selected, and loaded from their respective publication pages.
PomBase is now a Partner Database with microPublication, whose remit is to publish "brief, novel findings, negative and/or reproduced results, and results which may lack a broader scientific narrative" (https://www.micropublication.org/; Raciti et al. 2018). This partnership thus enables fission yeast researchers to publish individual experimental results that have not been included in traditional publications, thereby building a collection of S. pombe data that fill gaps in available datasets.
Publication pages are now also available for reviews and methods papers as well as microPublications.

Ontology slims
Ontology subsets can provide overviews of annotation across a gene set or the whole genome. PomBase has long provided the fission yeast Biological Process (BP) GO slim, a subset of the GO BP ontology designed to cover as many annotated gene products as possible, while remaining informative about the gene product's physiological role in the cell Wood et al. 2019). To complement the BP slim, we have now created three new ontology subsets. The fission yeast Molecular Function (MF) and Cellular Component (CC) slims summarize the MF and CC branches of GO, respectively; together, all three GO slims provide a simple yet comprehensive summary of S. pombe's biological capabilities by grouping gene products using broad classifiers from the full breadth of GO.
The third new ontology subset is drawn from Mondo and gives an overview of genes with human orthologs implicated in disease. To accompany the Mondo slim, we have improved coverage for gene curation that associates disease descriptors with fission yeast orthologs of human disease-causing genes; over 1400 S. pombe genes now have curated disease associations. PomBase curators also collaborate with Mondo to improve its disease classification, especially in areas relevant to fission yeast diseasegene associations.
All of the PomBase ontology slims are integrated into annotation displays and querying. For each slim, a summary page lists the terms and IDs with links to ontology term pages and provides a genome-wide overview of annotated genes. In addition to the number of genes annotated to each slim term, the slim page identifies sets of genes that are not annotated to any term in the ontology or are annotated only to terms that do not have paths in the ontology to any slim term. On gene pages, annotation tables show slim terms applicable to a gene in the headers. On all pages where they appear, gene lists are linked to the advanced search, allowing the gene list to be combined with any other results in a new query. Slim annotations can also be retrieved for any advanced search result list.
We also maintain an up-to-date list of protein-coding S. pombe genes that are broadly conserved in eukaryotes (present in vertebrates), but have not been assigned an informative role from the GO BP slim (https://www.pombase.org/status/priority-unstudiedgenes; Wood et al. 2019).

Quick Little Tool
The Quick Little Tool (QuiLT) is a new feature that allows users to view multiple types of annotation in a single graphical display. Inspired by our analysis of conserved unstudied proteins (Wood et al. 2019), QuiLT generates a graphic for any gene list uploaded or obtained in advanced search results. QuiLT visualization is also linked to PomBase pages that list genes annotated to an ontology term, and on the "Priority unstudied genes" page (see above). QuiLT has display options for deletion viability, presence or absence of budding yeast orthologs, presence or absence of human orthologs, annotation from each branch of GO, characterization status for protein-coding genes, taxonomic distribution, and protein length. The display is interactive, allowing users to highlight subsets of the list, filter the display, toggle annotation types on and off, reorder the list to focus on features of most interest, and download the image. Figure 1 shows QuiLT visualization of genes that are conserved in vertebrates and were found to be associated with chronological lifespan in a recent study (Romila et al. 2021). The authors compared their list of long-lived mutants with all genes previously associated with the phenotype 'increased viability in stationary phase' (FYPO:0001309) to uncover novel aging-associated genes; among the latter genes, they then readily identified conserved genes using the catalog of conserved unstudied genes described above, as well as genes associated with diseases in human.

Pathways
A new gene page section, "Molecular pathway," connects genes in PomBase to depictions of biochemical and signaling pathways. At present, this section is shown for any gene that appears in a pathway entry in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa and Goto 2000;Kanehisa 2019;Kanehisa et al. 2021), linking to the relevant KEGG page(s) and to a PomBase page listing all genes connected to the pathway. The Molecular pathway section will figure prominently in the future development of data integration in PomBase (see below).

Future directions
We will continue all of the well-established activities described above, ensuring that PomBase presents comprehensive, qualitycontrolled, curated knowledge from large-and small-scale S. pombe experiments. In addition, a diverse set of enhancements is under development.
To improve the accuracy and coverage of sequence feature annotation, a comprehensive overhaul of 5 0 and 3 0 untranslated region lengths is in progress, and we will gradually add observed transcript isoforms to more genes. The whole-genome sequence will also be updated to fill gaps, correct errors, and incorporate new telomeric sequence data, and resubmitted to the International Nucleotide Sequence Database Collaboration (INSDC; https://www.insdc.org/) databases.
We will continue to enhance PomBase's simple and advanced search tools with new features and capabilities, such as better handling of genes with multiple transcript isoforms, improved querying for phenotype conditions, and import/export of query histories. We also plan to investigate options, such as user accounts, to enable researchers to configure PomBase page displays, browser settings, etc. and to make saved settings and query histories more portable.
Most importantly, we will focus effort on building richer, more comprehensive connections among curated data of all types. Notably, we will adapt S. pombe GO annotations to follow the "GO Causal Activity Modeling" principles established by the GOC for GO-CAMs (Thomas et al. 2019), using our large, detailed body of existing GO annotations. Manually curated S. pombe GO annotations already include a rich set of extensions that capture reaction substrates, effector-target connections, high-confidence physical interactions, complexes, and regulatory effects. Converting these extended annotations to GO-CAM models will improve PomBase's representation of how protein activities are connected into pathways and how pathways are connected to each other. Pathway diagrams generated by software using GO-CAM data will be shown in the Molecular pathway section of gene pages. Furthermore, by integrating GO-CAM-based pathways with curated information about which genes and pathways relate to human diseases, and what proteins remain unstudied across species, PomBase will enable users to develop an emergent understanding of cell biology relevant throughout the eukaryotes. Biological stories crafted from fission yeast data will thus shed light on all biological systems.

Data availability
PomBase has used open-source tools and code to develop a modular, customizable, and readily reusable system that supports daily data updates and an intuitive web-based user interface . All PomBase code is available from the PomBase GitHub organization (https://github.com/pombase), where each major aspect of the project-including curation, website, Chado database, FYPO, and Canto-has a dedicated repository.
In addition to the web page displays and query result downloads described above, PomBase data can be downloaded in bulk from the website (see https://www.pombase.org/datasets and https://www.pombase.org/data). We have recently switched all downloads to use the HTTPS protocol, superseding FTP. Available downloads include nightly dumps and monthly snapshots of the entire Chado curation database, and a range of specific curated datasets, including GO annotations, single-allele phenotypes, protein modifications, high-confidence physical interactions, and manually curated ortholog lists.