Protein Ontology (PRO): enhancing and scaling up the representation of protein entities

The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.


OVERVIEW
It has long been known that the final product of a gene is inherently more complex than the gene itself (1). A single gene--even within an individual organism--can yield multiple possible specific proteoforms (the precise molecular form of a protein arising from alternative splicing events and post-translational modifications) with multiple possible functions. Proteins can also be combined in a variety of different ways within the context of multisubunit complexes. In contrast to this increase in specificity, we can consider, more generally, cases of related sets of proteins--groups of proteins that evolved via duplication and/or speciation events of a common ancestral gene (homologs).
The Protein Ontology (PRO; see Table 1 for additional abbreviations) provides a flexible way to refer to protein entities at any such level of specificity, as generically or as precisely as needed (2). PRO organizes these entities into classes describing proteins derived from homologs ('family level' classes), from a single gene ('gene level' classes), from a single transcript ('sequence level' classes), or from a set of modifications ('modification level' classes). Each of these categories of classes are neutral with respect to taxonomy, but there are also taxon-specific versions (e.g. 'organism-gene level'), thus allowing PRO to highlight con- World Wide Web Consortium nections and differences within and across species. In this pursuit, PRO obtains knowledge from existing resources (e.g. databases and literature), normalizes the information contained within, and provides mechanisms for information query and analysis. The normalized identifiers for protein entities at multiple levels of specificity provide appropriate targets for annotation of scientific papers. PRO is designed to facilitate the discoverability and aggregation of data relating to protein entities. Here, we describe new developments designed to further these goals, including steps taken to improve expert and automated interaction with the data, and steps taken to improve coverage and ensure scalability.

Standardized representation of proteoforms
Comparison of information from disparate resources requires the ability to discover entries that are meant to refer to identical entities. These are sometimes described in different ways. For example, Reactome (3,4) specifies the p25 form of CDK5 regulatory subunit 1 (CDK5R1(99-307), http://www.reactome.org/content/detail/R-HSA-6805262) by referring to the coordinates on the sequence given in the corresponding UniProt KnowledgeBase (UniProtKB) (5) entry (http://www.uniprot.org/uniprot/Q15078). However, since UniProtKB also assigns a 'feature identifier' to subsequences when a protein is cleaved in some way, it is possible to refer to that identifier instead (identifiers for subsequences are of the form 'PRO <10 digits>' which is not to be confused with Protein Ontology identifiers, some of which are 'PR:<9 digits>' in the OBO representation and 'PR <9 digits>' in the OWL representation). Unlike Reactome, the IntAct Complex Portal (6) specifies that same p25 form by referring to the UniProtKB feature identifier (Q15078-PRO 0000004795, cited in http: //www.ebi.ac.uk/intact/complex/details/EBI-9633559). For PRO to map or import such entries, we need first to normalize these descriptions. Compounding the issue is the need to account for position-specific post-translational modifications (PTMs). We thus developed a standard syntax to indicate positions of post-translational amino acid modification on sequences or subsequences, and for specific isoforms. Our standard is similar to that used by Reactome (7) but differs in some minor details. We use UniProtKB as a source of sequence information, indicating the accession, the isoform identifier, and the subsequence range (e.g. when a proteoform has been generated via signal peptide removal or other chain cleavage). We then add the PTM information in a specific order based on the position of the modified residue nearest the N terminus, but grouped by type of modification given as a PSI-MOD (8) identifier. Additional types of modifications are then listed after a pipe character, again in sequence order, as shown in Figure 1. Having a standard representation enables us to map existing PRO terms to terms from other resources, a necessary first step to integrating the salient information in such resources into PRO.

Addressing scalability
An ongoing problem with any curated resource is the ability to keep up with advances of scientific knowledge. This has led to increased reliance on computational means to supplement or aid expert curation. For example, PRO imports information from UniProtKB to create terms for selected species (2), and integrates information from the OMA (9) and PANTHER (10) databases to find connections between such terms. However, while such resources are well-suited for gene-level terms, we must turn to other resources to obtain increased information about specific proteoforms containing post-translational modifications. We previously described (2) the import of data from the Top-Down Proteomics Repository (11) and other resources. We now add the following data integration pipelines: 1) iPTMnet: We recently developed iPTMnet, a resource that integrates PTM information from several expert curated databases and the results of full-scale text mining of PubMed abstracts with RLIMS-P (12) that have been linked to a sequence and that have a validated phosphorylation site (Ross, K.E., Huang, H., Ren J. et al. 2016, in press). iPTMnet also includes phosphorylation dependent protein-protein interactions detected by a second text mining tool, eFIP (13). Taking advantage of iPTMnet, we implemented a fully automated workflow for creation of PRO terms for proteoforms where single phosphorylation sites are described by multiple sources (Ross, K.E., Natale, D.A., Arighi, C.N. et al. 2016, in press). We focused on forms phosphorylated on a single site only in our automated workflow because it is challenging, even manually, to curate proteoforms phosphorylated on combinations of multiple sites. The text mining results provided the literature evidence to distinguish between forms with single and multiple phosphorylation sites. We filtered out cases where the abstract mentioned multiple modification sites Figure 1. Standard representation of a proteoform. Proteoforms are represented using a standard format as annotated, consisting of a sequence block and one or more optional modification blocks. Sequence blocks consist of a UniProtKB accession with an optional isoform indicator separated by a dash, followed by a comma (first arrow) and optional subsequence range. Modification block 1, if specified, will follow a comma, and all other modification blocks will follow a pipe (second arrow). Each modification block is presented in order based on the N-terminal-most amino acid modified. Within a modification block are one or more amino acids listed by type and position, with multiples separated by slashes, followed by the PSI-MOD identifier specifying the type of modification. When an isoform is specified, N-terminal and C-terminal positions of subsequences as well as positions of modification are relative to the full length of that isoform; otherwise the numbering for the representative sequence is assumed. Only the accession is required. Missing subsequence indicates that the class encompasses either multiple species or multiple isoforms. Missing modification blocks with a subsequence indicates that the class is defined by subsequence only (such as when the only distinction is that a signal peptide has been removed).
or PTM types in addition to phosphorylation (e.g. acetylation). We also filtered out cases that were already curated by PRO. After these filtering steps, ∼820 substratesite pairs remained. These were used to automatically generate PRO entries using a template; these entries were flagged with an 'unreviewed' status. Annotation, including kinases and interacting partners that were extracted from iPTMnet, will be added to the terms only after expert review. The iPTMnet pipeline increased the number of organism-specific PTM proteoforms in PRO by 50%. 2) Reactome: Reactome is a pathways resource referenced by a number of projects such as Open Targets (https://www.opentargets.org/), Chemical Entities of Biological Interest (ChEBI) (14), and UniProt. Mapping between the reactants described in PRO to those in Reactome (and vice versa) will afford far greater impact than each in isolation. Using the standardized representation for proteoforms described above, we 'translated' proteoforms described in Reactome to the same representation by automated means. We made direct string-match comparisons to do an initial mapping (∼6300 Reactome proteins were already in PRO; most mapped to the gene level). We then created a limited set of new terms, the majority of which were proteoforms of the post-translationally processed sort (for example, amino acid modification, removed signal peptide). This resulted in a total of nearly 12 000 PRO-Reactome mappings (covering ∼60% of Reactome proteoforms  (15): HIstome provides a compendium of modifications to human histones, and includes the following minimum information needed to create a PRO term: UniProtKB accession (including isoform, if known), modification type (e.g. acetylation), modified amino acid and position, and experimental evidence (in the form of a PubMed reference). The data were downloaded and converted by a script to generate PRO stanzas defining each term. In a first pass, we generated 468 HIstomeevidenced PRO terms asserting that there is at least one modification on the molecule with a known position. 4) Dynamic generation of terms: A number of projects have need for PRO terms. Up to now such requests were filled fully manually. To keep up with growing demand for term requests, we have added the capability for terms of a certain type (specifically, gene-level and sequencelevel terms) to be generated dynamically. We previously described our reuse of UniProtKB accessions whenever we provide an ontological representation of proteins described in that database (2). UniProtKB-derived terms can often be defined as 'a protein that is a translation product of <some specific gene> in <some specific organism>.' Since such definitions require only information that is available from the relevant UniProtKB entry, they can be generated by using the UniProt web service to return data on a single entry, followed by processing of that data to create a PRO term. It is important to note that while it is possible to use a persistent URL (for example, http://purl.obolibrary.org/obo/ PR E1BE92) or the specific-entry retrieval service on the main PRO web page to reference or find such terms, they will not be stored in the underlying PRO database. Thus, dynamically generated terms cannot be searched on the main page, will not be present in the downloadable PRO files or visualization of the PRO hierarchy, and will not be obtainable via SPARQL query. Nonetheless, often only a landing page is needed, and full integration of a term can be requested as described below. We plan to enhance this service with the ability to quickly request permanence of generated terms directly from the landing page, and the ability to use accessions from protein databases other than UniProtKB. We will also integrate ortholog or family information into the results.

Addressing community needs
Community collaboration is an essential feature of PRO, as it can point to new areas of development or to specific targets for curation. Below we describe a few examples.
1) ImmPort (16): The Protein Ontology has been collaborating with the Immunology Database and Analysis Portal (ImmPort) project on the development of the Imm-Port Antibody Ontology (AntiO). AntiO is an ontology that represents monoclonal antibodies, in particular those in common use in immunology research and Imm-Port clinical studies. In AntiO, targets of monoclonal antibodies are identified via Protein Ontology terms, including proteoform terms for phosphorylated proteins, and protein isoforms. Over 900 antibodies are represented in AntiO, linked to their respective protein targets via expert curation. AntiO has been loaded into a triple store to allow complex queries for antibodies and antibody products based on the targeted proteoforms, antibody names, species specificity, experimental usage, and antibody product vendors, catalog numbers, and fluorochrome conjugations. We plan to use this information to make links between PRO proteoforms and the antibodies that specifically bind those proteoforms. 2) MGI: The Mouse Genome Informatics (http://www. informatics.jax.org/) group will henceforth be using Noctua (http://noctua.berkeleybop.org/), a graphical common annotation tool for gene product curation. As part of that migration, all PRO entities mapped to mouse genes will be loaded into the tool's database and made available as direct annotation objects. These mappings will be updated on a daily basis. In addition, any isoform identifiers associated with annotations imported into MGI from other sources will be converted into their PRO equivalents, and thus be normalized across all levels of specificity. This also benefits PRO in that MGI curators will be able to flag a paper for potential use in

Enhanced web pages for PRO terms
We previously described a web-based comparative view for PRO terms corresponding to all proteins derived from a single gene in a single organism (2). The display contained information not only about the term itself, but also about the term's subclasses and any associated annotation. We now apply that same type of view to terms that correspond to a given gene across multiple species. Furthermore, we have enhanced the display by adding new features. To facilitate a more direct comparison between proteoform sequences and positions of post-translational modification, we now include an annotated sequence alignment. An example featuring the mitotic checkpoint protein BUB1B (PR:000004855) is shown in Figure 2. The Interactive Sequence View panel ( Figure 2A) displays a multiple sequence alignment of BUB1B proteoforms across organisms. Experimentally determined modifications are highlighted in color (e.g., phosphorylation sites are pink) and potentiallymodifiable sites in other sequences that align with the experimentally verified sites are highlighted in gray. For example: • Thr-608 of human BUB1B (Figure 2A, blue rectangle at right) is phosphorylated in the proteoform hBUB1B/Phos:3 (PR:000035432) but not in the two other human BUB1B phosphorylated proteoforms shown. The aligned residue in frog BUB1B, Thr-593, is phosphorylated in the proteoform frogBUB1B/Phos:3 (PR:000035433).   clicking on the magnifying glass icon, and customize which sequences are shown by clicking on the 'Select/align proteoforms across species' link.
The Protein Forms table--previously just an unsorted flat list--has been enhanced ( Figure 2B). Terms in the table are now organized according to their positions in the PRO hierarchy, and branches of the hierarchy are expandable/collapsible so that users can focus on terms of interest. Clicking on a term's PRO identifier takes the user to the web page for that term. An orange square next to a PRO identifier indicates that the term has functional annotation, and clicking on the square will take users to the section of the Functional Annotation table showing the GO terms associated with that proteoform (not shown, but de-scribed in (2)). A green square next to a PRO ID in the Protein Forms table indicates that the proteoform is found in one or more protein complexes. Clicking on the green square will take users to a table ( Figure 2C) that lists the proteoforms and their corresponding complexes.
We have also developed new web pages for organismspecific complexes that provide detailed information about the complex subunits. As shown in Figure 3

PRO OWL
From its inception PRO has been distributed using the standard OBO Foundry 'OBO' format (http://owlcollab.github. io/oboformat/doc/GO.format.obo-1 4.html). This format has the benefit of being human readable. However, for purposes of reasoning and W3C conformity, we have added a distribution file in the OWL format (https://www.w3. org/OWL/) and are transitioning to using the OWL version as the main distributable. That format is more readily consumed by ontology query resources such as BioPortal (http://bioportal.bioontology.org/) and OntoBee (http: //www.ontobee.org/).

Pre-reasoned PRO
Two of the principal benefits of ontologies are the ability to perform internal consistency checks and to perform automated classification of terms. Each of these benefits is afforded by reasoning--an analysis of the assertions made within an ontology to make inferences about additional relationships. To use a simple example, if a protein is defined as having a phosphorylated serine, it would be classified after reasoning as a phosphoprotein, even if such was not directly stated. Reasoning can thus add additional parents to each term as appropriate. Conversely, if a protein was defined as specifically lacking any phosphorylation, yet was mistakenly classified as a phosphoprotein, reasoning would point out this inconsistency. To make such benefits available to users, we provide a version of PRO that has gone through the reasoning process. Reasoning over the latest release of PRO (v50.0) indicates there are 20 equivalencies, >80 000 new axioms and no logical inconsistencies. The reasoned version is now the default download using the links http://purl.obolibrary.org/obo/pr.obo and http://purl. obolibrary.org/obo/pr.owl.

SPARQL endpoint
We have built a resource description framework (RDF) linked data repository that includes the information from the current principal PRO ontology file pro reasoned.obo and the associated PRO annotation file PAF.txt. On this basis we have developed a SPARQL (http://www.w3.org/ TR/sparql11-overview/) endpoint server for PRO using the open source edition of OpenLink Virtuoso (http://virtuoso. openlinksw.com). This allows our users to query against PRO data following the W3C SPARQL specification (http: //www.w3.org/TR/sparql11-query/). A user can retrieve terms, subclass, and functional annotation from PRO, with or without inference. The PRO SPARQL endpoint is accessible from http://proconsortium.org/pro/pro sparql.shtml. Query results are reported in a user-selectable common data exchange format. We have developed sample queries (shown on the web page referenced above) to guide our users in building their own queries. The queries include those that return either direct or all subclasses (7 or 85 terms, respectively) of TGF-␤ superfamily receptor type-1 (the latter without or with (180 terms) HermiT reasoner-based (http://www.hermit-reasoner.com) forward chaining), one that returns the functional properties of a PRO term, and one that takes advantage of a federated query to retrieve information from PRO and UniProtKB. Figure 4 presents the query and results of one of the samples.