HAMAP as SPARQL rules—A portable annotation pipeline for genomes and proteomes

Abstract Background Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation. Results Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline. Conclusions HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.

have done this for us, We have also published the code contained in the supplement at https://github.com/sib-swiss/HAMAP-SPARQL COMMENT: Documentation of the tool is very limited. More detailed documentation, tutorial and example should be provided. ANSWER: We have published documentation and a tutorial at https://github.com/sib-swiss/HAMAP-SPARQL COMMENT: Citations were not correctly ordered. ANSWER: We have reordered the citations according to the guidelines for authors.

Curators: ============
To make it possible to reproduce the query described in Figure 6 we have the data required in nquads format in a compressed form. We would like to upload this to gigadb.

ALL: ===========
We also rephrased the section on performance to apply to a more common to understand task, annotating a bacterial genome instead of a whole Swiss-Prot release.
With sincere regards, The authors Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of e ort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.
Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the W3C standards RDF and SPARQL. We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, o the shelf solutions, and is applicable to any genome or proteome annotation pipeline.

Introduction
Continuing technological advances have reduced the costs of DNA sequencing enormously in recent years, leading to an explosion in the number of available whole genome and metagenome sequences from all branches of the tree of life [1,2,3,4,5]. This wealth of sequence data presents exciting opportunities for experimental and computational research into the evolution and functional capacities of individual organisms and the communities they form, but fully exploiting this data will require complete and accurate functional annotation of these genome and metagenome sequences. Resources for genome annotation such as RAST/MG-RAST [6,7], IMG/M [8], the NCBI genome annotation pipeline [9], InterPro [10], TIGR-FAMS [11], and HAMAP [12] exploit information from experimentally characterized sequences to infer functions for uncharacterized homologs. While the underlying principles of these resources are undoubtedly very similar, a lack of shared annotation standards and a suitable shared technical framework for annotation hamper e orts to use and combine them.
In this work, we use the HAMAP system (https://hamap.expasy.org) to demonstrate technical solutions that could facilitate the combination and reuse of functional genome annotation systems from any provider. HAMAP classi es and annotates protein sequences using a collection of expert-curated protein family signatures and annotation rules. Swiss-Prot curators build HAMAP rules as part of an integrated work ow that includes curation of experimentally characterized template entries in UniProtKB/Swiss-Prot, as well as curation of the associated rule and protein family signature (encoded as a generalized pro le). HAMAP rules annotate family members to the same level of detail and quality as the expert curated UniProtKB/Swiss-Prot records on which they are based, combining family membership and residue dependencies to ensure a high degree of speci city [12].
The current implementation of HAMAP uses a custom rule format and annotation engine that are not easy to integrate into external pipelines. The HAMAP-Scan web service (https://hamap.expasy.org/hamap_scan.html) is a good alternative for small research projects, but large genome sequencing projects cannot depend on external web services to process large amounts of data. Our goal here was to develop a generic HAMAP rule format and annotation engine that is easily portable by external HAMAP users, using standard technologies that developers of other genome annotation pipelines could also adopt. To achieve this we have developed a representation of HAMAP annotation rules using the World Wide Web Consortium (W3C) standard SPARQL 1.1 syntax. SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language) is a query language for RDF (Resource Description Framework), a core Semantic Web technology from the W3C (see https://www.w3.org/RDF/ for more details). Our implementation allows users to apply HAMAP rules in SPARQL syntax to annotate protein sequences expressed as RDF using othe-shelf SPARQL engines -without any need for a custom pipeline. If other annotation system providers were to adopt the same approach, it would then be possible to share and combine the annotation rules from multiple systems, execute them with any SPARQL engine, and compare the results.

Methods
To use a generic SPARQL engine to execute rule-based protein sequence annotation, we need the following input data: a) annotation rules in SPARQL syntax, b) protein sequence records in RDF syntax, and c) protein sequence/signature matches in RDF syntax, including alignment information for positional annotations.
To keep the examples given in the Figures short, we provide all RDF namespace pre xes declarations in Figure 1 and omit these from subsequent Figures. We use the UniProt core ontology and other ontologies used by UniProt, such as FALDO [13], which is also used in the RDF of Ensembl [14] and Ensembl Genomes [15], to describe sequence positions, and the EDAM ontology [16] to describe sequence/signature matches.

HAMAP annotation rules in SPARQL syntax
A HAMAP annotation rule consists of two parts: 1) the annotations, and 2) a set of conditions that must be satis ed in order to apply those annotations. The rule annotations can be expressed either by a CONSTRUCT block that returns the annotations as RDF triples or by an INSERT block that inserts these triples directly into an RDF store, while the rule conditions can be expressed by the WHERE clause of a SPARQL query. Figure 2 shows part of the HAMAP rule for the signature MF_00005 as a SPARQL query. The CONSTRUCT block generates two annotations consisting of RDF triples for two Gene Ontology (GO) terms, providing that all conditions de ned in the WHERE clause are satis ed. The conditions here are that the target must be a complete protein sequence, of bacterial or archaeal origin, and a member of the HAMAP family MF_00005 (i.e. matching the corresponding family signature). Figure 3 shows how the CONSTRUCT block of Figure 2 can be extended to generate metadata for provenance and evidence for each annotation that the rule generates. We attribute the annotations to the HAMAP rule (MF_00005) and describe the type of the evidence with a value from the Evidence Code On-   tology (ECO) [17]. We link the attribution to the annotations via RDF rei cation quads, which is verbose but is understood by all RDF syntaxes and data stores.
The original HAMAP rule implementation has two features that we have not yet implemented in this work. The rst is the ability to call sequence analysis methods such as SignalP [18] and TMHMM [19] for the annotation of signal and transmembrane regions, which is not implemented here as these methods may not be available to external users. The second is precedence relationships between HAMAP rules, which are complex and apply to relatively few rules.

Protein sequence records in RDF syntax
HAMAP SPARQL rules require protein sequence records in a simple RDF format. Figure 4 shows an example protein record with the identi er 'P1' (example:P1). The rules require an identi er for the sequence (example:P1-seq) and the organism as an NCBI taxonomy identi er (taxon:83333). The actual protein sequence is provided as an IUPAC amino acid encoded string (in the rdf:value predicate of example:P1-seq) for positional annotations.

Protein sequence/signature matches in RDF syntax
HAMAP SPARQL rules require sequence/signature match data in an RDF format. Figure 4 shows an RDF representation of the sequence/signature match of the example protein 'P1' (Figure 4) and the HAMAP signature MF_00005. The core information is a triple that states that the protein (example:P1) matches the signature (signature:MF_00005).
For positional annotations, the rule needs the start and end positions of the match region on the sequence, as well as the alignment between sequence and signature. We describe this information with the EDAM and FALDO ontologies and use the alignment format returned by the pfsearchV3 [20] and Inter-ProScan [10] software.
A HAMAP rule speci es the sequence positions of feature annotations -such as active sites or binding regions -with respect to one or more experimentally characterized 'template' sequences in UniProtKB/Swiss-Prot. The rule engine therefore requires the alignment(s) of the template sequence(s) to the rule's signature as input, and uses these to determine the corresponding positions on the template(s) and target sequence. A HAMAP rule may additionally require that the matching discrete position or range on the target sequence correspond to a speci ed amino acid or sequence motif, e.g. to check that an active site has the expected amino acid. This functionality can be implemented either in standard SPARQL 1.1 using the REPLACE, STRLEN and CONCAT functions (see Supplementary Figure 1 for an example), or via a custom SPARQL function (an example Java function for RDF stores that extends the Apache Jena ARQ SPARQL engine is given in Supplementary Figure 2). We distribute the template sequence/signature alignments that are required for rule application together with the rules on our FTP site (at ftp://ftp.expasy.org/databases/hamap/hamap_sparql.tar.gz). Simplifying the output from HAMAP rules for other annotation pipelines HAMAP rules provide functional annotation in the form of free text and using controlled vocabularies and ontologies developed by UniProt and others. These include the Gene Ontology (GO) [21], the Enzyme Classi cation of the IUBMB ("EC numbers") [22] represented by the ENZYME database [23], and the Rhea database of biochemical reactions [24] based on the ChEBI ontology [25]. Each HAMAP rule provides all annotation elds required in UniProtKB. For users requiring only a subset of these annotations -such as enzymatic reactions described using Rhea, or protein functions, processes and cellular components described using GO -it is possible to translate only the desired annotation types into SPARQL queries. We can also modify the CONSTRUCT/INSERT block of the queries to return the results as simple protein-annotation associations (see Table 1). This tabular result format can easily be loaded into a relational database or JSON-based document store and requires no further investment in a Semantic Web technology stack.

Validation
We have tested the approach of executing rule-based annotation with a generic SPARQL engine with the data from the HAMAP and UniProtKB/Swiss-Prot releases 2019_10. We translated the HAMAP rules into SPARQL CONSTRUCT queries and the protein sequences into the RDF format described in Figure 4.
We generated the RDF representation of the sequence/signature matches, as illustrated in Figure 5, directly from a relational database containing the results of pfsearchV3 scans of UniProtKB/Swiss-Prot versus HAMAP for our internal HAMAP release pipeline.
Other groups could achieve the same result by scanning their protein sequences with InterProScan and converting the XML result les into the described RDF format for sequence/signature matches. We provide an XSLT stylesheet for this conversion in https://github.com/sib-swiss/HAMAP-SPARQL/blob/master/src/main/xlst/interproToRdf.xslt. Or use pftoolsV3.2 which has the required RDF format as one of it's output options.
We tested two di erent open-source SPARQL engines (Virtuoso RDF 7.2 and Apache Jena TDB-2 3.13.1) to execute our rules and validated the generated annotations by comparing them to those obtained from our custom platform. This platform, implemented in Scala/Java, uses as input les protein entries in FASTA format and HAMAP rules in their custom text format to generate annotations in UniProtKB format (text, XML or RDF). The RDF data generated by the di erent systems was loaded into separate named graphs of an RDF database for comparisons using SPARQL queries to search for annotations unique to any of the three runs (see example query in Figure 6). The existing custom HAMAP annotation pipeline and each of  the two SPARQL engines generated identical annotations, except for those that depend on external sequence analysis methods and the evaluation of HAMAP rule precedence, which we did not implement here as described in section 2.1.
On a laptop with 8 cores, it takes about 4 minutes to scan a small E. coli proteome with the HAMAP signatures using pftoolsV3.2, and 1 minute to execute the HAMAP rules with Apache Jena TDB2 (see the tutorial for instructions). This shows that the sequence/signature scanning step is the bottleneck in our system. Both steps, scanning and rule execution, could be run in an embarrassingly parallel fashion. A further optimization in a HPC setting would be to avoid HTTP communication by running the SPARQL query reader and processor in the same process.
An additional small bene t of the SPARQL representation is that SPARQL queries can be serialized in RDF and loaded into a SPARQL engine. We set-up a server with our rules at https://hamap.expasy.org/sparql. This allows us to perform quality assurance on our rules by running analytical queries across them and SPARQL enpoints of other life science resources.

Protein function annotation pipelines based on SPARQL
Here we have developed a SPARQL representation of HAMAP annotation rules that allows other groups with basic knowledge of this widespread standard technology to incorporate HAMAP in their own genome and proteome annotation pipelines. SPARQL can express all features of complex HAMAP rules, including the logic required for positional annotations, while freely available SPARQL engines provide a means to execute HAMAP rules without recourse to specialized software. This work demonstrates the feasibility of adopting SPARQL as a means to integrate existing functional annotation pipelines for genome sequencing projects. This applies not only to expert curated rules from HAMAP and other systems, but also anno-  tation rules generated by automated approaches such as deep learning [26,27], which require a feature vector to be expressed as an RDF triple as shown by LOD4ML (http://lod4ml.org). SPARQL can also be adopted by those without access to specialized RDF triple stores by using a SPARQL to SQL mapping (such as that provided by any of the R2RML tools, see https://www.w3.org/TR/r2rml/) to execute SPARQL rules directly against data stored in a relational database. The main weakness of SPARQL is that, like many generic query engines, it tends to be computationally more expensive than a custom solution, but we have seen signi cant progress in the optimization of SPARQL engines over the past years [28].

An approach that is extensible to any domain of biology
While we have limited our demonstration to the use of SPARQL queries to formalize and execute protein annotation rules from HAMAP, there is nothing that ties the SPARQL approach to a particular domain of biology. Complete genome annotation requires identi cation and functional annotation of RNAs as well as proteins, and Figure 7 provides a demonstration of how that annotation could be provided by SPARQL. Here a hypothetical SPARQL rule speci es functional (GO) annotation for an RNA sequence of RNAcentral [29] that is a member of the U1 spliceosomal RNA family as de ned by Rfam [30].
The development of annotation rules for a given domain across di erent groups will require community standards for the representation of the relevant domain-speci c annotation types. In this work we have used the RDF vocabularies of UniProt, which allowed us to easily compare the results of the SPARQL approach to those of our existing HAMAP rule annotation pipeline. As other appropriate community ontologies become available, our queries and SPARQL rules can be easily adapted.

Further work
We plan to further extend our implementation of HAMAP rules using SPARQL to include external method calls and deal with rule precedence (see Section 2.1), and also develop a SPARQL representation for PROSITE, which provides protein domain annotation via a custom pipeline, ScanProsite (at https://prosite.expasy.org/scanprosite/) [31]. HAMAP and PROSITE are two of the main components of the UniRule system of UniProt, which provides automatic annotation for unreviewed entries of UniProtKB/TrEMBL [32], and the approach described here could be extended to the entire UniRule system. The UniProt data model was recently extended to allow enzyme annotation using biochemical reaction data from the Rhea database [33], which will further extend the scope of HAMAP SPARQL rules to more specialized applications -such as the creation and annotation of draft networks of metabolic reactions [34,35].