APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins

Abstract RNA-binding proteins (RBPs) have been established as core components of several post-transcriptional gene regulation mechanisms. Experimental techniques such as cross-linking and co-immunoprecipitation have enabled the identification of RBPs, RNA-binding domains (RBDs) and their regulatory roles in the eukaryotic species such as human and yeast in large-scale. In contrast, our knowledge of the number and potential diversity of RBPs in bacteria is poorer due to the technical challenges associated with the existing global screening approaches. We introduce APRICOT, a computational pipeline for the sequence-based identification and characterization of proteins using RBDs known from experimental studies. The pipeline identifies functional motifs in protein sequences using position-specific scoring matrices and Hidden Markov Models of the functional domains and statistically scores them based on a series of sequence-based features. Subsequently, APRICOT identifies putative RBPs and characterizes them by several biological properties. Here we demonstrate the application and adaptability of the pipeline on large-scale protein sets, including the bacterial proteome of Escherichia coli. APRICOT showed better performance on various datasets compared to other existing tools for the sequence-based prediction of RBPs by achieving an average sensitivity and specificity of 0.90 and 0.91 respectively. The command-line tool and its documentation are available at https://pypi.python.org/pypi/bio-apricot.


INTRODUCTION
Ribonucleoproteins and RNA-binding proteins (RBPs) are important post-transcriptional regulators in several pro-cesses such as, RNA splicing, transport, localization, translation and stabilization. Such regulatory mechanisms involve brief interactions or stable bindings of regulatory RNAs with RBPs, which are structurally and functionally important for various cellular processes. Due to developments in high-throughput mass-spectrometry and sequencing approaches, it is technically possible to perform global analyzes to comprehensively catalog RBPs in an organism. Several studies have been conducted to identify and characterize RBPs as post-transcriptional regulators in human, mouse and yeast (1)(2)(3)(4)(5)(6). More than 1000 eukaryotic RBPs have been described to contain conserved amino-acid motifs or RNA-binding domains (RBDs), which serve as RNA binding sites (1,7). A large number of these RBDs are classified based on their RNA-binding characteristics as classical RBDs and non-classical RBDs (1,8) based on their identification in several RBPs or few well-characterized ribonucleoproteins respectively. Additionally, a small number of RBPs lacking known RNA-binding motifs have been identified, which in most cases rely on intrinsically disordered domains for their interaction with RNAs (1). Moreover, numerous structures of protein-RNA complexes have also been solved experimentally, providing biophysical information on the interaction between nucleic acids and amino acids.
The developments in RNA and RBP research have provided reliable resources for the advancements of computational methods for the identification of similar RBPs in different genomes. Bioinformatic approaches have been established to predict and characterize known RBPs using sequence-based features, such as biochemical properties, structural properties and their evolutionary relationship (9)(10)(11). A few computational tools such as SPOT-Seq (10), RNApred (12) and catRAPID signature (13) allow the identification of RBPs directly from the primary sequences of proteins. Other computational methods, such as RNAProB (14), BINDN+ (15) and RNABindRPLUS (16) have been developed to characterize RBPs by predicting RNA-binding residues derived from the known protein-RNA structures. Such tools can also be used to identify RBPs when RNA-binding residues in the query proteins are recognized. Since these methods are computationally expensive and have been trained on specific subsets of RBP structures, they do not perform equally well on heterogeneous datasets (17). For example, RBRIdent (18) is a recent approach that utilizes several biological features for an improved sequence-based prediction of RNA-binding residues, which, like many other tools, performs well only on specific datasets (17).
Since the experimental techniques established for the eukaryotic systems cannot be directly applied to bacterial systems without their intensive optimization, there is a lack of a system wide study of RBPs in bacteria (19). Current knowledge of the RBPs in bacterial species is restricted to only a few proteins such as Hfq and CsrA, which together with their targets are an integral part of large posttranscriptional regulons (20)(21)(22)(23)(24). In contrast to the limited number of RBPs in bacteria, several hundreds of noncoding RNAs have been discovered that are linked to various regulatory processes such as expression of specific regulons and transcription factors via interactions with mR-NAs and proteins (20). In order to understand the mechanisms involved in such RNA-regulated events, it is crucial to quantify and characterize the proteins that interact with these regulatory RNAs. Based on the experimentally derived RBPs from all the domains of life, computational methods can be developed that are capable of screening large protein sets.
We report APRICOT, an integrated pipeline for the sequence-based identification of RBPs in complete proteome sets of both eukaryotic and bacterial species. The pipeline characterizes a protein as RBP on the basis of experimentally annotated functional motifs and domain families such as RBDs. APRICOT measures similarity between the predicted RNA-binding site in the query proteins and their corresponding reference domains based on the sequence-based features and performs statistical analyzes. This tool is built upon a broad knowledge and sophisticated computational approaches in the field of functional motif discovery and our experiences of working with RBPs in bacteria. The pipeline has been trained and tested on several test sets from protein databases and compared with previously described tools for RBP predictions. By analyzing the complete proteomes of human and Escherichia coli (strain K-12) we demonstrate the ability of the pipeline to process large datasets including bacterial proteomes. Additionally, by easily adapting the pipeline for the identification of kinases, we demonstrate its application in the characterization of proteins by other functional classes as well.

Databases and the tools
APRICOT requires a set of query proteins as input for which the presence of RBDs should be determined. The basic information, e.g. amino acid sequences and taxonomy data are retrieved from UniProt Knowledgebase (25). In addition, a reference domain set is collected from domain databases based on functional classes specified by the users.
The domain resources used in this study are Conserved Domain Database (CDD) (26) and InterPro (27), which consist of predictive models and signatures representing protein domains, families and functional sites from multiple publically available databases. CDD includes domain entries as position-specific score matrices (PSSM) that are generated from multiple sequence alignment of representative amino-acid sequences obtained from several domain databases, namely Pfam (28,29), TIGRFAM (30), SMART (31), COGs (32), several NCBI-curated domains like PRK or Protein Clusters (33) and multi-model superfamilies of proteins (26). For the identification of domains in a given protein sequence, the PSSM entries in CDD are queried via reverse position-specific basic local alignment search tool (RPS-BLAST), a variant of popular position-specific iterative BLAST (PSI-BLAST) (34). CDD (v3.14) contains annotations for 50 648 domains where entries from every domain resource are assigned an individual PSSM identifier (id) allowing redundant entries of domains.
InterPro is a similar consortium that consists of domain entries as predictive models and signatures obtained from different databases, namely Pfam (28,29), TIGR-FAMs (30), SMART (31), PROSITE patterns and profiles (35), HAMAP (36), PRINTS (37), PIRSF (38), ProDom (39), PANTHER (40), GENE3D (41) and SUPERFAM-ILY (42). Most of these databases contain domain entries as Hidden Markov Models (HMM) (43) probabilistic models derived from sequence alignments, which capture information on both substitution and indel frequencies. These domains can be queried using tools like HM-MER3 (44). Few member databases contain PSSM domain models built from the multiple alignments of representative amino-acid sequences from the UniProt protein database, which can be queried by BLAST-based methods or single model search algorithm (45), which have been integrated into InterProScan 5 (45). As of May 2016, InterPro (v.57) contained 29 175 domain models of which several are annotated with gene ontology (GO) terms (46).
InterPro and CDD consortiums have only three databases in common (Pfam, TIGRPFAM and SMART) that account for about 20 000 domains. Technically, the PSSM based approach by CDD is built upon ungapped motifs, whereas the HMM probabilistic models of InterPro can handle motifs with insertions and deletions. By combining the predictive abilities of the CDD and InterPro consortiums, APRICOT provides a broader scope for domain characterization.

Workflow
APRICOT involves different modules for the identification and characterization of RBPs, which can be explained by its program input, analysis modules and program output (Figure 1). These modules are assembled into a command-line tool, for which the individual modules accessible through subcommands are specified below.
Program input. APRICOT requires two inputs for its execution: query proteins and the functional class of interest ( Figure 1). The query proteins can be provided either as a list of gene ids, protein ids or amino acid sequences. The query search can be limited to a specific species by providing a corresponding taxonomy identifier. Since APRICOT has been designed to process multiple queries, the motif prediction can be carried out for the functional characterization of an entire proteome set corresponding to a taxonomy id. As the second input, users must provide a list of terms or keywords like names of domain families, Pfam ids or MeSH terms depending on the functional classes of interest, referred hereon as domain selection keywords. APRICOT uses a string-based search to select relevant entries from the domain resources, which are further utilized for identifying proteins that contain these domains. Optionally a set of terms called result classification keywords can be provided for the classification of predicted domains into smaller subsets in order to help users in navigating large datasets or classifying proteins by the functional similarity.
Modules for domain prediction and annotations. The core functionalities of APRICOT involve a multi-step process for the selection of proteins by identifying functional sites or domains of interest in their sequences followed by their annotations by various biological features. We have used a multifunctional human protein PTBP1 (47) as an example in order to describe the different modules involved in domain prediction and annotations in Figure 2. PTBP1 is an mRNA regulator that contains several repeated RBDs, specifically a highly abundant eukaryotic domain called RNA Recognition Motifs or RRMs (48).

Selection of reference domain set.
A string based selection of domain families and functional motifs are carried out using the domain selection keywords to create a reference domain set. For this purpose, APRICOT scans each domain entry in the CDD and InterPro consortium and selects those domains that contain at least one of the user-provided terms in their annotations, such as description and GOs. If a term comprises multiple words, only those domains that have all words co-occurring in the same context are selected. APRICOT also allows the usage of regular expressions for the domain selection (see online documentation for details).
In this analysis, we considered the domains obtained from the human interactome study (1,4) as the comprehensive resources for building a reference RBD set. To report high confidence RBPs by avoiding the selection of ambiguous and functionally irrelevant domains, we included all the classical RBDs in domain selection keywords (Figure 2A). In order to account for ribosomal proteins, 109 terms related to RNA-binding ribosomal domains (4) were included in domain selection keywords. An additional term 'RNA-bind' was introduced to include any additional RBDs in the reference set that are well described as RBDs in databases but are not classified under classical RBDs ( Figure 2B). Using these domain selection keywords, a total of 4797 unique RBD entries were curated from CDD (1995 entries) and In-terPro (2802 entries) referred as reference domain set, which was used for filtering domain predictions in the downstream analysis (Supplementary Table S6).

Domain prediction.
In this step query amino acid sequences are characterized with all the possible domains from the databases without filtering a certain functional class. The sequences are subjected to domain prediction us-

PAGE 5 OF 13
Nucleic Acids Research, 2017, Vol. 45, No. 11 e96 ing RPS-BLAST and InterProScan to query their CDD and InterPro respectively ( Figure 2C). By default, APRICOT uses both CDD and InterPro for the domain predictions, however users can choose one of the databases to reduce the run-time. Since the primary requirement of this module is the amino acid sequences of the query proteins in FASTAformat, users can analyze novel sequences even when the gene/protein ids are unknown or lacking.

Selection of proteins by functional domains of interest.
This module allows the selection of relevant proteins from the query sets based on the predicted domains obtained in the previous step. The proteins are considered as candidates if they contain one of the domains of interest. Cut-offs for various statistical parameters (discussed below) can be defined for the selection of the predicted domains to identify such candidates, which are further annotated with additional information, such as ontology, pathway and cross-references to different databases.

Feature-based scoring.
This module ranks the domain predictions by their relevance. For this purpose, a comparative analysis is carried out between the protein region that are predicted in the candidate proteins as domain of interest and the corresponding fragments of their reference consensus sequence. This comparison is done for a number of sequence-based features namely chemical properties (average mass, pKa and pI), alignment scores calculated by Needleman-Wunsch algorithm (primary sequence and secondary structure), Euclidean distance of protein compositions (di-peptides, tri-peptides and physico-chemical properties) and measure of similarity between predicted sites and reference domains (for details see Supplementary Material S1A). A relative similarity between the predicted functional site and the reference domain consensus for these sets of features are calculated. We use Bayesian probabilistic score in a range from 0 to 1 to represent the functional potential of the predicted motifs, where 1 indicates the highest probability ( Figure 2D). To further estimate the statistical significance of a predicted domain, P-values are calculated for the sequence-based features except for the chemical properties. These probabilistic scores and P-values allow users to select proteins with high confidence motif predictions.

Additional annotations of the selected proteins.
Upon selection of proteins of functional relevance, users can choose to further annotate these proteins by information like subcellular localization by PSORTb (49) Program output. A comprehensive result is returned by APRICOT at each step of analysis and stored with relevant information that serves as the input for the subsequent steps. For example, the data for predicted domains can be repeatedly used for extracting proteins of different functional classes. The selected proteins are provided in a tabular format with the statistics on domain prediction and corresponding annotations obtained from UniProt and the comparative analysis (Supplementary Figure S2). To provide an easy navigation through the large-scale analysis data, the results can be classified using result classification keywords, into smaller subsets of proteins with enzymatic activities or specific functional aspect of proteins. Additionally, graphs and charts are provided to aid the visualization of the resulting data.

Training sets
For the identification of the most suitable parameters and their corresponding cut-offs for domain selection, training sets were collected from the manually curated and reviewed subset of the UniProt Consortium--SwissProt (51). A positive set of proteins was selected by using the keyword 'RNAbinding'. A second set of proteins was selected by using all the terms indicating functional association of proteins with nucleic acid. A third set comprising all the uncharacterized and hypothetical proteins from the database was selected. All these sets of proteins were subtracted from the Swis-sProt data and the remaining data consisting of 271 219 proteins were considered as the resource for negative set. All the redundant protein sequences from both positive and negative sets were removed by clustering the sequences using BLASTclust (52) using 90% of sequence identity. A total of 4779 non-redundant (nr) proteins were compiled in the positive set and a set of 5834 proteins were selected for negative set, referred to henceforth as SwissProt-positive and SwissProt-negative respectively (Supplementary Table S4).

Test sets
To consistently evaluate the sensitivity (SN), specificity (SP) and accuracy (ACC) of APRICOT, a pair of positive and negative set was obtained from NCBI Reference Sequence (RefSeq), a nr database (53), using the terms 'RNA-bind' and 'periplasmic' respectively. The former term retrieved 4470 RBPs proteins from various organisms. The term 'periplasmic', which retrieved 5836 bacterial periplasmic proteins, was considered as a resource for non-RBDs based on the assumption that the majority of periplasmic proteins lack RBDs. Using BLASTclust from the NCBI-BLAST package (52) the proteins in each set were clustered by 75% sequence similarity, which resulted into 687 proteins in positive set and 1199 proteins in negative set, henceforth referred as nr-positive and nr-negative respectively. Importantly, these datasets did not contain proteins that were included in the training sets (swissprot-positive and swissprotnegative).
An additional pair of positive and negative set was obtained from RNApred web server (12), which will be referred as RNApred-positive (377 proteins) and RNAprednegative (355 proteins). The SN of the pipeline was tested on other positive datasets collected from various resources, which are RBPDB (54), RNAcompete (7), RBRIdent (18), rbp86 (55), rbp109 (55) and rbp107 (55)  an example for bacterial species and Homo sapiens (taxonomy id: 9606) was used as an example for eukaryotic species consisting of 4479 and 70 076 protein entries in UniProt database respectively. The positive RBP sets were selected from both the proteomes to quantify the ACC with which APRICOT identifies RBPs in these genomes. We considered 1535 nr human proteins as positive set (Supplementary Table S6), which were proposed as RBPs in the global experiment-based studies or were reported by independent publications (1-4). So far no global study has been reported for the genome wide identification of RBPs in bacteria. Beside ribosomal proteins, only a few proteins such as Hfq (20), CsrA (22), YhbY (56), SmpB (57), ProQ (58,59), CspA (60) and CspB (60) have been reported as RBPs in E. coli. Hence, a larger RBP reference of E. coli K12 was retrieved from UniProt database using GO term GO: 0003723) for RNA-Binding that comprised of 160 proteins including the known RBPs (Supplementary Table S7).

Assessment criteria
The statistical parameters for domain predictions in the training set and the performance of the tool on the test sets were evaluated by using standard binary criteria of SN, SP, ACC, Matthews Correlation Coefficient (MCC) and

Parameter optimization for the selection of predicted domains
The training sets, SwissProt-positive (4779 proteins) and SwissProt-negative (5834 proteins), were analyzed in order to evaluate the ability of the method to accurately differentiate RBPs from non-RBPs. For this evaluation, we used statistical parameters of sequence similarity, residue identity, residue gap and E-value of the domain prediction to describe the similarity between a query and its corresponding reference. Unlike residue identity, sequence similarity accounts for the edit operations like positive substitutions, thereby capturing the secondary structure information at a better resolution. An E-value for searches of homologs against a database represents the number of times a given match in a sequence is obtained purely by chance, meaning that a low E-value reflects a higher significance of database match. We describe an additional parameter namely the domain coverage, which is the percentage of the length predicted as domain in the query compared to the original length of reference domain. Generally, lower domain coverage suggests a random similarity of the predicted domain, whereas higher domain coverage reflects a higher potential of a domain to be functionally relevant.
Initially we investigated the analysis of the training sets by naïve approach, which involved InterProScan and CDDbased batch-search methods in their default settings. Analysis by InterProScan achieved a TPR of 0.77 and CDD achieved a TPR of 0.79. Several queries in CDD based method were annotated as RBD containing proteins with coverage lower than 10% and sequence similarity lower than 5%, which indicated poor conservation of the functional domains. Similarly, InterProScan failed to characterize several RBPs due to its stringent filtering criteria. Interestingly, several RBPs were reported by only one of the methods, hence when the results from both the analyzes were combined, an increased TPR of 0.82 was achieved. This clearly showed the potential to achieve higher SN by the combined approach, which is implemented in APRICOT. We further analyzed the training datasets by APRICOT, which predicted thousands of RBD entries in both positive and negative sets that were evaluated using systematically varying cut-offs of each parameter to optimize the identification of RBPs. The corresponding ROC curves were generated and optimal cutoff ranges were defined by identifying the values of the parameters that show a optimal TPR (closer to one) and FPR (closer to zero) with high ACC (closer to one), resulting into statistically significant AUC, MCC and F-measure (Figure-3A and Supplementary Table S4).
For the coverage of the predicted domains, the minimum cut-off was recorded to be 39% that attained an ACC, TPR, FPR, MCC value and F-measure of 0.81, 0.87, 0.24, 0.63 and 0.81 respectively. Using a higher cut-off of 60% a lower TPR 0.81 but a better FPR 0.16 was obtained, which consequently shows better ACC and F-measure. Similarly, the optimal threshold for the minimum cut-off of sequence similarity was recorded to be 24%, which attains ACC, TPR, FPR, MCC value of and F-measure of 0.81, 0.83, 0.20, 0.63 and 0.81 respectively. Similarly, as shown in the ROC curve, by using a minimum cut-off of 15% for the residue identity and at a maximum E-value cut-off of 0.01, high accuracies of 0.81 and 0.82 were achieved. The decision values of the parameters were further ranked, individually and in combinations, for all the predicted RBD entries in the training sets, we generated ROC curves and AUCs to identify their marginal contributions on overall ACC in detecting RBDs (Supplementary Figure S5).
This evaluation led to the selection of domain coverage and sequence similarity as the default parameters for the APRICOT analysis with their minimum cut-offs of 39 and 24% respectively. The analysis by APRICOT using the selected parameters with their defined cut-offs achieves a TPR

Assessment of the pipeline performance
A variety of positive datasets were analyzed by APRICOT, on which the pipeline achieved SN in a range of 0.81-1 (Figure 3B) demonstrating its high efficiency in domain-based characterization of RBPs. A more detailed evaluation of the pipeline performance was carried out on the paired dataset of nr-positive and nr-negative, and RNApred-positive and RNApred-negative (Table 1).
To demonstrate the efficiency of APRICOT on largescale data, the complete proteomes of H. sapiens and E. coli K-12 were analyzed. The human proteome set containing 70 076 UniProt protein entries was subjected to domain prediction. A known set of 1535 nr RBPs was used as positive reference set (4) of which 25 RBPs have not been defined with any RBDs. The reference domain set was considered for the initial identification of RBPs using pre-defined cutoffs for the aforementioned default parameters. Upon filtering of proteins by predicted domains, 1091 from the reference RBP set were reported with at least one RBD from the reference domain set, showing a SN of 0.71. By including the non-classical RBDs in the reference domain set, 68 more proteins could be recognized as RBPs and 201 RBPs could be recognized additionally by further including domains listed as RBDs unknown (Supplementary Table S6). The remaining 180 proteins that are not identified as RBPs by APRICOT do not contain RBDs and are listed as RNArelated proteins by Gerstberger et al. (4). The data for this analysis has been provided in the Supplementary Table S6. A similar analysis of the complete proteome of E. coli K-12 was carried out by APRICOT using the default parameters with the reference domain set (Figure 2A and B). In the initial characterization of RBPs, 673 sequences were selected as RBP candidates by RPS-BLAST and 502 sequences by InterProScan analysis. These proteins account for 806 RBP candidates, of which 369 proteins were identified as putative RBPs by both the methods. From the full proteome set, APRICOT could successfully identify all the known E. coli RBPs. Specifically, Hfq, CsrA, YhbY, SmpB, ProQ, CspA and CspB were identified due to highly conserved RBDs in their sequences, which have been previously reported and characterized for their regulatory roles (Table 2). Furthermore, from the GO term derived 160 RBPs from E. coli K-12, 129 were identified correctly by APRI-COT that demonstrated an SN of 0.80. APRICOT failed to identify the remaining 24 proteins as RBPs because either the predicted RBDs could not pass the parameter filters or the reference domain set lack specific domains associated with these proteins. These unidentified RBPs included Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) system Cascade subunits, toxic proteins and several enzymatic proteins like ribonucleases, tRNAdihydrouridylases and mRNA interferases.
The feature-based scores were calculated for each domain selected from the predicted data, which facilitate in differentiating highly reliable RBD predictions from the low confidence RBD predictions. Query proteins that consist of high confidence RBDs were further annotated with ad- ditional information, namely subcellular localization, secondary structures, GO terms and tertiary structures (Supplementary Table S7). These proteome-wide analyzes clearly demonstrate a high SN of the pipeline in identifying RBPs based on functional domains. However, it also shows a limitation related to the dependence of query characterization on the functional domains and motifs selected from the databases based on the user-provided terms.

Identification of other functional classes by APRICOT
Importantly, in addition to the application for the functional identification of RBPs, APRICOT modules can be easily adapted for one or multiple other functional classes. As a part of the critical assessment of function annotation, a project to assess the methods for computational annotation of protein functions (61), APRICOT was successfully used to annotate bacterial datasets comprising of more than 1 million proteins by a wide number of biological functions (arXiv: 1601.00891 [q-bio.QM]). In order to emphasize the aspect of APRICOT as a tool for the characterization of other functional classes of proteins, we chose kinase proteins from E. coli (strain K-12) as the reference set. Kinases are known to catalyze the transfer of phosphate groups to a substrate molecule using adenosine triphosphate as a phosphate donor. In UniProt database, 110 proteins from E. coli (strain K-12) are annotated with various kinase activities (for e.g. Serine/threonine-protein kinase, Signal histidine kinase and Shikimate kinase) and are tagged by the GO term (GO: 0016301) for kinase activity.
The APRICOT pipeline was supplied with the term 'kinase', for the selection of reference domain set and the pipeline was applied to the kinase proteins (Supplementary Table S8). Out of 110, 106 kinase proteins were identified correctly by APRICOT, achieving an SN of 0.96. The set of proteins that was not selected by APRICOT, contain kinase-associated domains that were not present in the reference domain set due to the pipeline domain selection constraints. This analysis suggests that APRICOT is efficient in the characterization of proteins based on pre-defined set of domains associated with functional classes other than RBPs as well. However, it should be noted that the ACC of the results depends on the choice of terms for the domain selection.

Comparative assessment of APRICOT with other RBP prediction tools
Although there are several approaches developed for the prediction of nucleic acid binding sites, we could compile only four tools described for their original aim to predict RBPs, namely SVMprot (62), RNApred (12), SPOT-Seq-RNA (63,64) and catRAPID signature (13). SVMprot was designed to predict RBPs by Support Vector Machine (SVM)-based classification of proteins primary sequences into functional families (54 Pfam families) and it was made available as a web server. Since the tool is no longer available, we could not include it in our comparative analysis. RNApred uses SVM models that are developed with amino-acid compositions and PSSMs. SPOT-Seq-RNA, uses structure homology based predictions of the RBPs and also allows the identification of the binding residues and binding affinities using SPARKS X (65) and DRNA tools (64) respectively. The fourth tool, catRAPID signature, is a SVM based method to identify RBPs and their binding regions based on physico-chemical properties.

PAGE 9 OF 13
Nucleic Acids Research, 2017, Vol. 45, No. 11 e96  We conducted comparative assessment of APRICOT's capabilities with these tools (Table 3). Unlike other tools, which have been trained or constructed on a certain set of reference set, APRICOT is established independent of any fixed set of reference because it selects reference domains for each analysis based on the user provided keywords. Therefore, it is capable of using any new RBDs that might be added in the integrated domain sources in future. APRICOT takes proteins that are predicted with statistically significant RBDs and scores them in comparison with their reference consensus sequence for various features using Needleman-Wunsch alignment scores, Euclidean distance and similarity-based scores. At the end, the scores for each property are combined to obtain a Bayesian probabilistic score in a range of 0-1, where 1 indicates the best hits. The results from all the intermediate steps are provided to allow users to evaluate different statistical aspects of their study.
For an unbiased evaluation of the relative performances of APRICOT with RNApred, SPOT-Seq-RNA and catRAPID signature, we used two datasets RBscore R130 (130 RBPs) and RBscore R116 (116 RBPs), which are the training and test sets created for the RBscore SVM approach in NBench (17). On RBscore R130, APRICOT achieved a TPR of 0.88 whereas RNApred, SPOT-Seq-RNA and catRAPID signature attained much lower TPRs of 0.79, 0.82 and 0.55 respectively. On the RBscore R116, which is indicated as a challenging set in NBench, APRI-COT achieved a comparatively low TPR of 0.67, however, this was still higher than the TPRs achieved by RNApred (0.66), SPOT-Seq-RNA (0.51) and catRAPID signature (0.47). We also checked the performances of naïve RPS-BLAST, which is used for the batch-search of domain in CDD and InterProScan, which is used for motif prediction in InterPro consortium. On both the datasets the naïve approaches for domain identification showed lower performances compared to their combined performance. Both the methods in their default setting achieved a TPR of 0.82 on the RBscore R130 by identifying 107 RBPs. On the RBscore R116, RPS-BLAST and InterProScan showed performances higher than SPOT-Seq-RNA but lower than APRICOT and RNApred by achieving TPR of 0.55 and 0.57 respectively.
APRICOT performed better than the other tested tools in all the assessment metrics used for the evaluation of RBscore R246 (RBPs from both the datasets) as positive set and RNApred-negative (355 proteins) by achieving highest ACC, MCC and F-measure of 0.88, 0.75 and 0.86 respectively (Table 3).

APRICOT versus tools for the prediction of RNA-binding residues
A comparative assessment of the programs developed for the prediction of nucleic acid binding sites was carried out in Nucleic Acid Binding prediction Benchmark (17). Total 16 tools for the prediction of RNA-binding residues, five tools for the prediction of DNA-binding residues along with several datasets obtained from the structures of protein-nucleic acid complexes were included in this study (available at http: //ahsoka.u-strasbg.fr/nbench/index.html). The motivation behind developing APRICOT is noticeably different from the tools involved in NBench. APRICOT identifies RBPs among large-scale query sets and further characterizes them by biological functions, whereas the 16 tools in NBench predict RNA-binding residues in the pre-defined RBPs. Practically, APRICOT and these tools can complement each other by first using APRICOT to identify RBPs and their corresponding RBDs and then applying the best performing NBench tools to obtain a high-resolution annotation by identifying RNA-binding residues. To evaluate the poten-tial of this idea, we acquired 3657 PDB entries, consisting of 24 different RNA related datasets in NBench selected at a resolution cut-off of 3.5Å. This dataset was subjected to analysis by APRICOT and a comparative assessment was carried out between the identified RBD sites and the nucleic acid binding residues at the distance cut-off of 3.5Å in each PDB entry (Supplementary Table S9).
We observed that the RNA-binding residues of 3340 (91%) PDB entries overlap with the APRICOT predicted RBD sites showing an overall SN of 0.91 ( Figure 4A and B). The NBench tools were ranked by their SNs to identify RNA-binding residues together with APRICOT for its ability to identify RNA-binding sites on 24 datasets. As shown in Figure 4C, APRICOT was among the best performing tools compared to the other tools in NBench across the 21 diverse datasets. In agreement with the observations made for the tools, APRICOT showed a lower SN on the New R15 set (15 new structures) and RBscore R116 (116 proteins, mentioned as difficult set). Furthermore, unlike most of the tools that do not show discriminative potential for RNA and DNA binding residues, APRICOT showed a high SP (0.70) when 1374 DNA binding proteins were included in the analysis. This evaluation demonstrates that APRICOT's domain prediction based analysis is an extremely efficient approach to identify RBPs and their corresponding potential RNA-binding region in the query sequences. Furthermore, it also implies that the resolution of the RBP studies could be enhanced significantly by first identifying the RBPs using APRICOT, followed by the analysis with the tools for the identification of RNAbinding residues in the predicted RBD sites.

CONCLUSIONS
APRICOT is an integrated pipeline for the sequence-based identification and annotation of the query proteins based on the functional motifs and domains of interest known from the experimental data. Notably, here we report APRICOT primarily as a tool for the sequence-based identification of RBPs, which uses a consistent set of reference RBDs derived from large-scale experimental studies. Using several domain data-resources and associated tools, the domains are predicted in the queries and only those proteins that contain domains of interest are further characterized. By involving a wide range of biological features for the characterization of functional motifs, the pipeline carries out an intensive comparative analysis between the predicted domains and their respective reference consensus. This comparison is translated into statistical scores that enable users to differentiate proteins that are predicted to harbor domains of high similarity with their reference sequences from proteins that have poorly conserved domains. The proteins are subjected to annotation by additional biological properties, such as subcellular localization and secondary structure to get further insight into their functional relevance.
The pipeline has been extensively tested on several RBPs and is optimized for the identification of RBPs in large datasets, such as complete proteomes of human and E. coli. For instance, APRICOT could successfully identify the respective motifs of CsrA, ProQ, YhbY and SmpB in E.coli with domain coverage higher than 80% and residue simi-larity closer to 70%. In addition to these previously characterized RBPs, APRICOT predicted a number E.coli proteins that can potentially interact with RNAs via RBDs and hence, could be further validated by experimental studies.
A thorough comparison between APRICOT and the other RBP prediction tools successfully demonstrated its superior performance and efficiency in a wide range of datasets for the identification of RBPs. Furthermore, we showed that the RBD sites obtained from APRICOT analysis have high overlap with the known RNA-binding residue sites in RBPs. Hence, we suggest that analysis of APRICOT can be complemented with the RNA-binding residue prediction tools to achieve a high-resolution binding information of RBPs. Due to the automated framework and accessibility of different modules of the pipeline, APRICOT can be conveniently adapted for the characterization of other functional classes. In agreement, by applying the pipeline for the identification of kinase proteins in E. coli, we demonstrate that the tool is not built on a fixed set of domain information, but instead it allows users to characterize proteins based on the functional classes of their interest.

AVAILABILITY
APRICOT is implemented in Python as a standalone command-line program, which can be executed on Unix systems. The tool has been extensively refined based on the requirements and suggestions by experimental researchers. The source-code for the command-line tool is available under the ISC license at https://pypi.python.org/ pypi/bio-apricot and the releases are automatically submitted to zenodo (DOI: https://doi.org/10.5281/zenodo. 322677 for the current version 1.2.7). A Docker image of the software is available at https://hub.docker.com/r/ malvikasharan/apricot/. Instructions for the usage of this pipeline are provided in its comprehensive documentation including test cases and online video tutorials.