Improving the consistency of domain annotation within the Conserved Domain Database

When annotating protein sequences with the footprints of evolutionarily conserved domains, conservative score or E-value thresholds need to be applied for RPS-BLAST hits, to avoid many false positives. We notice that manual inspection and classification of hits gathered at a higher threshold can add a significant amount of valuable domain annotation. We report an automated algorithm that ‘rescues’ valuable borderline-scoring domain hits that are well-supported by domain architecture (DA, the sequential order of conserved domains in a protein query), including tandem repeats of domain hits reported at a more conservative threshold. This algorithm is now available as a selectable option on the public conserved domain search (CD-Search) pages. We also report on the possibility to ‘suppress’ domain hits close to the threshold based on a lack of well-supported DA and to implement this conservatively as an option in live conserved domain searches and for pre-computed results. Improving domain annotation consistency will in turn reduce the fraction of NR sequences with incomplete DAs. URL: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi


Introduction
The Conserved Domain Database (CDD) (1) consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These models are available as position-specific score matrices (PSSMs) for identifying conserved domains in protein sequences via Reverse Position-Specific (RPS)-BLAST. CDD is a redundant collection, it includes models imported from SMART (2), Pfam (3), COG (4), NCBI Protein Clusters (5) and TIGRFAMS (6), as well as NCBI-curated fine-grained hierarchical classifications for selected domain families based on phylogenetic analysis. Domain models that have overlapping annotation on the same protein sequences are clustered into CDD superfamilies (7). CDD provides pre-computed domain and site annotation for the majority of protein sequences tracked by NCBI's Entrez system. Two CDD search services are available: CD-Search (8), for protein and nucleotide queries and Batch CD-Search (9), for multiple protein queries. The default E-value threshold for the pre-computed annotation and for these search services is 0.01. NCBI's Conserved Domain Architecture Retrieval Tool (CDART) (10) carries out similarity searches of Entrez Protein based on domain architecture (DA). For a protein query, it returns the footprints of the highest scoring CDD superfamilies on that protein sequence and a list of proteins with similar DAs, grouped according to DA. CDART output for each DA includes its taxonomy span and the total number of NR sequences having that DA.

Methods
Estimating the frequency of domain hits to 'rescue' It has been noted earlier that profile-based annotation of domain footprints can benefit from considering domain co-occurrence (11). To determine whether developing an algorithm to 'rescue' domain hits above the default reporting threshold of E-value 0.01 would uncover a significant number of additional annotations, we estimated the frequency of domain hits we would 'rescue' by manually inspecting randomly picked sets of sequences. We chose SwissProt (12) as represented in NCBI's Entrez/protein database (542 902 sequences) and a representative human proteome (19 856 sequences, with 'NP' and 'XP' accession prefixes) as test sets (5 February 2014). The human proteome comprised essentially one representative protein sequence for each currently known human gene. To generate the set, we parsed the annotation files for the human reference assembly GRCh38 for GeneID and protein accessions and applied selection criteria to pick one per gene based on the longest annotated CDS per GeneID. We obtained three random subsets of 4000 sequences from SwissProt and two random subsets of 2000 sequences each from the representative human proteome and compared the domain hits reported at E-value thresholds of 1.0 vs. the default reporting threshold at 0.01 for these smaller subsets of proteins. An in-house version of CDART was generated that contains pre-computed sets of RPS-BLAST results for all sequences in NCBI's NR database reported at an E-value threshold of 1.0, above the default reporting threshold of 0.01. We asked how often new or additional domain hits were encountered and how often this happens when other non-overlapping domain footprints are present. We manually inspected subsets of the protein sequences that had found new hits to determine whether these hit(s) appeared meaningful and should be 'rescued'. In making that decision, we considered the frequency and taxonomy of the new DA, completeness of the 'new' domain hit and overlap between the 'new' domain hit and other well-supported domain annotations.
Consequently, we generated a set of protein sequences having valuable domain hits to 'rescue' and used it to later refine the following 'rescue' algorithm based on well-supported DA or tandem repeats.
'Rescue algorithm' quent alternative covers at least 20 sequences from NCBI's NR database or is more common than the initial 'A' DA, report it instead of the 'A' architecture, which means that 'rescued' domains that contribute to this most frequent DA are reported as well. 6. As CDART only reports/considers a single superfamily footprint for two or more consecutive domain hits to models from the same superfamily, irrespective of the repeat number, we added the following: additionally, if any of the additional domain hits that has not been 'rescued' at this point belongs to the same CDART superfamily as an adjacent hit that is being reported (i.e., a tandem repeat), that domain hit is being 'rescued' and reported as well.

Results and conclusions
Improving domain annotation consistency to increase the fraction of sequences in NR with more complete DAs Incomplete architectures may contribute considerably to the large overall number of DAs in CDART. As shown in Figure 1, a large fraction of DAs each cover only a few sequences from NCBI's NR protein set. On-going curation to improve representation of some domain families and other efforts such as reported here to improve domain annotation consistency will reduce the fraction of sequences with incomplete DAs and hence the number of rare or unusual DAs in CDART.
Manual inspection adds a non-trivial amount of valuable annotation using CDD at a higher threshold We manually screened random subsets of protein sequences having one or more additional domain hits at RPS-BLAST E-value threshold of 1.0 which were not present at the reporting E-value threshold of 0.01 and determined whether these additional hit(s) were valid and should be 'rescued' (Table 1). For example, in the SwissProt random sample 2, 987 of the 4000 unique protein sequences had new hits (24.7% of the total protein sequences). Of these 987 protein sequences, about 19.5% had valuable hits to 'rescue', which can be extrapolated to about 4.8% of the original 4000 random sample. Information such as the frequency of the alternative DA, the taxonomical span of the alternative DA, the completeness of the additional domain hit and its degree of overlap  An algorithm that considers DA and tandem repeats adds a significant amount of valuable annotation using CDD at a higher threshold An automated procedure (detailed in the Methods section) was developed, a simple classifier which considers DA and tandem repeats in making the determination as to whether to 'rescue' a borderline hit; 11.29% of protein sequences in the representative human proteome (2241 protein sequences out of 19 856) and 5.58% of the SwissProt protein sequences (30 267 protein sequences out of 542 902) had domain(s) 'rescued' by the algorithm. In the manual screening (Table 1), we estimated 19.66% 6 5.18% and 4.97% 6 0.13% and (averages and standard deviations) for the human proteome and SwissProt test sets, respectively. However, in the manual screening, we considered additional discriminators, such as taxonomy of the DA in CDART, and different thresholds of NR sequences and cannot exclude unconscious bias based on the biological knowledge of the curator. These additional discriminators were not considered in the automated procedure. In addition, the algorithm was run on a later CDD version/release CDD V3.12 46 675 PSSMs with improved domain models. This may explain the difference in the percent of rescued domains between our manual screening and automated procedure. From this study, it may not be necessary to include additional and more computationally intensive discriminators in the algorithm, such as taxonomic distribution.
The automated algorithm was implemented in the latest public CD-Search version (released 3 October 2014), as a selectable option 'rescue borderline hits' for live searches (Figure 2).    Table 2. Figure 3B shows the results of a live CD-Search result selecting the new 'rescue borderline hits' option. Only one of the two new hits detected at the raised E-value is 'rescued', B2 ¼ ABC_ATPase, it is indicated by a dotted line and its E-value is highlighted in red. Based on DA, the ABC_ATPase superfamily domain hit (cl21455, hit detected with superfamily member cd00079 HELICc) is 'rescued' by the algorithm, the intraflagellar transport complex B subunit 20 (IFT20) domain hit (cl20817) is not.

Example of a domain 'rescued' by the algorithm and of an incomplete DA
There are two common DAs in CDART, one with and one without the second ABC_ATPase SF domain ( Table 2, 2554 vs. 1542 NR sequences, respectively, determined 4 November 2014). The DA missing the second ATPase domain is most likely incomplete, as the SecA ATPase/DEAD motor composed of two ATPase domains, which function together to bind and hydrolyze ATP. For about 80% of the sequences in NR having the DA lacking the second ATPase domain (as of 16 October 2014), a CD-Search with the new 'Rescue borderline hit' option 'rescued' the second ATPase domain. It may be that with improved domain representation and detection, and with improved annotation consistency, these two architectures will resolve to a single DA with two ATPase domains.

Example of tandem repeats lifted by the algorithm
The algorithm also 'rescues' all additional hits detected at E-value 1 and not at E-value 0.01 that belong to the same CDART superfamily as an adjacent hit that is being reported at E-value 0.01, i.e. tandem repeats. An example (Figure 4) is the beta-Propeller of protein Krp1, which contains six Kelch repeats. Various pfam models detect four of  Manual inspection removes a considerable amount of incorrect annotation using CDD at a lower threshold We were also interested in whether some annotation with borderline hits close to the reporting threshold should be 'suppressed'. To investigate this, we manually screened sample protein sequences for domain(s) present at the default E-value threshold of 0.01 but lost at an E-value threshold of 0.001 and determined if those domain hits should be 'suppressed' ( Example of a domain hit that manual inspection classifies as incorrect

In summary
Manual inspection (i) reveals a non-trivial amount of valuable annotation using CD-Search at a higher E-value threshold and (ii) also reveals a smaller, but non-trivial amount of incorrect annotation that could be avoided using CD-Search at a lower threshold. The most recent version of CD-Search (released 3 October 2014) provides the option to 'rescue' borderline-scoring domain hits based on well-supported DAs and tandem repeats. Currently, this option is available for live searches. We plan to extend the post-processing of CD-Search results to also allow 'suppression' of some domains close to the default E-value threshold based on well-supported DA and finally to implement a conservative post-processing strategy for both pre-computed results and live searches.