GenDiS database update with improved approach and features to recognize homologous sequences of protein domain superfamilies

Abstract Since proteins evolve by divergent evolution, proteins with distant homology to each other may or may not bear similar functions. Improved computational approaches are required to recognize distant homologues that are functionally similar. One of the methods of assigning function to sequences is to use profiles derived from sequences of known structure. We describe an update of the Genomic Distribution of protein structural domain Superfamilies (GenDiS) database, namely GenDiS+, which provides a projection of SCOP superfamily members on the sequence space (NR database, NCBI). The sequences are validated using structure-based sequence alignment profiles and domain and full-length sequence alignments. GenDiS+ is a `tour de force’ for detecting homologues within around 160 000 taxonomic identifiers, starting from nearly 11 000 domains of known structure. Features, like full-sequence alignment and phylogeny, domain sequence alignment and phylogeny, list of associated structural and sequence domains with strength of interactions, links to databases like Pfam, UniProt and ModBase and list of sequences with a PDB structure, are provided.


Introduction
The number of protein sequences is estimated to be three orders of magnitude higher than the number of structures currently available in PDB (1). The number of folds being limited (2) and homologous proteins having similar structures and protein sequences can be classified into the existing folds using sequence profiles. Profiles derived from Position-Specific Scoring Matrix (PSSM) (3), Hidden Markov Model (HMM) (4) and Markov random field-based (5) methods are found to be very sensitive in accurately classifying sequences into structural folds or families. Several approaches exist for classifying a sequence into structural folds. These methods rely on different approaches like threading, contact-prediction, homologybased methods and support vector-based methods.
Databases like Superfamily (6) and Gene3D (7) provide information on homologues of known structure from 2487 complete genomes and genomes in UniProtKB (8), respectively. The NR database, NCBI, has sequence information for 564 020 taxonomic identifiers including unclassified organisms. Genomic Distribution of protein structural domain Superfamilies (GenDiS) database provides information about homologous sequences, starting from domains of known structure and belong to the same superfamily. The connections are performed at the sequence level along with the taxonomic distribution. The users can also compare different genomes for a particular domain superfamily and analyse the other co-existing domains. The users can also obtain information about the domain architecture (DA) of the homologues and the associated superfamily of the co-existing domains. GenDiS has been used for deriving profiles from superfamily hits for sequence searches or for an estimate on the number of proteins for a family in a genome (9)(10)(11)(12)(13), and the methodology has been adapted in other studies (14). In this paper, we describe an update of the database with additional features. The new release of GenDiS+ records 18 million homologous domains, around 23 million co-existing SCOP domains, around 38 million Pfam domains from 1965 SCOP superfamilies and over 27% of approximately 65 million scanned sequences (from around 1.6 million organisms). The results have also been presented for homologues, starting from domains represented from 1195 folds and 7 classes in SCOP 1.75.

Materials and methods
The methods for sequence search and validation have been described in our previous paper (15) and in Figure 1. However, we describe them briefly here.

Searching for homologues of structural members
Two approaches were used for identifying homologues, the first being multi-query (MQ) approach with all PASS2.4 (16,17) superfamily members as queries. The second approach was using multiple patterns per query for the single-membered and two-membered PASS2.4 superfamilies and a best-representative sequence-multi pattern (BRS-MP) approach for multi-membered superfamilies. CSI-BLAST (18) was used using MQ approach, while PHI-BLAST (19) was used for MP approach with an E-value and inclusion threshold of 10 −3 for 20 iterations. The sequence search was carried out against NR database, NCBI (May 2015).

Validation and analysis
In the PASS2 database, structural alignments are created for the superfamily members (as organized in SCOP) but with <40% sequence identity with each other. In this manner, the alignment is not skewed by sequences with high sequence identities. In some superfamilies, structurally deviant members were removed, and structural alignments were reconstructed to obtain good-quality alignments (17). HMMs were created from the PASS2.4 structure-based sequence alignments of domain superfamilies (Superfamily/SF-HMM) and individual sequences (Single query/SQ-HMM) using HMMER suite (20). Each of the hits was assessed if they are true positives, using the HMMs and program HMMSCAN from HMMER. HMM matches with an independent E-value (i E-value) of 10 −3 and model coverage of 0.7 were considered. HMM overlaps of up to 25 residues were allowed, beyond which the match with lower i Evalue was considered. Finally, the sequences with at least a single HMM match of the same superfamily as the query were assigned as true positive hits. Discontinuous domains were identified using continuity of HMM coordinates and discontinuity of sequence coordinates (explained in Supplementary File).
The presence of co-existing domains of known structure was recognized through HMM library derived from SCOP database. Henceforth, we refer to such connections, and DA is referred as SCOP-DA. Likewise, the presence of coexisting domains, which are mainly recorded in sequence domain databases, such as PFAM, was recognized using search against a library of PFAM (v 28) HMMs. Henceforth, this result is referred as Pfam-DA. DA of homologues was compared, and homologues were clustered using an in-house tool called alignment-free DA similarity search (ADASS) algorithm (21). The strings of domains between sequences are used to score the extent of dissimilarity amongst sequences employed to create phylogeny. Taxonomy assignments were carried out using NCBI Taxonomy database (22). Domain regions were extracted from sequences and aligned using Clustal Omega (23), and fulllength sequences were aligned using MAFFT (24). Trees were computed for the alignments using Clustal Omega and PartTree (25) program, respectively.

Database setup
The database was constructed using HTML, CSS, Bootstrap and json-like collections derived from MongoDB using The sequence searches were carried out using BRS-MP approach for multi-membered superfamilies, as described in the Materials and Methods section. The hits were validated using a stringent validation using structure-based sequence alignments of superfamily members. From the validated hits, features like DA computation and phylogeny, domain and full-length alignment and phylogeny and conserved motifs were analysed and have been made available in the database.
Robo3T. MongoDB collections were created for the details like taxonomic distribution, SCOP and Pfam DA, SCOP fold and class mapping, taxa and DA mapping and PASS2.4 HMM details for each. Search options were implemented using python scripts using Jinja as the interface between python and HTML. The web interface can be accessed from different browsers (Firefox, Chrome and Microsoft Edge) and operating systems (Windows, Linux and MacOS) and has been aesthetically designed for both desktop and laptop screens.

Search and browse GenDiS+
A quick search can be carried out against the database, using keywords like the six-digit SCOP superfamily code, SCOP superfamily name, Pfam family name, NCBI accession identifiers and SCOP and Pfam DAs. A fuzzy search has been implemented for superfamily and family names. The database can be browsed using three different SCOP hierarchies-class, fold and superfamily. The taxonomy browsing option projects trees to display NCBI taxonomic ranks like kingdom and superphyla, and the tree branches are connected to links displaying the lower taxonomic ranks by hyperlinks. The database can also be searched using NCBI taxa names like species name, family, phylum or kingdom names using the 'Browse by taxonomy' tab.
The users can browse the database through SCOP or taxonomic hierarchy and at the species level. The user can browse through SCOP classes, folds or superfamilies using the 'Browse by SCOP hierarchy' option. On clicking the class option, a page is displayed that lists the folds and superfamilies under each SCOP class ( Figure 2a). All the folds in SCOP (v1.75) are displayed in the webpage on folds, with the respective classes and superfamilies (Figure 2b). Links are provided for the corresponding structural class and folds in SCOP and superfamily information in GenDiS+.
The distribution of homologues in each species can be analysed for different hierarchies like cellular organisms, viruses, unclassified organisms and artificial sequences. Each of these hierarchies is divided into different superphyla, phyla and families. The page has been rendered using Anytree module from python, styled by Bootstrap CSS. By clicking the sub-taxa levels, the user can browse the different organisms under each sub-taxon. The taxid, species name and taxonomic hierarchy are displayed along with the structural superfamilies that occur in the proteome and their accession ids. Links are provided for the taxid, accession ids from NCBI and superfamily to the GenDiS+ superfamily page. Sort buttons are provided for each column to sort the entries numerically or alphabetically. The coverage of different taxonomic hierarchies has been provided in Table 1.

Information and downloads
The main 'Downloads' page can be used to retrieve homologous sequences annotated using our pipeline, PASS2 structure-based sequence alignment HMM (SF-HMM) and single PASS2 member HMM (SQ-HMM). This includes the combined SF-HMM and SQ-HMM libraries, full-length and extracted domain regions for all the superfamilies. The full-length sequences represent the fraction of sequence space with at least a single domain, which is homologous to a superfamily and can be used as a database for sequence searches. The statistics page provides information about sequence coverage of different databases and superfamily member statistics (Tables 1 and 2 and Figure 3).  The coverage of different kingdoms and superphyla in the taxonomy database, NCBI, has been described. Most of the archaeal and bacterial genomes have been covered, whereas lower coverage is seen for Eukaryotes especially, Metazoa and Viridiplantae.
A library of NCBI organisms with taxonomic identifiers, complete lineage and the superfamily domains identified (as taxid library) has also been provided for download. A library of the SCOP-DA and Pfam-DA, identified in the study, has been provided for download as SCOP-DA library and Pfam-DA library, respectively. List of superfamilies (specific to taxa such as archaea, bacteria, metazoa, viridiplantae, fungi and viruses) and superfamilies common to the three kingdoms of life are also available for download.

Superfamily description page
The superfamily description page can be accessed by searching the database using the SCOP code or superfamily name and from the 'Browse by Superfamily' option. The 'Summary' tab displays details of the superfamily such as the SCOP code, description, SCOP class and fold, number of hits, organisms, SCOP-DA and Pfam-DA (Figure 4a (Figure 4b). The full-length sequence phylogeny is provided with a metadata file for annotation of organisms in the tree at the species and family level. The 'Tree file' tab displays the SCOP DA tree as derived from ADASS tool (Figure 4c). The 'Members' tab displays PASS2 superfamily members with links for the sequence and conserved motifs file. The domain alignments and the HMMs of PASS2 superfamilies can be used for sequence searches or for validation of hits obtained from sequence searches carried out for specific superfamilies.

Tools
To facilitate analysis of the sequence data, three different tools have been made available in GenDiS+ and have been described in detail in Supplementary File. These tools enable analysis of particular domain superfamilies of interest.

Applications: case study of Smad protein MH1 superfamily
The superfamily SMAD MH1 (MAD homology 1) domain (SCOP code 56366) has SCOP entries from human and mouse. We found homologues for the superfamily in 297 proteomes from the phyla Cnidaria and Bilateria, thereby sampling superphyla that are not included in PDB such as Protostomia and Platyhelminthes.
The coverage of PDB entries from different databases by our method has been shown in Table 3. The two protein sequences covered in SCOP (v1.75) are from humans, whereas SCOPe (v2.07) has an entry from mouse (3KMP) #ref; the other entry that is 51% identical to the sequence from mouse has not been classified (3QSV) (Baburajendran et al., unpublished results).  TP  FP  TP  FP  TP  FP  TP  FP   Pfam  9  1  3  0  13  0  7  The profile with the best performance has been highlighted in bold. The GenDiS+ profile of full-length alignments with 70% column information performs the best in all the cases. TP, true positives; FP, false positives. All the PDB entries listed in the below databases were retrieved in the sequence searches. An entry from mouse that was not covered by the following databases was also recovered indicating that we have been able to improve coverage at both sequence and structure space.
The most prevalent SCOP-DA is the associated DA with SMAD/FHA domain, followed by the single domain of SMAD MH1. Other less prevalent SCOP-DA contain double-domain form of SMAD MH1 domain, β-β-α zinc fingers and P-loop containing nucleotide triphosphate hydrolases. Similarly, the Pfam-DA consists of the singleand double-domain forms and combinations with domains like SMAD MH2 domain, Snapin/Pallidin, GTPase ArgK and others (Figure 5a and b).
All the organisms contain the associated DA with SMAD/FHA domain and the single-domain form, whereas the double-domain form occurs in Cnidaria and Protostomia. Chordata contains all the other DA, which has the maximum domain architectural diversity.

Applications of GenDiS+ database in sequence searches
Profiles were derived from GenDiS+ domain sequence alignments, GenDiS+ full-sequence alignments with and without trimming of gaps and using information from columns ,which are present in at least 70% of the sequences for the acyl carrier protein-like superfamily. SCOP profiles were derived from PASS2.5 (26), and Pfam family profiles were also used for the comparison. The performance of the above profiles was compared with respect to the identification of annotated hits in the proteomes from different model organisms (UniProt Reference proteome). The GenDiS+ full-length profile (derived from alignment positions with at least 70% sequence conservation) performed as well as Pfam profiles with respect to number of hits and true positives (Table 4).

Discussion
Sequence annotation is the most important step in genomic, metagenomic and transcriptomic studies. The number of structures available in PDB and SwissProt is far less compared to the number of sequences in the sequence database, which poses a hurdle in the functional and structural annotation. We report the update of a database, GenDiS+, to bridge the gap between the sequence space and structural space using sensitive computational searches and profiles of structural entries of members of a superfamily. Our database is unique (in comparison to other databases like PDB, SCOP and PASS2.4) in providing additional information like sequence alignments, conserved motifs for homologues of a superfamily, DA for the identified homologues and taxonomic details. It is also different from databases like UniProt, Pfam and Superfamily in providing details for nearly 41% of organisms in NCBI Taxonomy database, rather than RefSeq proteomes or complete proteomes. Different details are provided for SCOP superfamilies, and files can be analysed and are available for download. Using our approach, ∼58% of sequences from unclassified and uncultured organisms and 30% of sequences from extrachromosomal origin like plasmids and transposons have been annotated. Information of the conserved motifs extracted from the alignment of homologues of superfamily members and associated SCOP superfamily and Pfam family domains (identified from the DA of homologues) has been provided. The full-length sequence alignments, domain sequence alignments and phylogeny have been provided with the taxonomy metadata. GenDiS+ database provides sequence information for validated superfamily members with links to other databases like PDB, SCOP, PASS2.4, UniProt and ModBase, if available.
The sequence alignment of the members is also available, which includes domains of known structures to validate the superfamily assignment. Both DA and sequence alignments, along with taxonomic distribution, can be used to understand the evolution of the superfamily. To the best of our knowledge, no other database provides structural annotations for entire taxonomy database and provides correspondence between SCOP and Pfam database with a high coverage (60%) of Pfam families. However, sequences with a novel fold or superfamily from PDB identifiers not classified in SCOP will be missed out in the tool. There are ∼40% families in Pfam that have not been assigned a domain superfamily of known structure from our analysis. These families are good targets for structural genomics initiative, which will help in bridging the gap between the sequence and structural space.

Supplementary data
Supplementary data are available at Database Online.