CroMaSt: a workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment

Abstract Motivation Protein domains can be viewed as building blocks, essential for understanding structure–function relationships in proteins. However, each domain database classifies protein domains using its own methodology. Thus, in many cases, domain models and boundaries differ from one domain database to the other, raising the question of domain definition and enumeration of true domain instances. Results We propose an automated iterative workflow to assess protein domain classification by cross-mapping domain structural instances between domain databases and by evaluating structural alignments. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all experimental structural instances of a given domain type into four different categories (‘Core’, ‘True’, ‘Domain-like’ and ‘Failed’). CroMast is developed in Common Workflow Language and takes advantage of two well-known domain databases with wide coverage: Pfam and CATH. It uses the Kpax structural alignment tool with expert-adjusted parameters. CroMaSt was tested with the RNA Recognition Motif domain type and identifies 962 ‘True’ and 541 ‘Domain-like’ structural instances for this domain type. This method solves a crucial issue in domain-centric research and can generate essential information that could be used for synthetic biology and machine-learning approaches of protein domain engineering. Availability and implementation The workflow and the Results archive for the CroMaSt runs presented in this article are available from WorkflowHub (doi: 10.48546/workflowhub.workflow.390.2). Supplementary information Supplementary data are available at Bioinformatics Advances online.


A. Review of existing domain databases
Among sequence-based domain databases, Pfam is certainly one of the most comprehensive resources [Mistry et al., 2021]. Seed alignments for a representative set of sequences are manually curated using structural information, whenever available. Then, an HMM (Hidden Markov Model) profile is built based on the seed alignment which is then used to retrieve a full set of sequences matching this domain from the UniProt reference proteomes, thus producing the Pfam entry full alignment. Sets of Pfam entries that are thought to be evolutionarily related are grouped together into clans. This grouping is based on sequence similarity, structural similarity, functional similarity and/or profile-profile comparisons. Since October 5, 2022 the Pfam website is redirected to InterPro as it was decommissioned in January 2023. However the Pfam project continues and the core database will be maintained on the same principles as before [Paysan-Lafosse, 2022].
Structure-based domain databases are best represented by the SCOP (Structural Classification of Proteins) [Andreeva et al., 2020], CATH (Class, Architecture, Topology and Homology) [Sillitoe et al., 2021] and ECOD (Evolutionary Classification Of protein Domains) [Cheng et al., 2014] databases. The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between proteins whose three-dimensional structure is known and deposited in the PDB. The classification of proteins in SCOP has been constructed mainly manually by visual inspection and analysis. In SCOP, entries are protein domains identified in PDB structures and organized into families and superfamilies and finally into structural folds and classes reflecting their secondary structure content. Domain boundaries are provided at both family and superfamily levels as the evolutionary relationships can sometimes span regions of different size between closely related (family level) and more distantly related (superfamily level) proteins. In brief, the family domain boundaries can define conserved multidomain regions whereas the superfamily domains span over the individual domains. The relationships between the SCOP and Pfam databases are complex because of differences in boundaries and members of the families. However, curated cross-references are indicated at the family level from SCOP to Pfam database in many cases. The CATH project has developed a semi automatic procedure to split 3D structures into the constituent domains defined as semi-independently folding globular units. These domains are then clustered into homologous superfamilies based on evolutionary relationships. The sequences of the CATH structural domains are then used to build HMM profiles in order to identify the domains in UniProt protein sequences for which no 3D structure is available. This effort is shared with the sister resource Gene3D. The lowest level of CATH is H for Homologous superfamilies and CATH provides structural superpositions of all representative domains of a superfamily. However superfamilies can be sub-divided in functionally coherent groups named Functional Families (FunFams). Recently CATH has created an additional class for non globular domains. The ECOD is a hierarchical classification consisting of five levels: architecture (A), possible homology (X), homology (H), topology (T), and family (F). The entire database is first divided into architectures (A) based on secondary structure element (SSE) composition and overall shape (for example, alpha bundles and beta sandwiches). Then, possible homology (X) groups domains where some evidence exists to demonstrate homology but further evidence is needed. At the homology (H) level, ECOD groups domains that are descended from a common ancestor as indicated by significant sequence-structure scores, functional similarity, opinions in literature and in SCOP. The topology (T) level groups domains that have similar arrangements of and connections between secondary structural elements. At the bottom-most level, the family (F) level groups domains that have significant sequence similarity primarily based on Pfam. ECOD employs an automatic software pipeline to classify newly released structures from PDB. Proteins that cannot be classified confidently and completely by automated methods are manually curated.
The third category of domain databases encompasses integrated databases such as InterPro and CDD (Conserved Domain Database). InterPro [Blum et al., 2021] is an integrated resource of predictive models or 'signatures' representing protein domains, families, regions, repeats and sites from major protein signature databases including CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Thus, InterPro aims to combine the individual strength of each protein domain database without building domain models itself. Quality control is performed at InterPro when integrating new signatures by checking whether such signatures generate false positive matches. Hierarchical relationships are identified between InterPro entries to represent subfamilies displaying specific functions within larger families, or specific subclasses within certain classes of domains. The InterProScan software regularly calculates InterPro signature matches to UniProtKB. The CDD database [Lu et al., 2020] is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for domains and fulllength proteins obtained from both NCBI projects (NCBI Protein Clusters collection, NCBIfam, CDD itself) and external sources (Pfam, SMART, COG, TIGRFAMs). These models are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST (Reversed Position Specific BLAST). The NCBIfam curated domains use 3D structure information to define domain boundaries explicitly and provide insights into sequence/structure/function relationships. CDD shares domain models with InterPro and contributes to enlarge InterPro with specific subfamilies.

C. List of inconsistent structural instances
There are a total of 324 StIs from Pfam and CATH considered as obsolete and inconsistent entries by CroMaSt, of which 8 are inconsistent and the remaining are obsolete entries.
Format of these structural instance entries differ slightly for Pfam and CATH.
• Pfam inconsistent and obsolete entries: "PDB id,Chain,Fam name,Fam id,UNP id,UNP start,UNP end" • CATH inconsistent entries: " PDB id,Chain,Domain position,Fam id,PDB start,PDB end" • CATH obsolete entries: " PDB idChainDomain position,Fam id,PDB start,PDB end" Below is the list of the 8 inconsistent domain StIs detected in CroMaSt run1 (Main paper): The domain StI with PDB ID '5OSG' (chain h) is associated with UniProt ID 'A0A3Q8IT45' in PDB and with UniProt ID 'E9BNI3' in Pfam. This is because UniProtKB has merged 'E9BNI3' into 'A0A3Q8IT45' in a recent update (release 2023 02). Regarding the domain StI present in PDB ID '2KU7' (chain A), the start and end residues from the PDB entry are mapped in SIFTS to two different UniProt IDs: Q03164 and Q9UNP9. The situation is the same for all domain StIs in PDB ID '3DXB', the start and end residues map to UniProt IDs, P0AA27 and Q9UHX1, respectively. Finally, the domain StI in PDB ID '4V19' (chain X) has no annotation for any UniProt entry in SIFTS.
The complete list of obsolete domain StIs can be found in the Results archive of CroMaSt run1.

D. Structural alignment step at family level
All the cross-mapped structural instances (StIs) are first averaged into domain instances (UniProt-instance-level) followed by averaging all the domain instances together, resulting in the average structure at family level (See Methods section). All these average structures at family level are aligned against the core average structure using Kpax, and based on the alignment score the families are either added at the beginning of next iteration or discarded. Table S1 shows the excerpt from the alignment result file.
The three families failing to pass the given threshold (Mscore < 0.6) are Ribosomal L23 (PF00276), PPV E2 C (PF00511), and Ribosomal S24e (PF01282). The average structures for these three families are shown in Figure S1. The topology (order of secondary structure elements) for all these structures are mentioned below - The topology of 'core average structure' for RRM and 'PPV E2 C' domain types are similar, thus to confirm we visualized the structural alignment of PPV E2 C with core average domain ( Figure S2). RNP regions are highlighted (RNP1: Green, RNP2: Blue) in the sequence alignment (Figure S2 C.) showing the difference between these sequences (structures). The sequence from RNP regions of 'PPV E2 C' (PF00511 avgStruct core avgStruct.pdb) does not match with the RNP sequence from core average structure (core avgStruct query.pdb), moreover 'PPV E2 C' structure lacks aromatic residues in this region, that can form stacking interactions with nucleotides.
A total of 42 StIs were used to compute the average structure at the family-level for PPV E2 C (PF00511). All of these StIs are originally from CATH StIs cross-mapped to PPV E2 C (PF00511) Pfam family. All the StIs used to compute the average structure at family level can be found in the Results archive of CroMaSt run1.

E. Mapping of CroMaSt results to other structure-based domain databases
CroMaSt uses the cross-mapping approach for individual StI between Pfam and CATH. Although the results from CroMaSt covers both sequence (from Pfam) and structural (from CATH) features, we wanted to compare the results with other structurebased classifications, i.e., ECOD and SCOP. This comparison can be done at different levels (family and instance-level), but it is an extensive and time-consuming procedure. Thus, we randomly selected a StI from each Pfam family (with at least 1 StI in list of 'True' domain StIs) and cross-mapped these StIs to families in ECOD and SCOP. Table S2 shows StIs from Pfam and the cross-mapped families in ECOD and SCOP, respectively. SCOP does not have a family exclusively named as 'RRM' or 'RNA Recognition Motif', but the 'Canonical RBD' family of SCOP can be cross-referenced to the 'PF00076' (RRM 1) of Pfam. The 'Canonical RBD' family classifies under the superfamily 'RNA-binding domain, RBD' in SCOP. All of the SCOP families from Table S2 are classified under the superfamily 'RNA-binding domain, RBD'.
In ECOD, the family (F) level classification for domains is primarily based on Pfam domains having significant sequence similarity. Thus, the family naming convention is similar to the Pfam. All the ECOD families from Table S2 are classified under the same topology (T) -'RNA-binding domain, RBD'.  In CATH, all these StIs from   S2. Alignment of PPV E2 C (PF00511) average structure (magenta) against RRM core average structure (green). A. and B. Two different views for the structural alignment. C. Fasta alignment resulting from structural alignment; Region highlighted in blue represents the RNP2 signature sequence and in green the RNP1 one.

F. RRM clan in Pfam
The RRM clan CL0221 from Pfam v35.0 contains families that are related to the RRM domain type and are thought to be evolutionarily related. This clan contains 33 families and the total number of domains in the clan is 433471. Table S3 lists all the families from RRM clan, their number of StIs (apart from obsolete or inconsistent ones) and their exploration by CroMaSt run1 (Main paper) and run2 (Section G, below).

G. CroMaSt run2 with 14 Pfam families from RRM clan
In a second run, we initiate the CroMaSt workflow with 14 Pfam families from RRM clan (13 Pfam families missing from the first run and PF00076) from Pfam and superfamily 3.30.70.330 xRRM 2 -✓ * * Starting Pfam family for CroMaSt workflow.
(RRM domain) from CATH. Table S4 shows the different results obtained at each step of the workflow. A total of 1445 domain StIs were extracted from 14 Pfam families of which 1 was inconsistent, whereas 1527 domain StIs were extracted from CATH, of which 316 were obsolete and 7 inconsistent. Then, the 1444 and 1204 domain StIs from Pfam and CATH, respectively, were residue-mapped. Out of all these residue-mapped domain StIs, 883 are shared between Pfam and CATH. Thus, 883 StIs constitute the 'Core' domain StIs, and are also included in the list of 'True' domain StIs. Core average structure for RRM domain type was computed using these 883 StIs. These 883 'Core' domain StIs are the same as in the first run, resulting in the same core average structure. From the remaining StIs (561 unique to Pfam and 321 unique to CATH), 21 StIs from Pfam were successfully cross-mapped to 4 different CATH superfamilies and 243 StIs from CATH were successfully cross-mapped to a total of 17 different Pfam families. Thus, 'family-level' average structures were computed for these 4 and 17 newly found CATH and Pfam families using the cross-mapped StIs. After aligning these average structures against the core average structure, 17 (3 from CATH and 14 from Pfam) of them passed the threshold (M-score >= 0.6) allowing to include these families at the beginning of the next iteration. The remaining 4 families (1 from CATH and 3 from Pfam) and their corresponding StIs did not pass the structural alignment step (M-score < 0.6) and were considered as 'Failed' domain families and StIs. Thus, only 79 StIs from the 243 CATH StIs cross-mapped to Pfam and 14 new Pfam families were kept for the next iteration. Moreover, only 17 StIs from the 21 Pfam StIs cross-mapped to CATH and 3 new CATH superfamilies were kept for the next iteration. The average structures were computed at the 'UniProtinstance-level' for all un-mapped StIs from Pfam (540) and CATH (78). After structural alignment against the core average structure, only 56 StIs from Pfam failed to pass the threshold. Thus, all remaining StIs (484 from Pfam, and 78 from CATH) are qualified as 'Domain-like' StIs.
In summary, the first iteration resulted in a total of 883 'Core', 562 'Domain-like', and 224 'Failed' domain StIs, as well as 3 CATH and 14 Pfam families along with 17 Pfam StIs and 79 CATH StIs ready for the next iteration.
The second iteration started with the 14 Pfam families (and 17 StIs from last iteration) and 3 CATH superfamilies (and 79 StIs from last iteration). A total of 100 (+17 from last iteration) domain StIs were extracted from the 14 Pfam families, whereas 17 (+ 79 from last iteration) domain StIs were extracted from 3 CATH superfamilies, with no inconsistent or obsolete entry.
The two sets shared 96 StIs ('True' domain StIs) and the other 21 StIs from Pfam remained un-mapped in CATH. Nearly all of them (20/21) passed the alignment threshold leading to 20 'Domain-like' StIs and 1 'Failed' domain StI. Thus, at the end of the second iteration, no new family was found, hindering any further iteration of the workflow.
In summary, the CroMaSt workflow, initialized with 14 Pfam families from RRM clan (including PF00076) and CATH superfamily 3.30.70.330, identified 979 'True' domain StIs (among which 883 are 'Core'), 582 'Domain-like', and 225 'Failed' domain StIs with respect to the RRM domain type. In terms of domain families, the CroMaSt workflow explored a total of 36 families (31 from Pfam and 5 from CATH, including the starting families) and 29 of them (25 from Pfam and 4 from CATH) qualified for the RRM domain type with at least one StIs in either 'True' or 'Domain-like' category (Table S4).
The 14 Pfam families explored by cross-mapping of CATH StIs in the first iteration of this run are same as in CroMaSt run1 (Main paper).
The results obtained at each step of the workflow are given in Table S5.
A total of 785 domain StIs were extracted from Cadherin Pfam family, whereas 454 domain StIs were extracted from CATH, with no inconsistent or obsolete entry. Then, all these domain StIs from Pfam and CATH, respectively, were residuemapped. Out of all these residue-mapped domain StIs, 397 are shared between Pfam and CATH. Thus, 397 StIs constitute the 'Core' domain StIs, and are also included in the list of 'True' domain StIs. Core average structure for Cadherin domain type was computed using these 397 StIs. From the remaining StIs (388 unique to Pfam and 57 unique to CATH), only 40 StIs from CATH were successfully cross-mapped to a total of 3 different Pfam families. Thus, 'family-level' average structures were computed for these 3 Pfam families using the cross-mapped StIs. After aligning these average structures against the core average structure, all of them passed the threshold (M-score >= 0.6) allowing to include these families at the beginning of the next iteration. Thus, all domain StIs (40) cross-mapped to Pfam and 3 new Pfam families were kept for the next iteration.
The average structures were computed at the 'UniProtinstance-level' for all un-mapped StIs from Pfam (388) and CATH (17). After structural alignment against the core average structure, all StIs (388 from Pfam and 17 from CATH) are qualified as 'Domain-like' StIs.  In summary, the first iteration resulted in a total of 397 'Core', 405 'Domain-like', and 0 'Failed' domain StIs, with 3 Pfam families and 40 CATH StIs ready for the next iteration.
The second iteration started with the 3 Pfam families and 40 StIs from CATH. A total of 92 domain StIs were filtered from the 3 Pfam families, with no inconsistent or obsolete entry. The two sets shared 40 StIs ('True' domain StIs) and the other 52 StIs from Pfam remained un-mapped in CATH. All of them (52) passed the alignment threshold leading to 52 'Domain-like' StIs and no 'Failed' domain StI. Thus, at the end of the second iteration, no new family was found, hindering any further iteration of the workflow.
In summary, the CroMaSt workflow, initialized with Pfam PF00028 and CATH 2.60.40.60 domain families, identified 437 'True' domain StIs (among which 397 are 'Core'), 457 'Domainlike', and 0 'Failed' domain StIs with respect to the Cadherin domain type. In terms of domain families, the CroMaSt workflow explored a total of 5 families (4 from Pfam and 1 from CATH, including the starting families) and all of them (4 from Pfam and 1 from CATH) qualified for the Cadherin domain type (Table S5). The 3 Pfam families detected by crossmapping of CATH StIs (PF08266, PF08758, and PF17756) are all three members of the same clan as of the starting family: CL0159 (E-set).

H.2. Zinc Finger domain type
We instantiated the CroMaSt workflow with Pfam family PF00096 (zf-C2H2) and CATH superfamily 3.30.160.60 (Classic Zinc Finger). We used the same parameters for this run of CroMaSt workflow except the starting family identifiers and minimum domain length. We used a minimum domain length of 10 for the Zinc finger domain type because it is very small (∼25 aa) compared to Cadherin (∼93 aa) and RRM (∼90 aa) domain types. The results obtained at each step of the workflow are given in Table S6.  A total of 579 domain StIs were extracted from zf-C2H2 Pfam family, whereas 359 domain StIs were extracted from CATH, of which 8 were inconsistent and 1 obsolete. Then, the 579 and 350 domain StIs from Pfam and CATH, respectively, were residue-mapped. Out of all these residue-mapped domain StIs, 217 are shared between Pfam and CATH. Thus, 217 StIs constitute the 'Core' domain StIs, and are also included in the list of 'True' domain StIs. Core average structure for Zinc Finger domain type was computed using these 217 StIs. From the remaining StIs (362 unique to Pfam and 133 unique to CATH), only 47 StIs from CATH were successfully cross-mapped to a total of 14 different Pfam families. Thus, 'family-level' average structures were computed for these 14 Pfam families using the cross-mapped StIs. After aligning these average structures against the core average structure, 4 of them (corresponding to 7 cross-mapped StIs) passed the threshold (M-score >= 0.6) allowing to include these families at the beginning of the next iteration. The remaining 10 families and their corresponding StIs (40) did not pass the structural alignment step (M-score < 0.6) and were considered as 'Failed' domain families and StIs. Thus, only 7 domain StIs from 47 CATH StIs cross-mapped to Pfam were kept for the next iteration.
The average structures were computed at the 'UniProtinstance-level' for all un-mapped StIs from Pfam (362) and CATH (86). After structural alignment against the core average structure, only 28 StIs from CATH failed to pass the threshold. Thus all remaining StIs (362 from Pfam and 58 from CATH) are qualified as 'Domain-like' StIs.
In summary, the first iteration resulted in a total of 217 'Core', 420 'Domain-like', and 68 'Failed' domain StIs, with 4 Pfam families and 7 CATH StIs ready for the next iteration.
The second iteration started with the 4 Pfam families and 7 StIs from CATH. A total of 45 domain StIs were filtered from the 4 Pfam families, with no inconsistent or obsolete entry. The two sets shared 7 StIs ('True' domain StIs) and the other 38 StIs from Pfam remained un-mapped in CATH. All of them (38) passed the alignment threshold leading to 38 'Domain-like' StIs and no 'Failed' domain StI. Thus, at the end of the second iteration, no new family was found, hindering any further iteration of the workflow.
In summary, the CroMaSt workflow, initialized with Pfam PF00096 and CATH 3.30.160.60 domain families, identified 224 'True' domain StIs (among which 217 are 'Core'), 458 'Domainlike', and 68 'Failed' domain StIs with respect to the Zinc Finger domain type. In terms of domain families, the CroMaSt workflow explored a total of 16 families (15 from Pfam and 1 from CATH, including the starting families) and 6 of them (5 from Pfam and 1 from CATH) qualified for the Zinc Finger domain type (Table S6).
The 4 Pfam families detected by cross-mapping of CATH StIs (PF08209, PF10426, PF13909, PF18450) are all three members of the same clan as of the starting family: CL0361 (C2H2-zf). However, some of the 10 'Failed' Pfam families are also from the same clan. This suggests that the alignment threshold used is too high and should be lowered in order to retrieve more 'True' domain StIs from different families assigned to this domain type. It is likely because of the rather small size (about 25 amino acids) of this domain type.