We identified a cadherin-like domain (CHDL) using computational analysis. The CHDL domain is mostly distributed in Proteobacteria and Cyanobacteria, although it is also found in some eukaryotic proteins. Prediction of three-dimensional protein folding indicated that the CHDL domain has an immunoglobulin β-sandwich fold and belongs to the cadherin superfamily. The CHDL domain does not have LDRE and DxNDN motifs, which are conserved in the cadherin domain, but has three other motifs: PxAxxD, DxDxD and YT-V/I-S/T-D, which might contribute to forming a calcium-binding site. The identification of this cadherin-like domain indicates that the cadherin superfamliy may exhibit wider sequence and structural diversity than previously appreciated. Domain architecture analysis revealed that the CHDL domain is also associated with other adhesion domains as well as enzyme domains. Based on computational analysis and previous experimental data, we predict that the CHDL domain has calcium-binding and also carbohydrate-binding activity.
The cadherin domain is mostly distributed in the metazoan lineage. According to the SMART database  (as of April, 2005), there are 9496 cadherin domains in 1421 metazoan proteins and only 73 cadherin domains in 19 bacterial proteins. The cadherins comprise a family of calcium ion mediated cell adhesion molecules that form and maintain cell–cell adhesion [2–4]. Cadherins are single-pass transmembrane proteins characterized by the presence of distinctive cadherin repeat sequences (cadherin domain) in their extracellular regions. Each of these repeats, consisting of about 110 amino acids, forms a β-sandwich with Greek-key folding topology. Cadherins typically have several cadherin domains tandemly repeated in their extracellular segments. Cadherin-mediated cell–cell junctions are formed as a result of interactions between extracellular domains of identical cadherins, which are located in the membranes of neighboring cells . Cadherins can be classified into several subfamilies : type I (classical) and type II cadherin, which are ultimately linked to the actin cytoskeleton [5,6]; desmosomal cadherins (desmocollins and desmogleins), which are linked to intermediate filaments ; and protocadherins, which are expressed primarily in the nervous system . In addition, there are several ‘atypical' cadherins which contain one or more cadherin repeat sequences but bear no other hallmarks of cadherins .
Using PSI-BLAST , we identified a cadherin-like domain, which mostly exists in bacterial toxins, enzymes and adhesion proteins in Proteobacteria and Cyanobacteria. We termed this new cadherin-like domain (CHDL). Sequence alignments and folding prediction indicated that the CHDL domain has an immunoglobulin β-sandwich fold and belongs to the cadherin superfamily. The CHDL domain is associated with several other adhesion domains and enzyme domains. Previous experimental data supports the proposal that the CHDL domain may also have carbohydrate binding activity. Our data predicted additional structural and functional information of CHDL domain containing proteins which have been less experimentally studied. The results also facilitate our understanding of the evolution of the cadherin domain.
Materials and methods
All PSI-BLAST  searches were carried out using inclusion thresholds (E < 0.005) against the NCBI non-redundant protein database. Multiple sequence alignments were performed using the T-Coffee program  using default settings with minimal manual adjustments and the resulting alignments were colored using the Chroma program .
Domain architecture and structure predictions
Domain architectures of CHDL-containing proteins were analyzed by individually comparing the protein sequences against the SMART 4.0  and the Pfam 17.0  databases. Secondary structure predictions were performed using the PSIPRED program  and the JPRED2 consensus program . Protein folding predictions were performed using the 3D-PSSM program , the FUGUE program  and the 123D folding prediction program .
Results and discussion
Domain definition and sequence analysis
When the hypothetical protein alr0276 (BAB77800) from Nostoc sp. PCC 7120 was compared against the SMART  database, a sequence located between the Exo-endo-Phos domain (PF03372) and the hemolysin-type calcium-binding repeat domain (PF00353) was recognized by BLAST search as a cadherin domain. This sequence, however, was not recognized as a cadherin domain in the SMART and the Pfam database. We found that this sequence represented a highly conserved family which is mostly distributed in bacteria. A PSI-BLAST  search of the NCBI non-redundant protein database with the sequence (residues 2055–2154) of the hypothetical protein alr0276 (BAB77800) as a query retrieved 133 proteins after three iterations. Within the 133 proteins, 75 proteins were found in the large group Proteobacteria, including Vibrio vulnificus YJ016, Vibrio parahaemolyticus, and Microbulbifer degradans; 19 proteins in the Cyanobacteria group, including Nostoc punctiforme PCC 73102 and Crocosphaera watsonii WH 8501; 9 proteins in Planctomycetes Pirellula; and 6 proteins in the Actinobacteria. There were several eukaryotic proteins also containing the CHDL domain such as human extracellular matrix protein (CAD54734) and the hypothetical protein (NP_199726) from Arabidopsis thaliana. The CHDL domain boundaries were further defined from examination of tandemly repeated copies in retrieved proteins. There are three copies of tandemly linked CHDL domains in the RTX toxin and related Ca2+-binding proteins (ZP_00111400) from N. punctiforme. A multiple sequence alignment of this domain from representatives of the 133 proteins was constructed using the T-Coffee program  with appropriate manual adjustments (Fig. 1).
Domain architectures were constructed by individually comparing the protein sequences against the SMART 4.0  and the Pfam17.0  databases. The domain architectures of the CHDL-containing proteins revealed many diverse domain combinations (Fig. 2).
Proteins containing the CHDL domain usually have one or more domains with known adhesion function. For example, CHDL is associated with the hemolysin-type calcium-binding repeat domain (PF00353) and the integrin α (β-propeller repeat) domain (SM00191) in an RTX toxin and related Ca-binding protein (ZP_00111400) from N. punctiforme. The CHDL domain is also associated with the fibronectin type 3 domain (FN3) (SM00060) and the dystroglycan-type cadherin-like domain (CADG) (SM00736) in the fibronectin type III domain protein (NP_715831) from Shewanella oneidensis MR1. The hemolysin-type calcium-binding repeat domain (PF00353) is involved in the binding of calcium ions in a parallel β roll structure . The fibronectin type 3 domain is comprised of approximately 100 amino acids tandemly repeated to generate binding sites for DNA, heparin and cell-surface adhesion domains . Integrin α domains mediate cell-to-cell and cell-to-matrix adhesion . The dystroglycan-type cadherin-like domain (CADG) (SM00736) is a cadherin-homologous domain which may bind calcium ions . Several other domains associated with CHDL also have adhesion function, for example, the von Willebrand factor type A domain (VWA) (SM00327) [22,23] in the hypothetical protein Tery02003054 (ZP_00326727) from Trichodesmium erythraeum, the PKD domain (SM00089)  in the RTX toxin (NP_759056) from V. vulnificus, and the chitin binding domain 3 (PF03067)  in a putative chitin-binding protein (CbpA) (DAA01337) from M. degradans.
The CHDL domain is also associated with some enzyme domains, such as the Exo-endo-Phos domain (PF03372) in a hypothetical protein alr0276 (BAB77800) from Nostoc sp. PCC 7120, the Glyco-hydro-16 domain (PF00722) in Beta-glucanase/Beta-glucan synthetase (ZP_00315837) from M. degradans-24, and the Peptidase-S8 domain (PF00082) in the subtilisin-like serine proteases (ZP_00226513) from Kineococcus radiotolerans. The association of the CHDL domain with a variety of enzymes, in addition to adhesion proteins, illustrates its role in proteins with diverse functions.
During domain architecture construction, one CHDL domain in the exochitinase protein (YP_204981) from Vibrio fischeri ES114 was found to overlap with a PKD domain. We speculated that it should be identified as a CHDL domain, rather than a PKD domain, for several reasons. First, the sequence (residues 694–782) of this exochitinase protein was recognized as a PKD domain with an E value of 1.80 in the SMART database, but was not recognized as the PKD domain in the Pfam database due to a high E value. Second, the sequence (residues 694–782) of this exochitinase protein has a TDADSD motif and this kind of xDxDxD motif may have calcium-binding activity, and we did not find this motif in the sequence of the PKD family alignment in the SMART database. Third, the sequence (residues 694–782) of this exochitinase protein was, in fact, recognized as a CHDL domain with an E value as low as 2 × 10−6 and had sequence identity of 25% with the sequence (residues 2055–2154) of the hypothetical protein alr0276 (BAB77800) from Nostoc sp. PCC 7120.
CHDL domain folding prediction
The sequence (residues 2055–2154) of the hypothetical protein alr0276 (BAB77800) from Nostoc sp. PCC 7120 and the sequence (residues 999–1088) of a putative RTX toxin (NP_798012) from V. parahaemolyticus, identified as CHDL domains, were selected to perform secondary structure predictions using the PSIPRED program and the JPRED2 consensus program. The results revealed that the CHDL domain was an all-β domain as indicated in Fig. 1. Other sequences of the CHDL domain in different CHDL domain containing proteins gave similar results.
Three different programs were used to perform protein folding prediction by the sequence (residues 2055–2154) of the hypothetical protein alr0276 (BAB77800) as a query. From the 3D-PSSM folding prediction program, a mouse E-cadherin protein (PDB entry: 1edh)  was identified with an E value of 0.50. From the FUGUE program, a mouse α-dystroglycan protein (PDB entry: 1u2c)  was identified with a Z score of 8.72. The 123D folding prediction program identified a mouse N-cadherin protein (PDB entry: 1nci) . Similar results were obtained using other sequences of the CHDL domain in different CHDL domain containing proteins as queries. Thus, three different protein folding prediction programs indicate that the 3D fold of the CHLD domain is similar to the fold of the cadherin domain. The sequence of a mouse E-cadherin protein (PDB entry: 1edh) was carefully aligned to the CHLD domain sequence based on the alignment results from querying the sequence (residues 2055–2154) of the hypothetical protein alr0276 (BAB77800) and the sequence (residues 999–1088) of the putative RTX toxin protein (NP_798012) in 3D-PSSM folding prediction program. The secondary structure of the cadherin domain (PDB entry: 1edh) is also labeled in Fig. 1. As mentioned, the cadherin domain mostly exists in the metazoan lineage, although it was thought that cell adhesion proteins such as cadherin arose prior to animal evolution , the information is still limited. From identification of the CHDL domain, the cadherin superfamily may exhibit wider sequence and structure diversity than appreciated and the evolution of the cadherin domain may be more complex than previously reported.
The cadherin family has been studied extensively [2–4]. The cadherin domain has three cadherin-specific signature motifs DxD, LDRE and DxNDN  which participate in the formation of calcium-binding pockets and are underlined in the 1edh sequence in Fig. 1. By checking the CHDL domain sequence alignment, we found that the CHDL domain only has a similar motif in the cadherin DxD motif region, but not to the LDRE and DxNDN motifs. CHDL domains do have several conserved motifs like PxAxxD, DxDxD, and YT-V/I-S/T-D and these motifs may contribute to binding calcium or other metal ions. If the CHDL domain can bind calcium ions, they may be bound by a different mechanism in the CHDL domain compared with the cadherin domain. It is highly possible that there are different conserved motifs that function to bind calcium ions. For example, in the MIDAS motif [22,23], the immunoglobulin β-sandwich fold VWA domain uses D-x-S-x-S to form a conserved metal ion binding motif.
CHDL domain function prediction
Most CHDL domains are mosaically interposed with other adhesion domains like VWA, FN3, integrin-α, CA, CADG and PKD in many proteins. From the SCOP database , VWA, FN, integrin-α, CA, CADG and PKD domains all have immunoglobulin β-sandwich folding. The high association of CHDL domains with many adhesion domains and the similarity of the 3D fold of the CHDL domains to the folds of these adhesion domains makes it reasonable to postulate that CHDL might also have adhesion functions.
One of the CHDL domain containing proteins, RapA1 (AAG18518), has been experimentally studied in detail . It is a secreted protein found in R. leguminosarum bv. Trifolii with calcium binding and possibly carbohydrate-binding activity. RapA1 consists of two homologous repeats, referred as the Ra domain by the author . No domain was recognized when the RapA1 protein sequence was queried in the SMART and the Pfam database. During the PSI-BLAST searches, the second repeat of RapA1 was retrieved with an E value of 1 × 10−7. Its homologous RapC (AAK01178) and RapA2 (AAK01177) were also retrieved with E values of 2 × 10−5 and 3 × 10−4, respectively.
In several proteins, the CHDL domain is associated with a chitin-binding domain or chintinase domain. For example, the CHDL domain is associated with chitin-binding domain (PF03067) and CBII domain (SM00637) in the CbpA protein (DAA01337)  from M. degradans. It is associated with the cellulose binding domain (CBIV) (SM00606) and Glyco_16 domain (PF00722) in the Beta-glucanase/Beta-glucan synthetase (ZP_00315837) from M. degradans, and is associated with the PKD domain, chitin-binding domain type 3 (SM00495) and Glyco_18 domain (SM00636) in an exochintinase protein (YP_204981) from V. fischeri. According to the SMART and the Pfam databases, the chitin-binding domain (PF03067), CBII domain (SM00637), cellulose-binding domain (CBIV) (SM00606), and chitin-binding domain type 3 (SM00495) all have carbohydrate-binding ability. Glyco_16 domain (PF00722) and Glyco_18 domain (SM00636) are O-glycosyl hydrolases and belong to the chitinase class II group. Based on the above domain associations, we postulate that the CHDL domain might have carbohydrate-binding ability.
This study was financed by the National 973 Program to Dr. L. Yu.