The VCBS superfamily forms a third supercluster of β-propellers that includes tachylectin and integrins

Abstract Motivation β-Propellers are found in great variety across all kingdoms of life. They assume many cellular roles, primarily as scaffolds for macromolecular interactions and catalysis. Despite their diversity, most β-propeller families clearly originated by amplification from the same ancient peptide—the ‘blade’. In cluster analyses, β-propellers of the WD40 superfamily always formed the largest group, to which some important families, such as the α-integrin, Asp-box and glycoside hydrolase β-propellers connected weakly. Motivated by the dramatic growth of sequence databases we revisited these connections, with a special focus on VCBS-like β-propellers, which have not been analysed for their evolutionary relationships so far. Results We found that VCBS-like form a supercluster with integrin-like β-propellers and tachylectins, clearly delimited from the superclusters formed by WD40 and Asp-Box β-propellers. Connections between the three superclusters are made mainly through PQQ-like β-propeller. Our results present a new, greatly expanded view of the β-propeller classification landscape. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Proteins with a b-propeller domain are found in all kingdoms of life ( Fig. 1c). They are involved in diverse biological processes, from adhesion to transcription regulation (Chen et al., 2011;Fü lö p and Jones, 1999;Guruprasad and Dhamayanthi, 2004;Pons et al., 2003). In them, the b-propeller acts mostly as a recognition site for different biomolecules, but may also carry catalytic activity. These repetitive domains (Andrade et al., 2001;Sö ding and Lupas, 2003) adopt a toroid fold, where between 4 and 12 ( Fig. 1d) copies of a widespread supersecondary structure, the 4-stranded b-meander, are arranged radially around a central channel (Fig. 1a, b). These repeats, whose strands are labelled A-D (Fig. 1b), are called 'blades' and the toroids they form correspondingly 'propellers'. Blades carry specific sequence motifs which allow the classification of cognate bpropellers into a hierarchy of families and superfamilies (Chaudhuri et al., 2008;Chen et al., 2011;Fü lö p and Jones, 1999;Guruprasad and Dhamayanthi, 2004;Pons et al., 2003).
Despite their wide sequence diversity (Fig. 1e, f), most b-propeller families are related to each other and emerged by independent amplification from a set of homologous ancestral blades, in a process that is still visibly ongoing (Afanasieva et al., 2019;Alva et al., 2015;Chaudhuri et al., 2008;Dunin-Horkawicz et al., 2014;Kopec and Lupas, 2013). Classification studies (Chaudhuri et al., 2008;Kopec and Lupas, 2013) suggested that most b-propeller families form a supercluster centred on WD40 b-propellers, a large superfamily characterized by a Trp-Asp motif at the end of strand C (in position 40). Proteins assigned to this supercluster in previous studies included the 7-bladed b-subunits of G-proteins, the 6-bladed low-density lipoprotein (LDL) receptors, the 6-bladed protein kinase PknD and the 5-bladed tachyletin-2 family, which comprises eukaryotic lectins involved in the innate immunity of cnidarians and crustaceans (Beisel et al., 1999;Hayes et al., 2010;Neer et al., 1994). Some peripheral groups connected weakly to this supercluster (Chaudhuri et al., 2008;Kopec and Lupas, 2013), such as the 7bladed b-propeller domain of a-integrins, characterized by a Ca 2þbinding DxDxDG motif in the loop connecting strands A and B (loop AB) and an FG-GAP/Cage motif, which is contiguous in space but not sequence, covering the N-terminal end of strand A and the C-terminal end of strand B (Chouhan et al., 2011;Rigden and Galperin, 2004). This connection was proposed to be weakly mediated by Asp-Box b-propellers, most of whose members are characterized by a SxDxGxTW motif in the loop connecting strands C and D (loop CD) (Quistgaard and Thirup, 2009).

5618
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics, 36(24), 2020, 5618-5622 doi: 10.1093/bioinformatics/btaa1085 Advance Access Publication Date: 8 January 2021 Original Paper (Meusch et al., 2014) and fungal PVL lectins (Cioci et al., 2006), and other found in a variety of hypothetical archaeal toxins (Makarova et al., 2019). As PVLs carry a conserved Ca 2þ -binding DxDxDG motif in loop AB, their similarity to integrin-like b-propellers has been conjectured (Cioci et al., 2006), but their mode of carbohydrate recognition appears to be more similar to that of tachylectin-2 (Beisel et al., 1999;Cioci et al., 2006). In order to obtain further insight into this group and locate it within the b-propeller landscape, we performed a survey of VCBS-like b-propellers and their relationship to integrin-like, Asp-Box, tachylectin and WD40 b-propellers.

Materials and methods
Thirteen b-propeller representatives of known structure, chosen to represent the families described above (Supplementary Table S1), were used as queries for sequence searches with PSI-BLAST (Altschul, 1997). Searches for most families were carried out with the nr database filtered to a maximum sequence identity of 30% (nr30, as of May 2020) (Zimmermann et al., 2018). Given their sparse taxonomic distribution, tachylectins were searched on the nr database filtered to a maximum sequence identity of 50%. Matches covering more than 80% of the corresponding query were collected after 2 rounds and filtered to a maximum sequence identity of 50% with CD-HIT (Li and Godzik, 2006). The final sequences were assigned an ECOD family by HHsearches against a database of HMM profiles built for the ECOD database filtered to 70% maximum sequence identity (HHpred ECOD70 database as of March 2020) (Zimmermann et al., 2018). Each sequence was assigned the best match at a probability better than 90%. Taxonomic information was collected from the Entrez Taxonomy database. Sequences were clustered with CLANS (Frickey and Lupas, 2004) based on the P-value of their BLASTp pairwise comparison, computed using the BLOSUM62 scoring matrix. Clustering of the entire set was preformed until equilibrium at a P-value of 10 À5 and superclusters identified manually based on the name of the corresponding query sequences and the ECOD domains assigned. To identify subclusters and internal connections, the sequences in the VCBS supercluster, including and excluding the PQQ/RGL11 sequences, were re-clustered at P-values of 10 À18 (Fig. 2b) and 10 À20 , respectively ( Supplementary Fig. S1a).
In order to evaluate the domain environments of the b-propellers in each subcluster, their parent full-length proteins were collected and binned by size, with a step of 100 residues. A representative for each bin was collected and domains annotated iteratively with HHsearch as above. A maximum of four iterations was carried out, where sequence regions not yet mapped to a domain were searched individually. Only the best matches at a probability better than 70% and larger than 40 residues were considered. Signal peptide prediction was carried out with Phobius (Käll et al., 2004).
For HMM comparisons, the full-length sequences of the b-propellers composing the clusters and subclusters depicted in Figure 2 were used. For each group, the sequences were aligned with MUSCLE (Edgar, 2004) and the alignment trimmed with trimAl (Capella-Gutierrez et al., 2009), removing columns where >25% of the positions were a gap (gap score of 0.75) and sequences that only overlapped with less than half of the columns populated by 80% or more of the other sequences. HMM profiles were built with HHmake and aligned with HHalign (Sö ding, 2005), using default parameters without secondary structure scoring. The alignments were then inspected and segments corresponding to the best conserved individual blades were used to build Figure 3b. Structural alignments were carried out with TM-align (Zhang and Skolnick, 2005).

Results
PSI-BLAST searches with 13 b-propellers of known structure, chosen to represent the families described above (Supplementary  Table S1), yielded a total of 5996 sequences from bacteria, archaea and eukaryotes (see Methods). When clustered by pairwise similarity (Fig. 2), these sequences form three superclusters organized around cores of WD40, Asp-Box and VCBS-like b-propellers, respectively. The WD40 and Asp-Box superclusters were expected, based on previous analyses (Chaudhuri et al., 2008;Kopec and Lupas, 2013), but we were struck by the clear grouping of the other b-propeller families into a third supercluster, centred on VCBS and clearly delimited from the other two.
The core of the VCBS supercluster comprises prokaryotic b-propellers from diverse hypothetical protein families ( Supplementary  Fig. S1), which carry a signal sequence and may contain several bpropeller domains, accompanied by domains associated with biomolecular interactions (mostly immunoglobulin-like domains, but also armadillo repeats and jelly-roll-like lectins, Supplementary Fig. S1). The VCBS core group is connected to a large periphery of VCBS-like families, including PVL, TcB and AUDH, as well as to diverse hypothetical b-propellers, which have hitherto remained unstudied ( Fig. 2b and Supplementary Fig. S1). b-Propeller families in this periphery are found in a variety of hypothetical proteins, whose domain composition suggests an involvement in biomolecular interactions and catalysis (Supplementary Fig. S1a). The most peripheral families that still connect directly to the VCBS core are the integrin-like bpropellers and the bacterial RGL11 family (rhamnogalacturonan lyase YesX, ECOD: 001396995). Two other important b-propeller families complete the VCBS supercluster, comprising tachylectins and PQQ b-propellers, respectively. These connect to each other, and also to the VCBS core via RGL11, in the case of PQQ and a bpropeller family we have named VCBS actinolectins, in the case of tachylectins.
We chose the name 'VCBS actinolectins' given their exclusive occurrence in actinobacteria and evolutionary connection to tachylectins ( Fig.1b and Supplementary Fig. S1), but no member of this family has as yet been characterized functionally or structurally. These b-propellers are found in proteins that carry a signal sequence and either consist of the single b-propeller domain or of the b-propeller preceded by a TIM barrel (Supplementary Fig. S1a). Their connection to the tachylectin cluster is mediated by a core of bacterial tachylectin-like sequences, which are found in secreted proteins often containing additional domains involved in catalysis. Two groups radiate from this core, the eukaryotic tachylectins-2 and a  (Beisel et al., 1999;Hayes et al., 2010;Smock et al., 2016). HMM comparisons highlight the sequence motifs behind the connections described here (Fig. 3). The most prominent motif is the aspartate-rich DxDxDG sequence of loop AB (Figs 3b and 4) (Chouhan et al., 2011;Cioci et al., 2006;Rigden and Galperin, 2004). While in PVL and a-integrin, this loop binds Ca 2þ (Fig. 4b), in other members, it may recognize also other metal cations (Chouhan et al., 2011;Claesson et al., 2012;Meusch et al., 2014;Rigden and Galperin, 2004). Also conspicuous are two noncontiguous, highly conserved residues of loop CD, G and W (Fig. 3b). Their functional role is uncertain, but in integrin-like bpropellers, the G coordinates a water molecule involved in Ca 2þ Clustering was carried out with CLANS in 2D until equilibrium at a BLASTp P-value of 10 -5 , with connections represent similarities at this P-value (the darker, the more similar). Different regions of the map are annotated with the name of the sequences within the corresponding cluster or, when a cluster encompasses multiple families, by the b-propeller family as in ECOD and Pfam. (b) Cluster map of the 2662 sequences in the VCBS supercluster. Clustering was carried out as in (a) but a BLASTp P-value of 10 -18 , in order to expand it and uncover its internal structure.
Connections are shown at a BLASTp P-value of 10 -10 . Dots are coloured based on the family name (f-name) of the best match in HMM searches against ECOD. Multiple colours within the same cluster correspond to sequences that match multiple close b-propeller families. HP stands for 'hypothetical propeller' Fig. 3. HMM comparison of b-propeller groups. (a) Sequence homology matrix of b-propeller groups selected from the cluster maps, as measured by the probability of the alignment of full-length HMM profiles with HHalign. (b) Multiple alignment of the HMM consensus sequences, focused on representative single-bladed regions. Sequence motifs common to the VCBS supercluster are highlighted in grey and summarized on top. Their function in members of known structure is depicted: a grey circle with Me þ represents 'metal binding' and a grey hexagon 'sugar binding'. The Asp-Box motif is highlighted in light red. Arrows depict the four strands of blade and are named accordingly. This annotation was carried out based on the known structures of families shown, but represent only a consensus as, due to structural deviations or especial structural features, the specific start and end of these strands may be shifted binding (Chouhan et al., 2011), and in tachylectin-2 the W anchors a short a-helix involved in forming the sugar-binding pocket (Fig. 4). A fourth prominent motif is a GW in loop DA' (the loop that connects strand D from one blade to strand A of the next) (Figs 3b and 4a, c), which in tachylectin-2 and PVL is involved in forming the sugar-binding pocket ( Supplementary Fig. S2) (Cioci et al., 2006;Kawabata and Tsuda, 2002).
While widely represented in the families of the VCBS supercluster, none of these motifs is universal. Thus, for example, the aspartate-rich motif of loop AB is not found in tachylectin-like and PQQ b-propellers. These are connected to other families in the supercluster by the sequence of loop CD and, in the case of tachylectin-like b-propellers, by the GW motif of loop DA'.

Conclusions
Our results confirm the relationship conjectured between fungal PVL lectins, tachylectin-2 and integrin-like b-propellers (Cioci et al., 2006). We find that all three of these eukaryotic protein families are satellites of larger prokaryotic clusters, from which they are presumably descended. Jointly with these, they are part of a supercluster of b-propeller families, centred on the large group of prokaryotic VCBS b-propellers. This supercluster had not been recognized in previous studies (Chaudhuri et al., 2008;Kopec and Lupas, 2013) because most relevant proteins could not be included, primarily due to the lack of relevant sequences of known structure. We note that, in a study on the prokaryotic ancestry of eukaryotic networks mediating innate immunity and apoptosis (Dunin-Horkawicz et al., 2014), the predicted functional interactomes in bacteria with complex life cycles clearly separated b-propellers of the WD40 supercluster from those that we now recognize to be part of a new, VCBS-like supercluster. Both superclusters show highly repetitive, recently amplified members, highlighting the ongoing genesis of new propellers in response to what we surmise are functional challenges specific to each supercluster.
We believe two factors were essential in our ability to resolve the evolutionary connections between the main b-propeller groups. The first is the presence of members of the VCBS superfamily, which revealed their intermediate position between integrin-like and PQQ b-propellers, providing a context for the weak links previously observed between integrin-like and Asp-Box b-propellers. The second was the collection of a substantial number of tachylectin-like sequences. Given the structural approach of previous studies (Chaudhuri et al., 2008;Kopec and Lupas, 2013), these encompassed only the one tachylectin-like sequence found in PDB, which clustered in the WD40 supercluster. In our study, more than 140 tachylectin-like sequences were collected, including sequence intermediates essential for the establishment of evolutionary links. Many of these sequences are of bacterial origin and resulted from metagenomic studies, highlighting the importance of such efforts for the better understanding of protein evolution paths and the structure of the b-propeller sequence space.