CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems

Central to Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-Cas systems are repeated RNA sequences that serve as Cas-protein–binding templates. Classification is based on the architectural composition of associated Cas proteins, considering repeat evolution is essential to complete the picture. We compiled the largest data set of CRISPRs to date, performed comprehensive, independent clustering analyses and identified a novel set of 40 conserved sequence families and 33 potential structure motifs for Cas-endoribonucleases with some distinct conservation patterns. Evolutionary relationships are presented as a hierarchical map of sequence and structure similarities for both a quick and detailed insight into the diversity of CRISPR-Cas systems. In a comparison with Cas-subtypes, I-C, I-E, I-F and type II were strongly coupled and the remaining type I and type III subtypes were loosely coupled to repeat and Cas1 evolution, respectively. Subtypes with a strong link to CRISPR evolution were almost exclusive to bacteria; nevertheless, we identified rare examples of potential horizontal transfer of I-C and I-E systems into archaeal organisms. Our easy-to-use web server provides an automated assignment of newly sequenced CRISPRs to our classification system and enables more informed choices on future hypotheses in CRISPR-Cas research: http://rna.informatik.uni-freiburg.de/CRISPRmap.

orientation, we highlight this sequence (or one if many copies exist) in our CRISPRmap cluster tree, and automatically assign it to the corresponding structure motif and/or sequence family and stop here.
2. What is the correct orientation? If the user is not sure about the correct repeat orientation, i.e. the checkbox for repeat orientation has been activated, we first predict the orientation with our model described in the methods section of the main manuscript. The orientation should then be consistent with our data.
3. Is it structured or unstructured? The RNA structure prediction algorithm, RNAfold [3] is used to determine whether the repeat sequence is structured or unstructured. If the minimum free energy structure is the unstructured sequence, i.e. contains no base-pairs, it remains unassigned to a structure motif and we continue with Step 5.

4.
Does it belong to a structure motif ? Albeit a structure being predicted, the repeat does not necessarily belong to a conserved structure motif. We add the repeat sequence to all repeats assigned to one of our structure motifs and re-run RNAclust [4] with a modified UPGMA algorithm (see following section "Constrained Clustering"). In short, the modification allows the generation of the cluster tree by keeping the motifs intact, i.e. non-overlapping. If a repeat falls into or next to one of the existing structure motifs, we assign it to the motif by the following: (1) The repeat is folded by RNAfold [3] with the option -p to calculate a structure dotplot. (2) This dotplot is aligned with the consensus dotplot of the structure motif using LocARNA. (3) The repeat is assigned to be a member of the motif if it is able to fold into the consensus structure of that respective motif with at most one base-pair missing. We ensure that the new consensus structure contains at least four base-pairs and is at the same position as previously. A comparison of the new and old consensus structures and alignments is given on the web server results page.

5.
Does it belong to one of our conserved sequence families? We assign the repeat to a conserved sequence family by comparing it to the previously calculated ClustalW sequence profiles [5], see Methods section "Clustering of repeat sequences into conserved sequence families". Let sim(F, r) be the profile score of a repeat r compared with the profile of the family F , where r ∈ F . For each family, the minimum F min and maximum F max profile similarity was determined by removing each sequence from the family, re-calculating the profile for the remaining sequences, and determining the similarity score of the respective repeat to the profile. A repeat r was then assigned to a sequence family F if (1) sim(F, r) is greater or equal to F min and (2) the distance between sim(F, r) and F max is the minimum for all families. 6. Where is it located in the CRISPRmap cluster tree? With a final run of RNAclust on all repeat sequences, we get the updated CRISPRmap cluster tree and we highlight the input sequence location in this tree. Any additional annotations (outer rings), such as Cas subtype, are not displayed for novel repeat sequences.

S1.3 Constrained Clustering
We consider the general problem to cluster a set of taxa hierarchically based on their distances. Additionally, we constrain the clustering such that certain, e.g. a priori known, clusters are prevented from mixing with each other.
Given is a set of taxa, indexed from 1 to n, together with all pairwise distances between the taxa; furthermore, a set X of disjoint clusters of these taxa, i.e. X is contained in the powerset of {1, . . . , n} and all non-identical clusters c and d in X do not intersect. Commonly, X covers only a subset of all taxa; therefore, we distinguish constrained taxa (that are contained in some element of X ) and the remaining unconstrained taxa.
We aim to construct a cluster tree of the taxa, i.e. a rooted binary tree T with n leaves corresponding to the n taxa. First, this tree should reflect the given distances. Second it has to support the clustering given by X such that clusters in X are grouped together but unconstrained taxa can be interspersed freely. For this purpose, we require that no subtree of T contains leaves from two different clusters in X unless both clusters are completely contained in the subtree. We call this condition X -cluster constraint.
(Formally: for each subtree with leaves L and each pair of non-identical clusters c and d in Our novel constrained clustering algorithm is based on the unweighted pair group method UPGMA. The original algorithm UPGMA starts from n singleton clusters corresponding to the n taxa. Until all clusters are combined, it iteratively merges the two nearest clusters. For the latter, the cluster distances are initially derived from the input distances and distances to new clusters are computed after each merge of clusters. The sequence of merges determines the cluster tree. The novel algorithm modifies UPGMA, such that, in each iteration, it merges the nearest pair of clusters that can be merged without violating the X -cluster constraint. To check this condition efficiently, we keep track for each cluster whether it contains some elements of a cluster in X and whether it includes such a cluster completely. Merging two clusters does violate the constraint if and only if each cluster overlaps some cluster in X but does not cover it completely.

S1.4 Horizontal gene transfer between bacteria and archaea
Although archaeal CRISPRs are generally well-separated from bacterial ones in general, we observed a few instances where an archaeal CRISPR is located within a bacterial-dominated region and vice versa. To investigate whether these mixed regions could arise from potential horizontal transfer, we applied BLAST to search for homologous Cas1 (or Cas2) protein sequences (Cas1 and Cas2 are the most ubiquitous Cas proteins and exist in both bacteria and archaea). We identified 24 archaeal and 8 bacterial repeats that were assigned to sequence families or structure motifs dominated by the opposite domain. For 75% (18 out of 24) of the archaeal repeats, we identified Cas1 or Cas2 homologs in bacteria in the top five BLAST hits (E-value ≤ 2 × 10 −10 ); the same was true for only one of the four bacterial repeats.

S2.1 Number of Cas subtype annotations
We annotated each CRISPR in our dataset according to the closest Cas subtypes as described in the methods of the manuscript. The two major Cas subtype annotation systems were considered [1,6]; the number of CRISPRs we annotated with each subtype is given in Table S1.

S2.2 Summary tables of sequence families and structure motifs
Supplementary Tables S2-S19 summarise the sequence families and structure motifs, sorted according to the superclass they belong to.  Table S1: The number of identified Cas subtype annotations for our REPEATS dataset. There were double as many annotations using the more recent classification from Makarova et al., however, we did not require that all cas genes from the respective subtype to be present; whereas the annotations performed for Haft et al. were more strict, since we used full subtype models (see methods). In general, Dvulg, Ecoli, Hmari, Mtube, Nmeni, and Ypest correspond to I-C, I-E, I-B, III-A, both type II, and I-F, respectively. Structured repeats with very stable and conserved hairpin motifs, mainly found in bacteria, are written in bold. Note that the 9 subtype II-B CRISPRs in archaea are likely to be incorrect as we did not identify an RNase III in these organisms. Automated annotation of subtype II-B was especially difficult as it contains no subtype-specific Cas protein.
only given if this is more or less clear. If there is a complete mix of subtypes, no information is given. The Cas subtypes are summarised according to the cas genes that are found in the majority of chromosomes which contain the CRISPRs of each family or motif. More details of the majority cas genes is given on the web server. Archaeal families and motifs are highlighted in blue. If the CRISPRmap webserver is updated in future, then these tables supply a record for sequence families and structure motifs that are referred to in this work. The secondary structures of the motifs and sequence logos of the families are also provides in the tables.    10 20                                 In particular, these are systems for which the Cas endoribonuclease is characterised and/or the repeat structure has been verified. Published results are consistent with our data. The IDs, a-o, are marked, in order, as red lines on the CRISPRmap tree in the manuscript in Figure 1.  Figure S1: Pairwise similarities for repeats. We plotted the distribution of pairwise percent identities (x-axis) of Needleman-Wunsch [22] alignments for all repeats to determine a cutoff for the Markov clustering. Here we see that 65% is a reasonable cutoff in comparison to the background distribution. Repeats with a similarity below 65% are set to zero. Because of the short repeat length and conserved sequence motifs, it is necessary to choose such a high cutoff.  Figure S2: Verifying repeat families with sequence profiles and re-assigning individual repeats. All repeats were clustered into families using Markov clustering [23,24]. We verified these families using an independent method of sequence profiles, see Methods section "Clustering of repeat sequences into conserved sequence families". After the generation of one profile per family, we caculated the profile scores for each repeat in the REPEATS dataset. We plotted the profile scores (y-axis) for each repeat assigned to one of the families (x-axis) as red-coloured dots in Supplementary Figure S2. Subsequently, we used this range of profile scores to re-assign repeats to one of the existing families as stated in the main text of the manuscript. Profile scores for re-assigned dots are in blue (73 repeats). These profile scores are also used to assign new input repeat sequences from the webserver to one of our existing families. Distance of signature subtypes is in blue and the distance of signature types is in red; the cutoff is indicated with the green line. The plot shows the distribution of the closest signature genes to the CRISPR array. A signature gene is one that is unique to either the subtype or the type, respectively.  Figure S4: CRISPR of repeat conservation including all annotations. CRISPR repeats cluster into 33 structure motifs and 40 sequence families. Here we show the cluster tree with all annotation rings-the "alltogether" option in the webserver-colour coding starts from inside to outside, see the legend. The branches of the tree are labelled according to the origin of the repeat: blue-green for archaea and dark brown for bacteria. Ring 1 (inner-most) 33 structure motifs, ring 2 40 sequence families, ring 3 Haft 2005 subtype annotation, ring 4 Makarova 2011 subtype annotation, ring 5 18 cas1 clusters, ring 6 taxonomic phyla annotation and ring 7 (outer-most) the six superclasses for general orientation.  Figure S5: Comparison of our clustering with previous domain-wide repeat clusters or families on our CRISPRmap tree. The branches of the tree are labelled according to the origin of the repeat: blue-green for archaea and dark brown for bacteria. Ring 1 (inner-most) shows our structure motifs, ring 2 shows our sequence families. After the white ring, we show ten of the twelve clusters from Kunin et al. [2,25] in Ring3; clusters 11 and 12 contain fewer than ten repeats and to be consistent with our cluster minimum size, we have removed them here. Ring 4 contains those sequences of the Rfam [26] database that are also contained in REPEATS (since we have all sequenced genomes to-date) and only families (16 out of 65) with at least ten sequences. We do not mark the family names here, but just want to show the relative locations of sequences in the CRISPRmap tree. Ring 5 (outer-most) shows the six superclasses for general orientation. In summary, we clearly see that our data is significantly more comprehensive than previous work.  Figure S7: CRISPRmap tree-a use-case study. This is the CRISPRmap cluster tree after reclustering 150 repeats from a human metagenomic studies [27] together with our REPEATS data. The new 150 repeats are maked with red lines. Interestingly, many repeats have been assigned to superclass E and cluster together to potentially form new classes of motifs or families. Inner to outer rings Figure S8: Analysis of array, repeat and average spacer sizes. First, we see the very small arrays containing less than 5 repeat instances (red-brown) are mostly located in the more divergent parts of the CRISPRmap tree; most are within the bacterial part of superclass F. Many of these arrays may not be functional CRISPR-Cas systems, but other repetitive elements instead. Second, superclass F contains both some unusually short and unusually long repeats, which also may not represent functional CRISPRs. In addition repeats in superclass F and half of D are longer than those in superclasses A to the first half of D. Third, repeats in superclasses A and F are longer than ones in B-D; this means the Cas subytpes I-C, I-E, and I-F associate with shorter spacers than the others. Spacers in Crenarchaeota are unusually long with most longer than 38 nt. Interestingly, shorter repeats seem to pair with longer spacers. Cutoffs were choses according to the distribution of each array characteristic.  Figure S9: Sequence families separated on a two-dimensional plane. The 40 sequence famlies are mapped onto a two-dimensional plane by BioLayout [28] according to their percent identity scores. We have marked only those families that are clearly visible. The families are divided into two main groups with some that are more separated from the rest. Superclasses B and C contain stable structure motifs of the subtypes I-E and I-F. The difference is that the structures in superclass B are closer to the 3' end of the repeat and that the potential cleavage site is in the double-stranded region of the stem instead of the 3' side of its base. c. Superclass D contains members of the I-C subtype with relatively long hairpin motifs. Note that the potential cleavage site leads to an 11 nt instead of an 8 nt tag in the mature crRNA and we also see the well-conserved 3' end of the repeat (AT T GAAAC); this 3' sequence is found in many CRISPRs, also in archaea. d. Examples of structure motifs found in archaeal repeats in superclasses A and F. These are smaller and less stable than the bacterial motifs. 29

ID Organism
-GGUUCAUCCCCACGUGUGUGGGGAACUC Methanosalsum zhilinae DSM4017 ......... (((((....)))))...... The secondary structure from the respective motif is written above in dot-bracket format: brackets and dots corresponds to base pairs and unpaired nucleotides, respectively. The highlighted brackets and squares show that the secondary RNA structure has been conserved by compensatory base pair mutations. These compensatory base pair mutations give excellent evidence for the conservation and importance of the respective structure motifs.