Pairs of Mutually Compensatory Frameshifting Mutations Contribute to Protein Evolution

Abstract Insertions and deletions of lengths not divisible by 3 in protein-coding sequences cause frameshifts that usually induce premature stop codons and may carry a high fitness cost. However, this cost can be partially offset by a second compensatory indel restoring the reading frame. The role of such pairs of compensatory frameshifting mutations (pCFMs) in evolution has not been studied systematically. Here, we use whole-genome alignments of protein-coding genes of 100 vertebrate species, and of 122 insect species, studying the prevalence of pCFMs in their divergence. We detect a total of 624 candidate pCFM genes; six of them pass stringent quality filtering, including three human genes: RAB36, ARHGAP6, and NCR3LG1. In some instances, amino acid substitutions closely predating or following pCFMs restored the biochemical similarity of the frameshifted segment to the ancestral amino acid sequence, possibly reducing or negating the fitness cost of the pCFM. Typically, however, the biochemical similarity of the frameshifted sequence to the ancestral one was not higher than the similarity of a random sequence of a protein-coding gene to its frameshifted version, indicating that pCFMs can uncover radically novel regions of protein space. In total, pCFMs represent an appreciable and previously overlooked source of novel variation in amino acid sequences.

unclassified and is not analyzed. It is, however, recorded as a "strange example" in the script log. 2.4. The target hole is removed from the holes list, and a new iteration through paragraphs 2.1-2.4 is initiated.
3. After all the insertions and deletions are classified, species undergo filtering. Species with genes (1) with nucleotide number not divisible by 3 or (2) without a start and/or an end or (3) having an inner stop are removed. The reason this step is not performed at the start is that species with such unreliable and possibly erroneously sequenced genes could still help us to determine insertions and deletions in species with trustworthy genes. 4. For every species in the alignment that is left (hereafter, target species), paragraphs 4.1-4.6 are executed (insertions and deletions of that species are or are not classified as pCFM).
4.1. A list of insertions and deletions (indels) this species carries is formed. Indels of length divisible by 3 are dropped from this list (i.e. further steps assume they never existed). This step is performed because such indels are not of any interest in the search for pCFMs, and the number of pCFMs candidates will matter in paragraph 4.3. 4.2. Indels, which are long (>20 nucleotides) and common for target species and all the descendants of this species's parent node are also dropped from the list. This is a mechanism for not considering long unaligned regions in basal species to be indicators of insertions in other species (suppose, for example, a hole of length 100 in a couple of basal species, which is most probably a defect of exon alignment. The algorithm will define sequences in all other species as insertions, which would not be appropriate. This procedure is taken in order to get rid of such abnormal "insertions"). This step is not crucial for general understanding of an algorithm, but if one wants to use our code on GitHub, we think s/he should be aware of that feature as well. 4.3. If more than 2 indels are left in the list, the species is dropped (i.e. paragraphs 4.4-4.5 are not executed). The reasoning behind this step is that such species are likely to be somewhat ill-sequenced or ill-aligned. However, handling species with multiple pCFMs was implemented, but wasn't used (the parameter brutal_conditions in the script results in bypassing this step). 4.4. The indels are classified as pCFM if one of two conditions are met: (1) indels have the same name (insertion-insertion or deletion-deletion) and sum of their lengths is divisible by 3 or (2) indels have different names (insertion-deletion or deletion-insertion) and the difference of their lengths is divisible by 3. 4.5. If pCFMs are detected, previously dropped species with the same indels (paragraph 3) are added into consideration. The reasoning here is that if they carry the same indels as trustworthy species, they are probably also trustworthy. This step is performed to gain more support in our analysis: we consider the cases where multiple species carry the pCFM to be more reliable. 4.6. If pCFMs are detected, it is checked, if both frameshfting indels from the pair happened simultaneously or not. For that two last common ancestors are compared: the last common ancestor of the species carrying the first frameshifting mutation of the pair and the last common ancestor of the species carrying the second. If these common ancestors are the same, frameshifting mutations happened simultaneously, else they did not. 5. After the iteration through all the species, for each pCFM the following information is added to the output : species in which pCFM was found, the names of mutations it is comprised of (insertion/deletion), lengths, positions, and simultaneity of their happening. An output is given in a form of two tsv tables: one for the simultaneous mutations, another for non-simultaneous.
Supplementary Figure 1. A schematic representation of the first part of the algorithm of search for indels in the alignment: classification of indels. A fragment of an alignment is shown, with letters a-h denoting different species. Red frames correspond to distinct holes. Green check marks flag species sharing the considered hole, and red crosses, those not sharing it. Blue circles flag the rest of the species (for which the presence of the hole is not unequivocal). The numbers correspond to the paragraphs in the algorithm description.
Supplementary Figure 2. A schematic representation of the second part of the algorithm of search for indels in the alignment: filtering and identification of pCFM. Species are denoted as sp1-sp7. The numbers correspond to the paragraphs in the algorithm description. Supplementary Table 2   Supplementary Table 2. Support from NCBI and Uniprot databases obtained for each of the 11 pCFMcarrying genes. The columns "evidence" and "source" indicate the type of evidence (protein or mRNA or predicted gene) and the database this evidence was obtained from. In the last column, the ID for the corresponding database (RefSeq ID or UniProt ID) is presented. For each gene and for each of the variants (with and without the pCFM), the evidence of type "mRNA" or "protein" is listed for all species for which such evidence is available. The evidence of the type "predicted gene" is listed only if no "mRNA" or "protein"-level evidence is available, and just for one of the species.