Natural selection has preserved and enhanced the phase-separation properties of FUS during 160 million years of mammalian evolution

Protein phase separation is essential for the self-assembly of non-membraneous organelles. However, we know little about its ability to change in evolution. Here we studied the evolution of the mammalian RNA binding protein FUS, a protein whose prion-like domain (PLD) is essential for the formation of stress granules through liquid-liquid phase separation. Although the prion-like domain evolves three times as rapidly as the remainder of FUS, it harbors absolutely conserved tyrosine residues that are crucial for phase separation. Ancestral reconstruction shows that the phosphorylation sites within the PLD are subject to stabilizing selection. They toggle among a small number of amino acid states. One exception to this pattern are primates, where the number of such phosphosites has increased through positive selection. In addition, we find frequent glutamine to proline changes that help maintain the unstructured state of FUS that is necessary for phase separation. In summary, natural selection has stabilized the liquid-forming potential of FUS and minimized the propensity of cytotoxic liquid-to-solid phase transitions during 160 million years of mammalian evolution.


Body: 27
The separation of a mixed protein solution into distinct phases is essential for the 28 formation of membrane-less organelles in living cells 1,2 . These biomolecular condensates 29 help mitigate the effects of various stressors, such as energy depletion 3 , increased 30 temperature 3,4 , and drop in cellular pH 4 , which trigger their formation 5,6 . 31 The RNA binding protein Fused in Sarcoma (FUS) 7-10 is one of the mammalian 32 proteins whose phase separation is well-studied. FUS is predominantly a nuclear protein 33 and regulates the mRNA life cycle at different stages. In addition, FUS directly interacts 34 with Poly-ADP-ribose polymerase and mediates DNA damage response in the cell 11 . The 35 N-terminal prion-like domain of FUS is essential for the self-assembly of FUS into liquid-36 like and gel-like states 12,13 . FUS requires this self-assembly for its nuclear functions, such 37 as binding to chromatin and recruitment to the sites of DNA damage 14,15 . 38 The liquid-like state of FUS is exquisitely sensitive to single point mutations. In fact, 39 missense mutations in FUS occur in patients with the neurodegenerative diseases 40 amyotrophic lateral sclerosis (ALS) and frontotemporal lobar degeneration 8 . For instance, 41 a single ALS-associated mutation, G156E, facilitates a liquid-to-solid phase transition of 42 FUS into irreversible aggregates 16 . The importance of FUS in the life of cells, together 43 with the sensitivity of FUS assemblies to point mutations, raises the possibility that natural 44 selection must actively maintain the ability of FUS to form the liquid-droplet state. We thus 45 hypothesized that evolution has preserved the phase separation propensity of FUS, and 46 To understand whether these amino acid switches are caused by neutral evolution 68 or positive selection, we estimated how strongly evolutionary rates vary across the amino 69 acid sites within the PLD, and along the branches of its phylogenetic tree (see Methods 70 for details) 18 . We detected positive selection in 10 branches (p.val < 0.05) and 12 sites 71 (probability > 0.90) (Figure 2A-B; Table S3) and in three types of substitutions: G to S, S 72 to G and Q to P. We observed the highest likelihood of positive selection for serine at the 73 sites 42, 119, 129, and 131 and threonine at the sites 40 and 71 which occurred in the 74 branches leading to primates and greater apes (Supplementary information). 75 The positively selected residues in primates (i.e., sites 42, 71, 78, 129, and 131) 76 are among the sites in human FUS that are phosphorylated at serine and threonine 19,20 . 77 Their phosphorylation not only increases the recruitment of FUS to the sites of DNA 78 damage 21 , but also inhibits its liquid-to-solid phase separation 19 . From a total of 32 79 phosphosites, only nine sites (i.e., sites 3, 7, 11, 26, 57, 77, 87, 96, and 148) were fully 80 conserved, but the rest (24 sites) switched forth and back between only two pairs of amino 81 acids (G-S and A-T) ( Figure 2C). These sites occurred in both evolutionary hotspots, and 82 their evolutionary rates were significantly higher than for the rest of the PLD residues 83 ( Figure 2D; Wilcoxon rank-sum test, p=7.9Î10 -4 ). The PLD sequences of primates and 84 great apes harbor an exceptionally large number of phosphosites. That is, they harbor 29 85 and 31 phosphosites, respectively, which is 3 and 6 sites more than the average number 86 of mammalian PLD phosphosites ( Figure 2E). Therefore, positively selected G to S and 87 A to T substitutions have significantly increased the total number of phosphosites in the 88 PLD sequences of primates. (Wilcoxon rank-sum test, p=4.0Î10 -12 ). 89 Aside from primates, the average number of phosphosites in the mammalian PLD 90 sequences is ~ 26 ± 2 sites. This small variation might indicate that the total number of 91 S/T amino acids in these sites are stabilized in the evolution of the FUS in mammals. To 92 find out whether stabilizing selection has acted on our FUS sequences, we compared the 93 likelihood that genetic drift alone or drift together with selection acted on the total number 94 of phosphosites using an Ornstein-Uhlenbeck process 22 . This process has been used to 95 compare the likelihood of drift alone with that of drift and selection in the evolution of 96 different traits and characters 23 ( Figure 2F; see Methods for details). We found that 97 stabilizing selection better explains the evolution of phosphosites than pure drift 98 (likelihood ratio test, p=0.015, Figure 2G; Table S4). Interestingly, this signature of 99 stabilizing selection increases dramatically when primates are removed from this analysis 100 ( Figure 2H; likelihood ratio test, p=2.28Î10 -7 ; Table S4). Together with our phylogenetic 101 analysis of positive selection (Figures 2A-B), these observations suggest two regimes in 102 the evolution of FUS phosphosites. In mammals except for primates, the number of 103 phosphosites is under stabilizing selection. In primates and, in particular, great apes, 104 positive selection has further increased the number of phosphorylation sites. 105 The disordered domains of proteins, in particular proteins that undergo phase 106 separation, preserve key amino acid features such as charge and sequence composition 107 throughout their evolution 24,25 . We thus examined the physicochemical properties that are 108 either conserved or positively selected in the evolution of the PLD in mammalian FUS 109 (see Methods for details). We found that amino acid substitutions in the PLD have 110 significantly conserved polarity, flexibility, and solvation free energy ( Figure 3A, Table S5; 111 Chi-square goodness-of-fit, p < 10 -7 ). We also found several properties whose changes 112 were more frequent than expected from strict neutrality, and that had diversified in the 113 evolution of the PLD ( Figure 3A). The most significantly diversified property is the average 114 occurrence of amino acids in a tetrapeptide unit in protein structures 26 . This property 115 quantifies the nucleation propensity of amino acids in segments of four residues, and 116 divides the amino acids into two groups. Tetrapeptides with amino acids in the first group 117 (Pro, Gly, His, Tyr, Cys, Asn and, Trp) are more likely to adopt extended structures. 118 Tetrapeptides with amino acids in the second group (all other amino acids), are more 119 likely to form helical and bend conformations. 120 For this property, amino acid substitutions in the region S117 to S147 had the 121 maximum strength of positive selection ( Figure 3B, Table S6). This region is enriched in 122 proline substitutions (i.e., in residues 105, 117, 126, 132, 134, 140, 141, 146, and 147; 123 Figure 3C). Importantly, our analysis of positive selection had shown that Q to P 124 substitutions are positively selected in different branches of the phylogenetic tree (e.g., 125 Q141P in Gorilla (Gorilla gorilla) and S134P in the branch leading to primates; 126

Supplementary information). 127
To study the effect of proline substitutions in this evolutionary hotspot, we made a 128 glutamine-rich and a proline-rich variant from the region S117 to S147 by selecting Q and 129 P in all residues that had experienced Q to P substitutions in different mammalian 130 sequences (residues 126, 132, 140, 141, 146, and 147), respectively (Supplementary 131 information). We then predicted the secondary structure content of the two variants using 132 the PEP-FOLD algorithm 27 and validated the stability of these predictions using molecular 133 dynamics simulations (see Methods for details). The glutamine-rich variant forms three 134 short helices that extend from residue Q118 to S121, from S129 to S135, and from Q139 135 to G144 ( Figure 3D). In contrast, the proline-rich variant is mostly unstructured and retains 136 only partially the middle of the three helices (Q126 to Y129; Figure 3E). Importantly, the 137 Q-rich variant showed higher helical content ( Figure 3F) and, on average, five more side-138 chain hydrogen bonds compared to the P-rich variant ( Figure 3G). We thus conclude that 139 Q to P substitutions help maintain an unstructured state that is necessary for self-140 assembly and phase separation of FUS. 141 Finally, we examined the changes in the propensity of fibril formation in the 142 evolution of the PLD in mammalian FUS. The first hotspot in the PLD, from S30 to S86, 143 corresponds to a region that forms a fibrillar beta-sheet structure at high concentrations 144 ( Figure 4A) 12 . The formation of these cross-beta sheet structures has been proposed 12 , 145 and disputed 28,29 , to drive phase separation of FUS. Strikingly, we observed hydrogen-146 bond breaking substitutions in these regions that abolish side-chain hydrogen bonding 147 and likely destabilize the fibril core ( Figure 4B). For example, T78 and S84, which form 148 inter-residue hydrogen bonds in the structure of the fibril core, are repeatedly substituted 149 to alanine and glycine in different mammals. Other examples include alanine or proline 150 substitutions in the residues S48, Q69, and T71, which hydrogen-bond and join the 151 segment S44 to Y50 with T64 to G80. We further predicted the stability of fibril cores in 152 different mammals and found a substantial variation in free energy of folding ( Figure 4C, 153 Table S7) which is more likely caused by pure drift than stabilizing selection (likelihood 154 ratio test, p=0.019; Table S8). In line with this observation, evolutionary rates of the PLD 155 sequences were not significantly higher along the branches leading to fibrils with higher 156 stabilities (Spearman's rank correlation, R=0.047, p=0.71). Altogether, our analyses 157 reveal that stability of the fibril core is unlikely maintained and selected in the evolution of 158 the PLD in mammalian FUS. 159 In summary, our work shows that properties affecting phase separation are highly 160 evolvable in FUS, an essential protein that is also involved in neurological diseases. The 161 lack of protein structure in the prion-like domain of FUS leads to substantial sequence 162 variation, a feature that is common in intrinsically disordered proteins [30][31][32][33]  shows the region S117 to S147, where selection to diversify this property is maximal. C) 234 Multiple sequence alignment of the segment S117 to S147 for selected mammalian PLD 235 sequences. Color saturation represents the frequency of the amino acid in each column, 236 and ranges from dark blue (> 80%) to white (<40%). D) PEP-FOLD predicted structure 237 of the glutamine-rich, and E) the proline-rich variant of the region S117 to S147. The sites 238 126, 132, 140, 141, 146, and 147 were computationally substituted to glutamine or proline 239 to create the Q-rich or the P-rich variants, respectively. Both structures were generated

Supplementary Tables:
• Table S1: The substitution patterns of the ancestrlal reconstruction in the PLD of FUS • Table S2: The substitution patterns of the ancestrlal reconstruction in the PLD of FUS • Table S3: The likelihoods for the alternative and the null model of branch-site test for positive selection • Table S4: Ornstein-Uhlenbeck processes for phosphorylated sites (without primates) • Table S5: Highly conserved and highly diversified physicochemical properties in the evolution of the PLD in FUS. • Table S6: Z-score values of the property RACS820112 per sliding window of 5 codons. • Table S7: FoldX free energy terms for extant and ancestral PLD sequences of mammalian FUS. • Table S8: Ornstein-Uhlenbeck process for changes of the folding free energy in the evolution of fibril core in FUS.
• Structures of Q-rich and P-rich variants made by PEP-fold algorithm.
• Structure of fibril cores in mammals made by FoldX in silico mutagenesis.

Data compilation
We retrieved 105 coding sequences of mammalian FUS genes from the NCBI 1 and ENSEMBL 2 databases. We subdivided these sequences into three subsets. We used the first of these subsets, which comprised 105 mammalian sequences, to build a multiple sequence alignment for the calculation of sequence entropy. We used the second subset

Estimating evolution rate and detecting positive evolution
We prepared protein sequence alignments with the codon-based CLUSTAL algorithm 4 implemented in MEGA 5

Ancestral Sequence Reconstruction
To reconstruct ancestral sequences, we fitted different substitution models to our data (PLD sequences and the mammalian phylogenetic tree), allowing that evolutionary rates may vary among protein sites. The substitution model JTT 11 with the gamma distribution of evolutionary rates had the highest Bayesian Information Criterion (BIC) score 35 . We thus used this model and inferred ancestral PLD sequences using the Maximum Likelihood method implemented in MEGA7 5 .

Detection of amino acid properties under selection
We used the TreeSAAP method 12 to infer how natural selection may change amino acid properties in the evolution of the PLD. Briefly, this method compares the distribution of changes in amino acid properties along the branches of a phylogenetic tree with an expected distribution, using the codon composition of a set of extant sequences. Changes in amino acid properties are divided into eight categories, from the most conserved (category 1) to the most radical changes (category 8). The method then calculates the goodness of fit (! " -distribution) between the expected and the observed frequencies and tests the hypothesis that these distributions are equal for each amino acid property. For a specific property, the deviation between observed and expected frequencies in each category is calculated using a Z-score. We refer to this Z-score as the deviation from neutrality or the selection strength throughout this paper. A highly significant z-score (z > 3.09, p<0.01) shows that more non-synonymous substitutions change the property of interest compared to neutral evolution.

Detection of Co-evolution
To infer the co-evolutionary history of protein sites within the PLD, we used Bayesian Graphical Models 13 implemented in the HyPhy package 14 . We first used our ancestral reconstruction of FUS to construct a binary matrix representing the presence and absence of substitutions on each branch (rows) of the phylogenetic tree and in each site of the protein (columns). The joint distribution of all substitutions was then inferred using Bayesian networks and Markov Chain Monte Carlo (MCMC) sampling with default parameters in the SpiderMonkey method 13 . We avoided the use of mutual information to infer co-evolution because it leads to a high rate of false positives in the detection of coevolving sites when sequences are substantially similar 19 (~ > 62%), as in our case.

Prediction of folding free energy
We predicted the stability of the structure of the fibril core of the PLD in different mammals.
We generated the 3D structures of the fibril core using its structure in human (PDB ID: 5W3N 15 ) as a template. We then calculated the free energy of folding of the fibril core made from the mammalian PLD sequences using the FoldX algorithm, which uses an empirical force field for the prediction of the free energy change of protein structures upon mutations 16,17 . We first minimized the free energy of this structure using the Repair command in FoldX 17 . We then created in silico mutants of this structure to create different mammalian PLD orthologs using the BuildModel command of FoldX 17 .

Ornstein-Uhlenbeck processes
To estimate the significance of stabilizing selection versus pure drift, we used Ornstein-Uhlenbeck processes that are corrected for phylogenetic dependence of species 18 . These models have been used to test various evolutionary hypotheses in the evolution of different characters and traits 19 , gene expression level 20 , and protein structure 21 . In brief, these models assume that the character of interest, X(t), We used the total number of phosphosites and the folding free energy as the traits of interest in evolution. We then fitted the models for pure drift and stabilizing selection using the BROWN and HANSEN commands in the OUCH package 19 , respectively. The input to these commands was the mammalian phylogenetic tree in Newick format, together with a data vector of the trait of interest, either the total number of phosphosites or the stability of the fibril core. We used the initial values of ( = 1 and , = 1 to initialize the optimization process using the Nelder and Mead simplex algorithm 22 . For modeling stabilizing selection, we assumed that all nodes belong to a single selective regime. We used likelihood ratio tests to compare the likelihoods of drift and selection.

Prediction of peptide structure and molecular dynamics simulations
We generated the initial structures of the glutamine-rich and the proline-rich variants using the PEP-FOLD online web server 23 . For molecular dynamics simulations, we used the GROMACS package 24 (v2019a) and employed periodic boundary conditions at 300 K and 1 atm, with a time step of 2 fs. We chose the Gromos 54a7 25 force field because of its ability to reproduce the kinetics of helix formation 26 . We kept the temperature and pressure constant with the Nose-Hoover thermostat 27,28 (time-constant = 0.1 ps) and the Parinello-Rahman barostat 29 (time constant = 1.5 ps), respectively. For both van der Waals and short-ranged Coulombic interactions, we used a cut-off radius of 1.0 nm and used the particle-particle mesh Ewald method for the long-ranged Coulombic interactions 24 . We minimized the energy of both structures by the steepest descent method, followed by a position-restraint simulation to equilibrate the water molecules. We then performed a grand canonical ensemble simulation (constant number of particles, temperature, and pressure) at 300K for 20 ns and calculated the percentage helicity of different residues and the number of side-chain hydrogen bonds. We performed all statistical analyses in R, using scripts available on GitHub (https://github.com/dasmeh/FUSEVOL).

Co-evolution of residues in the PLD
We also pursued a complementary analysis to study the co-evolution of fibril core residues. If the fibril core has remained intact during mammalian evolution, then we would expect that residues in the fibril core co-evolve with each other, and possibly also with the rest of the PLD33. To find out, we used Bayesian Graphical Models 13 to estimate the probability of co-evolution between pairs of residues in the fibril core (see Methods).