Synonymous variants that disrupt messenger RNA structure are significantly constrained in the human population

Abstract Background The role of synonymous single-nucleotide variants in human health and disease is poorly understood, yet evidence suggests that this class of “silent” genetic variation plays multiple regulatory roles in both transcription and translation. One mechanism by which synonymous codons direct and modulate the translational process is through alteration of the elaborate structure formed by single-stranded mRNA molecules. While tools to computationally predict the effect of non-synonymous variants on protein structure are plentiful, analogous tools to systematically assess how synonymous variants might disrupt mRNA structure are lacking. Results We developed novel software using a parallel processing framework for large-scale generation of secondary RNA structures and folding statistics for the transcriptome of any species. Focusing our analysis on the human transcriptome, we calculated 5 billion RNA-folding statistics for 469 million single-nucleotide variants in 45,800 transcripts. By considering the impact of all possible synonymous variants globally, we discover that synonymous variants predicted to disrupt mRNA structure have significantly lower rates of incidence in the human population. Conclusions These findings support the hypothesis that synonymous variants may play a role in genetic disorders due to their effects on mRNA structure. To evaluate the potential pathogenic impact of synonymous variants, we provide RNA stability, edge distance, and diversity metrics for every nucleotide in the human transcriptome and introduce a “Structural Predictivity Index” (SPI) to quantify structural constraint operating on any synonymous variant. Because no single RNA-folding metric can capture the diversity of mechanisms by which a variant could alter secondary mRNA structure, we generated a SUmmarized RNA Folding (SURF) metric to provide a single measurement to predict the impact of secondary structure altering variants in human genetic studies.

• We thank the reviewer for bringing these tools to our attention. We had originally omitted these tools as the focus of our manuscript is on synonymous variants that impact mRNA structure, and tools like SiLVA focus on splicing. However, we agree that when discussing synonymous variants, it is appropriate to include background on all tools for sSNVs. As such we have added a paragraph to that surveys these tools and also describe a novel one (SynTool) (BACKGROUND, P5) 2. The authors also seem to be unaware of a previous demonstration of the relations between minor allele frequency of single nucleotide variants and base-pairing probabilities at the mutation site and between MAF and RNA solvent accessibility at the mutation site (Yang et al. RNA 23:14-22, 2016).
• We thank the reviewer -we were indeed not aware of this paper. We address Yang's tool RNAsnap in the context of our own work, describing its focus on protein-bound tertiary structures as well as its finding of a correlation between RNA structure and population frequencies. (BACKGROUND, P4) 3. Figure 3: it would be interesting to examine the relation between MAF and dMFE/CFEED/dCD (and other structural properties) as the literature above.
• We agree that a study of the relation of our structural metrics to log(MAF) (as was performed in Yang et al., 2016) would be most interesting. We have added SUPPLEMENTARY FIGURE 7 showing the correlation of log(MAF) against the metrics dMFE, CFEED and dCD (BACKGROUND, P4).
• NOTE: to build these plots, we had to decide on a value of log(0). (This was not a problem for Yang et al., since 1000 Genomes contains only SNVs with positive MAFs). We adopted two separate approaches: setting log(0)=-6 (a reasonable value, since the gnomAD cohort has around 100,000 individuals, meaning the lowest possible log(MAF) is -5), or throwing out all SNVs with MAF=0. If we take the first approach, we essentially recover FIGURE 3. As might be expected, if we take the second approach (resulting in the loss of about 80% of our data) the pattern is lost. This seems to justify our decision to use P(MAF>0) as our primary y-statistic.
4. Figure 4: can you explain why peaked for dMFE at 2?
• This peak reflects the dominant contribution from CpG transitions, which have the highest gnomAD frequencies and an average dMFE of around 2. We added a note mentioning this important detail. (ANALYSIS, P7).
5. There are much more than a few examples of pathogenic SNVs shown in Table 2. In fact, there are a few thousands in HGMD (Stenson,et al. Human Genetics,136(6), 665-677 (2017)). A few datasets are available online to download. Authors should examine more examples for those neutral (from 1000 genomes) and pathogenic SNVs.
• The reviewer is quite correct that there are thousands of known pathogenic SNVs in HGMD. However, our goal was to present sSNVs that caused disease through their disruption of mRNA structure, and the HGMD does not consider RNA structure as an etiology, only including synonymous variants if they occur in splicing or regulatory regions. (Stenson et al., 2017, PMID 28349240, Table 1). The lack of mRNA structurally altering synonymous variants in databases such as ClinVar and HGMD was actually one of the motivating factors for our study, and it is one that the synonymous-variant community must grapple with. To lend credibility to this claim, we note that a 2019 Nature Communications article on synonymous mutations in cancer asserts that "no changes in RNA secondary structures of cancer genes have been proven so far." (Nat Commun 10, 2569 (2019) https://doi.org/10.1038/s41467-019-10489-2).
• For the purposes of evaluating our metrics against known sSNVs that disrupt mRNA structure the best resource we could find was the dbDSM (the Database of Deleterious Synonymous Mutations) which only had 17 sSNVs tagged specifically as disease-causal via their effect on mRNA structure. After checking the original publications of these 17 sSNVs, we found only 6 where we felt that true causality for RNA structure had been demonstrated (several other publications speculated a relationship but did not perform the necessary functional work). Through extensive literature searching we found an additional three sSNVs variants that clearly demonstrate a pathogenic disease etiology through alteration in mRNA structure and have added them to TABLE 2. For the 9 known sSNVs clinically implicated for mRNA structural pathogenicity, all of them are successfully predicted to be pathogenic by our structural metrics. (TABLE 2, P40). • With regard to the addition of neutral SNVs in Table 2 -we understand the reviewer's wish to have neutral sSNVs represented in the table. However, given that the vast majority of our sSNVs are predicted to be structurally neutral by our metrics, this would require adding a large sample of neutral sSNVs to give a good representation. We did however modify our core dataset to include the more recently released gnomAD 3 data, which included the addition of WGS data for over 76,000 healthy subjects and updated all figures accordingly (see DATA DESCRIPTION, P7). We also have uploaded the complete dataset to GigaDB (see AVAILABILITY OF SUPPORTING DATA AND MATERIALS, P25).
6. "However, the potential for pathogenic synonymous variants that impact RNA folding in human genetic disease remains largely unknown". This statement is incorrect as such an effect was demonstrated before. I would like to see the discussion changed.
• The reviewer is correct -it is now scientific dogma that such variants exist. We modified the discussion, changing "remains largely unknown" to "is not universally appreciated." (DISCUSSION, P15) 7. "In the absence of functional tools that would aid in the simultaneous assessment of both nsSNVs and sSNVs". This statement is incorrect as these tools do exist. I would like to see the discussion changed.
• We agree with the reviewer's recommendation. We changed "In the absence of functional tools" to "Given the scarcity of RNA-structure-specific tools". (DISCUSSION, P15) 1. This discussion of the relationship between mRNA secondary structure/thermodynamic stability and mRNA intracellular stability/propensity to degradation is very vague. The authors failed to cite some key publications in the field (e.g. Ding et al., In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature, 2014; Mauger et al, mRNA structure regulates protein expression through changes in functional half-life, PNAS, 2019, see also ref. therein).
• We thank the reviewer for raising this valid criticism and have greatly expanded the BACKGROUND (P3) to further explain that "RNA stability" refers to the RNA's physical structure, which derives primarily from base-pair stacks, rather than to its functional lifetime or its ability to avoid degradation. We emphasize however that stable RNAs do, in fact, have longer lifetimes, express more protein and suffer later degradation, and hypothesize this is the preeminent reason why stability is selected for in the first place. In the DISCUSSION (Molecular mechanisms underlying constraint of sSNVs, P17-18) we elaborate that stable RNA structures are thought to regulate ribosomal speed, possibly with the goal of preventing collisions, which are known to trigger degradation pathways. We discuss Ding et al. 2. The interplay between local and global mRNA secondary structure and mRNA intracellular stability as well as overall protein expression and folding/activity has not been fully understood yet. While it has been suggested that nucleotides that stabilize mRNA secondary structure within the coding regions do enable high protein expression, the effect of such variants in the 5'UTRs is usually quite the opposite, as the increase in 5'UTR stability usually decreases efficiency of translation initiation. The authors should carefully discuss these different scenarios in the context of their observations.
• We thank the reviewer for this relevant point, which we agree should be included. We have added a couple sentences explaining that, in contrast to the CDS, stability in 5' UTRs is associated with decreased expression due to the interference of structure with translation initiation. (BACKGROUND -P3-4) • Interestingly, in most organisms this effect extends into the coding region -the first 30 or so bases have reduced stability (Gu et al., doi:10.1371/journal.pcbi.1000664) However, warm-blooded animals are partially an exception to this rule -it holds only in the most G+C-rich genes, which may be attributable to the CpG islands that generally co-locate with mammal and bird gene promoters. We also describe this extension of this effect into the coding region, and why mammals might be an exception. DISCUSSION (Molecular mechanisms underlying constraint of sSNVs, P17)