Mapping overlapping functional elements embedded within the protein-coding regions of RNA viruses

Identification of the full complement of genes and other functional elements in any virus is crucial to fully understand its molecular biology and guide the development of effective control strategies. RNA viruses have compact multifunctional genomes that frequently contain overlapping genes and non-coding functional elements embedded within protein-coding sequences. Overlapping features often escape detection because it can be difficult to disentangle the multiple roles of the constituent nucleotides via mutational analyses, while high-throughput experimental techniques are often unable to distinguish functional elements from incidental features. However, RNA viruses evolve very rapidly so that, even within a single species, substitutions rapidly accumulate at neutral or near-neutral sites providing great potential for comparative genomics to distinguish the signature of purifying selection. Computationally identified features can then be efficiently targeted for experimental analysis. Here we analyze alignments of protein-coding virus sequences to identify regions where there is a statistically significant reduction in the degree of variability at synonymous sites, a characteristic signature of overlapping functional elements. Having previously tested this technique by experimental verification of discoveries in selected viruses, we now analyze sequence alignments for ∼700 RNA virus species to identify hundreds of such regions, many of which have not been previously described.

Mapping overlapping functional elements embedded within the protein-coding regions of RNA viruses

Supplementary Data
Table S1 -List of reference sequences and alignment diversity statistics 2 Synplot2 results for representative RNA viruses (Figures S1 to S16) 5 Table S2 -Additional ORFs added to virus genome maps for the synplot2 analysis 22 Dataset S1 -Regions of reduced synonymous site variability in RNA viruses 28 3. Mean synplot2 p-value that would be obtained for a 25-codon window in which the reduction in synonymous site variability relative to the null model expectation is 30% (i.e. obs/exp = 0.7).
4. Mean synonymous site variability, expressed as obs/exp, necessary to achieve a p-value of 10 −6 in a 25-codon window for the alignment.
Note that these are mean-over-genome statistics and may not correspond precisely to p-value and obs/exp statistics in a specific window. They also depend on sliding window size, so are not directly applicable to the 45-codon and 15-codon windows used in Tables 1 and 2. Synplot2 results for representative RNA viruses Notes: 1. See website for plots for other RNA viruses.
2. All plots herein are based on automated alignments (≥75% amino acid identity to a given reference sequence) and use a 25-codon sliding window. Alignments are mapped onto reference sequence coordinates by removing alignment columns that have a gap in the reference sequence.
3. Although plots are labelled with a specific species, some alignments (e.g. for Japanese encephalitis virus) also include sequences from related species that have ≥75% amino acid identity to the chosen reference sequence.
4. In each plot, the brown line (obs/exp) indicates the relative amount of synonymous site variability as represented by the ratio of the observed number of synonymous substitutions to the expected number, in a 25-codon window. The red line shows the corresponding p-value. Note that p-values cannot be compared directly between plots as larger and more diverse alignments provide more statistical power.
5. The dashed grey line represents a p-value of 0.05 / (coding length / window size) -an approximate Bonferroni-like correction for multiple testing. I.e. for each plot, there is an ∼5% probability that one or more regions evolving neutrally at synonymous sites would by chance register a signal above the dashed grey line.
6. Note that, in regions where the alignment contains gaps in many sequences, it is possible for the brown obs/exp line to register an extreme value which nonetheless has a non-significant p-value because it is based on a much smaller number of sequences compared to other parts of the alignment.
7. It is important to consider both obs/exp and the p-value. For very large and diverse alignments, p-values can be highly significant even for slight decreases in obs/exp (e.g. S14.4 Rabies virus). The p-value represents statistical significance while obs/exp indicates the degree of purifying selection.
8. Where coding ORFs overlap, the reading frame of the longest of multiple overlapping ORFs is used for defining synonymous codons in the overlap region.
9. Small breaks in the red and brown lines indicate non-coding regions and also junctions between overlapping coding ORFs where a partial codon (i.e. 1 or 2 nt) has been omitted from the calculations.
10. Due to the sliding window, and concatenating coding regions for the synplot2 analysis, it is possible to obtain a false conservation signal at one side of a non-coding gap if there is high conservation at the other side. 11. ORFs are offset vertically according to their frame with respect to nucleotide 1 of the reference sequence (frames 0, 1, 2 from bottom to top). Red '*'s in the genome map represent stop codons.
12. These plots use a 25-codon sliding window. This provides a reasonable compromise between detecting larger overlapping features such as overlapping genes and smaller overlapping features such as non-coding RNA elements. However, some features (particularly smaller RNA elements) are more prominent in plots using smaller sliding window sizes.
13. A 75% identity threshold was used to facilitate automated construction of reasonably robust full-genome alignments. In many cases, greater power for detecting overlapping features can be achieved if more divergent alignments (e.g. ≥65% amino acid identity to the reference sequence) are used. For this purpose, single-coding-ORF alignments are often more robust than full-genome alignments.
14. Plots similar but not identical to a small number of these plots have been published previously: 3 ′ ORF of S1 15. A selection of regions of reduced synonymous site variability that correspond to known or predicted elements are annotated. Abbreviations used: sgRNA -subgenomic RNA; PRF -programmed ribosomal frameshifting; RT -stop codon readthrough; crecis-acting RNA or cis-acting replication element; ISL -internal stem-loop; CSE -conserved sequence element; DSCE -distal subgenomic control element; OA -origin of assembly. The term 'sgRNA promoter' is used for elements involved in sgRNA production irrespective of the mechanism (internal initiation by the RdRp on the antigenome, premature termination by the RdRp during negative-strand synthesis, etc).     Dataset S1. Regions of reduced synonymous site variability in RNA viruses For generic identification of regions of reduced synonymous site variability, we identified codon positions in alignments where the synplot2 p-value for a 25-codon window centred on that codon position was ≤ 10 −6 and the ratio of the observed number to the expected number (obs/exp) of synonymous substitutions in the 25codon window was ≤ 0.65. Adjacent codon positions satisfying these conditions were merged into regions, and adjacent regions were merged if the gap between them was ≤ 24 codons. Regions are indicated by their nucleotide coordinates in the given RefSeq, e.g. "NC 014320.1: 3975..4046" represents nucleotides 3975 to 4046 of GenBank accession NC 014320.1. Note that, in general, a larger number of statistically significantly conserved regions are found in RefSeqs for which larger more diverse alignments could be generated; the additional limit obs/exp ≤ 0.65 ensures that, even for the largest alignments, only regions subject to strong purifying selection are reported. Note that additional statistically significantly conserved regions are identifiable with different window sizes (e.g. smaller window sizes for compact RNA structures, larger window sizes for extended overlapping genes), and/or with more divergent sequence alignments (the alignments used here were built from full-length sequences with ≥75% amino acid identity to the RefSeq). Note that p ≤ 10 −6 is a very conservative threshold, designed to have an expected probability of ∼5% of obtaining a single false positive over the analysis of all alignments. Note, however, that recombinant sequences can give rise to conserved regions that do not represent overlapping functional elements. Alignments were not systematically screened for recombinants, though three obviously problematic alignments were removed (viz. those for NC 020439.1, NC 016416.1, NC 016081.1; see website). Caliciviridae -