OLGenie: Estimating Natural Selection to Predict Functional Overlapping Genes

Abstract Purifying (negative) natural selection is a hallmark of functional biological sequences, and can be detected in protein-coding genes using the ratio of nonsynonymous to synonymous substitutions per site (dN/dS). However, when two genes overlap the same nucleotide sites in different frames, synonymous changes in one gene may be nonsynonymous in the other, perturbing dN/dS. Thus, scalable methods are needed to estimate functional constraint specifically for overlapping genes (OLGs). We propose OLGenie, which implements a modification of the Wei–Zhang method. Assessment with simulations and controls from viral genomes (58 OLGs and 176 non-OLGs) demonstrates low false-positive rates and good discriminatory ability in differentiating true OLGs from non-OLGs. We also apply OLGenie to the unresolved case of HIV-1’s putative antisense protein gene, showing significant purifying selection. OLGenie can be used to study known OLGs and to predict new OLGs in genome annotation. Software and example data are freely available at https://github.com/chasewnelson/OLGenie (last accessed April 10, 2020).

Natural selection in protein-coding genes is commonly inferred by comparing the number of nonsynonymous (amino acid changing; d N ) and synonymous (not amino acid changing; d S ) substitutions per site, with d N /d S <1 indicative of purifying (negative) selection. Thus, d N /d S can be used to predict functional genes (Gojobori et al. 1982;Nekrutenko et al. 2002). However, complications arise if synonymous changes are not neutral, in which case purifying selection may reduce d S (i.e., increase d N /d S ). This is usually negligible, as the effects of most synonymous variants are dwarfed by those of disadvantageous nonsynonymous variants, causing the majority of genes to exhibit d N /d S <1 (Hughes 1999;Holmes 2009). However, this assumption does not hold for overlapping genes (OLGs). A double-stranded nucleic acid may encode up to six open reading frames (ORFs), three in the sense direction and three in the antisense direction, allowing pairs of genes to overlap the same nucleotide positions in a genome ( fig. 1). In such OLGs, changes that are synonymous in one gene may be nonsynonymous in the other, making otherwise "silent" variants subject to selection. As a result, d N / d S methods designed for regular (non-overlapping) genes do not take into account the nonsynonymous effects (in the alternate gene) of some synonymous changes (in the reference gene). As a result, standard (non-OLG) d N /d S methods can fail to detect purifying selection or erroneously predict positive (Darwinian) selection when applied to OLGs (Holmes et al. 2006;Sabath et al. 2008;Sabath and Graur 2010).
OLGs are widespread in viruses (Belshaw et al. 2007;Brandes and Linial 2016;Pavesi et al. 2018), and may not be uncommon in prokaryotes (Meydan et al. 2018;Vanderhaeghen et al. 2018;Weaver et al. 2019) and eukaryotes, including humans (Makałowska et al. 2007;Sanna et al. 2008). The number of OLGs has likely been underestimated, partly because genome annotation software is biased against both short ORFs (Warren et al. 2010) and overlapping ORFs (Vanderhaeghen et al. 2018). Current methods for detecting OLGs, such as Synplot2 (Firth 2014), d N /d S estimators (Sabath et al. 2008;Wei and Zhang 2015), and long-ORF identifiers (Schlub et al. 2018) are subject to one or more of the following limitations: restricted to long OLGs, limited to single or pairs of sequences, unsuitable for low sequence divergence, not specific to protein-coding genes, lacking accessible implementation, or too computationally intensive for genome-scale data (Table 1). For example, those available methods that are suitable for genome-scale analysis are not able to specifically detect protein-coding OLGs. Scalable bioinformatics tools are therefore needed to predict OLG candidates for further analysis, preferably by utilizing the evolutionary information available in multiple sequences and quantifying purifying selection in a way that is comparable with that of non-OLGs. We wrote OLGenie to fill this void. Codons are denoted with solid black boxes, each comprising three ordered nucleotide positions (1, 2, 3) with light gray boundaries. The reference gene frame is shown with a white background, whereas alternate gene frames are shown with a gray background. Frame relationships are indicated using the nomenclature of Wei and Zhang (2015), where "ss" indicates "sense-sense" (same-strand), "sas" indicates "sense-antisense" (opposite-strand), and the numbers indicate which codon position of the alternate gene (second number) overlaps codon position 1 of the reference gene (first number). For all alternate frames except sas13, one reference codon partially overlaps each of two alternate codons. (B) Example of an overlapping gene in the ss13 frame. A minimal overlapping unit of 6 nt is shown, comprising one reference gene codon and its two overlapping codons in the alternate gene. At position 2 of the reference codon (highlighted in yellow), three nucleotide changes are possible: two cause nonsynonymous changes in both genes (NN; nonsynonymous/nonsynonymous) and one causes a nonsynonymous change in the reference gene but a synonymous change in the alternate gene (NS; nonsynonymous/synonymous). No synonymous/nonsynonymous (SN) or synonymous/synonymous (SS) changes are possible at this site. Thus, this site is counted as two-thirds of an NN site and one-third of an NS site. Finally, a pair of sequences having a C/A or C/G difference at this site is counted as having 1 NN difference, whereas a pair of sequences having a C/T difference at this site is counted as having 1 NS difference. (C) Example calculation of d NN , d SN , d NS , and d SS for a pair of sequences with an overlapping gene in ss13. Codons are denoted with brackets above (reference gene) and below (alternate gene) each sequence. The distance d is calculated for each site type (NN, SN, NS, and SS) as the number of differences divided by the number of sites of that type. Because the first and last reference codons only partially overlap alternate codons, they are excluded from analysis and the numbers of sites sum to 15 (¼ 5 codons Â 3 nt; codons 2-6). Numbers of sites are not an exact multiple of 1/3 because nucleotide 6 of sequence 2 (TTT; alternate codon TTG) does not tolerate a change to A, as this would lead to a stop codon in the alternate gene (TAG). Thus, this position is considered an SN site in sequence 1, but one-half of an NN site and one-half of an SN site in sequence 2, for a mean of 0.25 NN and 0.75 SN sites. The table shows the mean numbers of sites for the two sequences (sequence 1 ¼ 4.33 NN, 5 SN, 5.67 NS, and 0 SS; sequence 2 ¼ 5.83 NN, 4.5 SN, 4.67 NS, and 0 SS), used to calculate each d value. For a multiple sequence alignment, the mean number of differences and sites for all pairwise comparisons would be used.

New Approaches
OLGenie is executed at the Unix/Linux command line with two inputs: 1) a multiple sequence alignment (FASTA file) of contiguous codons known or hypothesized to constitute an OLG pair; and 2) the frame relationship of the OLGs. The codon frame beginning at site 1 of the alignment is considered the "reference" gene, which overlaps one "alternate" gene. The choice of which gene to consider the reference versus the alternate is arbitrary; however, in practice, the reference gene ORF is typically longer, whereas the alternate gene ORF usually occurs entirely or partially within the reference gene, and is of unknown or more recently established functionality (Pavesi et al. 2018). The alternate gene can occur in any one of five frames: ss12, ss13, sas11, sas12, or sas13. Here, "ss" indicates "sense-sense" (same-strand), "sas" indicates "sense-antisense" (opposite-strand), and the numbers indicate which codon position of the alternate gene (second number) overlaps codon position 1 of the reference gene (first number) ( fig. 1). We prefer this nomenclature because the meaning of each frame is described in its name; however, at least nine others have been employed, summarized in Table 2 In each case, the effect of mutations in one of the two OLGs is held constant (N or S), ensuring a "fair comparison" in the other gene. For example, if nonsynonymous changes observed in the reference gene are disproportionately synonymous in the alternate gene (d NS > d NN ), the result will be d NN /d NS < 1.0, and purifying selection on the alternate gene can be inferred . In practice, d NN /d NS rather than d SN / d SS is typically used to test for selection in the alternate gene, as SS sites are usually too rare to allow a reliable estimate of d SS .
The original Wei-Zhang method is computationally prohibitive when many nucleotide variants are present in neighboring codons, and the size of the minimal bootstrap unit is data-dependent (Table 1). To circumvent these issues, we introduce three modifications: 1) consider each reference codon to be an independent unit of the alignment amenable to bootstrapping; 2) apply the Nei-Gojobori method to each OLG, as implemented in SNPGenie (Nei and Gojobori 1986;; and 3) consider only single  (1) is not strictly true when two neighboring reference codons share sites with the same alternate codon, introducing biological nonindependence. Nevertheless, no individual site is included in more than one unit of the alignment, and the assumption of independence has proven widely effective (Nei and Kumar 2000), even though nearby codons may never evolve independently.
Modification (3) is identical to the original Wei-Zhang method when a pair of sequences contains only one difference in contiguous codons. However, differences may be misclassified when !2 sites in contiguous codons differ. As a result, OLGenie tends to underestimate the denominator of d N /d S (d NS or d SN ), biasing the ratio upward and yielding a conservative test of purifying selection that nevertheless has increased power over non-OLG d N /d S (supplementary section S1, Supplementary Material online).

Results and Discussion
Assessment with Simulated Data Our simulations also allowed us to identify the most accurate and precise ratios for estimating each frame's d N /d S . For ss12/ss13, sas11, and sas13, the rarest site class is SS (0-2.7% of sites), leading to high stochastic error when estimating d SS . Thus, for alternate genes in these frames, the d NN /d NS ratio is relatively "site-rich" and preferred. Contrarily, for sas12, SS sites are usually more common (18.3%) than NS (7.4%) and SN (7.4%) sites, so that d NN /d NS is preferred only 52.5% of the time (51.2-53.9%; binomial 95% C.I.) (supplementary tables S4 and S5 and figs. S5 and S6, Supplementary Material online). Thus, for alternate genes in sas12, either ratio can potentially be informative, and should be selected on a case-by-case basis, according to the number of sites: d NN /d NS if the minimum of (NN, NS) ) minimum of (SN, SS); d SN /d SS if the inequality is reversed; or both if the minima are approximately equal.

Assessment with Biological Controls
To evaluate OLGenie's performance with real biological data, we next applied the program to 58 known OLG (positive control) and 176 non-OLG (negative control) loci from viral genomes (Pavesi et al. 2018). Strict codon alignments were generated from quality-filtered BlastN hits (Materials and   Case Study: HIV-1's Putative Antisense Protein Gene Lastly, we examined the unresolved case of human immunodeficiency virus-1's (HIV-1) env/asp sas12 overlap (Miller 1988;Torresilla et al. 2015), where the putative antisense protein (asp) gene has evaded detection by sev-eral bioinformatic methods, including non-OLG d N /d S (Cassan et al. 2016;Schlub et al. 2018). We used OLGenie to test for purifying selection in three subregions of env: 1) 5 0 non-OLG; 2) putative asp-encoding; and 3) 3 0 non-OLG. Three data sets were used: 1) M group from Cassan et al. (2016)  The sas12 d N /d S ratio is significantly <1 in all three data sets for the 5 0 non-OLG (d N /d S 0.66; P ¼ 2.04 Â 10 À7 ) and asp (d N /d S 0.58; P ¼ 2.75 Â 10 À5 ) subregions of env. The lowest ratio for each data set always occurs in asp, reaching very high significance in the BLAST data set (d N /d S ¼ 0.29; P ¼ 5.04 Â 10 À25 ). As a benchmark, our ss12/ss13 controls suggest a false-positive rate of 0% for d N /d S 0.4 when employing P 1.04 Â 10 À6 (based on 28 OLGs and 27 non-OLGs). The 3 0 non-OLG region is also significant for the Cassan non-M groups (d N /d S ¼ 0.78, P ¼ 0.00921); how- The HIV-1 env gene was analyzed in sas12 with the site-rich ratio d NN /d NS using 25-codon sliding windows (step size ¼ 1 codon), limiting to codons with !6 defined (nongap) sequences. The hypothesized asp gene is located at codons 655-1,033 (supplementary table S15, Supplementary Material online). The y axis shows significance, calculated as the natural logarithm of the inverse P value, as suggested by Firth (2014), using Z tests of the null hypothesis that d NN ¼ d NS (1,000 bootstrap replicates per window; reference codon unit). The horizontal dashed gray line shows the multiple comparisons P value threshold (0.000924) suggested by Meydan et al. (2019) and described in supplementary section S5, Supplementary Material online, that is, a threshold of 0.05/(CDS length/window size). Results for other frames are shown in supplementary figure S9, Supplementary Material online. Positive selection (red) refers to d N /d S > 1; purifying selection (blue) refers to d N /d S < 1. Sequence features are described in supplementary table S15, Supplementary Material online and shown here as shaded rectangles: yellow for hypothesized sas12 genes, green for the highly structured RNA Rev response element (RRE), and gray otherwise. Nelson et al. . doi:10.1093/molbev/msaa087 MBE ever, the expected false-positive rate is high ($22-28%) and the other two data sets are not significant in this region (d N / d S ! 0.74; P ! 0.107) (supplementary table S14, Supplementary Material online).
To test whether our results are an artifact of other sequence features, including the highly structured RNA Rev response element (RRE; supplementary table S15, Supplementary Material online; Fernandes et al. 2012), we also used OLGenie to perform sliding window analyses. Results show that purifying selection in the sas12 frame of env is most significant in regions of asp not overlapping the RRE (fig. 3C). The strongest evidence is observed in variable region 4, suggesting that accepted nonsynonymous changes in this region are disproportionately synonymous in asp. Significance is also attained in the correct frame for the two known ss12 OLGs, vpu and rev (supplementary figs. S9 and S10, Supplementary Material online). Thus, OLGenie specifically detects protein-coding function in all three data sets. Contrarily, Synplot2 shows the strongest evidence for synonymous constraint in the RRE, likely due to RNA structure rather than protein-coding function, and fails to detect vpu in the BLAST data set (supplementary fig. S11, Supplementary Material online). It should be noted that these OLGenie results concern the sas12 frame, for which the d NN /d NS ratio is not always conservative ( fig. 2), and that our biological controls were limited to the ss12 and ss13 frames. Nevertheless, our results provide evidence that purifying selection acts on the sas12 protein-coding frame of env, particularly in the asp region. This finding is corroborated by recent laboratory evidence demonstrating expression of ASP in multiple infected cell lines, where it localizes to both the host cell membrane and viral envelope upon activation of HIV-1 expression (Affram et al. 2019). This suggests ASP as a potential drug target, for which our sliding window results may be useful for identifying functionally constrained residues, that is, regions with low and highly significant d N /d S ( fig. 3C and supplementary figs. S9 and S10 and supplementary data, Supplementary Material online).

Conclusions
OLGenie provides a simple, accessible, and scalable method for estimating d N /d S in OLGs. It utilizes a well-understood measure of natural selection that is specific to proteincoding genes, making it possible to directly compare functional constraint between OLGs and non-OLGs. Moreover, although its estimates of constraint are conservative, its discriminatory ability exceeds that of other methods (Schlub et al. 2018). Power is greatest at relatively low levels of sequence divergence, and may be increased in the future by incorporating mutational pathways or comparing conservative versus radical nonsynonymous changes. Even so, not all functional genes exhibit detectable selection, so that some OLGs are likely to be missed by any selection-based method. Nevertheless, because candidate OLGs are usually subject to costly downstream laboratory analyses, minimizing the falsepositive rate is paramount. To this end, OLGenie achieves a false-positive rate of 0% for several subsets of our control data, for example, regions with d N /d S < 0.4 and P 1.04 Â 10 À6 . OLGenie can therefore be used to predict OLG candidates with high confidence, allowing researchers to begin studying evolutionary evidence for OLGs at the genomic scale.