Genome-scale detection of positive selection in nine primates predicts human-virus evolutionary conflicts

Abstract Hotspots of rapid genome evolution hold clues about human adaptation. We present a comparative analysis of nine whole-genome sequenced primates to identify high-confidence targets of positive selection. We find strong statistical evidence for positive selection in 331 protein-coding genes (3%), pinpointing 934 adaptively evolving codons (0.014%). Our new procedure is stringent and reveals substantial artefacts (20% of initial predictions) that have inflated previous estimates. The final 331 positively selected genes (PSG) are strongly enriched for innate and adaptive immunity, secreted and cell membrane proteins (e.g. pattern recognition, complement, cytokines, immune receptors, MHC, Siglecs). We also find evidence for positive selection in reproduction and chromosome segregation (e.g. centromere-associated CENPO, CENPT), apolipoproteins, smell/taste receptors and mitochondrial proteins. Focusing on the virus–host interaction, we retrieve most evolutionary conflicts known to influence antiviral activity (e.g. TRIM5, MAVS, SAMHD1, tetherin) and predict 70 novel cases through integration with virus–human interaction data. Protein structure analysis further identifies positive selection in the interaction interfaces between viruses and their cellular receptors (CD4-HIV; CD46-measles, adenoviruses; CD55-picornaviruses). Finally, primate PSG consistently show high sequence variation in human exomes, suggesting ongoing evolution. Our curated dataset of positive selection is a rich source for studying the genetics underlying human (antiviral) phenotypes. Procedures and data are available at https://github.com/robinvanderlee/positive-selection.


INTRODUCTION
Conservation of structure and sequence often indicate biological function. Rapidly evolving sequence features may however also indicate function, as they may reveal molecular adaptations to new selection pressures. But what drives these rapid genetic changes during evolution? Can these changes explain the specific phenotypes of species or individuals, such as a differential susceptibility to viruses?
Immunity genes contain the strongest signatures of rapid evolution due to positive Darwinian selection (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13). Pathogens continuously invent new ways to evade, counteract and suppress the immune response of their hosts, thereby acting as major drivers of the observed adaptive evolution of immune systems (14,15). In line with this, the structural interfaces of human proteins directly involved in virus interactions show accelerated evolution (16). Numerous proteins involved in the virus-host interaction have been demonstrated to be in genetic conflict with their interacting viral proteins, a phenomenon that has been likened to a virus-host 'arms race' (15). Such studies have generally focused on a single gene or gene family of interest sequenced across a large number of species. Evolutionary analyses can then predict which genes and codons may be involved in virus interactions. For example, mitochondrial antiviral signaling protein (MAVS) is a central signaling hub in the RIG-I-like receptor (RLR) pathway, which recognizes infections of a wide range of viruses from the presence of their RNA in the cytosol. Analysis of the MAVS gene in 21 primates identified several positions that have evolved under recurrent strong positive selection and turned out to be critical for resisting cleavage by Hepatitis C virus (17). Other examples of immunity genes showing evolutionary divergence that directly impacts the ability to restrict viral replication include TRIM5␣ (18), PKR (19) and MxA (20).
Positive selection can be detected through comparative analysis of protein-coding DNA sequences from multiple species (9,21). Markov models of codon evolution combined with maximum likelihood (ML) methods (22) can analyze alignments of orthologous sequences to identify genes, codons and lineages that show an excess of nonsynonymous substitutions (mutations in the DNA that cause changes to the protein) compared to synonymous ('silent') substitutions (d N /d S ratio or , see Text S1 for a detailed explanation). Successful application requires many steps (15,23) (Text S1), including: (i) the identification of orthologous sequences, sampled from species across an appropriate evolutionary distance (distant enough to show variation, but not too divergent to prevent saturation); (ii) accurate alignment and phylogenetic tree reconstruction; (iii) parameterization of the ML model. While studies of positive selection on individual genes have achieved reliable results, estimates of positive selection in whole genomes have been substantially affected by unreliabilities in sequencing, gene models, annotation and misalignment (24)(25)(26)(27)(28)(29)(30). In addition to the analysis of genomes from different species, the increasing availability of individual human genomes and populations provides new opportunities to systematically analyze more recent sequence variation (9,31).
In this study, we performed comparative evolutionary analyses of recent whole-genome sequenced primates as well as of human genetic variation to identify highconfidence targets of positive selection. Our findings provide insights into the biological systems that have undergone molecular adaptation in primate evolution, which goes beyond immunity genes. Interrogation of the positively selected genes with structural and genomic data describing virus-host interactions provides insights into potential determinants of viral infection and predicts new virus-human evolutionary conflicts.

Genes, orthologs and sequences
We obtained protein-coding DNA sequences of all nine simian primates for which high-coverage whole-genome sequences are currently available from Ensembl release 78, December 2014 (32) (Supplementary Table S1, Figure S1, Supplementary Files at https://github.com/robinvanderlee/ positive-selection). We processed orthology definitions from the Ensembl Compara pipeline (33) to obtain 11 170 one-to-one ortholog clusters containing for all nine primates a single coding sequence corresponding to the canonical transcript, which usually encodes the longest translated protein (Supplementary Table S2). Clusters consist of genes for which only one copy is found in each species, and these genes are one-to-one orthologs to the human gene. Sequences are of high quality as indicated by the general lack of undetermined nucleotides ('N'): 98 420 of 100 530 sequences (98%) and 9312 of 11 170 (83%) clusters contain no Ns. Genomes with most Ns are gibbon (in 680 sequences), gorilla (313) and marmoset (291). Ortholog clusters never contain sequences with internal (premature) stop codons and stop codons at the end of sequences were removed.
Of the original 21 983 human genes, 7015 were discarded because for one or multiple of the other primates no or-tholog was annotated (likely a combination of suboptimal genome annotation and lineage-specific gene loss), while another 3798 were discarded because they were part of oneto-many or many-to-many relationships. Human genes part of the one-to-one ortholog clusters are generally a good representation of all protein-coding human genes, as only functions related to olfactory signaling and sensory perception of smell are strongly underrepresented among them. A number of other categories are moderately underrepresented (antigen presentation and MHC; translation and ribosome) or overrepresented (development and differentiation, including neuron, synapse and nervous system development; kinases and phosphatases; cell motility and migration). Note that our analyses of PSG functions (below) correct for these patterns by using the one-to-one orthologs as a background.

Initial alignments
We first obtained multiple alignments of the clusters of orthologous primate protein sequences using MUSCLE (34) and Mafft (35), and from the Compara pipeline (i.e. filtering the larger vertebrate alignment for primate sequences) (33). Inspections revealed that misalignment of nonhomologous codons affects virtually all alignments, as was observed in previous studies (25)(26)(27)(28)(29). This is probably the result of the tendency of alignment algorithms to overalign, i.e. produce alignments that are shorter than the true solution due to collapsed insertions in an attempt to avoid gap penalties (26,36). The PRANK algorithm to some extent prevents alignment of nonhomologous regions by flagging gaps made during different stages of progressive alignment and permitting their reuse without further penalties (36). As PRANK has been shown to provide better initial alignments for large-scale positive selection detection (25)(26)(27)(28)(29)(30), we obtained multiple alignments of the primate ortholog clusters using the PRANK codon mode (prank +F -codon; v.140603). We used the default settings of (i) obtaining a guide tree from MAFFT for the progressive alignment procedure and (ii) selecting the best alignment from five iterations. These settings likely result in the best alignment for a given cluster of sequences (including those showing a gene tree topology that differs from the species tree topology). The PRANK approach markedly improved the initial alignments.

Alignment filtering and masking
Even with improved initial alignments, positive selection studies remain affected by a high rate of false positive predictions. Part of those may be alleviated by additional automated masking of unreliable alignment regions. GUID-ANCE assesses the sensitivity of the alignment to perturbations of the guide tree (37) and has been recommended for positive selection studies (27,28,30). We applied GUIDANCE with the default 100 bootstrap tree iterations (guidance.pl --program GUIDANCE --seqType nuc --msaProgram PRANK --MSA Param "\+F \-codon"; v1.5). TCS assesses alignment stability by independently re-aligning all possible pairs of sequences and scoring positions through comparison with the multiple alignment (38). We ran TCS on translated PRANK codon alignments (t coffee -other pg seq reformat -action +translate; t coffee -evaluate -method proba pair -output score ascii, score html; Version 11.00.61eb9e4).
Low confidence scores of either method led us to remove entire alignments from our analysis or mask individual columns and codons. Alignments were removed in the case of a low score (default cutoffs of <60% for GUIDANCE, <50% for TCS) for (i) the overall alignment or (ii) one or more sequences (i.e. we only retained alignments with sequences for all nine species). Entire columns were masked if GUIDANCE <93% or TCS <4; individual codons were masked if <90% or <4. Masked nucleotides were converted to 'n' characters to distinguish them from undetermined nucleotides in the genome assemblies ('N'). For visualization and quality inspection purposes we translated the masked codon alignments to the corresponding protein alignment. Nucleotides 'n' and 'N' were converted to 'o' and 'X' upon translation, respectively. Detailed visual inspection revealed the value of our masking approach: masked codons tend to comprise unreliable alignment regions, primarily consisting of large inserts, insertion-deletion boundaries (i.e. regions bordering well-aligned blocks), and aligned but nonhomologous codons (Supplementary Files).

Evolutionary analyses: reference phylogenetic tree
Maximum likelihood (ML) d N /d S analysis to infer positive selection of genes and codons was performed with codeml of the PAML software package v4.8a (22) (Text S1). We used a single phylogenetic tree with branch lengths for the ML analysis of all alignments to limit the influence of genespecific phylogenetic variability. To obtain this reference tree, we concatenated all 11,096 masked alignments into one large alignment and ran the codeml M0 model (i.e. fitting a single d N /d S for all sites; NSsites = 0, model = 0, method = 1, fix blength = 0), provided with the wellsupported topology of the primate phylogeny (32,39). We took this approach for two main reasons: (i) to best reflect the overall evolutionary distance between the primate species (which influences codon transition probabilities in the ML calculations, Text S1) and (ii) to estimate branch lengths in units compatible with codon-based evolutionary analyses, i.e. the number of nucleotide substitutions per codon. For comparisons with other primate phylogenetic trees, the branch lengths of our codon-based tree were converted to nucleotide substitutions per site (i.e. nucleotide substitutions per codon divided by three). The codeml M0 model under the F61 or F3×4 codon frequency parameters resulted in virtually identical phylogenetic trees (median branch length difference of a factor 0.99) and d N /d S estimates (0.213 vs. 0.217; Supplementary Figure S1, Supplementary Files). The M0 tree is also highly similar to a ML phylogenetic tree inferred from the same concatenated alignment using nucleotide rather than codon substitution evolutionary models (median branch length difference of a factor 0.98; Supplementary Figure S1; RAxML v7.2.8a (40); -f a -m GTRCAT -N 100).

Evolutionary analyses: inference of positive selection
In the first of two steps for inferring positive selection using codeml, the 11 096 filtered and masked alignments were subjected to ML analysis under evolutionary models that limit d N /d S to range from 0 to 1 ('neutral' model) and under models that allow d N /d S > 1 ('selection' model; Text S1) (21). Genes were inferred to have evolved under positive selection if the likelihood ratio test (LRT) indicates that the selection model provides a significantly better fit to the data than does the neutral model (P LRT < 0.05, after Benjamini-Hochberg correction for testing 11 096 genes). We included apparent Positively Selected Genes (aPSG) if they met the LRT significance criteria under all four tested ML parameter combinations. These combinations consist of two sets of evolutionary models: M1a (neutral) versus M2a (selection); M7 (beta) versus M8 (beta&). And two codon frequency models: F61 (empirical estimates for the frequency of each codon); F3×4 (calculated from the average nucleotide frequencies at the three codon positions). I.e. we used combinations of the following codeml parameters: NSsites = 1 2 or NSsites = 7 8; CodonFreq = 2 or CodonFreq = 3; cleandata = 0, method = 0, fix blength = 2. A total of 2992 (27%) genes showed significant evidence of apparent positive selection at the level of the whole alignment (Supplementary Figure S2A).
Second, for the significant aPSG we retrieved from the site-specific codeml ML analyses (step one, above) the Bayesian posterior probabilities, which indicate the individual codons that may have evolved under positive selection (Text S1) (41). We included apparent Positively Selected Residues (aPSR) if their codons were assigned high posteriors under all four ML parameter combinations (P posterior ( > 1) > 0.99). 416 aPSG contain at least one significant aPSR (1405 in total; Supplementary Figure S2B).

Quality control
We subjected each inferred aPSR and aPSG to visual inspection (Supplementary Table S3). In this way, we identified several indicators for positive selection artefacts that we then used for their automated detection in the complete set. First, we obtained the gene trees for our individual masked alignments using RAxML (40)(-f a -m GTRGAMMAI -N 100). Type-I [orthology] and -II [transcript definitions] artefacts tend to lead to gene trees with (i) a long-branched clade consisting of the set of sequences that are distinct from the others (e.g. paralogs, alternative exons) and (ii) a topology that is not congruent with the well-supported species phylogeny (Supplementary Figure S3). We filtered out likely false positives by selecting gene trees with an extreme longest/average branch length ratio. Second, to assess the distribution of PSR across exons, we mapped Ensembl exon coordinates for human transcripts to the human protein sequences. Type-II [transcript definitions] and -III [termini] artefacts could often be filtered out by a high concentration of aPSR located to a single exon (Supplementary Files).

GC-biased gene conversion (gBGC)
The effects of gBGC seem specifically correlated to regions of high meiotic recombination in males rather than females (42). We calculated genomic overlaps of PSG and non-PSG with male (8.2% of PSG, 7.7% of non-PSG) and female (6.7% of PSG, 8.1% of non-PSG) recombination hotspots in human, which we obtained from the family-based deCODE maps (43) via the UCSC genome browser (44). Sex-averaged recombination hotspots estimated from linkage disequilibrium patterns were obtained from HapMap Release 22 (45) (43% of PSG, 39% of non-PSG). Human genomic regions under the influence of gBGC were predicted by phastBias (46)(9.1% of PSG, 11.4% of non-PSG).

PSG function analyses
Our final curated set of 331 PSG (Supplementary Tables S4 and S8) was analyzed for gene ontology terms, pathways databases and other functions (Supplementary Tables S6  and S7) (47,48), compared to a background of 11 011 genes (the 11 096 genes tested for positive selection excluding the 85 artefacts). Enrichment statistics were calculated for overlaps between the PSG and various gene lists (Supplementary Table S8), by assessing whether a gene list of interest was significantly over-represented among the 331 PSG compared to the 10 680 non-PSG (two-tailed Fisher's exact test on a 2 × 2 contingency table of the overlap). Secreted (keyword KW-0964) and membrane (KW-0472) proteins were obtained from UniProt (49). Innate immunity, PRR and virus-human PPI gene lists are described in (10). We mined the GenomeRNAi database (50) for genes whose perturbation significantly affects viral infection or replication. Data on gene expression in mouse immune cell subsets were obtained from the Immunological Genome Project (using the recommended expression thresholds--usually 120)(51) and mapped to human orthologs. Expanding upon a humanvirus structural interaction network published previously (16), we searched for protein structures involving interactions between viral proteins and PSG. Optimized atomicresolution structures were obtained from PDB REDO (52), visualized with YASARA (www.yasara.org), and analyzed using WHAT IF (53).

Human variation
To study genetic variation in the human population we obtained all variants and allele frequencies reported by the Exome Aggregation Consortium (ExAC, release 0.3.1) (31) from the variant call file (VCF). ExAC combined over 91 000 unaffected and unrelated exomes from a range of primarily disease-focused consortia, into 60 706 high-quality exomes. These exomes originate from diverse populations, including European (Non-Finnish and Finnish), African, South Asian, East Asian and Latino. Our analyses focus on global human variation rather than stratified populations, but it should be noted that African populations have greater levels of genetic diversity and that Middle Eastern and Central Asian samples are underrepresented in ExAC (31). ExAC variants were mapped from genomic position to codons using an in-house pipeline that matches the longest translated GENCODE (54) sequence for each mRNA-validated protein-coding gene to a Swiss-Prot (49) protein sequence. This ExAC dataset was then merged with the total set of genes tested for positive selection in our primate analysis. The final dataset (Supplementary Table S11) covers 7506 genes of which 229 are primate PSG (containing 636 PSR).
We considered only single nucleotide variants (SNVs) and included all SNVs that pass the filters imposed by ExAC. We did not impose any allele frequency filters and used the aggregated allele counts/frequencies across all populations. (i) At the level of unique variants, we first calculated the number of unique missense and synonymous variants reported per codon by aggregating the potentially multiple, distinct variants reported at the same codon. We then summed these per-codon counts to obtain the number of unique variants per gene, which we normalized for gene length by dividing by the number of codons ('Unique variants per codon', Figure 5A). A d N /d S measure of human variation was obtained by correcting the observed unique missense and synonymous variant counts in a gene (observed, obs) for the total possible missense and synonymous variants in that gene based on the codon table (background, bg): d N d S = missense obs /missense bg synonymous obs /synonymous bg . Note that this ExAC-based human d N /d S uses a background of possible variants based exclusively on the codon table, ignoring any biases that influence codon transition probabilities, such as transition/transversion rates, codon frequencies and divergences times, which are all taken into account in our analyses of primate positive selection using codeml. These differences in approach may partly explain the overall higher d N /d S observed in our analyses of human compared to primates. (ii) At the level of allele frequencies (AF), we measured total human variation in a gene by summing the reported AFs (counts for the alternative allele / allele number [i.e. total number of called genotypes]), separately for missense and synonymous variants. These aggregated AFs were normalized for gene length to obtain the average (per-codon) allele frequency of all missense and all synonymous variation within the gene ('Total allele frequency per codon', Figure 5B).
ExAC-based Residual Variation Intolerance Scores (RVIS) were downloaded from http://genic-intolerance. org/data/GenicIntolerance v3 12Mar16.txt. RVIS is a regression-based measurement of the tolerance for a gene to accumulate 'functional variation' (missense, nonsense and splice variants), given its level of synonymous variation (55). We analyzed the 'default' RVIS genome-wide percentiles, which were calculated based on variants with a MAF of at least 0.05% in at least one of the six individual ethnic strata from ExAC (i.e. we used %ExAC 0.05%popn), but all MAF filters gave similar results.

Scripts and tools
In addition to the methods cited above, we made extensive use of the Ensembl API version 78 (32), GNU Parallel (56) and Jalview (57). Our developed procedures and analyses consist of custom Perl and R scripts, which we have made available together with instructions for use at https://github. com/robinvanderlee/positive-selection.

RESULTS
To obtain a confident dataset of positively selected genes relevant to human biology and infectious disease, we investigated protein-coding DNA sequences from nine simian ('higher') primates for which whole-genome sequences are available (Supplementary Table S1, five genomes released in 2012 or later (58)). This set consists of five hominoids ('apes'; human, chimpanzee, gorilla, orangutan and gibbon), three Old World Monkeys (macaque, baboon, vervet) and one New World monkey (marmoset), spanning an estimated 36-50 million years of evolutionary divergence (39).

A reliable procedure for conservative inference of positive selection
Given the high incidence of false positives in large-scale detection of positive selection reported in literature (24)(25)(26)(27)(28)(29)(30), we developed a stringent six-stage procedure that we subjected to rigorous manual curation and quality control at all steps. This section and Figure 1 summarize the procedure. More details are available in the Methods section. Code is available at https://github.com/robinvanderlee/ positive-selection.
To minimize artefacts, we prioritized high precision over sensitivity (i.e. potentially missing interesting sites). (i) To limit the influence of evolutionary processes other than divergence of orthologous codons (e.g. gene duplications), we only assessed clusters of one-to-one orthologous genes (Supplementary Table S2). (ii) To reduce alignment of nonhomologous codons, a common issue leading to inflated estimates of positive selection (25)(26)(27)(28)(29)(30), we computed multiple sequence alignments using PRANK (36). (iii) To achieve maximum alignment quality, we masked low-confidence codons and columns, and filtered out alignments based on the GUIDANCE and TCS algorithms, two distinct concepts for assessing reliability (37,38). These steps resulted in 11 096 alignments, representing about half of the human protein-coding genes ( Figure 1A, Table 1). Detailed inspection revealed that the alignments are of good overall quality, with the major improvements gained by using PRANK over other alignment methods. The masking procedures filter out most of the remaining ambiguities.
Next, we used d N /d S -based codon substitution maximum likelihood (ML) models (22) to infer positive selection acting on genes and codons ( Figure 1B, Materials and Methods, Text S1). This requires an estimation of the overall evolutionary divergence between the primate species. (iv) To best reflect this distance, we used the 11 096 one-to-one ortholog alignments to construct a single codon-based reference tree for use in all ML analyses (Figure 2, Supplementary Figure S1; see also next section). (v) To ensure that we studied only the strongest signatures of positive selection, we analyzed four combinations of evolutionary model parameters and required genes to test significant across all of them (P < 0.05, after Benjamini-Hochberg correction for testing 11 096 alignments). (vi) Finally, we obtained the set of positively selected codons (and their corresponding amino acid residues) using stringent Bayesian calculations (P posterior > 0.99). These steps resulted in 416 apparent Positive Selected Genes (aPSG) inferred to contain at least one apparent Positively Selected Residue (aPSR; 1405 in total, Supplementary Figure S2, Table S3).

Evolutionary models estimate substitution rates in primate protein-coding genes
The overall d N /d S rate across our one-to-one proteincoding orthologs is 0.21 (d N = 0.0477, d S = 0.2235; Table 1), consistent with average strong purifying selection. Our primate phylogenetic tree, calculated for the ML analyses and based on all included alignments, has identical topology to the well-established primate phylogeny (Figure 2, Supplementary Figure S1) (39). The estimated substitution rates per nucleotide follow the expected pattern, depending on the fraction of noncoding sequences used for tree reconstruction: the nucleotide-converted branch lengths of our coding sequence tree are a factor 0.87 shorter (median of all branches) than those of a phylogeny based on genomic regions of 54 primate genes (consisting half of noncoding segments) (39), and a factor 0.46 shorter than a tree based on whole-genome alignment (32) (Supplementary Figure S1). Thus, our phylogenetic tree informs on the rate of proteincoding sequence divergence observed in primates.

Rigorous quality control reveals artefacts in 20% of apparent PSG
To assess the reliability of our procedure and estimate the impact of common errors in large-scale sequence analy-sis, we performed systematic manual inspection of all 1405 aPSR and 416 aPSG. Based on the results, we implemented filters for automatic detection of false predictions of positive selection (referred to as artefacts; Materials and Methods). Although the large majority of predictions are reliable, 85 aPSG (20%) contain artefacts.
The artefacts we encountered fall into five classes (Figure 3A, Supplementary Table S3). (I) Orthology inference or gene clustering errors (occurring in 8 of 85 problem cases, 9%). One gene cluster consists of the TRIM60 sequences of seven primates plus the TRIM75 sequences of baboon and gibbon. TRIM60 and TRIM75 are close paralogs (46% identical amino acids, Figure 3B) encoded within a 43 kb region on chromosome 4 in human. TRIM75P is an annotated pseudogene in human and TRIM60P is a predicted pseudogene in gibbon, despite both appearing like fully functional genes: they lack frame-shift mutations or premature stop codons and remain highly similar to the corresponding non-pseudogenic copy in the other species (97-98% identical amino acids). Clustering of the outparalogs TRIM60 and TRIM75 led to artificially high substitution rates and an abundance of apparent PSR across the full alignment (Supplementary Figure S3). (II) Alternative transcript definitions (52 cases, 61%). In most cases this leads to alignment of divergent exons (Supplementary Figure S3). The CALU gene cluster consists of the coding sequences from two alternative isoforms that include either one of two genomic neighboring and homologous exons ( Figure  3C). Given their strong sequence conservation (64% identical amino acids), these mutually exclusive exons likely originated from a tandem exon duplication event (59). For six primates the CALU gene model predicted a single transcript isoform that contains the first of the tandem exons, which made these sequences inconsistent with the alternative transcript selected for the other primates. (III) Unreliable N-/Ctermini (32 cases, 38%). When they are sufficiently similar, alternative non-homologous translation starts/stops or alternative exon boundaries may still be aligned. Such cases appear as divergent, high d N /d S codons (Supplementary Figure S3). (IV) Alignment ambiguities (four cases, 5%). These are spurious cases of short, hard-to-align sequence regions, usually surrounding residues that are masked due to low alignment quality (Supplementary Table S3). Without independent data (e.g. more sequences, aligned 3D structures) it remains challenging to determine the correct alignment. (V) Other issues (three cases, 4%). These are apparent PSR with high posterior probabilities in all ML model parameter combinations, but upon closer inspection are untrustworthy (Supplementary Table S3). One site for example consisted of distinct codons distributed across the tree (6x AGC, 3x TCA), which were inferred to evolve under high d N /d S despite all encoding serine residues.

GC-biased gene conversion does not systematically affect the PSG
GC-biased gene conversion (gBGC) may be an alternative explanation for the accelerated evolution of some PSG (60). gBGC leads to increased GC content in meiotic recombination hotspots, which may inflate d N /d S estimates (42). Genes affected by gBGC are expected to have substitution patterns more biased toward GC than genes evolving under positive selection, particularly in the selectively neutral fourfold degenerate (FFD) sites.
Nucleic Acids Research, 2017, Vol. 45, No. 18 10641 We analyzed our PSG to determine whether they are affected by gBGC. First, we found that PSG have significantly lower rather than higher GC contents compared to non-PSG in all studied primates, both across the full coding sequence as well as in FFD sites (human: median GC FFD = 0.518 (PSG), 0.578 (non-PSG), P = 5.3 × 10 −6 , Mann-Whitney U test; Supplementary Figure S4). Second, amino acids enriched for PSR positions correlate only marginally with GC content (Pearson's R = 0.14, P = 0.56; Supplementary Figure S5). Third, PSG overlap only marginally with human recombination hotspots (7-8% for both PSG and non-PSG), as well as with genomic regions predicted to be affected by gBGC (46) (9% of PSG versus 11% of non-PSG). Moreover, our site-specific PSG inferences are based on substitution patterns across a primate phylogeny covering at least 36 million years of evolution, which far outdates the rapid turnover rate of recombination hotspots (e.g. even human and chimp hotspot rarely overlap (61)), strongly reducing the influence of gBGC (42). Together these data suggest that although gBGC may affect individual PSG, the adaptive signatures in the large majority of PSG are not caused by gBGC but are likely the result of positive selection.

Strong statistical evidence for positive selection in 3% of primate protein-coding genes
Removal of the 85 detected artefacts resulted in a final, curated set of 331 genes with extensive statistical evidence for positive selection across nine primates (331 PSG, 3% of 11 096 ortholog clusters analyzed; Table 1, Supplementary  Table S4). The PSG are distributed evenly across the human genome (Supplementary Figure S6). They have a median d N /d S rate of 6.85 (minimum 2.10), consistent with the strong positive selection observed across multiple evolutionary models. 106 of the 331 PSG (32%) had been reported in an overview of previous genome-wide studies for positive selection, which include comparisons of different species as well as studies of human variation (9). The 331 PSG contain 934 positively selected residues/codons (PSR). The 934 PSR comprise 0.014% of all tested human codons and 0.5% of all PSG codons, with an average of 2.8 PSR per PSG (  Figure S7), which may indicate protein functionality through charged interactions (62).

Positive selection identifies known and novel virus-host genetic conflicts
The strong signal for immunity among the positive selection data suggests that at least some rapidly evolving sites are in a genetic conflict with one or more pathogens. To further assess the ability of the PSG data to detect such conflicts, we investigated our dataset for known virus-human evolutionary interactions. We evaluated five cases in which positively selected codons, identified through evolutionary analysis, were experimentally shown to be important for restricting viral infection. Our large-scale approach correctly identified four out of five genes (TRIM5, MAVS, BST2 [tetherin], SAMHD1 (17,18,63,64); MxA was not detected (20); Supplementary Table S9). Moreover, despite using sequences from substantially less species (those case studies sequenced one gene in ∼20-30 primates), we retrieved many of the previously reported codons (seven previously reported out of  Table S9). Next, we probed our evolutionary data for novel cases of virus-human genetic conflicts. To prioritize the PSR and PSG, we integrated them with a multitude of orthogonal datasets describing various aspects of antiviral immunity and virus-host interactions (Supplementary Table S8), including virus-human protein interactions (70), functional screens of virus infection (50), gene expression in immune cell subtypes (51) and virus-infected cells (10), and maps of recent human adaptation (71). We found that PSG are enriched for genes showing differential expression upon infection with respiratory viruses ( Table S8). Besides containing well-described viral interactors such as MAVS, SAMHD1 and CD55, those 11 PSG include poorly characterized genes that may represent novel candidate virus-host interactors. For example, FBXO22 encodes an F-box family protein, which may be involved in ubiquitinmediated protein degradation. FBXO22 has not been linked to (viral) infections, other than a role in macrophage NFB activation during Salmonella infection (72). Our evolutionary data together with the virus-host interaction data suggest a role for FBXO22 in primate HPV infections, possi-bly mediated by the positively selected third codon (Pro in human, chimpanzee, baboon, marmoset--Thr, Ser, Ala or Leu in gorilla, orangutan, gibbon, macaque, vervet). Ribosomal protein S29 (RPS29) is another example of a poorly annotated PSG. It is not only a strong target of positive selection in our study but it also localizes to a genomic locus that underwent recent human adaptation (71). RPS29 is expressed in 14 immune cell subsets, interacts with six different Influenza proteins, and affects HCV and SeV replication. Thus, data-driven prioritization of genes under positive selection may reveal novel cases of virus-host interactions shaped by evolutionary conflicts.

PSR are at the structural interface between viruses and their cellular receptors
To gain further insights into the role of positive selection in viral infection, we analyzed the involvement of PSR in structurally characterized virus-host interactions. Among the PSG with strong adaptive signatures, we identified three genes that function as virus receptors and for which structures of human-virus protein complexes have been solved: CD4, CD46 and CD55. CD4 interacts with MHC class II molecules and is the classical surface marker of T helper cells (CD4 + T cells). HIV exploits CD4 as a receptor for entering host T-cells (73). The two residues identified as positively selected by our approach, Asn77 and Ala80, are part of the CD4 V-set Ig-like domain. Both lie close to the interaction interface between CD4 and HIV envelop protein gp120, with CD4 Asn77 making extensive contact and hydrogen bonding to gp120 Ser365 ( Figure 4A).
CD55 [DAF] and CD46 [MCP] are members of the regulator of complement activation (RCA) gene family that control activation of the complement cascade (21% amino acid identity; both composed of four SCR domains). CD55 acts as an entry receptor for some picornaviruses, including several types of coxsackie-, entero-and echovirus (74,75). richness and CD55 in human likely evolved under balancing selection, which maintains allelic diversity in a population (76). We identified nine positively selected positions in CD55 (two in SCR domain 1, two in SCR2, five in SCR3). Arg170 and Gly178 (part of SCR3) are involved in the interaction with echoviruses (EV) 7 and 12, with Arg170 buried deep into the EV7 viral capsid ( Figure 4B). A previous study reported additional interactions between picornaviruses and our CD55 PSR, including Val155 (EV7), Val124 and Ile206 (coxsackievirus B3) (75). CD46 is a receptor for measles virus (MV), human herpesvirus 6 and several adenovirus subspecies (Av) (77,78). Protein structure analysis revealed that the single PSR in CD46, Arg103 (SCR domain 2), makes extensive contacts with both MV hemagglutinin (His448, Asn449, Glu471; Figure 4C) and the Av type 21 fiber knob (Asn304, Ile305). Taken together, examination of the PSG with known structures suggests that positions evolving under strong positive selection may be central to the interactions between virus particles and their human cellular receptors.

Primate PSG show ongoing change within the human population
The constantly changing spectrum of viruses and other pathogens causes a recurrent selective pressure on the human genes that interact with them (13,15). We therefore expected many genes that evolved under positive selection in primates (i.e. our PSG) to still be undergoing adaptive evolution in recent and current human evolution. To investigate this, we analyzed patterns of human variation in the Exome Aggregation Consortium (ExAC) dataset of 60 706 exomes (31).
PSG, compared to genes that lacked between-primate signatures of positive selection (i.e. non-PSG), contain significantly more unique missense variants (P = 5.8 × 10 −13 ), but similar amounts of unique synonymous variants (P = 0.23, Figure 5A). Indeed, a d N /d S -based measure of human variation, corrected for all possible missense and synonymous variants based on the codon table (Materials and Methods), shows that PSG have higher human d N /d S ratios than non-PSG (median: 0.71 versus 0.63, P = 1.5 × 10 −15 ; Figure 5A). Similar patterns emerge when analyzing human variation at the level of allele frequencies rather than at the level of unique reported variants: PSG again show more total missense variation than non-PSG, but similar amounts of total synonymous variation ( Figure 5B). PSG also contain far more 'functional variation' (missense, nonsense and splice variants) according to the Residual Variation Intolerance Score, which for each gene measures the deviation from the expected amount of functional variation given the background of synonymous variation present that gene (55)(median genome-wide RVIS percentile: 71% [PSG] versus 45% [non-PSG], Figure 5C).
Positively selected residues/codons (PSR) within our primate PSG show the same trends as the PSG themselves. Of the 636 PSR in genes represented in our ExAC dataset, 269 (42%) contain reported missense variants, compared to 30% for non-PSR positions in PSG genes and 27% of non-PSR positions in any gene ( Figure 5D). PSR and non-PSR positions however contain similar synonymous variation (107/636 = 17% [PSR] versus 15%/16% [non-PSR], Figure 5D). Thus, while whole-gene d N /d S ratios are small and suggest that the majority of codons are evolving under purifying selection both between different primates and within human, the genes and codons that do evolve under positive selection in primates consistently show strong patterns of ongoing change in human populations as well.

DISCUSSION
In this study, we have presented a robust screen for genes that have undergone positive selection in humans and related primates. This was enabled by the completion of several primate genome sequences bridging key evolutionary timescales between previously available species. Earlier estimates of primate genes undergoing positive selection range from ∼1% to 4-6% to ∼10% depending on the species studied, the detection method and the number of genes tested (2)(3)(4)(6)(7)(8)25,79). Lower estimates were likely underpowered and limited by genome availability (e.g. comparing humanchimp), while higher estimates tend to be affected by the high rate of false positive predictions discussed before (24)(25)(26)(27)(28)(29)(30). We restricted our analysis to one-to-one orthologs and imposed other stringent criteria to minimize false positives. Among the selected genes only olfactory signaling genes are substantially underrepresented and for instance immune functions are neither depleted nor enriched (see Materials and Methods), suggesting that the analysis overall is not systematically biased with respect to most gene classes. Nevertheless, it is concerning that one-to-one orthology approaches ignore half of the protein-coding genes, especially since the included genes still contain an appreciable number of orthology misinferences and non-orthologous exons. Our observed d N /d S rate across one-to-one protein-coding orthologs (0.21) is similar though slightly lower than previous estimates based on human, chimpanzee and macaque [0.23-0.26 (3,6,79)]. Given our conservative approach applied to half of all human protein-coding genes, the estimate that 3% of genes and 0.014% of codons are under positive selection should represent a reliable lower limit of what is detectable using the current whole-genome sequenced primates.
Positive selection between species as a result of adaptation to a single large environmental change is typically thought to be followed by some degree of fixation of the newly acquired beneficial variants through purifying selection within populations or species (9,80)(Text S1). Such 'selective sweeps' are thus expected to reduce the nonsynonymous over synonymous variation present within a single species at codons that show high d N /d S between species. This model is assessed by for instance the McDonald-Kreitman test, which estimates how much of the variation between species is driven to fixation within species (81). In contrast, ongoing environmental changes, such as in the case of the abundance of rapidly mutating viruses and other pathogens, will cause a recurrent selective pressure and induce adaptive changes even within a species (15). Our analyses of human genetic variation indicate that positively selected genes and codons in primates tend to still show elevated rather than reduced levels of missense over synonymous substitutions within humans. Although it is unclear whether this indicates relaxed negative selection or persisting positive selection, these data do argue against fixation and in favor of recurrent, ongoing change in many of the primate PSG in the human population. As most species are currently represented by a single consensus sequence, further studies of genetic variation in different human populations (our human analyses are based on global variation, independent of populations), in other primate species (82) and in archaic hominins (e.g. Neanderthals, Denisovans) may identify targets of positive selection caused by more recent selective pressures (9,71,83).
Despite the strict nature of our computational procedure and the limited evolutionary distance between primate genomes, our manual inspection of all positive selection hits revealed a sizeable number of artefacts. This suggests there still exists a discrepancy between data availability and the reliability of large-scale comparative sequence analysis. The large majority of positive selection artefacts remaining after stringent automated filtering arise from alignment of coding sequences that are not strictly from the orthologous genomic region in different species, i.e. they arise from alignment of non-orthologous codons. The predominant underlying source of these issues are inconsistencies in gene models and genome annotation, which cause differences in coding sequence start / stop locations, exon boundaries, pseudogene predictions and alternative transcripts. Thus, rigorous manual inspection and curation at all stages of automated pipelines remain critical for reliable results and pro-vide insights into the current challenges in comparative genomics studies.
Our positive selection screen successfully identifies genes with a role in immunity and offers insights into other functions and cellular systems that have evolved adaptively. A striking number of positively selected genes (PSG) act as cell surface receptors in adaptive immunity. Others are involved in cytokine signaling and innate immune functions such as the complement system, intracellular pathogen recognition (both receptors and pathway members) or intrinsic antiviral activity. We also found adaptive signatures potentially related to other phenotypes that may be of interest, including morphology (hair protein KRTAP24-1), dietary diversity (smell and taste receptors), cholesterol and lipid transport (apolipoproteins), and energy metabolism (mitochondrial proteins). It is peculiar to observe mitochondrial proteins among the list, as these might be regarded as having conserved housekeeping functions. One explanation is the occurrence of accelerated compensatory evolution (84), in which nuclear-encoded genes adapt to slightly deleterious mutations in the mitochondrial genome that are brought about by its relatively high mutation rate and the lack of recombination. Indeed, of the 13 PSG that are targeted to the mitochondrial matrix and whose direct interactors are known, 12 interact directly with either a mitochondrial rRNA, tRNA or a mitochondrial-encoded protein (Supplementary Table S10). We further examined whether not only the protein, but also the positively selected residue interacted with a mitochondrial-encoded RNA or peptide. This appeared to be the case in two out of the five cases where structure data were available for the protein itself or for a homolog (Supplementary Table S10).
Another exciting group of PSG are those involved in centromere structure, chromosome segregation and meiosis, which may be related to the phenomenon of centromere meiotic drive (85). Meiosis in female animals is asymmetric in that only one of four haploid gametes are retained. Retention of individual chromosomes depends on a preferred binding orientation of the centromeres to the microtubular spindle via the kinetochore. Centromere evolution has been theorized to be under strong Darwinian selection because 'selfish' centromeres with a favored retention in meiosis are more likely to be transmitted (85). This process may have driven the evolution of the highly variable centromere DNA sequence, marked by a specific chromatin structure containing the CENPA histone H3 variant (85,86). We found positive selection signatures in the centromere-associated proteins CENPO, which plays a role in linking CENPA nucleosomes to the kinetochore, and CENPT, which directly connects the centromeric DNA to the kinetochore through its histone-like fold (87). In addition, our PSG contain two genes involved in the meiotic cohesion complex (REC8, SGOL2) and various genes involved in the centrosome and spindle machinery (CEP250, PCNT, CDK5RAP2, MISP). The rapid evolution of the centromere may be a driver contributing to the observed adaptations in these genes, even though a direct physical interaction with the centromere has not been established in all cases.
Previous investigations of positive selection in the virushost interaction have focused on single genes (17)(18)(19)(20) or subsets of genes known to be important for infection (12,88,89). We took a more generic approach, starting from genomic signals of positive selection. As only some of these signals will have been caused by viruses and other pathogens, deconvoluting the contributions of different selective pressures to the complex landscape of genome variation requires additional information. We integrated the evolutionary data with orthogonal data describing the virus-host interaction to prioritize selection events involving viruses, which allowed us to predict novel virus-host genetic conflicts and to indicate positions that are likely important in these interactions (Supplementary Table S8). Functional studies would be required to unambiguously assign a role for the positively selected genes and codons in the virus-host interaction. The small number of highconfidence PSG and PSR for which structural information is available restricted further quantitative investigations into whether virus-host interaction interfaces overall show accelerated evolution (16). A more lenient approach for detecting positive selection, applied to virus-interacting proteins, might be better suited for testing such trends at the level of interaction interfaces, but at the cost of reduced confidence in individual predictions. Nevertheless, for strong candidates where structural data was available, we revealed close contacts between virus particles and positively selected residues in cellular entry receptors for HIV, measles, adenoviruses and picornaviruses (CD4, CD46, CD55), suggesting positions in both viral proteins and their human receptors that are important for infection. In this way, systematic analysis of virus-host evolutionary genetics may ultimately contribute to the identification of novel targets for vaccine or antibody development.

AVAILABILITY
Our curated datasets of positively selected genes and positions, the code underlying our developed procedures, as well as other Supplementary Figures, Tables, Texts, Files and Data are available at https://github.com/robinvanderlee/ positive-selection.