Comparative Genomics Approaches Accurately Predict Deleterious Variants in Plants

Recent advances in genome resequencing have led to increased interest in prediction of the functional consequences of genetic variants. Variants at phylogenetically conserved sites are of particular interest, because they are more likely than variants at phylogenetically variable sites to have deleterious effects on fitness and contribute to phenotypic variation. Numerous comparative genomic approaches have been developed to predict deleterious variants, but the approaches are nearly always assessed based on their ability to identify known disease-causing mutations in humans. Determining the accuracy of deleterious variant predictions in nonhuman species is important to understanding evolution, domestication, and potentially to improving crop quality and yield. To examine our ability to predict deleterious variants in plants we generated a curated database of 2,910 Arabidopsis thaliana mutants with known phenotypes. We evaluated seven approaches and found that while all performed well, their relative ranking differed from prior benchmarks in humans. We conclude that deleterious mutations can be reliably predicted in A. thaliana and likely other plant species, but that the relative performance of various approaches does not necessarily translate from one species to another.

deleterious mutations phenotypes genome training set Dramatically increased numbers of reference genomes and whole genome resequencing data sets have facilitated the discovery of sequence variants and increased interest in the annotation of functional variants in many organisms. Functional annotation can yield insight into the genetic basis of phenotypic variation and is often a critical step in the identification of genes and variants underlying human disease (Ahituv et al. 2007;Cooper and Shendure 2011). In particular, interest in identifying putatively deleterious variants has increased, because these variants may contribute substantially to phenotypic variation (Manolio et al. 2009;Thornton et al. 2013). Because deleterious variants are more likely to disrupt phylogenetically conserved sites, the availability of comparative genomics data has made it possible to develop computational approaches to identifying deleterious variants genome-wide (Ng and Henikoff 2006). Although a number of approaches have been developed to identify deleterious variants within noncoding sequences (e.g., Pollard et al. 2010;Kircher et al. 2014), most have focused on variants that alter the amino acid sequence of proteins (Ng and Henikoff 2006). This focus on amino acid substitutions in protein coding sequences is in part driven by the observation that amino acid-altering single nucleotide polymorphisms (SNPs) are more often associated with phenotypic variation than other classes of variants, but also because they are the most readily identifiable class of variants that are likely to have a biological impact (1000Genomes Project Consortium et al. 2012Fay 2013;Stenson et al. 2014).
While identification of disease-causing and potentially "actionable" genetic variants is fundamental to personalized medicine, identifying deleterious variants is also broadly relevant to understanding the genetic basis of phenotypic variation. In humans, annotation of deleterious variants improves prediction accuracy of complex traits (Dudley et al. 2012). For domesticated organisms, complementation of recessive deleterious variants between haplotypes is thought to be one of the primary mechanisms underlying heterosis (Charlesworth and Willis 2009). This suggests that identification of deleterious alleles may be applied to hybrid breeding strategies (Yang et al. 2017). Elevated proportions of deleterious relative to neutral variants in domesticated species suggest a cost of domestication (Moyers et al. 2018;Lu et al. 2006;Cruz et al. 2008;Rodgers-Melnick et al. 2015;Liu et al. 2017). Studies of the genomic distribution and genetic contribution of deleterious variants can contribute both to understanding the origin and domestication of crop species and to advancing breeding and crop improvement strategies (Morrell et al. 2012).
Accurate prediction of deleterious variants is a key component of assessing their contribution to phenotypic variation. Numerous approaches for predicting deleterious variants have been developed. The performance of an approach is typically assessed using the proportion of known, disease-causing human variants that are accurately classified as deleterious. Benchmarking of various approaches using standardized test sets has shown substantial variability among approaches, and improved performance is often achieved through combining results from multiple tools (Thusberg et al. 2011;González-Pérez and López-Bigas 2011;Olatubosun et al. 2012;Grimm et al. 2015). However, the causes of performance differences across approaches are not well understood. While all approaches rely on sequence conservation at the phylogenetic level to identify deleterious variants, some approaches also incorporate protein structure, physical or biochemical properties of amino acid changes, or other attributes of protein sequence when they are available. The earliest conservation metrics used heuristic measures, sometimes including filtering or weighting to account for phylogenetic distance (Sunyaev et al. 1999;Miller and Kumar 2001;Ng and Henikoff 2003). More recent approaches have incorporated evolutionary models that account for phylogenetic distance based on putatively neutrally evolving nucleotide sites (Chun and Fay 2009;Davydov et al. 2010). Reference bias and the alignments used to calculate conservation metrics are not often emphasized, but are important for making accurate predictions and may account for some of the variability among predictions (Chun and Fay 2009;Hicks et al. 2011;Adzhubei et al. 2013). Additionally, different predictions have been found using human-based or mouse-based queries of the same substitution (Miosge et al. 2015). The accuracy of predictions is particularly dependent on the availability of annotated genomes among related species and the potential to generate sequence alignments.
Despite most approaches being developed for and applied to humans, there has been growing interest in identifying deleterious variants in non-human species in order to understand genomic patterns of variation and their contribution to complex traits, especially in plants. Patterns of deleterious variation have been examined in Arabidopsis thaliana (Cao et al. 2011), rice (Günther and Schmid 2010;Liu et al. 2017), maize (Mezmouk and Ross-Ibarra 2014;Rodgers-Melnick et al. 2015), sunflower (Renaut and Rieseberg 2015), poplar (Zhang et al. 2016), barley, and soybean (Kono et al. 2016). However, the accuracy of predictions in plants has only been examined for a small number of known variants (Günther and Schmid 2010) and only in the past few years have a diverse set of plant genomes and protein homologs become available (Goodstein et al. 2012). Furthermore, plants are known to have a larger number of multi-gene families and a higher frequency of polyploidy than occurs in mammals (Lockton and Gaut 2005). These genome-specific factors influence whether a sequence variant is truly deleterious in a given species (Comai 2005;Charlesworth 2012).
The goal of this study was to evaluate the ability of various approaches to predict deleterious variants in plants. The model system A. thaliana is a particularly attractive plant species for evaluating approaches that predict deleterious variants because decades of basic research in development, physiology, cell biology, and plant-pathogen interactions have identified large numbers of amino acid-altering mutations with phenotypic consequences. We identified seven approaches that can predict deleterious variants outside of humans (Table S1). Among these approaches, SIFT (Ng and Henikoff 2003), PolyPhen2 (Adzhubei et al. 2013) and PROVEAN (Choi et al. 2012) generate their own alignments using hits from non-redundant protein databases, whereas MAPP (Stone and Sidow 2005), GERP++ (Davydov et al. 2010), and two versions of a likelihood ratio test (Chun and Fay 2009) make predictions using pre-specified alignments as input (Table S1). Because new genome sequences are continually becoming available, the BAD_Mutations pipeline was developed to flexibly identify homologs and generate alignments for any protein of interest (Kono et al. 2016). BAD_Mutations uses TBLASTX (Altschul et al. 1990) to identify the best match (homolog) from each specified genome and aligns them with PASTA (Mirarab et al. 2015). For the four approaches that require alignments, we used the BAD_Mutations pipeline applied to 42 plant genomes. BAD_Mutations was also used to implement two approaches based on a likelihood ratio test (Chun and Fay 2009;Kono et al. 2016).
To evaluate predictions of deleterious variants in plants, we generated a curated database of 2,910 A. thaliana mutants with known phenotypic alterations. We evaluated the ability of seven approaches to identify these deleterious variants and found that while performance was better than similar assessments in humans, the relative ranking and the highest performing approach differed from previously reported comparisons using human data. Our results demonstrate that reliable prediction of deleterious variants can be achieved in A. thaliana, and likely other plant species, expanding the potential value of using deleterious variants to understand naturally occurring variation and to improve crop breeding strategies.

Generation of a curated set of Arabidopsis thaliana mutations
We curated a set of amino acid-altering mutations with phenotypic impacts. Both morphological and biochemical phenotypes were represented, and mutations were in both single-copy and duplicated genes. These mutations were obtained from two sources. We generated a manually curated set of 542 amino acid-altering mutations in 155 genes with phenotypic effects that are described in the literature. These mutations were found by searching the Arabidopsis Information Resource (http://www.arabidopsis.org) for genes with either dominant or recessive alleles that differ by nucleotide substitutions. We also identified mutations using a literature search in Google Scholar (http:// scholar.google.com). For each variant, we recorded the amino acid substitution, position, and link to the published paper (Table S2). We excluded nonsense mutations because they frequently completely eliminate gene function. We identified a second set of 2,617 amino acidaltering mutations in 960 genes from the manually curated UniProt/ Swiss-Prot database (http://www.uniprot.org/) (Boutet et al. 2016). The two sets were independently generated and had an overlap of 249 mutants. Using mutants with named alleles as a proxy for those with morphological vs. biochemical phenotypes, 65% of our manually curated set and 33% of the Swiss-Prot set had macroscopic phenotypes. Duplicated genes were defined by those proteins with a significant BLASTP hit (E-value , 0.05) to another A. thaliana protein with . 60% identity. By this criterion 466 of 995 proteins were classified as duplicated.
Single nucleotide polymorphisms (SNPs) without any known phenotype were obtained from a set of 80 sequenced A. thaliana strains (Ensembl, version 81, "Cao_SNPs", (Cao et al. 2011)). At the time of download, these were the only SNP set available for unrestricted use. After filtering out sites with heterozygous or missing genotype calls, there were 10,797 biallelic amino acid-altering SNPs in the 995 proteins. We used a subset of 1,583 common SNPs (.10%) as those least likely to have phenotypic effects. Our rationale is that on average, strongly deleterious alleles are less likely to reach high frequency in a population, owing to the effects of purifying selection (Fay et al. 2001). We also assessed performance by measuring the enrichment of deleterious variants predicted for rare compared to common polymorphisms . A second set of common amino acid-altering SNPs were identified in an independent set of genes. Excluding the original set of 995 genes, we randomly selected 1,000 proteins from 35,386 peptides in the A. thaliana database. We removed 21 that carried no amino acid polymorphism in the 1,001 genomes project (http:// www.1001genomes.org). In the remaining 979 genes, we identified 40,736 biallelic amino acid altering SNPs in the 1,001 genomes project, of which 3,717 were common (.10%).

Performance evaluations of seven approaches
We assessed amino acid substitutions using seven approaches: LRT (Chun and Fay 2009 PolyPhen2 predictions were generated using the standalone software (v2.2.2) with the PolyPhen2 bundled non-redundant database (uniref100-release 2011_12) and the probabilistic variant classifier using the default HumDiv model. Precomputed SIFT 4G predictions were obtained for A. thaliana (TAIR10.23) (http://sift.bii.a-star.edu. sg) and are based on the UniRef90 database (2011). SIFT 4G predictions were not available for 855 substitutions, predominantly because the amino acid change involved more than one nucleotide change within a codon. Provean predictions (v1.1.5) were generated for all mutations using NCBI's non-redundant database (04/02/2016). MAPP and GERP++ predictions were generated using BAD_Mutations alignments and trees (see below). GERP++ generates predictions for single nucleotide positions rather than codons, based on a deficit of observed substitutions compared to that expected given a neutral substitution rate. To assess GERP++ performance we used the GERP++ score at the first, second or third position of the codon if the amino acid substitution could occur by a single change at one of those positions and the average of the GERP++ scores at the first and second positions for all other types of changes. In addition, because GERP++ did not initially perform well on the A. thaliana data using neutral substitution rates estimated from each alignment (default) we used a uniform neutral rate of 10 substitutions per site across all genes.

Implementation of BAD_Mutations pipeline
Predictions using a likelihood ratio test (LRT) were performed with the BAD_Mutations pipeline (Kono et al. 2016). The pipeline is comprised of Python and Bourne Again Shell (BASH) scripts and incorporates several open-source tools, including the alignment tool PASTA (Mirarab et al. 2015) and maximum likelihood methods implemented in HyPhy (Pond et al. 2005). The processing step of BAD_Mutations consists of five major subcommands: (1) setup; (2) fetch; (3) align; (4) predict; and (5) compile ( Figure S1). The setup subcommand generates the configuration files. The fetch subcommand downloads gzipped CDS FASTA files from both Phytozome (https://phytozome.jgi.doe.gov/pz/ portal.html) and Ensembl Plants (http://plants.ensembl.org/index.html), and then creates BLAST databases for identifying homologs. The align subcommand uses BLAST to identify homologs of any query protein and generates a protein alignment and phylogenetic tree using PASTA (Mirarab et al. 2015). The predict subcommand generates predictions for a list of codons of interest by sending a custom batch command to implement a likelihood ratio test using HyPhy. The likelihood ratio test compares the log likelihood of evolution at a single codon under a neutral model (dN = dS) to a model allowing for constraint (dN = vdS), where dN and dS are the synonymous and nonsymous substitution rates and v is a free parameter for selective constraint (Chun and Fay 2009). The compile subcommand is to generate the report and p-values. The user manual, including a brief tutorial, is available at https://github.com/ MorrellLAB/BAD_Mutations/blob/master/Manual/Manual_v1.0.md.
The BAD_Mutations pipeline makes use of sequenced and annotated genomes. We used BLAST searches of the A. thaliana gene sequences against 42 Angiosperm genomes, retaining the top hit from each species with a BLAST E-value threshold of 0.05. The homolog searches were restricted to Angiosperm genomes to avoid extensive saturation of synonymous sites. Protein alignments were generated with PASTA (Mirarab et al. 2015), and a likelihood ratio test (LRT) for constraint on each codon of interest was calculated using HyPhy (Pond et al. 2005). Sequences with 'N's or other ambiguous nucleotides were discarded prior to the likelihood ratio test. The LRT differs compared to its original formulation (Chun and Fay 2009) in that: i) dS was estimated using all codons for each gene separately, ii) query sequences were optionally masked (the entire sequence changed to N = missing) in the likelihood calculation to avoid any reference bias and iii) branches with dS greater than 3 were set to 3 to avoid spuriously high estimates of dS. Additionally, the original LRT used heuristics to eliminate sites with dN . dS, the derived allele present in another species, or sites with fewer than 10 species in the alignment. Rather than eliminating sites, we used logistic regression to provide a single probability of being deleterious based on the LRT test and these additional pieces of information.
Logistic regression was applied using both the masked and unmasked LRT p-values, where the masked p-values were generated from alignments without the A. thaliana reference allele. For the unmasked logistic regression, we used the terms log10(LRT p-value), dN/dS, Rn, and An, where Rn and An are the number of A. thaliana reference and alternative (i.e., mutant) amino acids observed in the alignment, respectively. For the masked model, we replaced An and Rn with the absolute value of Rn -An and the maximum of Rn and An, respectively. For both models p-values , 1e-16 were set to 1e-16 and constraint values . 10 were set to 10. Ten-fold cross-validation was used to assess the fit of the logistic regression. The average area under the ROC (receiver operating characteristic) curve based on cross-validation was 0.9575 (unmasked) and 0.9471 (masked). Because these values were nearly identical to the performance of the model fit to the entire dataset, 0.9581 (unmasked) and 0.9471 (masked), we used the logistic regression coefficients from the full dataset: Sensitivity, specificity, and area under the curve (AUC) were calculated for each approach using the pROC package in R (Robin et al. 2011). We define sensitivity as the proportion of phenotype-altering variants that are predicted to be deleterious, and specificity as the proportion of variants without known phenotypic effects that are predicted to be neutral. Confidence intervals for each were calculated by 2,000 replicates of stratified bootstrapping, where each replicate contains the same number of positives and negatives as in the original sample.
Combined predictions were generated based on the combined scores of six approaches: LRT, LRT-masked, PolyPhen2, Provean, GERP++, and MAPP. SIFT 4G was not included in the combined predictions because it had missing predictions for a large number (855) variants. Sites with missing predictions from one or more of the remaining approaches (n = 215) were removed. Combined predictions were generated using: 1) logistic regression with each approach's score as a predictive variable, 2) support vector machine, 3) random forest, 4) linear discriminant analysis and 5) generalized linear model with lasso penalized maximum likelihood implemented by the glmnet package in R (Friedman et al. 2010). The formula used for the ensemble methods was T$LRT+LRTm+PPH+PROVEAN+SIFT+GERP+MAPP, where T is a vector of 0/1 for true negative and true positive, and the explanatory terms are the raw scores from each of the singular prediction approaches.

Curation of a test set of Arabidopsis thaliana mutants
To evaluate approaches that predict deleterious variants, we generated a database of A. thaliana amino acid substitutions from mutants with described phenotypic alterations and common amino acid polymorphisms unlikely to affect fitness. Out of 2,910 mutants in 995 genes, 81% were from manually curated entries in UniProtKB/Swiss-Prot (n = 2,368), 10% were from our own literature curation (n = 293) and 8.6% were independently identified in both sets (n = 249) (Table S2). Within the same 995 genes, 1,583 common amino acid polymorphisms were identified in 80 accessions (Cao et al. 2011). For our analyses, we assume mutations that cause a deviation from the wildtype phenotype are likely deleterious.

Performance of approaches designed to identify deleterious variants
Using the database of A. thaliana mutations, we assessed seven approaches for their ability to distinguish deleterious and neutral changes. The approaches were selected because they can generate predictions in non-human organisms. Comparison of sensitivity to specificity showed that each approach could reliably distinguish deleterious and neutral substitutions (Figure 1). A likelihood ratio test (LRT) implemented using the BAD_Mutations pipeline showed significantly higher performance than all other approaches as measured by the area under the curve (AUC) of sensitivity vs. specificity (Figure 1, Table S3). A reference masked version of LRT (LRTm), designed to eliminate reference bias (Simons et al. 2014), was the approach with the second highest performance. PROVEAN and PolyPhen2 showed similar performance as measured by AUC, significantly higher than SIFT, GERP++ and MAPP. The relative ranking by AUC was identical when 1,050 mutations with missing predictions for at least one approach were removed (Table S3). We also found very similar measures of performance when we used common SNPs in a set of independent, randomly selected genes rather than common SNPs within the 995 genes with known phenotype altering mutations (Table S3).
A second means of assessing performance is through comparing predictions of rare vs. common variants. Common variants are likely neutral or nearly neutral, whereas deleterious alleles are expected to be kept at low frequency (Ewens 2004). Using SNPs identified in a set of 80 A. thaliana strains, we found each approach identified more deleterious SNPs at low compared to common frequencies (Figure 2). At minor allele frequencies between 2/80 (2.5%) and 8/80 (10%), the LRTm and SIFT predicted a lower proportion of deleterious SNPs compared to the other approaches, indicating that they are less sensitive to detecting alleles under weak selection. At the lowest frequency 1/80 (1.25%), which is expected to include many rare and potentially strongly deleterious variants, LRT called the largest proportion of SNPs deleterious.

Performance across phenotypic and duplicate gene categories
To further characterize differences in performance we compared class of variants, including those identified by genome-wide mutant screens or by directly targeting individual proteins. Mutants identified from screens have gross morphological or easily observable phenotypic effects and are often assigned allele names, whereas directed mutants are not often given allele names and tend to have biochemical phenotypes. To compare these two groups, we split the data into those with allele names (1,910), as a proxy for those with gross phenotypes, and those without allele names (1,000), as a proxy for biochemical phenotypes. As measured by AUC, some of the approaches performed better than others and performance was more similar for the gross phenotypic class compared to the biochemical class (Figure 3a). Both SIFT and PolyPhen2 demonstrated the largest difference in performance for predicting mutations with gross phenotypic alterations vs. biochemical phenotypes. For this type of mutation, the performance of PolyPhen2 was comparable to the LRT.
Gene duplication may reduce prior selective constraints on a protein, enabling variants to occur at previously conserved sites (Kondrashov et al. 2002). Thus, duplicated genes may pose challenges to predicting deleterious alleles, and none of the approaches explicitly distinguish orthologs and paralogs. We identified 466 of the 995 genes as duplicated in A. thaliana based on BLASTP hits with 60% or more identity. We compared the performance of these genes to the remaining single copy genes. Each approach showed equal or better performance for duplicated vs. single copy genes. SIFT had the largest increase in performance (Figure 3b).

Approach dissimilarity and composite predictions
As reported previously (Doniger et al. 2008;Chun and Fay 2009;González-Pérez and López-Bigas 2011;Olatubosun et al. 2012), we found substantial disagreement in predictions among the approaches. At a 95% specificity threshold, 93.6% of mutants were predicted deleterious by one or more approach but only 51.3% were predicted deleterious by at least six of the seven approaches (Table S2). Similarly, only 0.25% of common SNPs were predicted deleterious by all approaches but 16.6% were predicted deleterious by at least one approach (LRT and LRTm were considered separately). Comparing the disagreement between approaches, we found LRT and LRTm to produce very similar predictions, but to be distinct from most of the other approaches (Figure 4). We used five models that combined the predictions of all approaches except for SIFT, which had a higher proportion of missing calls. Only two of these ensemble models, a linear discriminant analysis and a generalized linear model with penalized maximum likelihood, performed significantly higher than LRT based on an AUC (Table S4).

DISCUSSION
In this study, we benchmarked the potential for several widely-used approaches to distinguish putatively deleterious and neutral amino acid substitutions in A. thaliana. Prior evaluations of performance focused on large sets of mutants for single proteins or known human disease Figure 2 The proportion of SNPs called deleterious across frequency classes. The fraction of SNPs called deleterious by each approach (legend) at its 95% specificity threshold across five frequency classes, labeled by the number of minor alleles present (n = 80). The minor allele is defined as the allele that is less frequent in the sample. Sample sizes for the five classes are 5,303 (1), 1,646 (2), 1,250 (3-4), 1,015 (5-8) and 1,583 (.8).

Figure 3
Performance of approaches across different classes of sites. Performance is measured by the area under the curve (AUC) of the approach's sensitivity vs. specificity. Acomparison of mutants with biochemical (n = 1,000) vs. gross phenotypes (n = 1,910). Bcomparison of performance for substitutions in duplicated (n = 2,098) vs. single copy genes (n = 2,395). Confidence intervals were determined by 2,000 bootstrapping iterations.
variants (Ng and Henikoff 2003;Adzhubei et al. 2013). Overall, we find high performance across approaches in their ability to distinguish neutral and deleterious variants, validating their use in plants. The highest performance is achieved by a likelihood ratio test (LRT) implemented using the BAD_Mutations pipeline, in this case using alignments from 42 plant genomes. However, the relative performance depended on the test set and, as discussed below, differs from previous benchmarking studies in humans. Thus, we recommend caution in interpreting slight differences in performance and advocate the use of multiple methods to achieve the highest confidence.
Below, we discuss our results along with characteristics of the approaches and test data that may contribute to differences in predictions and performance when applied to non-human species. One important consideration is the distinction between deleterious variants and those that impact protein function and have phenotypic consequences. While these two groups are overlapping, they are not identical. Because conservation between species is directly related to fitness, we have used the term "deleterious" when referring to the prediction approaches. However, the test sets used to evaluate approaches are composed of variants known to affect protein function or phenotype. Thus, regardless of the nomenclature, any evaluation of approach performance necessarily assumes a large overlap between conserved amino acid positions and those that affect protein function as measured by phenotype. Equally relevant, we use common variants as "neutral" controls even though some common variants are likely to affect protein function due to local adaptation (Hancock et al. 2011) or hitchhiking (Chun and Fay 2011). Despite potential contamination, common variants provide the only large set of negative controls that can be used for training and estimating rates of false positives (Ng and Henikoff 2006). Both common and rare variants may also have compensatory effects on deleterious variants (Poon et al. 2005). These potential interactions between variants further complicates the identification of truly deleterious variants in any species.
Phylogenetic power, alignments, and reference databases Phylogenetic power is critical to all comparative genomic approaches that predict deleterious variants. When homologs are too closely related, not enough time has passed for neutral sites to accumulate amino acid substitutions. When homologs are too distantly related, functional sites may not be conserved due to compensatory changes or divergence in homolog function (Marini et al. 2010;Breen et al. 2012;Jordan et al. 2015). The LRT differs from the other approaches examined in that it uses synonymous sites as an internal control to account for the expected amount of protein divergence under a neutral model. As such, even homologs that are nearly identical in their amino acid sequences are informative, given that they have accumulated changes at synonymous sites. However, distantly related homologs are uninformative if divergence at synonymous sites is saturated, thus the LRT should only be applied to organisms where a sufficient number of related genomes are available. In this study, the majority of total dS values for the gene alignments was between 10 and 50, which provides sufficient divergence to test the likelihoods of constraint and relaxation (Chun and Fay 2009). GERP++ is similar to the LRT in that it uses a neutral substitution rate to make its predictions but differs in that the neutral rate must be specified rather than being estimated from synonymous sites within the alignment. GERP++ also does not make use of the genetic code to distinguish synonymous and nonsynonymous changes. In this regard, GERP++ was not appropriately applied since we used a fixed neutral rate for all genes rather than an alignment specific neutral rate.
Out of the approaches compared, phylogenetic power cannot explain the differences between the LRT, MAPP, and GERP++ because they used the same alignments. However, we did notice substantial differences in performance based on the number of ungapped sequences present in the BAD_Mutations alignment at the position being queried ( Figure S2). Both LRT and LRTm performed better than the other approaches when there were 10 or fewer sequences at the position of interest. We did not see this pattern when we used the number of sequences present at any position in the alignment, which was typically close to 42. We also did not see this pattern when we examined performance based on the number of sequences used by Provean or PolyPhen2, typically over 100 per gene.
All approaches studied here use alignments to make their predictions, making the protein database and choice of homologs to be included in the alignment a critical step. For MAPP, GERP++, and LRT we used alignments generated using the BAD_Mutations pipeline which queries proteins from a set of annotated reference genomes, in this case from 42 Angiosperm species. SIFT and PolyPhen2 use the UniRef database (2011), whereas PROVEAN uses the most recent nonredundant protein database from NCBI. Both PROVEAN and Poly-Phen2 are known to be sensitive to the choice of the reference database and criteria for inclusion of homologs (Choi et al. 2012;Adzhubei et al. 2013). Despite the choice of homologs being an important step in predicting deleterious substitutions, the use of a plant-specific or entire non-redundant database does not appear to contribute to performance differences: the target database used for prediction does not determine the ranking of approaches in terms of their AUC (Figure 1).
Despite faster runtime of the ensemble approaches with respect to the LRT-based approach, there are circumstances where the LRT-based approach would have higher accuracy. The LRT-based approaches have higher performance in cases where there is shallow alignment depth across the phylogeny, for example, in newly formed genes or rare isoforms of a transcript. The LRT-based approach is able to estimate substitution rates and predict the impact of a variant while the heuristic approaches or the ensemble approaches would likely not make a prediction, and return a missing value.

Training and test sets
Performance of an individual approach depends on both the training and test sets used to measure it. Because performance is typically measured using common SNPs and known disease variants in humans, Figure 4 Dissimilarities among approaches. Dissimilarities were computed by the pairwise number of disagreements between each approach applied to mutants and common SNPs (n = 4,493). Dissimilarities are represented by a tree based on hierarchical clustering and values below nodes are bootstrap support based on 2,000 iterations.
there has been some concern over the lack of independence between training and test sets (Dong et al. 2015;Grimm et al. 2015). However, another consideration that has not yet been examined is whether performance in one species translates to other distantly related species, which may not have the same availability of homologs from sequenced genomes spanning a range of phylogenetic relatedness. The performance of individual approaches could depend on the study system in that some approaches may expect homologs at certain phylogenetic distances, low rates of compensatory change, or low rates of gene duplication.
Previous studies of the accuracy of prediction approaches made use of five human test datasets (Dong et al. 2015;Grimm et al. 2015). We find better performance across approaches in our A. thaliana dataset than that reported for humans (Table 1). It is unclear why the approaches uniformly perform better in A. thaliana. One possibility is that the neutral and deleterious variants in A. thaliana are more distinct from one another than in humans. The very large proportion of phenotype changing variants in our A. thaliana test set that are identified as deleterious means that this test data set is less useful for approach comparison due to the small number of cases that are difficult to predict correctly.

Population and gene-specific performance
Because nearly all measures of performance use either common polymorphism or recently fixed amino acid substitutions as a proxy for neutral SNPs, population and gene-specific factors that influence neutral polymorphism are expected to influence measures of performance.
Humans have a small effective population size relative to other mammals (Leffler et al. 2012) and consequently a high ratio of nonsynonymous to synonymous diversity (Fay et al. 2001;Kosiol et al. 2008). Thus, distinguishing neutral and deleterious variants may be more difficult in humans than other species, and approaches trained using human polymorphism may be more conservative with respect to weakly deleterious variants. In comparison, predicting deleterious variants in A. thaliana may be facilitated by the fact that A. thaliana has slightly larger effective population size (Cao et al. 2011).
It should be noted that both demographic history and the process of local adaptation could play important roles in the distribution of deleterious variants. In populations that are colonizing or expanding into novel environments, the selective coefficients against individual variants may change (Slotte et al. 2013), and locally adaptive variants may become appreciably enriched. Both humans and A. thaliana are known to have undergone demographic expansion in their recent evolutionary histories (Hoffmann 2002;Finlayson 2005). While the relative extent of local adaptation in these two species is difficult to quantify, both exhibit an excess of low-frequency amino acid polymorphism characteristic of deleterious variants (Lohmueller et al. 2008;Cao et al. 2011;Henn et al. 2016).
Another potentially important factor in predicting deleterious variants is gene duplication. A. thaliana carries remnants of a whole genome duplication along with numerous tandem duplications (The Arabidopsis Genome Initiative 2000) more than are present in the human genome (Lynch and Conery 2000). Gene duplication can lead to relaxed selection during subfunctionalization or pseudogenization (Ohno 1970), enabling amino acid variants to accumulate in recently duplicated genes. However, we found very similar performance between duplicate and single copy genes, consistent with a similar finding in humans using PolyPhen2 (Adzhubei et al. 2013). Because we only included genes with known mutant phenotypes, the sample of recently duplicated genes is limited.

Conclusions and future directions
Most approaches developed to predict deleterious mutations were trained using human data and in many cases, can only be used for human proteins (Li et al. 2009;Schwarz et al. 2010;Kircher et al. 2014). This study demonstrates that several generalized approaches perform exceptionally well in A. thaliana, implying that they should also work well for other plant species. Because of the similarly high performance, other considerations such as ease of implementation and compute time may be considered when choosing an approach to identify deleterious mutations in plants. Notably, LRT requires longer run times than any of the other approaches, typically 5.2 hr of computing time per gene compared to 14.5 and 9.4 min per gene for PolyPhen2 and Provean, respectively. One way the BAD_Mutations pipeline could be sped up while retaining the flexibility of querying customizable plant genomes is by using heuristic measures of site-specific conservation rather than the LRT. Provocatively, we found similar performance (AUC = 0.9551) for a logistic model that only used the number of reference and alternative alleles in the alignment (Rn and An). However, such heuristic measures may not be robust to a change in the reference species and its distance to other genomes in the database. A second approach would be to use predictions from the combined output of multiple prediction approaches, as this has been shown to be highly effective in humans (e.g., González-Pérez and López-Bigas 2011). Although we did not find an ensemble predictor that greatly improved performance, removing LRT predictions did not reduce the performance of the ensemble predictions.