Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants

Abstract Background Genome-wide gene function annotations are useful for hypothesis generation and for prioritizing candidate genes potentially responsible for phenotypes of interest. We functionally annotated the genes of 18 crop plant genomes across 14 species using the GOMAP pipeline. Results By comparison to existing GO annotation datasets, GOMAP-generated datasets cover more genes, contain more GO terms, and are similar in quality (based on precision and recall metrics using existing gold standards as the basis for comparison). From there, we sought to determine whether the datasets across multiple species could be used together to carry out comparative functional genomics analyses in plants. To test the idea and as a proof of concept, we created dendrograms of functional relatedness based on terms assigned for all 18 genomes. These dendrograms were compared to well-established species-level evolutionary phylogenies to determine whether trees derived were in agreement with known evolutionary relationships, which they largely are. Where discrepancies were observed, we determined branch support based on jackknifing then removed individual annotation sets by genome to identify the annotation sets causing unexpected relationships. Conclusions GOMAP-derived functional annotations used together across multiple species generally retain sufficient biological signal to recover known phylogenetic relationships based on genome-wide functional similarities, indicating that comparative functional genomics across species based on GO data holds promise for generating novel hypotheses about comparative gene function and traits.


I. BACKGROUND
Phenotypes and traits have long been the primary inspiration for biological investigation. Phenotypes are the result of a complex interplay between functions of genes and environmental cues. In an effort to organize and model gene functions, various systems of classification have been developed including systems like KEGG (the Kyoto Encyclopedia of Genes and Genomes), which is focused on protein function including gene activities superimposed on metabolic pathways [1]. Other such systems include the various Cyc databases, MapMan, and the Gene Ontologies (GO), a vocabulary of gene functions organized as a directed acyclic graph, which makes it innately tractable for computational analysis [2,3,4].
GO-based gene function annotation involves the association of GO terms to individual genes. Functions may be assigned to genes based on different types of * Contributed equally. evidence for the association. For example, functional predictions can be inferred from experiments (EXP), expression patterns (IEP), and more [5]. Computational pipelines are often used to generate functional predictions for newly sequenced genomes, where the genome is first sequenced and assembled, then gene structures (gene models) are predicted, then functions are associated with those gene predictions. Genome-wide gene function prediction datasets are frequently used to analyze gene expression studies, to prioritize candidate genes linked to a phenotype of interest, to design experiments aimed at characterizing functions of genes, and more [6,7,8].
How well a gene function prediction set models reality is influenced by how complete and correct the underlying genome assembly and gene structure annotations are as well as by how well the software used to predict functions performs.
GOMAP (the Gene Ontology Meta Annotator for Plants) is a gene function prediction pipeline for plants that generates high-coverage and reproducible functional annotations [9]. The system employs multiple functional prediction approaches, including sequence similarity, protein domain presence, and mixed-method pipelines developed to compete in the Critical Assessment of Function Annotation (CAFA) Challenge [10], a community challenge that has advanced the performance of gene function prediction pipelines over the course of five organized competitions [11].
We previously annotated gene functions for the maize B73 genome and demonstrated that GOMAP's predicted functions were closer to curated gene-term associations from the literature than those of other community functional annotation datasets, including those produced by Gramene (Ensembl pipeline) and Phytozome (Inter-pro2GO pipeline) [12]. Using the newly containerized GOMAP system [9], we report here the functional annotation of 18 plant genomes across the 14 crop plant species shown in Table I and report comparisons of performance based on comparison to a Gold Standard gene function datasets, where possible.
Given these multiple annotations across various plant species, we next considered whether these datasets could be used together for comparative functional genomics in plants. We describe here a simple and crude method by which we used gene function annotations to generate dendrograms of genome-level similarity in function. This idea is similar to that of Zhu et al., who determined the evolutionary relationships among microorganisms based on whole-genome functional similarity [13]. Here we expand on that approach, analyzing genome-wide GO assignments to generate parsimony and distance-based dendrograms (see Figure 1 for process overview). We compared these with well-established species phylogenies ( Figure 2) to determine whether trees derived from gene function show any agreement with evolutionary histories, taking agreement between generated dendrograms and known evolutionary histories to be evidence that sufficient comparative biological signal exists to begin to use GO functional annotations across multiple plant genomes for comparative functional genomics investigations.

A. Overview
As shown in Figure 1, gene function annotation sets were created and compiled for each genome. For those with existing annotation sets available on Gramene or Phytozome [50,51], the datasets were compared. From there, matrices that included genomes as rows and terms as columns were generated. These were used directly to build parsimony trees or to create distance matrices for neighbor-joining tree construction [52,53,54]. In subsequent analyses, jackknifing was used to remove terms (columns) or to remove genomes (rows) to map the source of signal for treebuilding results [55]. Table II shows quantitative attributes of each of the annotation sets. In summary, GOMAP covers all annotated genomes with at least one annotation per gene, and provides between 3.8 and 12.1 times as many annotations as Gramene or Phytozome.

B. Functional Annotation Sets Produced
Quality evaluation of gene function predictions is not trivial and is approached by different research groups in different ways. Most often datasets are assessed by comparing the set of predicted functions for a given gene to a Gold Standard consisting of annotations that are assumed to be correct. This assumption of correctness can be based on any number of criteria. Here we used as our Gold Standard dataset all annotations present in Gramene63 that had a non-IEA (non-Inferred by Electronic Annotation) evidence code, i.e. we used only annotations that had some manual curation. This enabled us to assess the 10 genomes shown in Table II. It is perhaps noteworthy that the IEA and non-IEA annotation sets from Gramene63 frequently contain overlaps, indicating that some of the predicted annotations were manually confirmed afterwards by a curator and that in such cases, a new annotation was asserted with the new evidence code rather than simply upgrading the evidence code from IEA to some other code, thus preserving the IEA annotations in Gramene63 that are produced by the Ensembl analysis pipeline [56], a requirement for comparing GOMAP-produced IEA datasets to the IEA datasets produced by the Ensembl pipeline.
A general limitation of using Gold Standards for quality evaluation is that they can never be assumed to be complete, and therefore false positives in the prediction cannot be distinguished from false negatives in the Gold Standard. In other words, is gene X, function Y truly a wrong prediction or has it simply not yet been discovered experimentally? This problem is laid out in more detail in [57]. As a consequence, the quality of larger prediction sets will be systematically underestimated compared to smaller ones, and this effect is exacerbated the more incomplete the Gold Standard is.
There are many different metrics that have been used to evaluate the quality of predicted functional annotations. For the maize B73 GOMAP annotation assessment in [12], we had used a modified version of the hierarchical evaluation metrics originally introduced in [58] because they were simple, clear, and part of an earlier attempt at unifying and standardizing GO annotation comparisons [59]. In the meantime, Plyusnin et al. published an approach for evaluating different metrics showing variation among the robustness of different approaches to quality assessment [60]. Based on their recommendations, we use here the SimGIC2 and Term-centric Area Under Precision-Recall Curve (TC-AUCPCR) metrics. We also evaluated with the F max metric, simply because it is widely-used (e.g., by [10]), even though according to Plyusnin et al., it is actually a flawed metric [60]. Results of the quality assessments for the 10 genomes where a Gold Standard was available are shown in Table III and Figure S2. While evaluation values differ between metrics and the scores are not directly comparable, a few consistent patterns emerge: GOMAP annotations are almost always better than Gramene and Phytozome annotations in the Cellular Component and Molecular Function aspect, with the only three exceptions being the Molecular Latest overview at https://dill-picl.org/projects/gomap/gomap-datasets/ * Previously published in [9].
Function aspect for T. aestivum using the TC-AUCPCR and the F max metric and the Cellular Component aspect for M. truncatula A17 using the F max metric. Conversely, GOMAP predictions achieve consistently lower quality scores in the Biological Process aspect with the exception of B. dystachion, O. sativa, and S. bicolor with the TC-AUCPR metric. Generally, annotations that are better in one aspect are also better in the other two aspects but the ranking of annotations does not necessarily hold across metrics. The Phytozome annotation for O. sativa is an outlier in terms of its comparative quality, potentially because it is based on a modified structural annotation that differs substantially from the Gold Standard and the other annotations under comparison.

C. Phylogenetic Tree Analyses
With the comparative quality of gene function predictions in hand, we approached the question of whether the datasets could be used together for comparative functional analysis across all genomes. As a simple first step, we began to work toward understanding the degree to which trees built based on gene functions agree with known, well-documented evolutionary relatedness. We constructed neighbor-joining and parsimony trees of the 18 plant genomes, and visulized them using iTOL [61]. The two tree topologies, rooted at P. lambertiana, were compared to one another and to the topology of the expected tree ( Figure 2). For both the neighbor-joining ( Figure 3a) and parsimony trees (Figure 3b), one common difference is noted: S. bicolor is not at the base of the Z. mays clade as expected, and is clustered with B. distachyon instead. Notable differences between the neighbor-joining and parsimony tree are the following: C. sativa appears at the base of the eudicots instead of G. raimondii in the neighbor-joining tree, while G. raimondii is grouped with C. sativa and A. hypogaea is grouped with G. max in the parsimony tree. Second, O. sativa was expected to be at the base of the BOP clade, but appears at the base of Z. mays in the neighborjoining tree, and at the base of all angiosperms in the parsimony tree. Differences among relationships within the Z. mays clade constaining B73, PH207, W22, and Mo17 were disregarded given the high degree of similarity across annotation sets and the fact that these relationships are not clear given the complex nature of withinspecies relationships.
Due to differences between the function-based dendrograms and the expected tree, jackknifing analysis was carried out by removing terms (columns in underlying datasets) to determine the degree to which the underlying datasets support specific groupings based on functional term assignments. This analysis was carried out for both neighbor-joining and parsimony trees. First, trees were generated by omitting 5% to 95% of the dataset in increments of 5 to determine the threshold at which the tree topologies deviated from those generated using the full dataset. That threshold was reached at 45% for both neighbor-joining and parsimony; therefore, we used trees generated with 40% of the data removed for reporting branch support values for the topology (Figure 3). Comparing the two trees, the parsimony topology was not as solid as that of the neighbor-joining at jackknife values up to 40%. Based on this robustness for neighbor-joining treebuilding in general, we carried out all subsequent analyses using neighbor-joining treebuilding methods.
We considered investigating the effect of using one  GO aspect to generate our neighbor-joining tree. In other words, we generated the neighbor-joining trees using cellular component GO terms, molecular function GO terms, and biological process GO terms separately (Supplementary Figure 3). Of the 14,303 total GO terms, 1,524 are cellular component terms, 3,926 are molecular function terms, and 8,853 are biological process terms. Out of the three single aspect phylogenetic trees, the one built using molecular function terms is the closest to our neighbor-joining tree obtained using all GO terms in our datasets Figure 3a. The only difference is that A. hypogaea and G. max are clustered in the molecular function tree, while they are not in our neighbor-joining tree Figure 3a. In the cellular component tree, G. raimondii and C. sativa are clustered together when they are not in the neighbor-joining tree with all GO aspects Figure 3a. Also, O. sativa is at the base of the monocots just like in the expected tree, but not in the neighborjoining tree Figure 3a. In the biological process tree, O. sativa is at the base of the angiosperms and there is no clear separation between monocots and dicots. In all the three single aspect phylogenetic trees and our all-aspect neighbor-joining tree, A. hypogaea is never placed at the base of the NPAAA clade. Also, B. distachyon and S. bi-color are always clustered together. Overall, the topologies constructed using one GO term aspect at a time are close to that of our neighbor-joining tree, such that not one GO term aspect alone restored the topology of the expected tree.
To map the source of discrepancies to specific gene annotation sets, we generated various neighbor-joining trees excluding one genome each time, an additional tree with both Medicago genomes excluded simultaneously, and another with all Z. mays genomes excluded simultaneously. To exemplify this, see the monocot clade in Figure 2 and the lower (monocot) clade in Figure 3a. When the neighbor-joining tree was generated, two species are misplaced: S. bicolor and O. sativa. As shown in Figure  5a, removal of O. sativa corrects one error (itself) but does not correct the errant grouping of S. bicolor with B. distachyon. In Figure 5b, it is shown that the removal of S. bicolor corrects the errant grouping of itself and B. distachyon, but O. sativa placement remains incorrect. However, as shown in Figure 5c, the removal of B. distachyon generates a tree where all relationships are in agreement with known species-level relationships. (Note well: all individual annotation sets were progressively removed, not just the three shown in the example.)  With this observation in hand, we sought to determine the minimum number of genomes that could be removed to create a tree that matched the expected tree topology. All possible combinations of removing 0-5 genomes to restore the topology were tested, and 10 combinations of minimum amount of genomes to be removed were obtained. The removal of four genomes was required to generate function-based trees consistent with known phylogenetic relationships. Of the 10, we selected the one that had the genomes that were most frequently part of a solution (O. sativa, 8; B. distachyon, 7; C. sativa, 6; A. hypogaea, 5; S. bicolor, 4; G. raimondii, 4; G. max, 4; T. aestivum) to show in this paper (the other combinations can be found in our publicly available dataset). To elaborate, the genomes removed here are O. sativa, B. distachyon, C. sativa, and A. hypogaea ( Figure 4). Jackknifing analysis was also carried out for this dataset with support shown. Branch support is generally higher than that for the full dataset (i.e., branch support is higher in Figure 4 than in Figure 3a), and removing genomes that are causing variations seem to stabilize the tree.

D. Potential Causes of Unexpected Groupings
As a first step toward explaining discrepancies between known evolutionary relationships and those resulting from comparative analysis of genome-wide gene function predictions, we assessed the quality of each genome assembly and structural annotation set using GenomeQC [62]. Tables IV-V and Figures 6-7 represent the resulting assembly quality, structural annotation measures of quality, and proportion of single-copy BUSCOs (Benchmarking Universal Single-Copy genes) [63] that were generated. Although these analyses make evident that the species annotated are comparatively different in both natural genome characteristics and in assembly and annotation quality aspects, it is not the case that the four species responsible for deviations between the functional annotation dendrograms and known phylogenetic relationships (i.e., C. sativa, A. hypogaea, O. sativa and B. distachyon) create these discrepancies due to issues of genome assembly and/or annotation quality. One potential for some explanation is in relation to C. sativa, which is the only genome that has an assembly length larger than the expected (see IV), and a comparatively large proportion of missing BUSCOs in the assembly (see Figure 6). Similarly, for A. hypogaea and O. sativa, there is a large proportion of missing BUSCOs in the annotations (see Figure 7).

III. DISCUSSION
In this study, we used the GOMAP pipeline to produce whole-genome GO annotations for 18 genome assembly and annotation sets from 14 plant species [9]. Assessments of the number of terms predicted as well as the quality of predictions indicate that GOMAP functional prediction datasets cover more genes, contain more predictions per gene, and are of similar quality to prediction datasets produced by other systems, thus supporting the notion that these high-coverage datasets are a useful addition for researchers who are interested in genome-level analyses, including efforts aimed at prioritizing candidate genes for downstream analyses. Given that we can now produce high-quality, whole genome functional annotations for plants in a straightforward way, we intend to produce more of these over time (indeed we recently annotated Vitis vinifera [64], Brassica rapa [65], Musa acuminata [66], Theobroma cacao [67], Coffea canephora [68], Vaccinium corymbosum [69] Solanum lycopersicum [70], and Solanum pennellii [71]).
With 18 genome functional annotations in hand, we sought to determine whether and how researchers could use multispecies GO annotation datasets to perform comparative functional genomics analyses. As a proof of concept, we adapted phylogenetic tree-building methods to use the gene function terms assigned to genes represented by the genomes to build dendrograms of functional relatedness and hypothesized that if the functions were comparable across species, the resulting trees would closely match evolutionary relationships. To our delight and surprise, the neighbor-joining and parsimony trees ( Figure  3) did resemble known phylogenies, but were not exact matches to broadly accepted phylogenetic relationships.
After removing the minimum number of genomes that resulted in restoration of the expected evolutionary relationships, we found that the individual species that may be responsible for the discrepancies observed in Figure 3 Zea mays   were C. sativa, A. hypogaea, O. sativa and B. distachyon. We hypothesize that the following could account for such errant relationships: 1) Quality of sequencing and coverage assembly: genomes of similarly high sequence coverage that have excellent gene calling would be anticipated to create the best source for functional annotation. Genomes of comparatively lower, or different, character would be anticipated to mislead treebuilding and other comparative genomics approaches.
2) Shared selected or natural traits: species that have been selected for, e.g., oilseeds may share genes involved in synthesis of various oils. Other shared traits would be anticipated to cause similarities for species with those shared traits.
3) Lack of good representation of diverse plant biology aspects in the GO graph: most plant-specific GO terms were derived from functional analysis of one model species, Arabidopsis thaliana. This single-source for presence of plant-specific functions limits the graph from containing unique functional aspects of plant biology represented in other species' genomes. In addition, this limitation could lead to the assignment of unknown or errant functions based on a lack of closely related and/or plantspecific terms. 4) Use of a simple method of treebuilding based on the presence or absence of gene function terms: the method we devised and describe here is not sophisticated enough to make full use of information in the GO graphs such that we recover the full detail of the species' evolutionary histories from the simple method.
To consider the first of these, we looked at genome assembly and annotation quality metrics (see Tables IV-V and Figures 6-7). For B. distachyon we could find no compelling evidence that assembly structural or functional annotation quality differed significantly from all others, except in the case of C. sativa, where we noted that the assembly length exceeded the predicted genome size based on C-values for genome sizes reported previously [72]. In this case, the fact that the C. sativa line sequenced is not inbred [73] may be responsible for the inflated assembly size relative to what is expected. This means that in the assembly, there are likely regions where alleles between chromosomes do not align, which would inflate the overall length of the assembly. In addition, the assembly misses a large proportion of BUSCO genes compared to most other genomes included in this analysis. Indeed, the comparatively low-quality assembly for Cannabis genome has been noted by others [74], and our preliminary investigations indicate that the assembly length is in fact longer than expected.
In an attempt to better understand conflicting phylogenetic signals that could be caused by the second potential cause, i.e., shared selected or natural traits, we mapped all GO terms that exist in our binary matrix and traced character history (presence/absence) on the nodes and leaves of our expected evolutionary tree using the software Mesquite version 3.61 [75]. These data summarize the gain or loss of each GO term across the species described in this paper and can be found in our Github repository. We carried out a number of simple experiments to reveal which terms could be causal for errant relationships (e.g., dropping all unique terms from the B. distachyon dataset, reconstructing the term states at  B. distachyon. Tree scale is shown above nodes that should be where B. distachyon should occur, etc., and could not identify any biologically compelling patterns. (Because these analyses were not fruitful, they were not specifically included in our materials and methods, though we do include the input datasets here, at the link provided in section V "Availability of Source Code and Supporting Data" of the paper, for others to consider and peruse independently.) A likely limitation in this analysis is the effect of the third potential cause of discrepancies between our generated phylogenetic trees and the expected topology: a deficiency of terms that describe diverse gene functions within across the diversity of plant species. Because most  GO terms specific to plant biology are likely derived from Arabidopsis, a model dicot species, gene functions unique to other species are expected to be missing from the GO graphs [76]. This source of error will only be corrected over time as gene functions unique to diverse plant species are populated into the GO graph. We consider the most likely explanation for observed discrepancies between the known evolutionary phylogenies and dendrograms created based on GO terms describing gene function to be a result of the fourth explanation: the simplicity of the treebuilding models and methods we used for these analyses. Because the treebuilding and analytics described in this paper were based on the presence/absence of GO terms, novel terms are highly influential to the outcomes of the analysis and the number of times a term is used does not influence the outcome at all. In contrast, plant genomes are notable for having many duplicated genes as a result of whole-genome and segmental duplications over evolutionary history, so these duplications are in fact a feature of and marker for what happened to that genome over time. Therefore, using presence/absence of GO terms where the count of term occurrences are not weighted may be too simple to get at the genuine biological complexities represented in any given plant genome. Our simplistic demonstration of the utility of GO datasets for comparative functional genomics shows that more sophisticated methods are very promising for comparative functional genomics analyses.
It should be noted that comparative analyses using Complete & Single Copy Figure 6: Assembly BUSCO plot generated using GenomeQC. Genomes analyzed are shown across the X-axis, and are ordered to match the occurrence of species shown in Figure 2. Percentage of BUSCO genes across four gene categories are stacked, with each adding up to 100 percent (Y-axis). Complete and single-copy genes are shown in grey, complete and duplicated copies in orange, fragmented copies in blue, and missing are shown copies in green.
gene functions are not completely absent from the literature -though they are absent for large genome comparisons in plants. An example of such existing comparative use of GO is one where a given tree topology was used to look for gains and losses of functions mapped to independently derived trees, which was reported by Schwacke et al. [77]. They report, as an example of their method, an analysis of gene loss in Cuscuta, a parasitic plant based on analysis using the Mapman ontology. In their work, they showed considerable loss of genes, which is a hallmark of the parasitic lifestyle. Our efforts differ in that we used the functions directly to infer tree structures as a demonstration that sufficient biological signal is present in GO-based datasets of genome-wide function prediction to reproduce known biological relationships. The method we used was quick and dirty, and we anticipate that refinements in approach that consider multiple copies of genes, as well as using different types of graph and network representations beyond tree structures, are logical next steps for refining the use of GO terms for comparative functional genomics analyses in plants. With that in mind, we look forward not only to developing systems to support GO-based comparative functional genomics tools, but also to seeing the tools other research groups will develop to approach the use of these datasets to formulate novel comparative functional genomics hypotheses.

A. Acquiring Input Datasets
For each of the 18 genomes listed in Table I, information on how to access input annotation products are listed by DOI. For each, one representative translated peptide sequence per protein coding gene was selected and used as the input for GOMAP, a gene function prediction tool for plants that is actively maintained, updated, and versioned. Details of how GOMAP annotations are derived including the specificity of component datasets and which terms are retained are described   Figure 2. Percentage of BUSCO genes across four gene categories are stacked, with each adding up to 100 percent (Y-axis). Complete and single-copy genes are shown in grey, complete and duplicated copies in orange, fragmented copies in blue, and missing are shown copies in green.
elsewhere [12] [9]. In brief, GOMAP annotations are a combination of the annotations from multiple sources. GOMAP combines the annotations from all the sources and removes the less specific annotations that could be inferred from the more specific annotations, keeping only the most specific terms for each gene that cannot be inferred from other terms (i.e., only leaf terms). Unless the authors of the genome provided a set of representative sequences designated as canonical, we chose the longest translated peptide sequence as the representative for each gene model. In general, non-IUPAC characters and trailing asterisks (*) were removed from the sequences, and headers were simplified to contain only non-special characters. The corresponding script for each dataset can be found at the respective DOI. Based on this input, GOMAP yielded a functional annotation set spanning all protein-coding genes in the genome. Using the Gene Ontology version releases/2020-10-09, this functional annotation set was cleaned up by removing duplicates, annotations with qualifiers (NOT, contributes to, colocalizes with; column 4 in the GAF 2.1 format), and obsolete GO terms. Any terms containing alternative identifiers were merged to their respective main identifier, uncovering a few additional duplicates, which were also removed. Table SII shows the number of annotations removed from each dataset produced.
To compare the quality of GOMAP predictions to currently available functional predictions from Gramene and Phytozome, we downloaded IEA annotations from Gramene (version 63, [50], https://www.gramene.org/) and Phytozome (version 12, [51], https://phytozome.jgi.doe.gov/) for each species with functional annotations of the same genome version. These datasets were cleaned as above. Similarly cleaned non-IEA annotations from Gramene63 served as the Gold Standard wherever they were available. More detailed information on how these datasets were accessed can be found at https: //github.com/Dill-PICL/GOMAP-Paper-2019.1/ blob/master/data/go_annotation_sets/README.md.

B. Quantitative and Qualitative Evaluation
The number of annotations in each clean dataset was determined and related to the number of protein coding genes (based on transcripts in the input FASTA file). This was done for separately for each GO aspect as well as in total.
We used version b6309cb (also included in our code as a submodule) to calculate SimGIC2, TC-AUCPCR, and F max quality scores. To provide the information content required for the SimGIC2 metric, the Arabidopsis GOA from https://www.ebi.ac.uk/GOA/ arabidopsis_release was used in version 2021-02-16.

C. Cladogram Construction
For clustering, we first collected all GO terms annotated to any gene in each genome into a list and removed the duplicates, yielding a one-dimensional set of GO terms for each genome (T ). Next, we added all parental terms for each term in this set (connected via is a in the ontology), their respective parental terms and higher, recursively continuing up to the very root of the ontology. Then we once again removed the duplicates, yielding a set S containing the original terms from set T as well as all terms proximal to them in the Gene Ontology directed acyclical graph. These sets with added ancestors served as a starting point of our tree-building analyses: pairwise distances between the genomes were calculated using the Jaccard distance as a metric of the dissimilarity between any two sets a and b.
A neighbor-joining tree was constructed based on the generated pairwise distance matrix using PHYLIP [52]. Additionally, term sets S of all genomes were combined into a binary matrix (with rows corresponding to genomes and columns corresponding to GO terms, values of 0 or 1 indicating whether a term is present or absent in the given set). PHYLIP pars was used to construct a parsimony tree from this binary matrix.
P. lambertiana, a gymnosperm, was included in the dataset as an outgroup to the angiosperms to separate between the monocot and eudicot clades. iTOL [78] was used to visualize the trees using their Newick format, and root them at P. lambertiana. Moreover, a cladogram representing the known phylogeny of the included taxa was created by hand based on known evolutionary relationships [79,80,81,82,83]. This was used to compare the generated phylogenetic relationship based on functional similarity with the evolutionary relationships of the plant genomes.
Jackknifing analysis was carried out for both parsimony and neighbor-joining trees to assess the support for each clade based on the proportion of jackknife trees showing the same clade. To this end, 40% of the terms in T were randomly removed, ancestors of the remaining terms were added and trees constructed as above. The majority rule consensus tree of 100 individual trees was calculated with the jackknife values represented on each branch. The tree was then visualized using iTOL using its Newick format, and rooted again at P. lambertiana.

D. Genome Quality Evaluation
Genome size was estimated from the C-values obtained from the Plant DNA C-values data resource from the Kew Database (https://cvalues.science.kew.org). The mean C-value for a given species was used for calculating genome size estimates in base pairs (bp) using the method of [84]. In brief, Genome size (bp) = C-value (pg) * 0.978x10 9 bp pg The estimated genome size (listed in Table IV) was used as an input for GenomeQC (https://genomeqc.maizegdb.org/) [62] to calculate quality metrics. For genomes that were too large to submit through the GenomeQC webtool or had missing exon information, modified scripts of those found in GitHub of GenomeQC (https://github.com/HuffordLab/GenomeQC, commit e6140ee) were applied to calculate the assembly and structural annotation metrics in Table IV and V. BUSCO version 5.0.0 [85] was used to calculate the assembly and annotation BUSCO scores, shown in Figures 6 and 7. The input for assembly BUSCO scores were chromosome sequences, whereas inputs were transcript/mRNA/CDS sequences for the annotation BUSCO scores. For the lineage parameter, the lineage datasets used were as follows: Eudicots for C. sativa and G. raimondii, Fabales for A. hypogaea, M. truncatula A17 and R108, P. vulgaris, G. max, and V. unguiculata, and Poales for B. distachyon, O. sativa, T. aestivum, H. vulgare, S. bicolor and Z. mays B73, Mo17, W22 and PH207.

V. AVAILABILITY OF SOURCE CODE AND SUPPORTING DATA
All data and source code generated are freely available at https://github.com/Dill-PICL/ GOMAP-Paper-2019.1 under the terms of the MIT license. All software requirements and dependencies are packaged into a Singularity container so no other setup is required to reproduce our results. We will provide a DOI through Zenodo for the final version of the manuscript after reviews and corrections are incorporated.
An up-to-date list of all available annotation sets can be found at https://dill-picl.org/projects/gomap/ gomap-datasets/.

A. Competing Interests
The author(s) declare that they have no competing interests.

B. Funding
This work has been supported by the Iowa State University Plant Sciences Institute Faculty Scholars Program to CJLD, the Predictive Plant Phenomics NSF Research Traineeship (#DGE-1545453) to CJLD (CFY and KOC are a trainees), and IOW04714 Hatch funding to Iowa State University.

C. Author's Contributions
LF, DP, CFY, KOC, HAD, PJ, DCS, HV, and KW generated annotations for plants as described in this paper. DP and CJLD co-conceived the idea for phylogenetic analysis. DP wrote the code for the analyses in this paper. LF worked with DP to create dendrograms and compare those to phylogenetic trees. LF carried out assembly and annotation metric comparisons. LF, DP, and CJLD wrote the manuscript. All authors read, offered suggestions to improve, and approved the final copy of the manuscript.

VII. ACKNOWLEDGEMENTS
Thanks to Steven Cannon for help to understand phylogenetic relationships among eudicots and helpful discussions and to Toby Kellogg and Jedrzej Szymanski for discussions and ideas on how to consider the data. Thanks to Nancy Manchanda for reviewing documentation and for checking genome versions used as input for GenomeQC. Thanks to Darwin Campbell who guided the deposition of datasets with CyVerse for release and DOI assignment. Thanks to the reviewers for their suggestions which have improved this manuscript.

VIII. AUTHORS' INFORMATION
KW created the GOMAP system during his time as a graduate student at Iowa State University. LF, DP, CFY, KOC, HV, and PJ are currently graduate students. HAD and DCS are undergraduate students. Each graduate and undergraduate student annotated at least one genome over the course of a research rotation lasting no more than one semester. CJLD coordinated research activities and manuscript preparation. Table II: Quantitative metrics of the cleaned functional annotation sets. CC, MF, BP, and A refer to the aspects of the Gene Ontology: Cellular Component, Molecular Function, Biological Process, and Any/All. GOMAP covers all genomes with at least one annotation per gene and provides substantially more annotations than Gramene63 or Phytozome, especially in the BP aspect. The total number of annotations per dataset is visualized in Figure S1. Download this table (CSV) a How many genes in the genome have at least one GO term from the CC, MF, BP aspect annotated to them? A = How many at least one from any aspect?
(A = CC ∪ MF ∪ BP) b How many annotations in the CC, MF, and BP aspect does this dataset contain? A = How many in total? A = CC + MF + BP c Take a typical gene that is present in the annotation set. How many annotations does it have in each aspect? A = How many in total? Please note that A = CC + MF + BP  Figure S2.   Supplementary Figure S2: Quality scores of the predicted annotation visualized as a grayscale heatmap. Each row represents a single species from a single data source. Input gene model dataset origin is indicated -blue for GOMAP, yellow for Gramene63, and orange for Phytozome12. Across the top are the three metric types used to assess the annotation quality. Across the bottom are subgraph indicators: C is Cellular Component, F is Molecular Function, and P is Biological Process. Darker cells indicate a higher (better) score whereas lighter cells indicate a lower score. Note that scales are different for each metric type (meaning that comparisons across the three metric types are not meaningful). Rows are clustered by pairwise correlation/similarity across all metrics. The dendrogram at left retraces clustering.  Figure S3: Neighbor-joining trees built on annotation subsets per GO aspect. Phylograms are colored and rooted as described in Figure 2. For each tree, only the annotations from the respective aspect of the Gene Ontology were used.  Click here to access/download; Figure;Figure1      Supplemental Figure S2 Click here to access/download; Figure;FigureS2