OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches

Abstract Motivation Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Results Here, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. Availabilityand implementation OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. Supplementary information Supplementary data are available at Bioinformatics online.

( 6 ∈ ) = 1 − (1 − ( 6 )) |9| The probability of observing 6 in a HOG of size one, i.e. with one k-mer, ( ) is approximated as the mean frequency of query k-mers inside the k-mer table (the average fraction of HOGs containing each query k-mer # [remember that each k-mer can only be stored once per root-HOG]).
In the non-parametric approach, (2) is simply (1) obtained from a random permutation of the query sequence. In an attempt to conserve some local composition bias, the permutation is performed by shuffling windows of size six in addition to shuffling individual amino acids within each such window.
Note that this approach additionally corrects for HOG composition biases.
Finally, to make the OMAmer-score comparable across queries, (3) is divided by | |, from which was subtracted the number of query k-mers shared with more ancestral HOGs.

Datasets and software parameters
OMAmer was compared with two closest sequence methods lying at different extremes of the speedaccuracy tradeoff: DIAMOND (v0.9.14) and Smith-Waterman, respectively. Due to the computational cost of performing Smith-Waterman alignments, we used pre-computed alignments from OMA (January 2020) (Altenhoff et al., 2018). DIAMOND databases were built with default parameters, and searches for the most similar sequence were performed with effectively no significance requirement (E-value set to 1e6). The OMAmer k-mer table was built with a k-mer size of 6.
OMAmer directly yields family and subfamily predictions. For Smith-Waterman and DIAMOND, each query was assigned to the family and most specific subfamily of its closest reference protein. To obtain multiple precision-recall values, predictions were computed for multiple score thresholds: E-values of 1e-322 to 1e6 for DIAMOND, alignment scores of 1 to 5,000 for Smith-Waterman and OMAmer-scores of 0 to 0.99.
To make family-level assignments comparable and well differentiated from subfamily-assignments, we selected HOGs from OMA (January 2020) defined at the Metazoa and Viridiplantae taxonomic levels as root-HOGs (families), and their sub-HOGs as subfamilies. To avoid low-confidence families, we further filtered out root-HOGs with less than six proteins. We picked Metazoa because it is one of the largest clades in OMA and Viridiplantae due to the high number of duplications and thus subfamilies in this clade. Note, due to the addition of Branchiostoma lanceolatum in the January 2020 OMA release, we removed it from the reference database used (before the k-mer index precomputation) to keep the same evolutionary distance existing between Branchiostoma floridae and reference proteomes of the previous OMA release (June 2019).
Then, we selected six species as experiment targets picked because they stand as outgroups of large clades in OMA and thus display some variability in divergence ages to reference species. Platypus, spotted gar and amphioxus were selected in Metazoa, while Gray rockcress, wine grape and Amborella trichopoda were chosen in Viridiplantae (Supp.

. Number of family-level TP queries overlapping between methods. TP sets were defined at F1max
for DIAMOND and OMAmer and at the minimum score (1) for Smith-Waterman alignments. These queries were used to assess subfamily assignment.

Supp. Fig. 2. Comparison of family assignments between OMAmer and DIAMOND across negative datasets.
(Left) Each curve displays the range of trade-offs between precision and recall when varying the threshold on the OMAmer-score or on the DIAMOND E-value. The curves labeled Bacteria refer to analyses using bacteriaspecific sequences as negatives whereas those labeled Random refer to using random sequences as negatives.
Crosses indicate the location of F1max values. (Right) Fraction of FPs coming from the misassignment of positive sequences.

Supp. Fig. 3. Comparison of subfamily assignments with OMAmer and by closest sequence (Smith-Waterman and DIAMOND).
Each curve displays the range of trade-offs between precision and recall when varying the threshold either on the OMAmer-score, on the DIAMOND E-value or on the Smith-Waterman alignment score.
These results were computed using the more stringent validation procedure. F1max values are annotated with crosses. Supp. Figure 5. Partitioning of subfamily assignments at F1max into closest sequence configuration.

Supp. Figure 6. Run time (left) and maximum memory usage (right) during database build for DIAMOND and
OMAmer. Whilst OMAmer is slower and requires more memory due to the increased pre-processing to enable fast lookup time, the increase in time and memory is linear with the number of reference genomes in the resulting database. 2600 (Putnam et al., 2008) Gray rockcress