ncOrtho: efficient and reliable identification of miRNA orthologs

Abstract MicroRNAs (miRNAs) are post-transcriptional regulators that finetune gene expression via translational repression or degradation of their target mRNAs. Despite their functional relevance, frameworks for the scalable and accurate detection of miRNA orthologs are missing. Consequently, there is still no comprehensive picture of how miRNAs and their associated regulatory networks have evolved. Here we present ncOrtho, a synteny informed pipeline for the targeted search of miRNA orthologs in unannotated genome sequences. ncOrtho matches miRNA annotations from multi-tissue transcriptomes in precision, while scaling to the analysis of hundreds of custom-selected species. The presence-absence pattern of orthologs to 266 human miRNA families across 402 vertebrate species reveals four bursts of miRNA acquisition, of which the most recent event occurred in the last common ancestor of higher primates. miRNA families are rarely modified or lost, but notable exceptions for both events exist. miRNA co-ortholog numbers faithfully indicate lineage-specific whole genome duplications, and miRNAs are powerful markers for phylogenomic analyses. Their exceptionally low genetic diversity makes them suitable to resolve clades where the phylogenetic signal is blurred by incomplete lineage sorting of ancestral alleles. In summary, ncOrtho allows to routinely consider miRNAs in evolutionary analyses that were thus far reserved to protein-coding genes.


INTRODUCTION
MicroRN As (miRN As) are single-stranded, short noncoding RNAs of ∼22 nucleotides (nt) in length that have essential roles in fine-tuning the expression network of eukaryotic cells (1)(2)(3). Canonical miRNAs are transcribed by RN A Pol ymerase II resulting in primary miRNAs (pri-miRNA) that form at least one distincti v e hairpin structure. Pri-miRNA are cleaved by the Micr opr ocessor complex, which contains Drosha and DiGeorge syndrome critical region 8, resulting in microRNA precursors (pre-miRNA) ( 4 ). Pr e-miRNAs ar e approx. 55-70 nt long sequences, which are exported into the cytoplasm by Exportin 5 where the terminal loop of the pre-miRNAs is removed to create miRNA duplexes ( 4 , 5 ). This duplex is subsequently resolved, and one of the two strands is loaded into an Argonaut protein to form the mature RNA-induced silencing complex (RISC) ( 6 , 7 ). Target mRNAs are silenced post-transcriptionally through binding of the RISC, which is induced by a pairing of the 7 nt long seed region in the mature miRNA and a 6-8 nt long target site in the mRNA ( 8 ).
miRNA-dependent gene regulation in animals dates at least back to the last common ancestor of the Eumetazoa that li v ed more than 800 million years ago ( 9 , 10 ). The acquisition of novel miRNA families in the course of evolution has been correlated with body-plan innovations and an increase of overall morphological complexity (11)(12)(13)(14). Novel miRNAs can emerge via the repurposing of already transcribed sequences, such as parts of introns or proteincoding exons ( 15 , 16 ). Once these miRNAs are integrated into a regulatory network, purifying selection contributes to their preservation ( 11 ). Still, findings that suggest the loss of evolutionary old and hence well integrated miRNA families were presented ( 17 , 18 ), but these were contrasted by the claim that the loss of established miRNAs is exceedingly rare ( 19 , 20 ). It was argued that false-positi v e miRNA annotations in the miRBase repository ( 21 ), which served as the basis of the analyses, paired with a limited sensitivity of the miRN A homolo g search created a spurious signal of miRNA loss ( 22 ). Meanwhile the manually curated Mir-GeneDB has put miRNA-based r esear ch on a more solid foundation ( 19 ). In this database, each miRNA entry is supported by corresponding transcript data and meets stringent annotation criteria. Moreover, MirGeneDB embeds miRNA annotations into an evolutionary context, and version 2.1 of the database provides information about the representation of miRNA genes across 75 species ( 23 ). While this increases the specificity of miRNA annotations, Mir-GeneDB faces two main challenges: first, miRNAs that ar e expr essed at low le v els or under specific conditions are prone to be missed. Second, the taxonomic resolution and ther efor e the evolutionary information content in the data will remain low because the requirement of multi-tissue transcriptomic datasets to support miRNA calls renders the integration of novel taxa into MirGeneDB cost-and laborintensi v e. For many protected or rare species, it is e v en virtually impossible to gather the necessary data, particularly if their export falls under international regulations for species transfer in the context of CITES ( https://cites.org/eng/prog/ Permit system ).
Scanning genome sequences for the presence of orthologs to miRNAs in MirGeneDB offers a powerful alternati v e. It allows to propagate miRNA annotation across species independent from the availability of transcriptome data. Genome sequences abound in public databases and can be obtained e v en with non-invasi v e sampling (e.g. 24 , 25 ). Thus, lar ge and ev olutionary di v erse taxon collections can be anal yzed, w hich is crucial for improving the signalto-noise ratio in evolutionary analyses ( 26 ). For proteincoding genes, the identification of orthologs and the generation of comprehensi v e phylogenetic profiles, i.e. the presence / absence pattern of orthologs across large taxon sets, is well established (e.g. 27 ). In the case of miRNAs, the tools available remain sparse. Individual solutions for genome-wide scans for miRN A homolo gs have been proposed, but published approaches either lack publicly available software implementations (e.g. 17 , 28 ), or are limited to pre-defined taxon sets (e.g. MapMi; 29 ). To our knowledge, the onl y currentl y available method to iden-tify miRN A ortholo gs in custom assemblies r equir es the alignment of whole genome sequences ( 30 ). While this approach is straightforward, in principle, the computational overhead is immense. For example, an alignment of 242 placental mammalian genomes took two months of computation time on the Amazon Cloud utilizing 260 instances, each equipped with 32 virtual CPUs ( 31 ). Moreover, aligning whole genomes of species covering the full diversity of vertebrates is difficult ( 32 ). Ther efor e, only a small fraction of the genome sequences that are currently being generated (e.g. 32-34 ) will be considered in whole genome alignments.
To close this methodological gap, we have developed ncOrtho, the first software that facilitates a targeted search for miRN A ortholo gs in individual genome sequences. ncOrtho scales linearly in time with both the number of miRNAs and the number of taxa included in the ortholog sear ch, and ther efor e enables miRN A ortholo g searches in large and customizable genome collections. ncOrtho identifies orthologs with high sensitivity and specificity irrespecti v e of the genome annotation status. The resulting phylogenetic profiles provide the basis for studying miRNA evolution at an unprecedented scale and represent the first step towards projecting regulatory networks of miRNAs from model-to non-model organisms.

Data
Genome assemblies of 161 mammalian species, 241 nonmammalian vertebrates, and 16 invertebrate species were downloaded from NCBI Refseq Genomes release 207 ( 33 ;  Supplementary Table S1). Locations and pre-miRNA sequences for all 556 human miRNAs were downloaded from MirGeneDB v2.0. The MirGeneDB nomenclature was deri v ed from miRBase ( 19 , 34 ), and mapping between the nomenclatures is available on the MirGeneDB w e bpage. Whole genome alignments were downloaded from the 100 vertebrate alignment track in the UCSC Genome Browser ( 35 ).

Co v ariance models
Pairwise orthology assignments between protein-coding genes were identified with OMA standalone v2.4.1 ( 36 ) and served as anchors to identify shared syntenic regions between the genomes of humans and each core species. Positional miRN A ortholo gs were identified in the shared syntenic regions as specified in the main text and were used to generate and train a Covariance Model (CM) for each human miRNA. Sequences in the individual training sets were aligned with R-Coffee ( 37 ) and were annotated with secondary structure information calculated by RNAalifold from the Vienna RNA package 2.0 ( 38 ). The multiple sequence alignment together with the secondary structure information were then used as input for the Infernal package to train and calibrate the corresponding CM in default settings ( 39 ). CM-based sear ches wer e performed with the cmsearch algorithm provided in the Infernal v1.1 package, and search results were filtered for sequences that achie v e at least 50% of the maximally achie vab le bit score (i.e. the score of the reference miRNA gi v en the CM).

PAGE 3 OF 14
Nucleic Acids Research, 2023, Vol. 51, No. 13 e71 Evaluation of orthology assignment performance MirGeneDB provided the gold-standard to evaluate orthology assignments by ncOrtho ( 9 ). For individual species, MirGeneDB used genome assemblies other than those deposited in NCBI RefSeq Genome. In these cases, we mapped the miRNAs stored in MirGeneDB to the NCBI RefSeq assembly of the corresponding species using BLASTn (cutoff: ≥90% identity and ≥80% query coverage) ( 40 ). During the software benchmark, we compared the location of each predicted miRN A ortholo g to the genomic location of MirGeneDB entries from the same miRNA family. In case of an overlap, the candidate was accepted as a true positi v e (TP). Results that cov er a genomic region with no matching entry in MirGeneDB were tentati v ely assigned as false positi v es (FP). An assignment was considered as false negati v e (FN) if the corresponding MirGeneDB entry was not identified as an ortholog. All cases in which no ortholog was detected and MirGeneDB has no entry for that family and species were counted as true negati v es (TN). We then calculated the sensitivity as TP / (TP + FN), the specificity as TN / (TN + FP), the accuracy as (TP + TN) / (TP + FP + FN + TN), and the F 1-score as (2 × TP) / (2 * TP + FP + FN).

Alternativ e appr oaches of miRNA ortholog identification
Whole Genome alignments: We downloaded the alignment blocks of the 100 Vertebrate alignment from the UCSC Bro wser FTP-server ( http://hgdo wnload.soe.ucsc.edu/ goldenPath/hg38/multiz100way/maf), and extracted the longest alignment block covering each human pre-miRNA locus annotated in MirGeneDB ( 9 ). We considered an ortholog candidate to be found in a species if its sequence covered at least 70% of the human miRNA locus excluding gaps. BLASTn search: We used the pre-miRNA sequences from MirGeneDB as input for a local BLASTn search ( 40 ) using the 'blastn' task with default parameter settings. The best hit was used as the miRN A ortholo g candida te. MapMi: Ma ture miRNA sequences provided by MirGeneDB were used as input for the MapMi implementation on the EMBL-EBI w e bserver using the 59 'Ensembl Species' set ( 41 , 42 ).

Population variation analysis
Human single nucleotide polymorphism (SNP) data was retrie v ed from dbSNP build 153 ( 43 ). We considered only data from gnomAD v2.1 Genomes, which contains variants from 15708 human genomes ( 44 ), to ensure comparability between the SNP counts for the following se v en genomic r egions: (i) matur e, (ii) star and (iii) loop r egion of a miRNA, the 30 nt (iv) upstream-and (v) downstream flanking regions that are annotated by MirGeneDB, (vi) all CDS and (vii) all lncRNA genes of the GRCh38 assembly. SNP counts for each r egion wer e normalized by their respecti v e length and then multiplied by 1000 resulting in the SNP density per 1kb.

Phylogenetic analyses
The presence / absence patterns of human miRNAs were visualized using PhyloProfile ( 45 ) once on the le v el of indi vid-ual genes and once on the miRN A famil y le v el. The e volutionary emergence of a miRN A famil y was tentati v ely dated to the last common ancestor of the two most distantly related species in which an ortholog from the corresponding family was detected. miRNA orthologs were aligned with MUSCLE v3.8.155 ( 46 ). The individual alignments were conca tena ted and alignment columns with more than 50% gaps wer e r emoved. ModelFinder ( 47 ) identified the TIM3 + F + R5 evolutionary model as the best fitting model for the data, and the maximum likelihood tree was calculated using IQ-TREE v1.6.8 with 1000 ultr a-fast bootstr ap replicates ( 48 , 49 ). The full pipeline used for reconstructing pre-miRNA sequence trees is implemented into the ncOrtho package. Phylogenetic trees were visualized and annotated with iTOL ( 50 ) or the ETE 3 toolkit ( 51 ). Competing phylogenetic hypothesis testing was performed with the Approximately Unbiased (AU) test as implemented in IQ-TREE ( 49 ).

RESULTS
miRNAs are not yet routinely considered in large-scale evolutionary analyses because scalable approaches for the reliable identification of miRN A ortholo gs across large and phylo geneticall y di v erse taxon collections were missing. We ther efor e de v eloped ncOrtho to identify orthologs of miRNA genes in genome assemblies irrespective of their annota tion sta tus. K ey to this appr oach is the use of Covariance Models (CMs), which are probabilistic models that integrate the consensus nucleotide sequence with the secondary structure of a functional RNA and facilitate a sensiti v e and discriminati v e homolog identification ( 52 , 53 ). The phylogenetic profiles generated by ncOrtho provide detailed insights into the taxonomic distribution, the minimal evolutionary age, and the duplication / deletion history of the anal yzed miRN As.

Algorithm
ncOrtho accepts an individual pre-miRNA (the 'reference miRN A') to gether with its genomic position in the r efer ence species as input. Additionally, two lists of taxa together with the corresponding genome sequences are required. The first list defines a set of 'core species' that are considered in the training phase of the covariance model (Figure 1 A). Core species should be closely related enough for microsynteny to be conserved to an extent that a comprehensi v e identification of positional miRN A ortholo gs is possible. Additionally, the last common ancestor of the core species should not be older than the miRNA gene of interest. To facilitate core species selection, we supply a function with which microsynteny conservation can be estimated as part of the ncOrtho package (Supplementary Figure S1). For the core species, the annotation of all protein-coding genes as well as the pairwise orthology assignments to the proteins encoded in the r efer ence species ar e needed. The second list specifies the target species whose genomes should be scanned for the presence of miRN A ortholo gs. ncOrtho then follows a twostage procedure to accomplish the ortholog search (see Figure 1 ). Initially, a set of high confidence orthologs is compiled from the core species which are then used for CM con- Covariance model training. ncOrtho uses a set of high confidence orthologs to the r efer ence miRNA for CM construction. This training data is compiled by exploiting collinearity of homologous genomic segments (shared synteny) to identify positional miRN A ortholo gs in the core species. We distinguish two cases: If the miRNA is positioned between two protein-coding genes, we use the flanking genes as syntenic anchors and consider a region in the core species' genome as shared syntenic if no more than k protein-coding genes separate the orthologs of the anchor genes (parameter -mgi; Default: k = 3). To account for gene losses or gene annotation artefacts, we provide the option to consider up to n flanking genes as alternati v e syntenic anchors (parameter -max anchor dist; Default: n = 1) (Figure 1 B). If the miRNA is located within a protein-coding gene, the shared syntenic region in the core species is the genomic locus harboring the ortholog of the protein-coding gene. If no ortholog could be detected, the algorithm proceeds as if the miRNA-gene would be in an intergenic region. Upon the identification of a shared syntenic region, its sequence is extracted and a reciprocal best BLASTn hit approach using the r efer ence pr e-miRNA sequence as query identifies a miRN A ortholo g, if pr esent ( 54 ). This procedur e is r epeated for all core species, which results in a set of positional orthologs that are used to train a covariance model of the r efer ence miRNA.
Tar g eted miRNA or tholog search using covariance models . The genome sequence of each target species is scanned for the presence of a miRNA using the corresponding CM (Figure 1 C). Candidates represented by CM search hits meeting the score threshold (see methods) are accepted as an ortholog if a BLASTn search against the r efer ence genome using the candidate as query re v eals the r efer ence miRNA as the best hit. If two or more candidate orthologs for the same r efer ence miRNA ar e confirmed by the re v erse search, all are kept as co-orthologs. The runtime complexity of CM sear ches r enders scans of entir e genomes across lar ger tax on sets time consuming (Supplementary Figure S2). Therefore, ncOrtho provides the option to reduce the search space by first searching for subsequences in the target genome that share a local sequence similarity to the reference miRNA ('Quick' mode; Figure 1 C). Hits are then extended by 1000 nucleotides (nt) up-and downstream, and the resulting candidate regions serve as input for a subsequent refined search using the r efer ence miRNA-specific CM.

Benchmarking ncOrtho orthology assignments
Performance. To evaluate ncOrtho, we used the manually curated miRNA database MirGeneDB as a gold standard ( 19 ). We identified orthologs to 556 miRNA genes r epr esenting 266 miRNA families. Homo sapiens served as the r efer ence species, and Macaca mulatta, Gorilla gorilla , Pongo abelii and Nomascus leucogenys were used as core species. 242 miRNA genes were located within, and 314 between protein-coding genes, and there was no difference in the core ortholog set sizes between the two groups (Supplementary Figure S3). For the benchmark, we concentrated on 20 vertebrate species that were represented both in MirGeneDB 2.0 and in the RefSeq partition of the NCBI genome database. We further included 16 invertebrate animals to sound out the sensitivity limits of ncOrtho in miRNA families that are conserved across the Bilateria ( 9 ). Table 1 shows that sensitivity and specificity of ncOrtho in Quick mode are very high across all analyzed vertebrates. Using this mode, ncOrtho identifies orthologs with a median runtime of 0.7 s per miRNA and species (Supplementary Figure S4). In the invertebrates, orthologs are identified with near perfect specificity but the sensitivity drops substantiall y. Repeating the ortholo g search in 'Exhausti v e' mode (see Figure 1 C) had almost no effect on the sensitivity, but the run time increased by two orders of magnitude (Table 1 , Supplementary Figure S2). Thus, the search heuristic used in the Quick mode of ncOrtho does not account for the sensitivity drop. On closer inspection, we noticed that the missing of invertebrate miRN A ortholo gs coincides with changes in the loop and star region that are specific to the invertebrate orthologs (Supplementary Figure S5).
ncOr tho or thologs without MirGeneDB confirmation. ncOrtho identifies miRN A ortholo gs with an overall high accuracy, and the mature human miRNAs align with < 3 mismatches to 99% of all identified orthologs (Supplementary Figure S6A). Additionally, there was no difference in the predicted secondary structure between miRNA orthologs identified by ncOrtho and those deposited in MirGeneDB (Supplementary Figure S6B). This indicates that ncOrtho fulfills the criteria for miRN A ortholo g annotation defined by the miRN A comm unity ( 34 ). Still, on first sight a specificity of 0.90 in the vertebrate set of target species suggests that the false positi v e rate can be improved. Interestingly, the fraction of orthologs predicted solely by ncOrtho is the highest in rhesus macaque (Figure 2 A). Humans and rhesus have a genome-wide average pairwise sequence identity of 93% ( 55 ), which should render the identification of miRN A ortholo gs straightforward. We ther efor e used this species as an example to show that our findings can be at least partly explained by missing data in MirGeneDB.
A parsimonious interpretation of the phylogenetic profiles of human miRN A ortholo gs as provided by Mir-GeneDB v2.0 suggested two independent losses of miRNA families in rhesus and 55 gains on the human lineage (Figure 2 B). Complementing the profiles with orthologs exclusi v ely predicted by ncOrtho reduced the number of humanspecific families that are confined to humans to 0, and many of the ncOrtho-only miRNA orthologs share the seed sequence with the human miRNA (Figure 2 B). Moreover, the updated version 2.1 of MirGeneDB now includes 16 miRNA families from 8 species that wer e pr edicted by us but were not represented in MirGeneDB v2.0, for example Mir-6715 or Mir-1912 (Supplementary Figure S7). Thus, many ncOrtho-e xclusi v e ortholo gs, w hich were initially considered as false positi v e assignments in our benchmark, r epr esent genuine orthologs that are not r epr esented in MirGeneDB v2.0. The specificity of ncOrtho is ther efor e substantially higher than 0.91. Howe v er, ther e ar e also cases where ncOrtho extends the phylogenetic profile with entries that are not covered by MirGeneDB. For example, Mir-1287 is annotated as human-specific e v en in MirGeneDB v2.1, but ncOrtho identified orthologs with conserved seed sequences in species as distant as the armadillo ( D. novemcinctus; Figure 2 C ). Consulting the 100 vertebrate whole genome alignment ( 35 ) re v ealed that the orthologs identified by ncOrtho reside in a genomic region that is conserved across the vertebrates.
Comparison of ncOrtho with other tools. To the best of our knowledge, only three tools have been developed that allow the identification of miRN A homolo gs in genome assemblies. Of these, we did not consider miRNAminer ( 56 ) because it does not support batch requests for multiple miRNAs and / or species. A second tool, miROrtho ( 28 ), is no longer pub licly availab le. MapMi ( 29 ) detects putati v e miRN A homolo gs in a fixed set of 59 species, of which 18 are also represented in MirGeneDB 2.0. We further compared the performance of ncOrtho to the identification of miRN A ortholo gs using the 100 Vertebrate whole genome alignment (see Supplementary Figure S8 for further details) and using a na ïve BLASTn search. A ppl ying the same benchmark criteria as described above (see Table  1 ), re v ealed that ncOrtho outperforms the alternati v e methods by far (Table 2 ). Only BLASTn was superior in terms of sensitivity, but this comes at the cost of a specificity of only 0.37.

Phylogenetic profile of human miRNA orthologs across 402 vertebrates
Inv estigating the e volutionary trajectory of miRNAs has so far been hindered by a limited taxon sampling in publicly available data repositories of miRN A ortholo gs. In many cases, systematic groups up to the or der le v el are represented by only one or at most a few r epr esenta tives. W hen analyzing missing miRNAs, this makes it hard to differentiate between noise, e.g. due to incomplete data, and a true miRNA loss. To obtain a better resolved picture of miRNA e volution, we e xtended the taxon sampling for the miRNA ortholog search to represent 402 vertebrate genomes. The resulting phylogenetic profiles summarized on the miRNA family le v el are shown in Figure 3 and the gene-le v el resolution is shown in Supplementary Figure S10. The genomic locations and sequences of all ncOrtho orthology assignments are provided in Supplementary Table S2.  The presence of a miRN A ortholo g in a species is indicated by a dot. Dot color: blue -seed sequence is identical to the human miRNA; orangeseed sequence differs. The green inlay indicates the presence of co-orthologs, and the inlay diameter is proportional to the co-ortholog numbers summed over all family members. Cell color indicates the fraction of (co-)orthologs that are supported by MirGeneDB: white -all; yellow -some; red -none. A 20-species alignment of the genomic region harboring the mature Mir-1287 (highlighted in red) is shown in ( C ). Only nucleotides tha t dif fer from the human r efer ence ar e specified. The seed r egion of the miRNA is marked with red lines, and a dot next to a sequence indica tes tha t an ortholog to Mir-1287 was detected in this species where the color coding follows (B). Note, the ortholog assignment uses the full pre-miRNA sequence and not only the section shown in the alignment.  Howe v er, some of these families are also sporadically present in r epr esentati v es of the earlier branching lineages, which makes their evolutionary age harder to assess. We next overlaid the phylogenetic profiles of the miRNA families with information about the conservation of the human seed sequence (see Figure 3 ).
In the evolutionarily older miRNAs, the seed sequences are overall highly conserved. This picture changes substantially for the miRNA families that emerged in the last common ancestor of the higher primates. Here, seed sequence changes are commonly observed, and in many cases a single substitution suffices to explain the differences between the individual primate lineages (Supplementary Figure S11).
Loss of miRNA f amilies . Our results show that the loss of miRNA families is rare. The entire data matrix in Figure  3 has only 12.1% missing data, i.e. a miRN A famil y was not detected in a species that di v ersified after the evolutionary emergence of the family (Supplementary Table S3; see Supplementary Figure S10 for a gene le v el resolution). In most cases, these are sporadic absences of miRNA families in individual species. Without further manual curation, this is best interpreted as noise. Howe v er, a fe w notab le e xceptions exist where the absence of a miRNA family is consistently observed in several species within a systematic group, which results in the characteristic 'windows' in the phylogenetic profiles (see Figure 3 ). We can dif ferentia te two main scenarios. The joint loss of se v eral miRNA families in the rodents is one prominent example for a concerted loss of se v eral miRNA families in the last common ancestor of a monophyletic group of species. The situation is different for the missing miRNA families in birds. Integrating the presence / absence pattern of miRNA families with the evolutionary relationships of the birds indicates multiple and independent losses of the same miRN A famil y on individual evolutionary lineages (Supplementary Figure S12).
Whole genome duplications extend the miRNA repertoire. Lineage-specific duplications of a gene results in two copies that are both co-orthologous to the corresponding gene in a species that branched of prior to the duplication e v ent ( 57 ). The number of co-orthologs is ther efor e corr elated with the number of successi v e lineage-specific duplications. ncOrtho detected miRNAs r epr esented by two or more coorthologs in almost all taxa (Figure 4 ; see Supplementary Figure S10 for the full taxon set). Howe v er, the fraction of miRNAs with co-orthologs varies, in parts, substantially between species. Within the tetrapods and coelacanth (Sarcopterygii), co-orthologs are overall rare with one exception: in the African clawed frog ( Xenopus laevis ) about three quarters of the r epr esented human miRNA genes have two co-orthologs. The ray-finned fish (Actinopterygii) are substantially more di v erse. Co-orthologs are rare in the early branching lineages, except for the Acipenseriformes represented by the sterlet ( Acipenser ruthenus ) and the paddle fish ( Pol y odon spathula ). Throughout the teleosts, the fraction of human miRNAs r epr esented by two or more co-orthologs are up to 15 times higher compared to the tetrapods. On most lineages, human miRNA genes are represented by two co-orthologs, but three or four co-orthologs are common ( > 10% for either class) in both the Salmoniformes and the Cypriniformes. Reconciling the lineagespecific increase of co-ortholog numbers with existing evidence for polyploidization or whole genome duplications (WGD) results in a perfect match (see Figure 4 ). Two primordial diploid frog species likely hybridized giving rise to the allotetraploid X. laevis ( 58 ). Polyploidizations were reported for the Acipenseriformes ( 59 , 60 ), a teleost-specific whole genome duplication was proposed (3R-Hypothesis; 61 -62 ), and one additional round of WGD (4R) occurred in the Salmoniformes and in the Cypriniformes, respecti v ely (63)(64)(65).

Phylogenomics with miRNA genes
The targeted search for miRN A ortholo gs across vertebrate di v ersity has reconstructed the evolutionary trajectory of human miRNAs at an unprecedented scale and resolution. Individual studies based on limited data have explored the use of miRNAs for the re v erse a pproach, w here patterns of sequence change between miRN A ortholo gs should inform about the evolutionary histories of the species they re-side in (e.g. 22 , 61 , 66 ). We next investigated the potential of miRNAs for resolving evolutionary relationships in greater detail. Phylogenetic markers should meet two criteria: they should be rarely lost to warrant taxon-gene matrices with little missing data. Moreover, their within-species diversity should be low over evolutionary time scales to reduce the risk that incomplete lineage sorting (ILS) of ancestral alleles blurs the phylogenetic signal generated by speciation e v ents ( 67 ). Since miRNAs fulfill the first r equir ement (see Figure 3 ) we next investigated their diversity.
Genetic diversity of miRNA genes. We determined the frequency of SNPs in human miRN A genes separatel y for the mature, star and loop regions and compared this to the SNP frequency in flanking regions of the miRNA, in proteincoding sequences, and in long non-coding RNAs. This revealed a significantly lower frequency in the miRNA regions compared to the other regions in the human genome ( ttest, P -value < 10 −8 ; Figure 5 A), with the mature miRNA having the lowest di v ersity among all. Moreover, the represented variants have a lower minor allele frequency (Figure 5 B). Both observations are in line with constraints on the evolutionary change of miRNAs that are imposed by miRNA secondary structure formation and by their function. The sequence of the mature miRN A m ust remain conserved since it mediates the pairing to the target mRNA ( 8 ). This constraint extends to the star sequence as it must form a stem-loop during pre-miRNA hairpin formation although individual mismatches are admissible ( 34 ). A reduced di v ersity compared to other genes is ther efor e a feature that most likely applies to miRNAs in general, irrespecti v e of the species they reside in.
Ver tebr ate tr ee of life r econstructed with pr e-miRNA sequences. We have shown that miRNAs ar e rar ely lost, and that their genetic di v ersity is lowest among all investigated genomic regions. The orthology-assignments of ncOrtho for 556 human miRNAs across 402 vertebrate species comprises ther efor e an unpr ecedented data basis for miRNAbased phylogenomics. Figure 6 shows the maximum likelihood tree that is based on a supermatrix compiled from this da ta. Petr om yz on marinus was not considered in this analysis because < 20% of human miRNAs were represented by an ortholog in this species. Additionally, the tarsier ( Carlito syrichta) was excluded because this species could not be stably placed in the tree (see Supplementary Figure S14). The r esulting tr ee topology r eproduces the accepted branching patterns of all major vertebrate groups (Figure 6 A). This underlines that the phylogenetic signal in concatenated miRNA genes is sufficient to resolve deep splits in the vertebrate phylogeny accurately and unambiguously. The mammalian subtree is shown in Figures 6 B. Deep splits in this tree are well resolved, and monophyletic Xenarthra and Afrotheria are placed as the earlies branching lineage within the Eutheria (ML BS = 100). While this resembles the Atlantogenata hypothesis (see also 68 ), one of the two competing hypotheses (Xenarthr a br anched off first) cannot be rejected with this data (p-AU = 0.06; 69 ). Within the Eutheria, the tree is well resolved on the order le v el, and here in particular for clades in which ILS is known or at least suspected to interfere with the species tree re-

DISCUSSION
The accurate detection of orthologs across lar ge tax on collections is one cornerstone of comparati v e genomics. It f orms the f oundation f or tracing genes , their functions , and the corresponding functional networks across species and through time ( 76 ). ncOrtho allows to extend such analyses routinely to miRNA genes by using covariance models (CMs) in the targeted search for orthologs. Se v eral methods attempted to extend such analyses to miRNA genes ( 28 , 29 , 56 ). Common to all approaches is the use of an unidirectional sequence similarity-based search for identifying miRN A homolo g candidates, followed by a candidate curation using secondary structure-and sequence conservation filters. These tools are either no longer available or suffer from a high false-positi v e and false-negati v e rate (see Table 2 ). ncOrtho is the first tool to identify miRNA orthologs using a reciprocal sequence similarity search. The secondary structure constraints and primary sequence consensus of a miRN A famil y are captured in the CMs that are used in the ortholog search ( 52 ). The performance of these models essentially depends on the quality of the training data, and ncOrtho addresses this issue by exploiting the conservation of gene order to identify positional orthologs ( 77 ) that are then used for model training. While this warrants a highly reliable training data, it also implies that, in PAGE 11 OF 14 Nucleic Acids Research, 2023, Vol. 51, No. 13 e71 principle, whole genome alignments may be directly used for the identification of miRN A ortholo gs ( 30 ). Howe v er, this hinges on the condition that the genomic position of a miRNA genes does not change. Additionally, the phylogenetic distances between species under study must not be too large, because otherwise sequences become too di v erged for aligning anything but protein coding exons ( 78 ). Eventually, the generation of whole genome alignments is computationall y highl y demanding, w hich hinders the inclusion of novel species and ther efor e leaves the data basis of such a pproaches considerabl y rigid. More specificall y, for large collections of assemblies that are phylo geneticall y as di v erse as the vertebrates, whole-genome alignments are not feasible ( 32 ). ncOrtho, in turn is a highly fle xib le frame wor k for the accura te identifica tion of miRN A ortholo gs across extensi v e custom taxon sets, which include species as di v erse as humans and sharks who last shared a common ancestor about 600 million years ago ( 79 ).
Orthology assignment is an evolutionary reconstruction problem, and as such a ground truth does not exist. Ther efor e, the benchmarking of orthology assignments for protein-coding genes relies on a standardized framework that allows to assess and compare the performance of individual ortholog search tools ( 80 ). Since there is no such frame wor k for miRNA genes yet, we have used the miRNA families that ar e r epr esented in the manually curated Mir-GeneDB as a bona fide gold standard. Due to its stringent annota tion criteria, this da tabase was excellent for assessing the sensitivity of ncOrtho. The attempt to benchmark the specificity of the orthology assignments indicated, howe v er, that MirGeneDB is likely not fully comprehensi v e and lacks an unknown but probably non-negligible number of miRN A ortholo gs. ( 23 ). Obviously, this does not rule out that ncOrtho may also identify miRN A ortholo gs that are no longer functional or makes individual spurious orthology assignments. Such cases may be characterized by multiple independent variations as for example seen in Mir-1287 (Figur e 2 C). Her e, a thorough integration of miRNA orthology assignments based on ncOrtho paired with a subsequent curation via targeted search for these orthologs in RNAseq data would allow to trace miRNAs across species with the highest confidence. This combined approach allows to direct the miRNA search in transcriptomes to predicted, yet missing miRNAs. This would increase the probability to also detect those miRNAs that are only lowly expressed or that are expressed only under certain conditions. ncOrtho performs a targeted ortholog search resulting in orthology assignments for pairs of species. This has several advantages. The ortholog search can use any reference sequence to start the ortholog search. This makes it independent from pre-compiled catalogs of miRNA genes and allows the tracing of both novel mirRNA genes but also the analysis of miRNAs specific to taxa that are not represented in the public miRNA databases. In the same context, ncOrtho uses a reciprocal hit criterion for the ortholog identification, instead of relying on pre-computed bit-score thresholds ( 81 ). This is particularl y important, w hen the training data for computing the bit score thresholds does not cover the full diversity of the miRNAs. The linear scaling in time and CPU usage facilitates an ortholog search across taxon set sizes that are too r esour ce-demanding for more complex search algorithms ( 82 ). Lastly, ncOrtho can identify orthologs in target species independent of any apriori gene annotation. This aspect is particularly important because the annotation of miRNA genes thus far depends on the availability of deep transcriptomic sequencing of small transcripts ( 21 ). Consequently, there are substantially varying le v els of miRNA annotation quality between species as a result of a dataset-availability bias ( 18 ).
While the annotation status of the target genomes does not impact the performance of ncOrtho, the assembly quality does. The number of false negati v e orthology assignments of genome-based searches is bound to increase when using low-coverage genome assemblies ( 83 ). In line with this hypothesis, se v eral e volutionaril y old miRN A families show a patchy presence-absence pattern of bird orthologs w hich could onl y be explained by the same miRNA being lost multiple times independently (see Figure 3 and Supplementary Figure S12). Among all analyzed vertebrates, this is a unique observation, which hints towards issues with the assembly qualities for these species. However, with the increasing number of chromosomally complete reference assemblies (84)(85)(86), the issue of genome completeness can be expected to play only a subordinate role in the future.
The phylogenetic profiles of 556 human miRNAs r epr esenting 266 miRNA families across 402 vertebrate species r epr esent the highest resolving analysis of miRNA evolution to date. The resulting presence / absence patterns are consistent with previous studies that describe a 'burst-wise' acquisition of novel miRNA families in the Eutheria and Amniota ( 13 , 87 , 88 ). Here, we could pinpoint another surge of miRNA innovation to the higher primates (Simiiformes), which has been previously ascribed to either the primates or the old-world monkeys ( 17 , 20 ). This dif ferentia tion is important for reconstructing the genetic basis of primate diversification. But it is also essential for applications that, for example, determine the organismal origin of small RNAseq samples ( 89 ). Next to lineage-specific gains of miRNA families, ncOrtho allows to trace likely changes in the regulatory network of miRNAs due to either miRNA loss or seed change. In the context of protein-coding genes, concerted lineage-specific loss is often interpreted as an indicator for a functional integration of the affected genes ( 26 , 90-91 ). Seed-pairing is the most important determinant for the targeting specificity of miRNAs ( 8 ). Notably, we find evidence for pronounced lineage-specific changes in the seed sequence, pr efer entiall y for evolutionaril y young miRN As that emerged in the last common ancestor of the higher primates. It is tempting to speculate that these seed changes contribute to the re-wiring of still fle xib le regulatory networks by changing the spectrum of target genes. We can now identify such e v ents with a resolution that is unprecedented for miRN As, w hich lays the foundation for investigating their relevance for re-wiring the regulatory network of miRNAs in the future.
Next to the functional implications in the context of gene regulation, changes in miRNAs allow to trace also evolutionary e v ents on a genome-wide le v el. miRNAs hav e recently been used to investigate whole genome duplication (WGD) e v ents in tr anscriptomic data from compar ati v ely small taxon sets ( 92 , 93 ). Here, we have shown that co-orthologs detected by ncOrtho faithfully trace whole genome duplications across di v erse collections of genome assemblies. This makes it possible to ra pidl y scan the plethora of vertebrate genomes that will emerge from the ongoing sequencing initiati v es for indications of whole genome duplications ( 84 , 85 ). miRNA sequences have been previously proposed as good phylogenetic markers ( 19 , 20 ), and miRNAs have been used in individual and small-scale phylogenomic studies to shed light on the evolution of individual vertebrate lineages (e.g. 68 , 94 ). Here, we could show that miRNAs fulfill two main r equir ements for phylogenetic markers: They ar e very rarely lost, and they display among all investigated regions in the human genome the lowest genetic di v ersity. The latter finding complements pr evious r esults tha t a ttested miR-NAs a low di v ersity in humans, mice and pigs, but did not put these numbers in relation to other regions in the human genome (95)(96)(97)(98). With the help of ncOrtho, the compilation of miRN A-based phylo genomic datasets is now straightforward, which allows to establish miRNAs as an alternative marker to protein-coding genes. We could show that the phylogenetic signal in miRNAs suffices to resolve even deep splits in the vertebrate tree unambiguously and with high confidence. For two recently di v erged clades whose resolution was hindered by incomplete lineage sorting, we found unequivocal support for one topology. This lends support to the view that miRNAs are indeed suitable for overcoming the effect of ILS. Despite these encouraging r esults, tr ees computed from miRNA alignments also need to be treated with a grain of salt. In line with a pr evious r eport ( 68 ), our miRNA-based phylogeny of the mammals agrees with the Atlantogenatha hypothesis. Howe v er, a topology test reveals that the alternati v e Xenarthra hypothesis does not explain the data significantly worse. This can indicate either a limited phylogenetic signal in the data, or alternati v ely that hybridization blurred the phylogenetic signal at the onset of eutherian di v ersification (99)(100)(101). In the light of our results, incomplete lineage sorting becomes a less likely explanation.
In summary, miRNAs are essential regulators of gene expression and their tracing across a comprehensi v e taxon collection is bound to shed light on the emergence and evolution of the underlying regulatory networks. At the same time, miRN As are highl y informati v e with respect to the evolution of the species they reside in. With the help of ncOrtho the potential of miRNAs for addressing both questions can now be tapped on a comprehensi v e scale independent of pre-compiled databases, and thus throughout the eukaryotic tree of life.

DA T A A V AILABILITY
All data and software used in this study is open source. The ncOrtho algorithm is available on GitHub ( https:// github.com/BIONF/ncortho ) or FigShare ( https://doi.org/ 10.6084/m9.figshare.21679568.v1 ). The human miRNA orthologs are available in the Supplementary Data where we also supply a list of all genome assemblies which were used for the ortholog search.