Protein-to-genome alignment with miniprot

Abstract Motivation Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over 10 years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. Results Here, we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and vectorized dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. Availability and implementation https://github.com/lh3/miniport.


INTRODUCTION
Sequencing technologies have been rapidly evolving in recent years.The advent of long-read sequencing, especially accurate longread sequencing (Wenger et al., 2019), have enabled high-quality genome assembly at scale (Nurk et al., 2020;Cheng et al., 2021Cheng et al., , 2022)).After we sequence and assemble the genome of a new species, the immediate next step is to annotate genes.
There are three ways to annotate gene structures: ab initio gene prediction, aligning RNA-seq data from the same species and mapping known genes with cross-species alignment.While ab initio gene prediction works well for bacterial genomes, it is error-prone for Eukaryotic genomes that may contain large introns.In a recent benchmark (Scalzitti et al., 2020), all the evaluated gene finders miss ∼50% nucleotides in annotated exons and predict ∼50% extra sequences not in exons.If we have RNA-seq data, we can map short or long RNA-seq reads (Dobin et al., 2013;Li, 2018) and reconstruct transcripts from the alignment (Kovaka et al., 2019).This will give much more accurate gene structures than ab initio gene prediction.Unfortunately, RNA sequencing adds extra cost and may miss genes lowly expressed in the tissues being sequenced.We still rely on cross-species alignment to derive a complete gene set and to transfer known functional annotations to the new genome.
For very closely related genomes, we can reconstruct gene structures from whole-genome alignment (Fiddes et al., 2018) or from the alignment of gene regions (Shumate and Salzberg, 2020).These methods would not work well for genomes at longer evolutionary distances because intron sequences are less conserved and this will affect the quality of the alignment.Aligning the more conserved coding regions (Li et al., 2007;Gotoh, 2008) may alleviate the issue.However, for distantly related species, even coding nucleotide sequences are not conserved well.Just as we almost exclusively use protein sequences to reconstruct the phylogeny of distant homologs, Ensembl (Aken et al., 2016) and other gene annotation pipelines (Haas et al., 2008;Holt and Yandell, 2011;Brůna et al., 2021) also heavily rely on protein-to-genome alignment especially when the annotation of closely related species is not available.
There are several protein-to-genome aligners that pinpoint exact splice sites: GeneWise (Birney and Durbin, 1997;Birney et al., 2004), Exonerate (Slater and Birney, 2005), GeneSeqer (Usuka and Brendel, 2000), GenomeThreader (Gremme et al., 2005), genBlastG (She et al., 2011), ProSplign (Kapustin et al., 2008) and Spaln2 (Gotoh, 2008;Iwata and Gotoh, 2012).They all take a protein and a nucleotide sequence as input and output spliced alignment between them.GeMoMa (Keilwagen et al., 2019) additionally requires gene structures in the source genome as input.It aligns exons without splicing and then connects exon alignments.As gene structures are conserved across species, this strategy simplifies alignment and potentially reduces spurious hits but it would not easily work for proteins from a variety of species.
Among the tools above, only Spaln2, GenomeThreader and GeMoMa are practical for whole-genome alignment.They can align several hundred proteins per CPU hour and may take a couple of days to align a few hundred thousand proteins often needed to annotate a genome without closely homology.Protein-to-genome alignment is time consuming.
It is challenging to develop a fast and accurate protein-to-genome alignment algorithm.The core of such alignment is a dynamic programming (DP) that jointly considers affine gap penalties, introns and frameshift.It is perhaps the most complex DP for pairwise alignment.In addition, as we will show later, a successful aligner functions like a gene finder and has to properly model splice signals, which is not a trivial task, either.On top of these, we need to fit these complex methods to an efficient implementation with modern computing techniques.This is partly why we have over a hundred short-read mappers (Alser et al., 2021) but only three protein-to-genome mappers capable of whole-genome alignment.
In this article, we will describe miniprot, a new protein-togenome aligner developed from scratch.We will demonstrate its performance and accuracy on real data along with the few existing algorithms.

METHODS
Miniprot broadly follows the seed-chain-extend strategy used by minimap2 (Li, 2018).It indexes the genome k-mers in all six open reading frames (ORFs) on both strands.During alignment, miniprot extracts k-mers on a query protein, finds seed anchors and then performs chaining.It closes unaligned regions between anchors and extends from terminal anchors with dynamic programming (DP).

Notations of strings
For a string T , let |T | be its length and T [i], i = 1, . . ., |T |, be the i-th symbol in T .T [i, j], 1 ≤ i ≤ j ≤ |T |, is the substring starting at i and ending at j inclusively.In this article, T denotes the genome sequence over the nucleotide alphabet and P denotes the protein sequence over the amino acid alphabet.

Reduced alphabet
There are twenty amino acids.We need at least five bits to encode each amino acid.To encode protein sequences more compactly, we reduce the Li amino acid alphabet using the SE-B(14) scheme by Edgar (2004), except that we merge N and D.More exactly, we map amino acid groups to integers as follows: A→0, ST→1, RK→2, H→3, ND→4, EQ→5, C→6, P→7, G→8, IV→10, LM→11, FY→12, W→13, * →14 and X→15, where * denotes the stop codon and X denotes an amino acid.
Under this encoding, if two amino acid groups only differ at the lowest bit (e.g. group 'A' and 'ST'), the two groups tend to be similar.We may flip the lowest bit of an integer to generate more seeds and thus to increase the seeding sensitivity.We did not use this strategy as miniprot seems reasonably sensitive on real data.

Indexing the genome
Miniprot only indexes a subset of k-mers in the genome.Suppose φ(a) maps an amino acid a to a 4-bit integer with the scheme described above.The integer encoding of a k-long protein sequence P can be recursively defined as φ(P ) = φ(P [1, k − 1]) × 16 + φ(P [k]).φ(P ) has 4k bits.Let B = ψ(φ(P )) where ψ(•) is an invertible integer hash function (Li, 2016) over [0, 2 4k ).Then B is also an integer with 4k bits.By default, miniprot only indexes B if the lowest bit of B is 0. We thus sample 50% of k-mers in average with a high-quality hash function ψ(•).
Internally, miniprot treats each genome sequence and its reverse complement as two independent sequences.It enumerates all ORFs of 30 amino acids or longer and samples 6-mers from translated ORFs with the strategy above.For each selected k-mer R at position x, miniprot stores (ψ(φ(R)), ⌊x/256⌋) in a hash table with the key being ψ(φ(R)) and the value being an array of positions.We do not retain the base resolution at the indexing step such that we can use 32-bit integers to store positions for a genome up to 2 39 (= 2 32 • 256/2) base pairs in size.Without binning, miniprot would have to use 64-bit integers to store positions in a human genome, which would double the index size.

Chaining
The miniprot chaining algorithm is similar to the minimap2 algorithm.However, because the miniprot index does not keep the exact genome positions, the gap size calculation needs to be modified.For completeness, we will describe the full chaining equation here.
Let 2-tuple (x, y) denote a seed match, also known as an anchor, between binned position x on the genome and residue position y on the protein.Suppose (x i , y i ) and (x j , y j ) are two anchors with x i ≤ x j and y i < y j .The minimum possible gap size between the two anchors, in the unit of base pair, can be calculated by with ∆x = x j − x i and ∆y = y j − y i .When g(i, j) = 0, we do not know if there is a gap due to binning.Meanwhile, g(i, j) > 0 indicates a definitive insertion to the genome and g(i, j) < 0 indicates a definitive deletion.Given a list of anchors sorted by genomic position x, let f (j) be the maximal chaining score up to the j-th anchor in the list.f (j) can be calculated with where k is the k-mer length (6 amino acids by default), g(i, j) is calculated by Eq. ( 1) and α(i, j) = min{y j − y i , k} is the number of matching residues between the anchors.The gap penalty function γ(•) is Here G is the maximum intron size (200 kb by default) and β is the weight of the logarithm gap penalty (0.75 by default).The logarithm term allows miniprot to join exons over introns.
After the initial round of chaining for each protein, miniprot selects the top 30 chains and performs another round of chaining in local regions around these top chains.In the second round, miniprot indexes all 5-mers on both the protein and the genome subsequences without binning.This finds better chains and retains the base resolution of each anchor.Miniprot uses g ′ (i, j) = 3∆y − ∆x to compute gap lengths and applies the same gap penalty Eq. (3) during chaining.

Residue alignment with dynamic programming
Miniprot uses DP to close gaps between anchors in chains and to extend from terminal anchors.The DP aims to find gaps, frameshift and splicing at the same time as is demonstrated as follows ("Geno" for the genome sequence, "Tran" for the translated protein sequence in the alignment and "Prot" for the query protein sequence): In this example, symbol "$" denotes frameshift substitutions and "+" denotes frameshift insertions.We will explain their differences later.In this section, we will first review the AE86 DP formulation for affine gap cost (Altschul and Erickson, 1986), and then derive the DP equation for protein-to-genome alignment.

DP with affine gap cost
Under the affine gap cost, a gap of length g costs q+e•g.A direct formulation of the DP looks like where 'M ' represents the matching state, 'I' the insertion state, 'D' the deletion state and s(i, j) gives the score between the residue at position i on the target sequence and the residue at position j on the query.If we define Eq. ( 5) is the AE86 formulation.It invokes fewer comparisons.When there are more states, AE86 may save more comparisons and simplify the DP equation.

DP for protein-to-DNA alignment
In a similar manner, we can derive the DP for protein-to-DNA alignment, allowing frameshifts but not splicing: It is similar to Eq. ( 5) except for codon phase transitions with a penalty of f .We have two types of frameshift.The first type is created by inserting one or two bases to the DNA sequence (symbol '+' in the example above) and the second type by deleting one or two bases in a codon ('$' in the example).These are modeled by the four H•,• terms on the last line of Eq. ( 6).This equation is broadly similar to Zhang et al. (1997).

DP for protein-to-genome alignment
When aligning proteins to genomes, we need to keep phases through introns.We add three additional states, A, B and C, for phase-0, phase-1 and phase-2 introns, respectively.Our final formulation is where r is cost of an intron, and d(•) and a(•) model splice signals.The great majority of introns start with GT and end with AG across all species.
For a simple model, we may define: This still allows non-GT-AG splicing but penalizes such introns by cost p.We will describe a more sophisticated model in the next section.
It is worth noting that when the DP transitions from state H to B at position i, the phase-1 intron B starts at i + 1; when the DP transitions from B to H at j, the intron ends at j − 2. The DP ignores the split codon bridging the two exons around the phase-1 intron.Phase-2 intron state C is treated similarly.Not scoring split codons is a weakness of our equation.
Though not explicitly derived from a Hidden Markov Model (HMM), Eq. ( 7) is similar to the Viterbi decoding of the 6-state HMM employed by GeneWise (Birney et al., 2004) and Exonerate (Slater and Birney, 2005).To that end, our formulation should have comparable accuracy to the two older aligners if they are parameterized the same way.
We implemented Eq.( 7) with striped DP (Farrar, 2007).We used 16bit integers to keep scores and achieved 8-way parallelization for x86 64 CPUs with SSE2 or ARM64 CPUs with the NEON instruction set.Our implementation is over 50 times times faster than GeneWise and Exonerate in their exact mode.

Splice models
We observed that under distant homology, the splice model may have a large influence on the junction accuracy, confirming Iwata and Gotoh (2012).
The most common splice pattern in all species is GT-AG with GT at the donor site (5'-end of an intron) and AG at the acceptor site (3'-end of an intron).We occasionally see GC-AG and AT-AC at ∼1% frequency in total (Sheth et al., 2006).Among the GT-AG class, we more often observe GTR-YAG from yeasts to mammals (Irimia and Roy, 2008), where R denotes A or G and Y denotes C or T.
The default miniprot splice model considers the signals above.Using human data from Sibley et al. (2016), we estimated that 99.81% of acceptor sites are AG and only 0.10% are AC.In the BLOSUM scaling (Henikoff and Henikoff, 1992), an AC acceptor site would be penalized by 2 log 2 99.81/0.10≈ 20.We can adapt this approach for three bases at either the donor or the acceptor sites.In our final model, In mammals and even Drosophila, the last exon base adjacent to a donor site is more often a G and we often see a poly-pyrimidine (i.e.C or T) sequence close to an acceptor site.Our human splice model considers these signals.It is also applicable to species with the sequence features above, including Drosophila.
Exonerate uses a position-specific weight matrix over ∼10 positions to model splice sites.Spaln2 additionally considers branching sites and provides pre-trained models for a variety of species.Miniprot adopts a relatively simple model with fewer parameters.This makes the model more general but may affect the accuracy of alignment.We are considering a second pass with a splice model trained from the first pass.This strategy is often used in mainstream gene finders (Brůna et al., 2021).

Avoiding pseudogenes
If a spliced gene has an unspliced pseudogene, the unspliced pseudogene may get a better DP score because the alignment to the pseudogene does not pay intron penalties.To reduce the effect of pseudogenes, miniprot recalculates a DP score between the query protein and the translated coding region without introns.In addition, miniprot further penalizes single-exon alignment by intron open score r in Eq.( 7) in case a pseudogene is aligned better by chance.

Evaluation datasets
To evaluate the accuracy of miniprot, we collected the proteincoding gene annotations of various species: human (Homo sapiens) from Gencode v41, mouse (Mus musculus) from Gencode M30, zebrafish (Danio rerio) and fruit fly (Drosophila melanogaster) from Ensembl v107 and mosquito (Anopheles gambiae) from Ensembl metazoan v54.We selected the longest protein for each gene to reduce redundant sequences.We mapped zebrafish and mouse proteins to the primary assembly of the human reference genome GRCh38 and mapped mosquito proteins to the Drosophila BDGP6 genome.

Evaluated tools
To evaluate what aligners can map proteins to a whole genome, we randomly sampled 1% of zebrafish proteins and mapped with various aligners.Only miniprot-0.7,Spaln2-2.4.13c (Iwata and Gotoh, 2012), GeMoMa-1.9(Keilwagen et al., 2019) GenomeThreader-1.7.3 (Gremme et al., 2005) could finish the alignment in an hour.GenomeThreader found less than 30% of coding regions in Spaln2 or miniprot alignment.It is not sensitive enough for the human-fish divergence and thus not evaluated on the full dataset.We also evaluated MetaEuk-r6 (Levy Karin et al., 2020).Although MetaEuk does not find exact splice sites, it may be still useful for locating coding regions (Manni et al., 2021).
When running Spaln2, we applied option "-Q7 -T# -yS -LS -yB -yZ -yX2" where "#" specifies the species-specific splice model.Option "-LS" enables local alignment and yields sligtly better alignment overall.Option "-yB -yZ -yX2" apparently has no effect for human-zebrafish alignment but it greatly improves the junction accuracy of the fly-mosquito alignment.We let Spaln2 choose the maximum intron and gene size automatically.Miniprot finds introns up to 200 kbp in length by default.We changed this value to 50 kbp for fly-mosquito alignment.
We ran GeMoMa with MMseqs2 (Steinegger and Söding, 2017) as the underlying engine.We evaluated the best unfiltered alignment of each protein as GeMoMa discarded most alignments in the final Protein-to-genome alignments are compared to the annotated genes in "Genome species"."# multi-exon" gives the number of proteins mapped with multiple exons.A splice junction (junc.) is confirmed if it is annotated in "Genome species" with exact boundaries; is non-overlapping (non-ovlp.)if the intron in the junction is not overlapping with annotated introns."% confirmed junc." is the percentage of predicted junctions that are confirmed.Base sensitivity (base SN) is the fraction of annotated coding regions on the longest transcripts that are covered by alignments.Base specificity (base SP) is the fraction of genomic bases in alignments that are covered by annotated coding regions.
output.We tried to specify the maximum intron length to 200 kb but GeMoMa took more than 320 GB memory and was killed on our cluster.We thus used 50 kb for all alignment.GeMoMa crashed for the human-mouse dataset at the splice alignment step.
In principle, we could localize a protein with a whole-genome mapper above and then run GeneWise, GeneSeqer and Exonerate in local regions.However, this would not evaluate mapping accuracy.In addition, Iwata and Gotoh (2012) have already shown Spaln2 outperformed these older tools.We thus ignored them in evaluation.

Evaluating protein-to-genome alignment
We aligned zebrafish proteins to GRCh38 using all tools (Table 1).With human-specific splice models, miniprot is slightly more accurate than Spaln2 on most metrics.Nonetheless, for proteins mapped by both miniprot and Spaln2, Spaln2 could find more correct junctions.Looking at proteins Spaln2 aligned better, we observed Spaln2 is more sensitive to small introns and small exons, while miniprot tends to merge them to adjacent alignments.We speculate this may be caused by two factors.First, Spaln2 uses a more sophisticated splice model and may be putting more weight on splice signals than residue alignment.It may create an intron even if the alignment is weak.Second, the Spaln2 developers observed that heuristics may be doing better than strict DP around short introns or exons.In one case, Spaln2 correctly created an exon with one amino acid.Miniprot under the current setting would not produce such an alignment.On the other hand, while Spaln2 found more correct junctions for proteins mapped by both miniprot and Spaln2, it also produced more false junctions related to small exons and introns.It is not clear to us what is the best balance point.
GeMoMa is more sensitive than both miniprot and Spaln2, finding more junctions and more annotated coding regions.It however has lower junction accuracy.We could tune miniprot for increased sensitivity but we decided to keep the current behavior as the additional alignments are less accurate.MetaEuk was not designed to find exact splice junctions, as is expected.
For the human-mouse alignment, miniprot is again slightly better than Spaln2.GeMoMa crashed.On the more challenging flymosquito dataset, Spaln2 has higher junction sensitivity and higher base specificity than miniprot.GeMoMa continues to have the highest sensitivity but lower junction accuracy and base specificitiy.
Miniprot is over 30 times faster than Spaln2 and GeMoMa.The performance gap between miniprot and Spaln2 increases with divergence.This is potentially because Spaln2 has to invoke DP through introns more often when it does not see overlapping high-scoring segment pairs (HSPs) and cannot initiate "sandwich DP" (Wu and Watanabe, 2005) to skip introns.With a much faster DP implementation, miniprot can afford to align through all introns regardless of sequence divergence.It thus has more stable performance.Always aligning through introns might be a contributing factor to the higher specificity of miniprot even though Spaln2 has a more careful algorithm.

DISCUSSIONS
Miniprot is a fast protein-to-genome aligner comparable to existing tools in accuracy.Its primary use case is to assist gene annotation.At present, the Ensembl pipeline (Aken et al., 2016) still relies on GeneWise (Birney et al., 2004) and Exonerate (Slater and Birney, 2005).MAKER2 (Cantarel et al., 2008;Holt and Yandell, 2011) calls Exonerate.BRAKER2 (Brůna et al., 2021) integrates Spaln2 (Iwata and Got 2012) and depends on ProtHint (Brůna et al., 2020) which also optionally invokes Spaln2.As older protein-to-genome aligners are relatively inefficient, researchers often use faster approximate methods to localize proteins and then apply these aligners.Now with miniprot, we can perform approximate mapping and exact splice alignment in one go and thus simplify existing pipelines.In addition, when there are closely related species, miniprot could find 90% coding regions in minutes (see "base SN" on the human-mouse dataset in Table 1).It could also be useful for evaluating de novo assemblies (Manni et al., 2021).
Miniprot would not replace full-pledge gene annotation pipelines such as BRAKER2 (Brůna et al., 2021).Miniprot aligns each protein independently.When multiple proteins are mapped to the same locus, miniprot is unable to merge identical gene models or resolve conflicts between alignments.In addition, although miniprot has a realistic splice model, it is not as sophisticated as the BRAKER2 model and is not trained on the target genome.More importantly, BRAKER2 has an ab initio gene prediction component and may find genes with weak homology to the input proteins.We are considering to improve our splice model and to develop a separate tool to reconcile overlapping gene models in simple cases.This may provide a convenient annotation pipeline when closely related species are available.
We are evaluating the possibility to support HMMER profiles (Eddy, 2011) as queries.As a HMMER profile summarizes a gene family from multiple species, it may reduce the number of queries and improve the sensitivity of miniprot for distant homologs.There are two algorithmic challenges: seeding and alignment.For seeding, we could generate seeds from the most probable protein or sample multiple seeds directly from the profile; for alignment, we could introduce position-specific substitution cost and gap cost.Nonetheless, the exact solution to these challenges and how much HMMER profiles may improve the alignment remain unknown.
The Vertebrate Genome Project (Rhie et al., 2021), the Darwin Tree of Life project, the Earth Biogenome Project (Lewin et al., 2018) and many other sequencing efforts are going to sequence hundreds of thousands of species to the reference quality in coming years.The annotation of these genomes is as important as the assembly.While we have seen rapid evolution of sequencing technologies and assembly algorithms in recent years, we still heavily rely on core annotation tools developed more than a decade ago.Miniprot is one effort to replace the protein-togenome alignment step with modern techniques.We look forward to renewed development of other core annotation tools from the community.

Table 1 .
Evaluating protein-to-genome alignment