Improved annotation of protein-coding genes boundaries in metazoan mitochondrial genomes

Abstract With the rapid increase of sequenced metazoan mitochondrial genomes, a detailed manual annotation is becoming more and more infeasible. While it is easy to identify the approximate location of protein-coding genes within mitogenomes, the peculiar processing of mitochondrial transcripts, however, makes the determination of precise gene boundaries a surprisingly difficult problem. We have analyzed the properties of annotated start and stop codon positions in detail, and use the inferred patterns to devise a new method for predicting gene boundaries in de novo annotations. Our method benefits from empirically observed prevalances of start/stop codons and gene lengths, and considers the dependence of these features on variations of genetic codes. Albeit not being perfect, our new approach yields a drastic improvement in the accuracy of gene boundaries and upgrades the mitochondrial genome annotation server MITOS to an even more sophisticated tool for fully automatic annotation of metazoan mitochondrial genomes.


Computation of δ and relative start and end positions
In the following, the computation of the factor δ is explained in more detail. Let us first consider the computation of the relative start position r s (p), which gives the position on the query model that corresponds to position p on the (target) sequence. r s (p) = q s + qe−qs se−ss (p − s s ) is computed with respect to a hit (computed with HMMER) with a query model on a sequence. This hit has four coordinates, i.e. the start and end positions in the sequence (s s and s e ) and the start and end position in the query model (q s and q e ), see Figure S1.
Consider first the fraction qe−qs se−ss . It gives the relation of the length of the query and the length of the sequence that are covered by the hit. This value can be interpreted as how much the hit is stretched or compressed in the sequence with respect to the query (compressed for values > 1 and stretched for values < 1). Next, p − s s gives the offset between p and the start of the hit in the sequence. The multiplication of qe−qs se−ss and p − s s translates this offset to query positions, i.e., to the offset between q s and the sought query position. Hence, adding q s gives the sought value.
While r s (p) gives the position on the model that corresponds to p in the sequence, i.e. the offset to the start of the model, r e (p) gives the offset to the end of the model. Thus, for the computation of r e (p) l q − q s is added instead of q s .
Supplementary Figure S1: Illustration of the positions in the query and sequence used for the computation of r s (p) For the computation of δ (δ s (p) = 1− rs(p) lq and δ e (p) = 1− re(p) lq ), the relative start and end positions, respectively, are normalized with the length of the query model l q which gives a value between 0 and 1 and measures the distance to the model start and end. One minus this normalised relative positions gives the final δ value.
The rationale behind δ is to penalize positions in the sequence that are covered by the hit that are further away from the query model ends, i.e. further within the query model. δ does not distinguish between position that are not covered by the hit because in these cases δ is set to 1. This is also motivated by the observation by (1) that the hits of the model are mostly shorter than the actual sequence, i.e., positions supported by the hit should belong to the PCG.

Case study -Bank vole
The mitogenome of the bank vole Myodes glareolus (Chordata: Mammalia) (NCBI accessions NC 024538 and KF918859) (2) has been investigated using RNA-Seq data to analyze polyadenylation of the 3' ends of PCGs (3). This allows for a precise annotation of the stop codons. See also Supplemental Text 1.3. Thus, we selected this example to analyze the performance of the novel annotation method for gene boundaries in detail.
Only for three of the 12 PCGs the gene boundaries predicted by the method currently implemented in MITOS are in agreement with RefSeq. Furthermore, for three genes (nad5, nad4, and cob) the annotation of start and stop codon positions by MITOS differs considerably (by more than 18 bp) to the reference annotation. In contrast, the predictions based on our new method are equal to the reference annotation in almost all cases. For three genes stop codon positions differ by only one or two base pairs, because RefSeq includes a full codon in the annotation even if the stop codon is incomplete. The prediction of start codon positions of nad1 and nad5 differ by 3 and 9 bp, respectively. Here, the annotation in MitoAnnotator is in agreement with the RefSeq annotation, which thereby avoid overlaps with tRNAs in both genes. But note that overlaps of PCGs with tRNAs are a common feature in mitogenomes according to RefSeq (see Supplementary Figure S11).
To assist the user in the critical evaluation of the proposed annotation, the prediction values of the novel method can be plotted and visually inspected (see Supplementary Figures S4 and S5 for plots for the nad5 gene of M. glareolus). For example, the start codon position of the nad5 RefSeq annotation (base pair 11 720) differs from the prediction by our new method (base pair 11 711) (see Supplementary Figure S4). However, in both cases, the respective codons are equal (ATA). Both codons are located upstream of the initial prediction by HMMER which is located at position 11 840, which can be seen by δ values of 1, i.e. δ does not help to distinguish these two possible start sites candidates. Furthermore, the plot shows that there are many possible start and stop sites from which the method needs to choose. The predicted stop codon position corresponds to the RefSeq annotation (base pair 13 532) which corresponds to the full stop codon TAA, which has a very high empirical probability (φ).
The final selection of the start and stop codon position is strongly influenced by the values of λ (see left panel of Supplementary Figure S5). For instance, all combinations that include the putative start position 12 656, which has the largest value for φ, have λ = 0 (which is therefore not shown in Supplementary Figure S5). Despite the small difference of 9 bp, the resulting p-values can differentiate between alternatives (λ = 0.59 for positions 11711 − 13531 vs. λ = 0.25 for positions 11720 − 13531). The combination of the values for φ, δ, and λ leads to a preference for the starting position 11 711 over the RefSeq annotation at position 11 720 (see right panel in Supplementary Figure S5). Thus, the prediction by our new method suggests that nad5 may overlap with the adjacent trnL1 gene by 9 bp. Note, that the combinations with the highest values for λ (11 699 with 13 521 (λ = 0.91) and 13 532 (λ = 0.85), respectively) are not chosen by our method because the corresponding values of φ for the alternative start position are lower (0.03) than for the combination that is finally selected (0.18).

Protein coding gene boundary prediction from RNASeq data
Gene annotations in reference data bases such as RefSeq or annotations derived from automated or systematic reannotation efforts such as MitoAnn or MitoZoa, respectively, are only an approximation of biological reality. Hence, the explanatory power of a comparison of gene predictions with such data bases might be considered limited. Since the true gene boundaries are unknown for nearly all of the available mitochondrial genomes, these data sources are nevertheless valuable due to the large number of included data. Furthermore, several arguments can be made that the limitations of comparisons with published annotations are not too severe: 1) Annotations in RefSeq have been determined with a multitude of methods and have been ideally scrutinized by different experts on the field. Hence, the chance of systematic errors should be low and non-systematic errors should be compensated by the large number of data sets. 2) Automated analyses in MitoAnn implement state of the art knowledge on gene boundary properties (i.e. present codons).
3) Annotations in MitoZoa are systematically improved reannotations by experts on the field. 4) In the combined analyses with all data sources potential disadvantages of each data source are likely compensated, e.g. inconsistencies in RefSeq annotations due to the usage of a variety of annotation methods in RefSeq annotations and potentialy biased annotations in MitoAnn due to the usage of a consistent annotation method.
In principle, RNASeq data sets offer the possibility to determine real gene boundaries, see below. But a systematic and comprehensive comparison is still difficult: 1) The coverage does not drop sharply at gene boundaries. Instead, a gradual increase or decrease is observed. This is likely because the RNASeq data contains unprocessed and/or only partially processed transcripts. A naïve analysis of the data, e.g. by determining the minimal in the coverage curve, therefore cannot achieve the positional accuracy necessary to improve the annotation of gene boundaries. A method for the accurate inference of gene boundary from coverage data would require a comprehensive understanding of the stochasics of transcript processing. At present, such a quantitative model is not available. It is not at all clear, furthermore, to what extent there are differences between individual species and/or genes, and to what extent the processing is affected by rearrangements of the mitogenome.
2) RNASeq data is only available for a very small fraction of the species in the mitochondrial RefSeq data set.
Hence we focused here on the detection of polyadenylation in RNASeq data which allows to determine 3' gene boundries. In contrast to processing sites, polyA sites can be determined with high accuracy from the available RNAseq data. Our analysis below makes the explicit assumption that the polyadenylation sites mark the transcript end or complete stop codons.  Table S1: Used RNASeq data sets and their properties: organism name, ENA identifier of the RNASeq data, accession of the genome used for mapping, RefSeq accession of the mitochondial genome, total number of reads, paired end (y) or single end data (n), number of reads that mapped to the mitochondrial genome, and number of PolyA sites identified on the mitochondrial genome (above 5% of the coverage) We downloaded RNASeq data sets from the European nucleotide archive (ENA, https://www.ebi.ac.uk/ena) of organisms not in our training set (see Table S1). To obtain a set of phylogenetically diverse organisms, we used data from the insect Sitophilus oryzae, the mollusc Octopus bimaculoides, the mammal Castor canadensis, and the nematode Ancylostoma ceylanicum. We also used RNASeq data of two model organisms that were part of our training set: Drosophila Melanogaster, and used in-house mouse data from (4). Genomic assemblies were downloaded from NCBI, and mitochondrial sequences were added to the genomes where necessary. The FASTQ files were trimmed using Trim Galore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and cutadapt 2.3 (5) and mapped using segemehl (version 0.3.4 (6)) onto genomes downloaded from NCBI https: //www.ncbi.nlm.nih.gov/genome/. Coverage on the mitochondrial genomes was computed using bedtools ( (7)). PolyA sites were identified using an in house python script that identifies reads with non-genomic poly Ts at the 5' end or poly As at the 3'end. To reduce random artefacts a polyA site was only accepted if it was detected in at least 5% of the reads covering the site. Poly A sites are assigned to the genomic position of the first A (3'end) or last T, respectively. Figures were compiled using the Gviz R-package ( (8)). Distances of polyA sites to gene annotations were computed using bedtools closest with the "-d" option separately for forward and reverse strands.
The average coverage of the mitochondrial genome is between ≈ 10x (Mus musculus) and 100x (Drosophila melanogaster ). Between eight and 17 polyA sites were detected for each of the mitochondrial genomes. The data shows that the 3' gene boundaries annotated in RefSeq and the 3' gene boundaries predicted by the automatic method presented here have comparable, small distances to polyA sites. Occasional large distances are in most cases explained by missing annotations in RefSeq or MITOS. Besides showing that the automatic annotations of the 3' ends have a good precision this also indicates that a comparison with RefSeq annotations is indeed useful for evaluating the validity of gene boundary predictions.