MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization

Abstract Mitochondrial genome (mitogenome) plays important roles in evolutionary and ecological studies. It becomes routine to utilize multiple genes on mitogenome or the entire mitogenomes to investigate phylogeny and biodiversity of focal groups with the onset of High Throughput Sequencing (HTS) technologies. We developed a mitogenome toolkit MitoZ, consisting of independent modules of de novo assembly, findMitoScaf (find Mitochondrial Scaffolds), annotation and visualization, that can generate mitogenome assembly together with annotation and visualization results from HTS raw reads. We evaluated its performance using a total of 50 samples of which mitogenomes are publicly available. The results showed that MitoZ can recover more full-length mitogenomes with higher accuracy compared to the other available mitogenome assemblers. Overall, MitoZ provides a one-click solution to construct the annotated mitogenome from HTS raw data and will facilitate large scale ecological and evolutionary studies. MitoZ is free open source software distributed under GPLv3 license and available at https://github.com/linzhi2013/MitoZ.


INTRODUCTION
With the onset of High Throughput Sequencing (HTS) technologies, we have entered an era in which massive nucleic acid sequencing is becoming routine in phylogenetic and biodiversity monitoring studies (1,2). For example, metabarcoding studies, by taking advantage of complex DNA extracts (e.g. environmental DNA (eDNA)), identify multiple taxa simultaneously from diverse types of samples--stomach contents (3), feces (4,5), sediments (6), soil or water (6)(7)(8). In most cases, these studies deal with degraded DNA, thus are in urgent demand for short barcoding fragments for taxonomic identification (9,10). Genes on mitochondrial genomes are preferred due to high copy number per cell, making them more likely be picked up than single-copy nuclear genes. Rapid access to mitochondrial genomes of a myriad of taxa will, firstly, provide critical taxonomic connections between the most abundant and well-constructed DNA barcode COI and those eDNA widely-adopted short markers 12S rRNA, 16S rRNA, CYTB etc. (reviewed by 11); secondly, facilitate the fast-emerging approach--mito-genomics (12)(13)(14), which circumvents PCR and requires a taxonomically well covered reference dataset used both for species identification and in gene capture array design (2,15). In addition to its importance in biodiversity monitoring, mitochondrial genome also records maternal inheritance information and is extensively utilized to infer phylogenetic relationship between diverse lineages (1,16).
Apart from the mitogenomes achieved using long-range PCR followed by primer walking strategy and sanger dideoxy sequencing (17), quite some mitogenomes were obtained using a reference-based method via HTS platform (e.g. [18][19][20][21]. Traditional genome assembly software, for instance, SOAPdenovo2 (22), ALLPATHS-LG (23), Platanus (24), can hardly assemble complete mitogenomes since they are programmed to abandon sequences with extremely-high depth. The two frequently-used mitogenome assembly software, MITObim (25) and NOVOPlasty (26), require closelyrelated mitochondrial fragments as seeds to anchor short reads and build initial datasets. However, it is often difficult to set an appropriate criterion to define closely-related species--e.g. should an appropriate criterion be congeneric or coordinal in the Linnaean system. The similarities between species also vary a lot between different groups (27). There are also some genera within which none species has a complete mitogenome albeit the plunging cost of sequencing (28). In addition, both software can only generate mitogenome assembly as their final outputs. Thus, separate software, like DOGMA (29), MOSAS (30), MITOS (31) are required for the following genome annotation (32). Besides, all of the three aforementioned annotation software only provide web page version and can hardly deal with assembly with multiple scaffolds.
Here we presented a mitochondrial genome toolkit, Mi-toZ, providing a one-click solution from HTS raw reads to genome assembly together with annotation and visualization outputs. MitoZ is programmed in Python3 (33) with the assembly module of a modified version of SOAPdenovo-Trans (34), the annotation module of a Perl based script for protein coding genes (PCGs), MiTFi (35) for tRNA and infernal-1.1.1 (36) for rRNA ( Figure 1). We tested the accuracy and efficiency of MitoZ using a batch of mammals and arthropods which have both mitogenomes obtained by sanger sequencing in NCBI RefSeq database (37) and shotgun Paired-End reads in NCBI Sequence Read Archive (SRA) database (38). The results showed that Mi-toZ can recover 97.33% of PCGs and rRNA genes of the test samples, of which 94.66% genes are in full length and the recovered genes are of high similarity (≥ 97%) to their sanger sequenced mitogenomes.

Samples for test
A total of 30 arthropod and 26 mammal species (Supplementary Table S1) were selected to evaluate the performance of MitoZ. Species were picked up by considerations: (i) have mitogenome in NCBI RefSeq database (37) and were obtained using traditional sanger sequencing method (refer as sanger mitogenomes afterward); (ii) have HTS data with a volume size ≥ 3 Gb and Paired-End (PE) read length ≥ 91 bp in NCBI Sequence Read Archive (SRA) database (38); (iii) for mammal, data generated from tissue samples were preferred, but seven blood samples were included as well for comparison.
We estimated the ratio of mitochondrial derived reads (MDR) for each sample by aligning raw reads to their corresponding sanger mitogenomes using BWA (version 0.7.12-r1039) (39). It showed that most samples had a MDR ratio in a range from 0.12% to 0.51% with the mammal blood samples possessing a significant lower MDR ratio from 0.01% to 0.05% (Supplementary Table S2). In addition, we also noticed that six samples (including three mammal nonblood samples, two mammal blood samples and one arthropod sample) possessed MDR ratio of zero. For those samples, the MDR could be removed on purpose before data deposit. Thus, we removed them from the following performance evaluation, leading to a total of 50 samples in the final dataset, consisting of 29 arthropods, 16 mammal nonblood samples and five mammal blood samples. See Supplementary Tables S3 and S4 for details of the procedures of species selection and dataset download.

MitoZ
MitoZ consists of multiple modules, including raw data pretreatment, de novo assembly, candidate mitochondrial sequences searching, mitogenome annotation and visualization ( Figure 1). Each module can work independently in case users want a sub part of the entire workflow.
Raw data pretreatment. A Perl script is designed for filtering the raw data generated by HTS platforms, such as Hiseq 4000. It accepts Pair-End (PE) or Single-End (SE) reads and filters out reads with many 'N's, low quality reads, or reads of PCR duplicates (defined as a pair of identical reads). By default, reads will be removed if: (i) of > 40% low quality (Q ≤ 17) bases; (ii) of > 10 Ns; (iii) are PCR duplications.
De novo assembly. Algorithms adopted in SOAPdenovo-Trans (34) fit well for mitogenome assemblies from nuclear DNA extracts, where mitogenome sequences possess considerable higher copy number comparing to nuclear genome sequences--being alike to the differential gene expression patterns it is designed for. The de Bruijn graph (DBG) based assembler includes two main function modules--contig assembly and scaffold construction, of which we adopted all the error removal strategies adopted in the contig assembly step, for example, the removal of low-frequency kmers, edges and arcs, and the filtration of graph elements through a percentage threshold (5% by default) that calculated from their adjacent graph elements, and we modified the codes in the scaffold construction process, especially the graph traversal step, where we ask for higher linkage support to avoid connections between mitochondrial reads and nuclear mitochondrial DNA segments (NUMTs) (40) and remove inferior paths in the complex graphs (which were output as alternative splice transcripts) to avoid redundancy.
MitoZ has two assembly modes--Quick mode and Multi-Kmer mode. It uses the Quick mode by default, where only one kmer size (K = 71) is used for assembly. Users can also use the Multi-Kmer mode to search for the missing PCGs (if any) failed in the Quick mode.
Mitogenome sequences identification. Since it is designed to assemble mitogenomes without the aid of closely-related references, MitoZ outputs assemblies containing both mitogenome sequences and nuclear genome sequences. Thus, we firstly filter out candidate mitogenome sequences using a profile Hidden Markov Model (profile HMM) (  both Chordate and arthropod (2413 and 4007 species, respectively, see Supplementary Table S5) in the current version. We built the profile HMMs for each taxonomic clade from their gene alignments and did sequence conservation pattern modeling based on the residue and indel distributions for each position. The powerful 'Forward/Backward' HMM algorithm that computes not just one best-scoring alignment, but a sum of support over all possible alignments provides an important advance in terms of sensitivity of sequence searches for remote homology, thus can promote the detection rate for mitochondrial candidates (43). Then, we conduct PCG annotation for candidate mitogenome sequences. The PCG annotation is detailed in section 2.4. After that, we use the following three steps to remove potential false positive mitochondrial scaffolds, such as NUMTs and contaminations: 1) Each candidate mitogenome sequence is assigned to a Linnaean taxonomic name using a Python package ETE3 (44) according to their most closely-related PCG homologs in the PCG annotation step. Then, filter out sequences falling outside of a user predefined taxonomic rank, which can be set as order, family or genus. 2) We calculated a confident score S j for each scaffold to determine the final mitogenome sequences. S j is calculated using a formula as follows: where A represents the assemble reliability of each assembly. It is a weight factor representing the reliability of de bruijin route selection--the higher the better. Its value consists mainly of two factors--Kmer depth of contigs and read supportive number when connect contigs to scaffolds. For sequences generated using other assembly software, A will be surrogated by average depth informa-tion calculated according to the number of reads that can be mapped to the targeted sequences. n indicates the number of PCGs in sequence j and C i indicates the completeness (in percentage) of i th PCG, which is calculated by dividing the length for each gene by the length of its shortest reference counterpart in the annotation database. 3) Sequences are ranked by their confident scores. Then, Mi-toZ tries to find all 13 PCGs from the sequence with the highest confident score. In case mitogenome is not assembled as an integrate single sequence, MitoZ finds the rest PCGs from sequences by rank and skips sequences containing conflict genes, e.g. a complete COX1 gene is located in a former sequence, then the latter sequences with lower S j score containing COX1 gene (complete or not) will be skipped. However, if former COX1 is incomplete, the latter sequences containing also an incomplete COX1 genes will be kept for PCG searching. The searching stops when 13 PCGs are all located, or sequences are run out. In addition, Sequences, regardless of gene conflicts, containing ≥ 5 PCGs (complete or not) will be retained to confirm identities, e.g. parasites.
Genome annotation. We further annotate the putative control region in the case that the remaining interval region has a length ≥ 600 bp (48) and all its PCGs, tRNA and rRNA are fully recovered.
Visualization. Mitogenome features can be illustrated with an independent module, which employed Circos (49) to depict gene elements features, such as PCGs, rRNA genes, tRNA genes, GC content, and sequencing depth distribution. The color of each element can be set as personal preference.

Performance evaluation
MitoZ assembly. We firstly applied MitoZ quick mode to all the 50 test samples, and tried to filter out contamination sequences by mitochondrial PCG annotation (set -requiring taxa to be taxonomic rank of order for each species). For those did not get 13 PCGs, we applied another run using multi-Kmer mode.
Assembly quality. We examined the similarities between those mitochondrial genes (PCGs + rRNA) MitoZ recovered and that of sanger mitogenomes using megablast in BLAST+ (2.6.0) (50). The genes with similarities < 97% to their sanger counterparts were regarded as false positive, but those false positives who can find matches (similarity ≥ 97%) to their corresponding genes of the same species on NCBI (detailed in Supplementary Tables S7-S9) were regarded as correct assemblies in the following statistics. We used MAFFT (version 7.309) (51,52) and Unipro UGENE (version 1.26.1) (53) to conduct global alignment between each pair and check mismatches, respectively.
Factors influencing mitogenome assembly. A+T content plays an important role in HTS experiments since the known bias in the library preparation step leading to genome regions that possess extreme A+T content tend to have low sequencing depth and are difficult to be assembled (54). Plus, the heterozygosity rate works against the genome assembly quality using HTS platforms (55). The 'heteroplasmy' of our samples may come from pooled individuals aiming to produce enough DNA extracts for HTS library construction. We investigated the influences of sequence characteristics on the assembly qualities, including MDR ratio, depth, A+T content and mitogenome heteroplasmy. We aligned HTS reads to their corresponding mitogenomes using BWA (39) and calculated regional depth using SAMtools (56). We further calculated the heterozygosity value for each sample based on Site Frequency Spectrum (SFS) obtained using ANGSD (57).
Comparison between MitoZ and NOVOPlasty. We also ran NOVOPlasty (version 2.7.2) to assemble mitogenomes of the 50 test samples with default parameters and the corresponding sanger mitogenome of each species was used as the reference seed. We then annotated the NOVOPlasty results with the annotate module in MitoZ and examined the similarities between those mitochondrial genes (PCGs + rRNA) obtained by NOVOPlasty and that of sanger mitogenomes (detailed in Supplementary Tables S10 and S11).
We categorized our assembly results into four types: (i) Type A, all the genes (13 PCGs and 2 rRNA) were recovered and represented by one single sequence; (ii) Type B, all the genes were assembled, however represented by ≥ 2 sequences; (iii) Type C, not all but more than half (≥ 8) of the genes were recovered; (iv) Type D, the number of recovered genes was less than eight. In total, we got 37 (74.00%) mitogenomes of type A, four (8.00%) mitogenomes of type B, nine (18.00%) mitogenomes of type C, and none mitogenome of type D. See Figure 2(I) and (II).

Gene similarities
For the 735 PCGs and rRNA genes recovered, 724 (98.50%) genes matched their sanger counterparts well (similarity ≥ 97%, Figure 3). We further checked these single nucleotide variances (SNVs) between genes assembled by MitoZ and their sanger counterparts. Although the SNVs can arise from individual variances or mitochondrial heteroplasmy, it is also worth to note that those non-perfect-match genes always possess high sequencing depth in HTS assemblies except for the ones located around 'Ns' regions (Supplementary Figure S1(i)) and those SNVs are in most cases located in homopolymers or A+T-rich regions ( Supplementary Figure S1(ii)), where are regions that typical sanger sequencing errors happen. The blue dots present three species whose genes possessed similarities < 97% to their sanger mitogenomes but can find high similarity genes of the same species in NCBI NT database. Such incongruences could derive from intraspecies variances. The red points (five in total) present genes possessed similarities < 97% to their sanger mitogenomes and could not find better hits in NCBI. The rest genes and samples were presented by green dots. The five false positive genes (the red dots in Figure 3) with low similarities to their sanger counterparts could be contributed to insufficient sequencing depth in HTS sequencing and sequencing errors in Sanger mitogenomes, see Supplementary Table S9 for details. Figure 4 shows an example of mitogenome visualization by MitoZ.

Comparison between MitoZ and NOVOPlasty
NOVOPlasty successfully recovered 570 (76.00%) PCGs and rRNA genes of full length, partially assembled 30 (4.00%) genes and failed to assemble 150 (20.00%) genes ( Figure 5(I)). A total of 133 (17.73%) NOVOPlasty-failed genes were successfully assembled by MitoZ. The gene similarities between NOVOPlasty and sanger mitogenomes were in concert with that of MitoZ, indicating these genes of low similarities were probably attributed to intra-species genetic variation ( Figure 5(II)).

Factors influencing mitogenome assembly
The portion of mitochondrial DNA varies in different samples. Our results showed that the assemblies of type A and B mitogenomes tended to possess higher MDR ratios than the assemblies of type C and D (Supplementary Figure S2

DISCUSSION
The development of HTS technique and low sequencing cost greatly facilitates mitogenome sequencing which used to require a tedious process of long range PCR followed by primer walking (58). Nowadays, we are able to obtain mitogenomes from shotgun reads with higher efficiency and lower cost. At the same time, however, the large volume of data challenges our ability to efficiently analyze the exponentially growing dataset. MitoZ, a versatile mitogenome toolkit, aims to achieve mitogenome assemblies together with annotation results from whole genome shotgun reads. It is by far the easiest method to deliver human-readable outcomes for mitogenome studies--require no special pretreatment in either DNA extraction or nucleotide sequencing, and it combines the key bioinformatics steps--clean data filtering, de novo assembly, annotation and visualization, thus provides users an 'one-step' solution from raw data to publishable outcomes and will accelerate the accumulation of mitogenomes. In addition, MitoZ conducts assembly without the aid of reference sequences from closelyrelated species, which can be a crucial feature when the mitochondrial genes from closely-related species are unavailable.
The boundaries of PCGs, especially the stop codons, of many mitochondrial PCGs are not precisely determined in Genbank. Aside from tBlastn and Genewise, MitoZ developed an in-house Python script to further determine the start and stop codons around the boundary of each gene and in most cases was able to precisely locate the start and stop codons (Supplementary Tables S7 and S8). In addition, the current annotation module, especially PCGs annotation, works well mainly for arthropods and mammals and needs further improvement to support more domains of life.
In summary, MitoZ shows the ability to assemble and annotate mitogenomes efficiently and accurately. With the rapid accumulation of mitogenomes and robust reference Genes that recovered by both MitoZ and NOVOPlasty (n = 592) were included in the analysis, and paired samples t-test was used. databases of specific environments or groups, it will facilitate the developments of several important fields in the foreseeable future, such as phylogenetic inference, quarantine inspection, aquatic and agriculture ecosystems scrutiny.