MitoZ: A toolkit for mitochondrial genome assembly, annotation and visualization

Mitochondrial genome (mitogenome) plays important roles in evolutionary and ecological studies. It becomes routine to utilize multiple genes on mitogenome or the entire mitogenomes to investigate phylogeny and biodiversity of focal groups with the onset of High Throughput Sequencing technologies. We developed a mitogenome toolkit MitoZ, consisting of independent modules of de novo assembly, findMitoScaf, annotation and visualization, that can generate mitogenome assembly together with annotation and visualization results from HTS raw reads. We evaluated its performance using a total of 50 samples of which mitogenomes are publicly available. The results showed that MitoZ can recover more full-length mitogenomes with higher accuracy compared to the other available mitogenome assemblers. Overall, MitoZ provides a one-click solution to construct the annotated mitogenome from HTS raw data and will facilitate large scale ecological and evolutionary studies. MitoZ is free open source software distributed under GPLv3 license and available at https://github.com/linzhi2013/MitoZ.


23
With the onset of High Throughput Sequencing (HTS) technologies, we have entered an era in 24 which massive nucleic acid sequencing is becoming routine in phylogenetic and biodiversity 25 monitoring studies (1,2). For example, metabarcoding studies, by taking advantage of complex 26 DNA extracts (e.g. environmental DNA (eDNA)), identify multiple taxa simultaneously from 27 diverse types of samplesstomach contents (3), feces (4,5), sediments (6), soil or water (6-8). 28 In most cases, these studies deal with degraded DNA, thus are in urgent demand for short 29 barcoding fragments for taxonomic identification (9,10). Genes on mitochondrial genomes are 30 preferred due to high copy number per cell, making them more likely be picked up than single-31 copy nuclear genes. Rapid access to mitochondrial genomes of a myriad of taxa will, firstly, 32 provide critical taxonomic connections between the most abundant and well-constructed DNA 33 barcode COI and those eDNA widely-adopted short markers 12S rRNA, 16S rRNA,CYTB et 34 al. (reviewed by 11); secondly, facilitate the fast-emerging approachmito-genomics (12)(13)(14), 35 which circumvents PCR and requires a taxonomically well covered reference dataset used both 36 for species identification and in gene capture array design (2,15). In addition to its importance 37 in biodiversity monitoring, mitochondrial genome also records maternal inheritance 38 information and is extensively utilized to infer phylogenetic relationship between diverse 39 MitoZ has two assembly modes -Quick mode and Multi-Kmer mode. It uses the Quick mode 108 by default, where only one kmer size (K=71) is used for assembly. Users can also use the Multi-109 Kmer mode to search for the missing PCGs (if any) failed in the Quick mode. 110

Mitogenome sequences identification 111
The assemblies contain both mitogenome sequences and nuclear genome sequences. Thus, we 112 filter out candidate mitogenome sequences using a profile Hidden Markov Model (profile 113 HMM) (41,42) based method which works in an efficient and criteria-relaxed way at first. 114 HMMER (version 3.1b2) (http://hmmer.org/) (43) is utilized to construct HMM models for 115 both mammal and arthropod clades (2413 and 4007 species, respectively, see Table S5) in the 116 current version. Then, we conduct PCG annotation for candidate mitogenome sequences. The 117 PCG annotation is detailed in section 2.2.4. After that, we use the following three steps to 118 remove potential false positive mitochondrial scaffolds, such as NUMTs and contaminations: 119 1) Each candidate mitogenome sequence was assigned to a Linnaean taxonomic name 120 using a Python package ETE3 (44) according to their most closely-related PCG 121 homologs in the PCG annotation step. Then, filter out sequences falling outside of 122 a user predefined taxonomic rank, which can be set as order, family or genus. 123 2) We calculated a confident score for each scaffold to determine the final 124 mitogenome sequences. is calculated using a formula as follows: 125 where represents the assemble reliability of each assembly. It is a weight factor 127 representing the reliability of de bruijin route selection -the higher the better. Its value is mainly consisted of two factors -Kmer depth of contigs and read supportive 129 number when connect contigs to scaffolds. For sequences generated using other 130 assembly software, will be surrogated by average depth information calculated 131 according to the number of reads that can be mapped to the targeted sequences. 132 indicates the number of PCGs in sequence and indicates the completeness (in 133 percentage) of th PCG, which is calculated by dividing the length for each gene 134 by the length of its shortest reference counterpart in the annotation database. 135 3) Sequences are ranked by their confident scores. Then, MitoZ tries to find all 13 136 PCGs from the sequence with the highest confident score. In case mitogenome is 137 not assembled as an integrate single sequence, MitoZ finds the rest PCGs from 138 sequences by rank and skips sequences containing conflict genes, e.g. a complete 139 COX1 gene is located in a former sequence, then the latter sequences with lower 140 score containing COX1 gene (complete or not) will be skipped. However, if former 141 COX1 is incomplete, the latter sequences containing also an incomplete COX1 142 genes will be kept for PCG searching. The searching stops when 13 PCGs are all 143 located, or sequences are run out. In addition, Sequences, regardless of gene 144 conflicts, containing ≥ 5 PCGs (complete or not) will be retained to confirm 145 identities, e.g. parasites. 146

Genome annotation 147
protein coding genes 148 An in-house Perl script is designed for PCG annotation. Basically, the script finds candidate 149 PCG sequences by aligning nucleotide sequences to a local protein sequences database using

Visualization 167
Mitochondrion genome features can be illustrated with an independent module, which 168 employed Circos (47) to depict gene elements features, such as PCGs, rRNA genes, tRNA 169 genes, GC content, and sequencing depth distribution. The color of each element can be set as 170 personal preference. 171

Performance Evaluation 172
MitoZ assembly contamination sequences by mitochondrial PCG annotation (set --requiring_taxa to be 175 taxonomic rank of order for each species). For those didn't get 13 PCGs, we applied another 176 run using multi-Kmer mode. 177

Assembly quality 178
We examined the similarities between those mitochondrial genes (PCGs + rRNA) MitoZ 179 recovered and that of Sanger mitogenomes using megablast in BLAST+ (2.6.0) (48). The genes 180 with similarities < 97% to their sanger counterparts were regarded as false positive, but those 181 false positives who can find matches (similarity≥ 97%) to their corresponding genes of the 182 same species on NCBI (detailed in Table S7 and Table S8 and Table S9) were regarded as 183 correct assemblies in the following statistics. We used MAFFT (version 7.309) (49,50) and 184 Unipro UGENE (version 1.26.1) (51) to conduct global alignment between each pair and check 185 mismatches, respectively. 186 Factors influencing mitogenome assembly 187 A+T content plays an important role in HTS experiments since the known bias in the library 188 preparation step leading to genome regions that possess extreme A+T content tend to have low 189 sequencing depth and are difficult to be assembled (52). Plus, the heterozygosity rate works 190 against the genome assembly quality using HTS platforms (53). The "heteroplasmy" of our 191 samples may come from pooled individuals aiming to produce enough DNA extracts for HTS 192 library construction. We investigated the influences of sequence characteristics on the 193 assembly qualities, including MDR ratio, depth, A+T content and mitogenome heteroplasmy. 194 We aligned HTS reads on their corresponding mitogenomes using BWA (39) and calculated regional depth using SAMtools (54). We calculated the heterozygosity value for each sample 196 based on Site Frequency Spectrum (SFS) obtained using ANGSD (55). 197 Comparison between MitoZ and NOVOPlasty 198 We also ran NOVOPlasty (version 2.7.2) to assemble mitogenomes of the 50 test samples with 199 default parameters and the corresponding sanger mitogenome of each species was used as the 200 reference seed. We then annotated the NOVOPlasty results with the annotate module in MitoZ 201 and examined the similarities between those mitochondrial genes (PCGs + rRNA) obtained by 202 NOVOPlasty and that of Sanger mitogenomes (detailed in Table S10 and Table S11). 203

Mitogenome completeness 205
Of the 750 genes ((13 PCGs + 2 rRNAs) * 50 species), 691 (92.13%) genes were full-length 206 recovered, 39 genes (5.20%) were partially recovered and 20 (2.67%) genes were not 207 assembled by MitoZ (Fig. 2 (iii)). The Multi-Kmer mode contributed a total of 46 genes that 208 were either failed (33) or partially (13) assembled in the Quick mode. 209 We categorized our assembly results into four types: 1) Type A, all the genes (13 PCGs and 2 210 rRNA) were recovered and represented by one single sequence; 2) Type B, all the genes were 211 assembled, however represented by ≥ 2 sequences; 3) Type C, not all but more than half (≥ 8) 212 of the genes were recovered; 4) Type D, the number of recovered genes was less than eight. In 213 total, we got 37 (74.00%) mitogenomes of type A, four (8.00%) mitogenomes of type B, nine 214 (18.00%) mitogenomes of type C, and none mitogenome of type D. See Fig. 2

(i) and (ii).
For the 735 PCGs and rRNA genes recovered, 724 (98.50%) genes matched their Sanger 217 counterparts well (similarity≥97%, Fig. 3). We further checked these single nucleotide 218 variances (SNV) between genes assembled by MitoZ and their Sanger counterparts. Although 219 the SNVs can arise from individual variances or mitochondrial heteroplasmy, it is also worth 220 to note that those non-perfect-match genes always possess high sequencing depth in HTS 221 assemblies except for the ones located around "Ns" regions ( Fig. S1 (i)) and those SNVs are 222 in most cases located in homopolymers or A+T-rich regions (Fig. S1 (ii)), where are regions 223 that typical Sanger sequencing errors happen. 224 The 5 false positive genes (the red dots in Fig. 3) with low similarities to their Sanger 225 counterparts could be contributed to insufficient sequencing depth in HTS sequencing and 226 sequencing errors in Sanger mitogenomes, see Table S9 for details. Fig. 4 shows an example 227 of mitogenome visualization by MitoZ. 228

Factors influencing mitogenome assembly 236
The portion of mitochondrial DNA varies in different samples. Our results showed that the 237 assemblies of type A and B mitogenomes tended to possess higher MDR ratios than the 238 assemblies of type C and D (Fig. S2 (i) and (ii)). However, no significant correlation was detected between mitogenome assembly qualities and factors of both A+T content and 240 heterozygosity ratio (Fig. S2 (iii) and (iv)). 241

242
The development of HTS technique and low sequencing cost greatly facilitates mitogenome 243 sequencing which used to require a tedious process of long range PCR followed by primer 244 walking (56). Nowadays, we are able to obtain mitogenomes from shotgun reads with higher 245 efficiency and lower cost. At the same time, however, the large volume of data challenges our 246 ability to efficiently analyze the exponentially growing dataset. MitoZ, a versatile mitogenome 247 toolkit, aims to achieve mitogenome assemblies together with annotation results from whole 248 genome shotgun reads. It is by far the easiest method to deliver human-readable outcomes for 249 mitogenome studiesrequire no special pretreatment in either DNA extraction or nucleotide 250 sequencing, and it combines the key bioinformatics stepsclean data filtering, de novo 251 assembly, annotation and visualization, thus provides users an 'one-step' solution from raw 252 data to publishable outcomes and will accelerate the accumulation of mitogenomes. In addition, 253 MitoZ conducts assembly without the aid of reference sequences from closely-related species, 254 which can be a crucial feature when the mitochondrial genes from closely-related species are 255 unavailable. 256 The boundaries of PCGs, especially the stop codons, of many mitochondrial PCGs are not 257 precisely determined in Genbank. Aside from tBlastn and Genewise, MitoZ developed an in-258 house Python script to further determine the start and stop codons around the boundary of each 259 gene and in most cases was able to precisely locate the start and stop codons (Table S7 and  260   Table S8). In addition, the current annotation module, especially PCGs annotation, works well 261 mainly for arthropods and mammals and needs further improvement to support more domains 262 of life. 263 In summary, MitoZ shows the ability to assemble and annotate mitogenomes efficiently and 264 accurately. With the rapid accumulation of mitogenomes and robust reference databases of 265 specific environments or groups, it will facilitate the developments of several important fields 266 in the foreseeable future, such as phylogenetic inference, quarantine inspection, aquatic and 267 agriculture ecosystems scrutiny.  Table S1. Species list used for MitoZ performance testing 326 Table S2. A+T content of sanger mitogenome and sanger mitochondrial genes, 327 mitochondrial reads ratio and heterozygosity of SRA data 328