An expanded mammal mitogenome dataset from Southeast Asia

Abstract Southeast (SE) Asia is 1 of the most biodiverse regions in the world, and it holds approximately 20% of all mammal species. Despite this, the majority of SE Asia's genetic diversity is still poorly characterized. The growing interest in using environmental DNA to assess and monitor SE Asian species, in particular threatened mammals—has created the urgent need to expand the available reference database of mitochondrial barcode and complete mitogenome sequences. We have partially addressed this need by generating 72 new mitogenome sequences reconstructed from DNA isolated from a range of historical and modern tissue samples. Approximately 55 gigabases of raw sequence were generated. From this data, we assembled 72 complete mitogenome sequences, with an average depth of coverage of ×102.9 and ×55.2 for modern samples and historical samples, respectively. This dataset represents 52 species, of which 30 species had no previous mitogenome data available. The mitogenomes were geotagged to their sampling location, where known, to display a detailed geographical distribution of the species. Our new database of 52 taxa will strongly enhance the utility of environmental DNA approaches for monitoring mammals in SE Asia as it greatly increases the likelihoods that identification of metabarcoding sequencing reads can be assigned to reference sequences. This magnifies the confidence in species detections and thus allows more robust surveys and monitoring programmes of SE Asia's threatened mammal biodiversity. The extensive collections of historical samples from SE Asia in western and SE Asian museums should serve as additional valuable material to further enrich this reference database.

samples and historical samples, respectively. This dataset represents 52 species, of which 30 species had no previous mitogenome data available. The mitogenomes were geotagged to their sampling location, where known, to display a detailed geographical distribution of the species. Our new database of 52 taxa will strongly enhance the utility of environmental DNA approaches for monitoring mammals in SE Asia as it greatly increases the likelihoods that identification of metabarcoding sequencing reads can be assigned to reference sequences. This magnifies the confidence in species detections and thus allows more robust surveys and monitoring programmes of SE Asia's threatened mammal biodiversity. The extensive collections of historical samples from SE Asia in western and SE Asian museums should serve as additional valuable material to further enrich this reference database.

Data Description
Context Southeast (SE) Asia is 1 of the most biodiverse regions in the world, hosting ∼20% of mammal species, but it is experiencing rapid deforestation for agriculture and development. To assess the ecological consequences of land use change, there is growing interest in using environmental DNA to monitor mammal populations, particularly threatened taxa that often underpin conservation policies [1][2][3][4]. Yet current efforts are hampered by the lack of a reference database of mitochondrial barcodes and complete mitogenome sequences. Currently there are 922 mammalian mitogenomes available in Genbank. Unfortunately, most are not tagged by location/origin. Data mining through manual screening of each mitogenomes resulted in 174 terrestrial mammal species that are typical to SE Asia. In this work, 30 novel species are added, contributing to ∼17% expansion of the current SE Asia mammal mitogenome database.

DNA extraction
Genomic DNA was extracted from different sample types of 72 small mammals, comprising 52 species, listed in Table 1 and  Table 2. DNA from modern tissue and blood samples was isolated using the Qiagen DNeasy extraction kit (Qiagen, Hilden, Germany, [QIAGEN, RRID:SCR 008539]) or Invitek DNA extraction kit (Invitek GmbH, Berlin, Germany), as per standard protocols following the manufacturer's guidelines. Historical samples obtained from the Zoological Museum, Natural History Museum of Denmark, and University of Copenhagen (ZM, KU) were treated differently according to type of tissue (Additional file 1a), while at the German Primate Center, DNA extraction from museum specimens followed Liedigk et al. (2015) [5] using the Gen-IAL First All Tissue Kit (Gen-IAL, Troisdorf, Germany). Complete details of sample information are provided in Additional file 2.

Mitogenome sequencing, assembly, and annotation
Mitogenomes were generated using several approaches. In Copenhagen, author F.M.S. constructed Illumina shotgun libraries with insert sizes ranging between 50 and 400 bp. To construct libraries, DNA was sheared to the target size range using Bioruptor R XL (Diagenode, USA [Diagenode, RRID:SCR 014807]) and converted into an Illumina-compatible sequencing library using the NEBNext E6070 Kit (New England Biolabs, UK). The libraries were polymerase chain reaction (PCR) amplified with index primers and purified using Qiaquick columns (Qiagen, Hilden, Germany) according to the manufacturer's instruction (Additional file 1b). Multiple libraries were combined together into 3 pools, normalized to 10 nM, and sequenced across 3 lanes of Illumina HiSeq 2500 using SR100 bp chemistry. In Berlin and Goettingen, mitogenomes were generated by authors P.R.P. and C.R. using overlapping PCR products using long-range PCR (Additional file 1c) followed by library construction and MiSeq sequencing, or Sanger sequencing as described in Patel,  [5,7,8], respectively. Author R.M.'s mitogenomes were done using methods outlined in Fortes and Paijmans (2015) [9]. Further details about laboratory methods are described in Additional file 1.
Raw reads for F.M.S. samples were assembled independently by authors F.M.S. and F.P. using 2 different approaches, then compared for consistency. Author F.M.S. trimmed the reads for sequencing adapters, low-quality stretches, and leading/tailing Ns using AdapterRemoval 1.2 (AdapterRemoval, RRID:SCR 011834) [10]. The mitochondrial genome was reconstructed with MI-TObim v. 1.8 [11] using the reference mitogenome of the closest species available in GenBank as the seed reference (Additional file 2). In order to obtain the mapping statistics of the samples, we ran PALEOMIX v. 1.2.6 [12] with default parameters where reads shorter than 25 bp after trimming were discarded. The trimmed reads were aligned against the newly assembled mitogenome generated by MITObim using Burrows-Wheeler Aligner [13]. Alignments showing low-quality scores and PCR duplicates were further removed using the MarkDuplicates program from Picard tools, and reads were locally realigned around small insertions and deletions (indels) to improve overall genome quality using the IndelRealigner tool from the Genome Analysis Toolkit (GATK, RRID:SCR 001876) [14]. In contrast, author F.P. inputted the trimmed reads into mitoMaker [15], which performs a de novo and reference-based assembly using SOAPdenovoTrans v. 1.03 (SOAPdenovo-Trans, RRID:SCR 013268) [16] and MITObim v. 1.7 [11]. Post-assembly, the F.M.S. and F.P. mitogenomes were manually compared for consistency by F.M.S. to generate the final consensus sequences. These assemblies were automatically annotated using tRNAscan-SE v. 1.4 (tRNAscan-SE, RRID:SCR 010835) [17] and Basic Local Alignment Search Tool v. 2.2.29 (NCBI BLAST, RRID:SCR 004870) [18] using the mitochondrial genomes found in the National Center for Biotechnology Information Reference Sequence Database (Ref-Seq, RRID:SCR 003496) [19] as references.
For the mitogenome constructed by author R.M., Illumina sequence reads were de-multiplexed according to the respective indexes with the Illumina software bcl2fastq v. 2.17 (Illumina, San Diego, CA, USA), and adapters were clipped from the sequence reads with the software cutadapt v. 1.3 [20]. Quality trimming was done through a sliding window approach (10 bp; Q20), and all reads shorter than 20 bp were removed from the     [14] variant calling output files were further filtered to have a minimum read coverage ≥ ×3, and variants were only called when the corresponding base was represented by ≥50%; otherwise this position was "N"-masked. Numbers of raw reads generated for each sample and mapping statistics for all 72 mitogenome assemblies are shown in Additional file 2. Sanger sequenced mitogenomes were checked with 4Peaks 1.8 (4Peaks, RRID:SCR 000015) [23], assembled with SeaView 4.5.4 [24], and annotated with DOGMA [25]. All mitogenomes were checked manually by eye to identify possible errors caused by insertion and deletions in Tablet [26]. The final mitochondrial genomes have been uploaded to GenBank (accession numbers are provided in Tables 1 and 2). The details of all new mitogenomes assembled in this work are given in Tables 1 and 2. Mitogenomes (60 samples) with known localities were geotagged and mapped to display their geographical distribution (Fig. 1).

Re-use Potential
We anticipate that the now-expanded mitogenome reference dataset for SE Asian mammals will provide benefits for a number of research areas. First, it should enhance the power of environmental DNA and other metabarcoding/barcoding approaches that relate to the identification of SE Asian mammals by conferring the ability to identify more species to the species level. This in turn has practical applications for those monitoring SE Asia's threatened mammal biodiversity, combatting trade in mammal species and so on. Second, the data will also have relevance to phylogenetic and population studies based on mtDNA data, which will be of use as we investigate the evolutionary history of this biodiversity hotspot.