Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard Shinisaurus crocodilurus

Abstract The Chinese crocodile lizard, Shinisaurus crocodilurus, is the only living representative of the monotypic family Shinisauridae under the order Squamata. It is an obligate semi-aquatic, viviparous, diurnal species restricted to specific portions of mountainous locations in southwestern China and northeastern Vietnam. However, in the past several decades, this species has undergone a rapid decrease in population size due to illegal poaching and habitat disruption, making this unique reptile species endangered and listed in the Convention on International Trade in Endangered Species of Wild Fauna and Flora Appendix II since 1990. A proposal to uplist it to Appendix I was passed at the Convention on International Trade in Endangered Species of Wild Fauna and Flora Seventeenth meeting of the Conference of the Parties in 2016. To promote the conservation of this species, we sequenced the genome of a male Chinese crocodile lizard using a whole-genome shotgun strategy on the Illumina HiSeq 2000 platform. In total, we generated ∼291 Gb of raw sequencing data (×149 depth) from 13 libraries with insert sizes ranging from 250 bp to 40 kb. After filtering for polymerase chain reaction–duplicated and low-quality reads, ∼137 Gb of clean data (×70 depth) were obtained for genome assembly. We yielded a draft genome assembly with a total length of 2.24 Gb and an N50 scaffold size of 1.47 Mb. The assembled genome was predicted to contain 20 150 protein-coding genes and up to 1114 Mb (49.6%) of repetitive elements. The genomic resource of the Chinese crocodile lizard will contribute to deciphering the biology of this organism and provides an essential tool for conservation efforts. It also provides a valuable resource for future study of squamate evolution.


Data Description Background
The Chinese crocodile lizard, Shinisaurus crocodilurus (NCBI taxonomy ID 52224) (Fig. 1), was first collected in 1928. In 1930, to accommodate the monotypic genus and species, Ahl established Shinisauridae as a new family under the order Squamata [1]. The species usually is found along slow-flowing rocky streams in montane evergreen forests [2] and is distributed in the east part of the Guangxi Zhuang Autonomous Region, the west and north parts of Guangdong Province in southern China, and in mountainous areas of northern Vietnam [3]. It is a semiaquatic diurnal predator and a strong swimmer, preying on fish, tadpoles, and aquatic insects. It is ovoviviparous and breeds in July and August in the wild [1,4,5]. A variety of anthropogenic hazards have caused severe population declines within the last decades. Illegal poaching for the international pet trade, traditional medicine, and food represents the main driver fueling the ongoing population decline [2]. A quantitative survey on the species was carried out in 1978, which estimated the total known population at 6000 individuals, and in 2008 it estimated a total population of 950 animals [5,6]. The species was listed in the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) Appendix II in 1990 [7] and in the IUCN Red List of Threatened Species in 2014 [6], and a proposal to uplist it to Appendix I was passed at the CITES Seventeenth meeting of the Conference of the Parties in 2016 [8].

Sample collection and sequencing
The genomic DNA of the Chinese crocodile lizard was extracted from the blood collected from the tail vein of a single adult male lizard in Ocean Park Hong Kong, which is a zoological theme park in Hong Kong. The venipuncture procedure was identical to that used for routine clinical blood draws in lizards in compliance with the Animal Welfare and Use Guidelines of Ocean Park. This lizard was alive in the collection of Ocean Park Hong Kong at the time of manuscript submission (Animal ID: MIG12-30 061 867; the CITES license to possess number is APO/PL 266/12). Three standard DNA libraries with short-insert sizes (250, 500, and 800 bp) and 10 mate-paired libraries with longinsert sizes (2 kb × 2, 5 kb × 2, 10 kb × 2, 20 kb × 2, and 40 kb × 2) were constructed with the standard protocol provided by Illumina (San Diego, CA, USA). Paired-end sequencing was performed for all the 13 libraries on the HiSeq 2000 platform according to the manufacturer's instructions (Illumina, San Diego, CA, USA). The sequenced read length was 150 bp for the shortinsert libraries and 49 bp for the long-insert libraries. In total, about 290.85 Gb (×149) of raw reads were eventually produced (Table 1).

Read filtering and genome size estimation
We obtained 136.73 Gb of clean data from the raw data by removing duplicated reads arising from polymerase chain reaction amplification during library construction, adapter-contaminated reads with ≥10 bp aligned to adapter sequence, low-quality reads that contain >5% "Ns" for the short-insert (250, 500, and 800 bp) data or >20% for the long-insert (2, 5, 10, 2, and 40 kb)   data, and low-quality reads that contain ≥40 low-quality (Illumina phred quality score ≤ 7) bases for the short-insert data or ≥30 bases for the long-insert data using SOAPfilter, an application included in the SOAPdenovo package (SOAPdenovo2, RRID:SCR 014986) ( Table 1) [9]. We obtained about 136.73 Gb of clean reads from about 290.85 Gb of raw reads; the total number of clean reads is 1 494 896 603, and the total number of raw reads is 3631.968,900. Then we used the clean data from the 3 shortinsert (250, 500, and 800 bp) libraries to estimate the genome size of Chinese crocodile lizard with a 17-mer analysis [10]. A kmer is related to an artificial sequence division of K nucleotides iteratively from sequencing reads [11]. We selected a fragment length of 17; the fragment is called 17-mer. When a certain coverage was reached, k-mer frequencies were plotted against the sequence depth gradient following a Poisson distribution [12], then the genome length could be estimated from the number and depth of kmer by the following formula: genome size = (Kmer number)/(Peak depth). According to that prediction, the Chinese crocodile lizard genome is estimated to be 1.95 Gb in size ( Table 2; Fig. 2).

Genome assembly and completeness estimation
We employed the SOAPdenovo package (version 2.04.4) for genome assembly (SOAPdenovo2, RRID:SCR 014986) [9]. Briefly, the sequences derived from the short-insert libraries were first decomposed into k-mers to construct the de Bruijn graph, which was simplified to allow connection of the k-mers into a contiguous sequence (contigs). We tested a series of kmer lengths ranging from 31 to 81 bp, and the 69-mer was finally selected to generate a contig assembly with the longest N50 value. We then aligned the paired-end reads from both the small-and largeinsert libraries to the contigs, calculated the support for relationships between contigs, and constructed scaffolds using distance information from read pairs. We required at least 3 and 5 read pairs to form a reliable connection between 2 contigs for short-insert and large-insert data, respectively. Finally, Kgf (version 1.16) [9] and GapCloser (GapCloser, RRID:SCR 015026;   [9] were employed to close intra-scaffold gaps using paired-end reads from the small-insert libraries. The end result was a genome assembly with a total length of 2.24 Gb, scaffold and contig N50s of 1470 kb and 11.7 kb, respectively, and unclosed gap regions representing 7.98% of the assembly, which is comparable to the previously published reptile genome assemblies (Table 3). We then employed Benchmarking Universal Single-Copy Orthologs (BUSCO; version 3.0.0) to evaluate the completeness of the assembly using 2586 vertebrata expected genes (BUSCO, RRID:SCR 015008) [13]. This analysis showed that 2391 (92.5%) and 125 (4.8%) of the 2586 expected vertebrata genes were identified as complete and fragmented, respectively, while 70 (2.7%) genes were considered missing in the assembly. We ran the same version of BUSCO to the other 14 retile genomes, respectively; the completeness of the Chinese crocodile lizard assembly was also comparable to other published reptile genome assemblies (Table 4).

Repeat annotation
Repetitive elements in the Chinese crocodile lizard genome were identified by homology searches against known repeat databases and de novo predictions. Briefly, we identified known transposable elements (TEs) by using RepeatMasker (version 3.3.0; RepeatMasker, RRID:SCR 012954) [14] to search against the Repbase TE library (RepBase16.10) [14] and RepeatProteinMask within the RepeatMasker package to search against the TE protein database. We then employed RepeatModeler (version 1.0.5; RepeatModeler, RRID:SCR 015027) [15] and LTR FINDER (version 1.0.5) [16] for de novo prediction of TEs. RepeatModeler was first used to construct a de novo crocodile lizard repeat library, which was subsequently used by RepeatMasker to annotate repeats in the crocodile lizard genome. LTR FINDER was used to search the whole genome for a characteristic structure of the full-length long terminal repeat retrotransposons (LTRs), mainly based on their ∼18 bp terminal sequences being complementary to the 3' tail of some tRNAs [16]. We provided LTR FINDER with all eukaryotic tRNAs to search for LTRs. Finally, we searched the genome for tandem repeats using Tandem Repeats Finder (TRF; version 4.04) [17]. The results from different methods were presented in Table 5. Overall, a total of 1114 Mb of non-redundant repetitive sequences were identified, accounting for 49.6% of the Chinese  (Table 5), and by using RepeatMasker, we found that long interspersed elements are the most predominant elements in de novo predictions, which accounted for 10% of the genome. This lizard genome, in terms of repeat content, is similar to the known genomes of the Burmese python [18], king cobra [18,19], Australian dragon lizard [20], and green anole lizard (Table 6) [21].

Gene prediction
We combined homology-based and de novo methods to build consensus gene models of the reference genome. In the homology-based method, protein sequences of Anolis carolinensis, Gallus gallus, and Homo sapiens derived from the Ensembl database (release-67; Ensembl, RRID:SCR 002344) were first mapped to the Chinese crocodile lizard genome using TBLASTN (version 2.2.23; TBLASTN, RRID:SCR 011822) [22] with an E-value cutoff of 1e-5, and the BLAST hits were linked into candidate gene loci with GenBlastA (version 1.0.4) [23]. Then the genomic sequences of the candidate loci together with 2 kb flanking sequences were extracted, and gene structures were determined by aligning the homologous proteins to these extracted genomic sequences using GeneWise (version 2.2; Ge-neWise, RRID:SCR 015054) [24]. In the de novo method, we randomly chose 1000 homology-based gene models with intact open reading frames and an aligning rate of 100% (i.e., query protein length/predicted protein length) to train Augustus (version 2.5.5) (Augustus: Gene Prediction, RRID:SCR 008417) [25] in order to obtain gene parameters appropriate to the Chinese crocodile lizard genome. Then we performed de novo prediction on the repeat-masked genome using Augustus with the obtained gene parameters. Finally, gene models from these 2 methods were

Conclusion
Here we report the first annotated Chinese crocodile lizard genome sequence assembly. This research yielded a draft genome assembly with a total length of 2.24 Gb and an N50 scaffold size of 1.47 Mb. The assembled genome was predicted to contain 20 150 protein-coding genes and up to 1114 Mb (49.6%) of repetitive elements. The draft genome and annotation will provide valuable data for the captive breeding and aid research into the phylogeny and biological features such as ovoviviparity, venom glands, etc. [37,38].

Funding
This work was funded by the China National Genebank, and the Project U1301252 was supported by National Natural Science Foundation of China. The sample was supplied by the Genome 10K (G10K) consortium. We would like to thank the faculty and staff in BGI-Shenzhen, who contributed to the sequencing of the Chinese crocodile lizard genome.