Development of Diagnostic SNP Markers To Monitor Hybridization Between Sika Deer (Cervus nippon) and Wapiti (Cervus elaphus)

Sika deer (Cervus Nippon) and wapiti (Cervus elaphus) are closely related species and their hybridization can result in significant allele-shift of their gene pool. Additive genetic effects and putative heterotic effects of their hybridization on growth performance could confer considerable economic advantage in deer farming. Here, we used double-digest restriction site-associated DNA sequencing technology (ddRAD-seq) and detected ∼320,000 genome-wide SNPs from 30 captive individuals: 7 sika deer, 6 wapiti and 17 F1 hybrids (reciprocal cross). By screening observed heterozygosity of each SNP across four taxonomic groups, we report for the first time a resource of 2,015 putative diagnostic SNP markers (species-specific SNPs for sika deer and wapiti), which can be used to design tools for assessing or monitoring the degree of hybridization between sika deer and wapiti. These ddRAD-seq data and SNP datasets are also valuable resources for genome-wide studies, including trait discovery for breeders of domestic deer.

probes for determining the degree of hybridization has been applied to a wide variety of taxa (Nussberger et al. 2013;Lamer et al. 2015;Marques et al. 2017). To date, genome-wide SNP discovery in sika deer (Ba et al. 2017), white-tailed deer (Seabury et al. 2011), and red deer (Brauning et al. 2015) has been reported. Very recently, the high-density genetic map of red deer (Johnston et al. 2017) (based on 38,000 SNPs) and draft genome of red deer (Bana et al. 2018) and reindeer (R. tarandus) (Li et al. 2017) have also been published. We believe that these deer genetic resources will provide a rich source for further development of genome-wide SNPs within Cervidae.
Here, we employed double-digest restriction-site associated DNA sequencing (ddRAD-seq) technology (Peterson et al. 2012) to obtain a dataset of species-specific SNPs that are fixed for a different allele in sika deer and wapiti respectively, meaning that the polymorphism at the SNP was only found between, but not within these two species. These SNPs could be used as candidate diagnostic markers to assess the degree of hybridization between sika deer and wapiti.

Animals
A total of 30 unrelated animal individuals were selected, including 7 sika deer (Cervus nippon hortulorum) (SK), 6 wapiti (Cervus elaphus xanthopygus) (WT) and 17 F1 hybrid offspring (produced from reciprocal cross using artificial insemination: sika deer ♀ · wapiti ♂ (9 SW) and wapiti ♀ · sika deer ♂ (8 WS)) ( Table 1). The SK and WT were captured from their native region respectively, and transferred to Changbai Mountain Wildlife Experimental Station (Zuojia, Jilin, China). The deer in the station have been served as experimental animals for multiple experiments over the years. Individual animals including sika deer and wapiti were selected based on their pedigree to make sure they were pure and unrelated with each other. The fact that the SK and WT are not parents of the F1 hybrids in this study should be emphasized here.

DNA extraction
Whole blood samples were extracted from the jugular vein using EDTA vacuum tubes and were stored at -20°until DNA extraction. Genomic DNA was purified from the collected blood samples using the Blood DNA kit (Qiagen). Each DNA sample was subjected to 0.7% gel electrophoresis for detecting the presence of high-molecular weight DNA and then stored at -80°until ddRAD-Seq library construction.
Pilot study to identify restriction enzymes We selected two restriction enzymes for the study based on cutting efficiency from the four enzymes having 6 base recognition sites (i.e., PstI, NsiI, EcoRI and SacI) and two with 4 base recognition sites (i.e., MseI and MluCI). Six digest reactions were carried out based on the protocols by manufacturers: MseI + Buffer 3.1, EcoRI + Cutsmart Buffer, PstI + Buffer 3.1, NsiI + Buffer 3.1, SacI + Cutsmart Buffer and MluCI + Cutsmart Buffer. Both PstI and MseI were selected to construct our ddRAD-Seq libraries because both could be activated in the same buffer and achieve nearly complete digestion (Figure 2 A).

ddRAD-SEQ library preparation and sequencing
The double digest reactions were carried out in a volume of 25 ml containing 0.8 mg of genomic DNA, 5U of PstI and MseI, and 1· Buffer 3.1 (NEB). The reaction mixture was incubated at 37°for 2 hr and 65°for 30 min. Amplification and sequencing adapters with a unique barcode (6 bp or 8 bp) were ligated onto the digested DNA. Each sample was then amplified via PCR in a 50 ml reaction volume, containing 70-100 ng of adaptor-ligated DNA fragments and amplified with 22 cycles following the manufacturer's protocol. Samples were electrophoresed on a 2% agarose gel, and DNA in the 300-450 bp size range (with indices and adaptors) was excised and purified using a Gel Extraction Kit (Qiagen). Each sample library was pooled in equal amounts and quantified using an Agilent 2100 bio analyzer (Agilent Technologies) and then paired-end 101 bp sequencing was performed using the Illumina HiSeq4000 platform at BGI (Shenzhen, China).

Raw data processing
Low quality reads were filtered or corrected by using the process_radtags program (-c -q -r) in STACKs v2.0 (Catchen et al. 2011;Catchen et al. 2013), based on the three following rules: 1) remove reads with an uncalled base; 2) discard reads with low quality average scores within the sliding window (15% of the read length) that drop below a raw phred score of 10; 3) rescue ddRAD digestion sites within a certain allowance. Quality of clean reads was checked using FastQC (Andrews 2010).

ddRAD-SEQ data analysis and SNP inference
The SNP inference process used in this study was an expanded version of a recently published STACKs protocol (Rochette and Catchen 2017), and is illustrated in Figure 1. The filtered paired-end reads from the SK, WT, WS and SW groups were mapped to sika deer reference genome in end-to-end mode using Bowtie2 (Langmead and   The BAM files for the all individuals were used as the input of gstacks program in STACKs v2.0 with the marukihigh model (Maruki and Lynch 2017), a type of maximum-likelihood method that is appropriate for high-coverage sequencing data (average 15x per ddRAD locus in this study), and minimum required quality above 20 to call SNP alleles. The populations program (-p 4 -r 0.85-min_maf 0.1-lnl_lim -10.0-merge_sites-renz mseI) in STACKs was subsequently used to produce a VCF file that contains SNP information found in at least 85% of individuals across the four groups and with minor allele frequency $0.1. It is notable that we did not specify a maximum H obs required to filter out a nucleotide site at a locus by using the-max_obs_ het parameter in the populations program.

Development of high-quality diagnostic SNP markers
We selected species-specific SNPs as diagnostic SNP markers to meet two filtration criteria: 1) biallelic SNPs are fixed in each individual of the SK and WT groups respectively and 2) all corresponding homologous sites in each individual of the SW and WS groups are heterozygous (observed heterozygosity of all sites are 1). In order to design genotyping assays in the future, sufficient 40 bp flanking sequences with no repetitive sequences and no SNPs within them were also obtained by retrieving the sika deer reference genome.

Availability of supporting data
The VCF file including SNP information is available from Figshare The sika deer ddRAD-seq data can be retrieved from NCBI SRA under BioProject accession number PRJNA383774 (SRA accession number SRR5481378, SRR5481376, SRR5481397, SRR5481381, SRR5481392, SRR5481409 and SRR5481408). The ddRAD-seq data of

RESULTS AND DISCUSSION
ddRAD sequencing and SNP discovery After performing quality control, we obtained a total of 1.11 billion clean paired-end reads (202 Gb), ranging from 2.75 to 10.34 Gb for each individual with an average of 6.76 Gb (Table 1). The average percentage of unique mapping with mismatch #3 is 63.03. We obtained 2,121,131 candidate ddRAD loci with length of 229 6 71 bp (average depth of 15x) shared by at least 85% of individuals across the four groups. From these ddRAD loci, we inferred 324,623 biallelic SNPs that had minor allele frequency $0.1, corresponding to 6.7 heterozygous SNPs per Kb. Finally, we selected a subset of 2,015 species-specific SNPs with 40 bp flanking sequences that are fixed for a different allele in sika deer and wapiti respectively (see Materials and Methods).
To confirm whether ddRAD-SEQ library construction and sequencing data were unbiased and verifiable In order to examine if the distribution of the observed length of ddRAD loci (2,121,131) produced from the gstacks program perfectly matched with those in the reference genome, we performed a simulation and produced a total of 3,049,956 restriction fragments from the reference genome. Consistent length distribution between the observed and the expected ddRAD loci indicates restriction digestion and sequencing are complete for our ddRAD-seq libraries ( Figure 2B). That is, 69.5% of restriction sites on the genome are covered by the unique mapping reads (63.03%) sequenced from the ddRAD-seq libraries. Note that the short restriction fragments may be the repeat loci, such as 122 bp fragments, the number of which is significantly higher than those of the observed ddRAD loci containing SNPs. Highly enriched 320 bp and 440 bp fragments sheared by both PstI and MseI could correspond to a distinct electrophoresis band in the region between 750 bp and 1000 bp in the PstI lane (not digested by MseI) (Figure 2A red asterisk). Overall, these screening results confirmed that our ddRAD-Seq library construction and sequencing were unbiased and verifiable.

Confirmation of taxonomic status of groups
The diagnostic power of SNP markers could be strongly reduced if any of the samples was unrepresentative. STRUCTURE analysis was performed to confirm that each sequenced individual was assigned to its respective population based on its SNP genotypes by using Structure v2.3.4 (Pritchard et al. 2000), with three replicate runs and 1 million steps after a burn-in period of 300,000 and K = 2. In order to reduce the linkage probability, a subset of 50,000 SNPs (324,623 SNPs in total) drawn from one random SNP per independent ddRAD locus was analyzed to estimate cluster membership and admixture proportions across the 30 individuals. The admixture proportions were plotted using Excel2010. The results showed the F1 hybrids acquired 50% of the genome derived from either of the SK and WT groups respectively ( Figure 3A). Based on the estimation of credible intervals (CIs), the admixture proportion of all SK and WT individuals fell within their 95% CIs (0.999 to 1.0), indicating these individuals are pure, and introgression among them did not occur. However, the 95% CIs of almost all F1 hybrids (except for WS5) fell in the region outside the 50% ( Figure 3B), which further supports the fact that the SK and WT are not parents of the F1 hybrids in this study. Although the 95% CIs fell in the region outside the 50% for the F1s hybrids, it would not affect development of diagnostic SNP markers to monitor hybridization between sika deer and wapiti.
Theoretically, all biallelic SNPs should be shared among these groups and the private alleles (the alleles that only appear in one of the four groups in the study) should not be detected if sampling number were large enough. We observed only a small subset of 2,086 (0.64%) SNPs that belonged to private alleles calculated by the populations program in STACKs, including 1,386, 352, 308 and 40 SNPs in the SK, WT, SW and WS groups respectively. More convincingly, the private alleles have lower allele frequencies (average MAF = 0.19). These results confirmed that our samples are suitable for the development of diagnostic SNP markers.
Observed heterozygosity (H obs ), expected heterozygosity (H exp ) and inbreeding coefficients (F IS ) for each SNP within each separate group were calculated by the populations program in STACKs. It is widely known that an observed excess of heterozygotes can be caused by hybridization. Substantial heterozygous SNPs with H obs . 0.6 were observed in the WS and SW, but not in SK and WT ( Figure 4A). In addition, the average F IS of each SNP was less than zero in WS and SW, further supporting that the F1 hybrids included an excess of SNP heterozygotes as expected ( Figure 4B).

Conclusions
The ddRAD-seq data and SNP datasets for the sika deer and wapiti are valuable resources to researchers who carry out genome-wide studies, including trait discovery for breeders of farmed deer. The developed diagnostic SNP markers will be useful for the development of hybridization monitoring tools. However, it is noticeable that these diagnostic SNP markers were only inferred from a small population, and therefore any diagnostic genotyping assay built from this dataset should be first tested with adequate sample sizes.