Accurate Genomic Predictions for Chronic Wasting Disease in U.S. White-Tailed Deer

The geographic expansion of chronic wasting disease (CWD) in U.S. white-tailed deer (Odocoileus virginianus) has been largely unabated by best management practices, diagnostic surveillance, and depopulation of positive herds. Using a custom Affymetrix Axiom single nucleotide polymorphism (SNP) array, we demonstrate that both differential susceptibility to CWD, and natural variation in disease progression, are moderately to highly heritable (h2=0.337±0.079─0.637±0.070) among farmed U.S. white-tailed deer, and that loci other than PRNP are involved. Genome-wide association analyses using 123,987 quality filtered SNPs for a geographically diverse cohort of 807 farmed U.S. white-tailed deer (n = 284 CWD positive; n = 523 CWD non-detect) confirmed the prion gene (PRNP; G96S) as a large-effect risk locus (P-value < 6.3E-11), as evidenced by the estimated proportion of phenotypic variance explained (PVE ≥ 0.05), but also demonstrated that more phenotypic variance was collectively explained by loci other than PRNP. Genomic best linear unbiased prediction (GBLUP; n = 123,987 SNPs) with k-fold cross validation (k = 3; k = 5) and random sampling (n = 50 iterations) for the same cohort of 807 farmed U.S. white-tailed deer produced mean genomic prediction accuracies ≥ 0.81; thereby providing the necessary foundation for exploring a genomically-estimated CWD eradication program.

analyses (GWAA's) and produce marker-based heritability estimates. Finally, we conclude our study by utilizing the genome-wide SNP data to deploy genomic prediction equations with k-fold cross validation to assess the potential for developing a genomically-estimated CWD eradication program.
Animal resources, CWD diagnostics, and DNA isolation Frozen whole blood samples (n = 448) and rectal biopsies (n = 37) from farmed U.S. white-tailed deer (WTD; both sexes) were available within an existing USDA APHIS repository that was created via federal CWD surveillance activities; including depopulations of CWD positive herds (USDA APHIS, Fort Collins, CO). All herds included both CWD positive (n = 256) and CWD non-detect (n = 229) WTD (Thomsen et al. 2012), with geographic representation that included WTD farms located in the U.S. Midwest, Northeast, and South. All diagnostic classifications were based upon immunohistochemistry (i.e., IHC of lymph node, obex), as implemented and performed at USDA National Veterinary Services Laboratory (NVSL) in Ames Iowa (Thomsen et al. 2012). Genomic DNA was isolated from frozen whole blood using the Applied Biosystems MagMAX DNA Multi-Sample Ultra Kit with the KingFisher 96 Purification System (ThermoFisher), as recommended by the manufacturer, at the Texas Veterinary Medical Diagnostic Laboratory (TVMDL; College Station, TX). Genomic DNA was isolated from WTD rectal biopsies using the LGC sbeadex tissue purification kit (LGC) with automation at GeneSeek Neogen (Lincoln, NE). Hair samples (n . 700) from farmed WTD (both sexes) were also available within an existing Texas Parks and Wildlife Department (TPWD; Austin, TX) repository created via surveillance and depopulation efforts after the initial detection of CWD in Texas. At the time of study, sample repositories for these herds included CWD positive (n $ 100) and CWD nondetect (n $ 600) WTD, with a geographic representation that included WTD farms located in the U.S. South (Texas). All diagnostic classifications for TPWD samples were based upon IHC (i.e., lymph node, obex, and one tonsil) initially performed at TVMDL, with further confirmation at NVSL (Thomsen et al. 2012). Genomic DNA was isolated from WTD hair follicles using the LGC sbeadex tissue purification kit (LGC) with automation at GeneSeek Neogen (Lincoln, NE). All genomic DNAs were quantified and evaluated for purity (260/280 ratio) via Nanodrop (ThermoFisher).

Reduced representation libraries and sequencing
Pooled DNA samples were previously shown to be effective for variant prediction; thus enabling downstream genotyping in WTD (Seabury et al. 2011). Therefore, pooled DNA samples were created for CWD positive (WT1) and CWD non-detect (WT0) WTD acquired from the USDA APHIS repository (frozen blood). Briefly, WT1 (n = 190) and WT0 (n = 184) genomic DNAs with concentrations $ 15 ng / ml were used to construct sequencing pools representing depopulated WTD from the U.S. Midwest, Northeast, and South by targeting 50 ng of genomic DNA per WTD in each respective pool (WT1, WT0). Genomic DNAs with concentrations , 15 ng / ml were retained for downstream Affymetrix Axiom and PRNP genotyping. Aliquots from each genomic DNA pool (WT0, WT1) were digested with EcoRI, HindIII, and PstI (NEB) in 1X CutSmart buffer for 4 hr at 37°. Enzymes were heat inactivated at 80°for 20 min and held at 10°until ligation. Ligase buffer, ligase (NEB) and barcoded enzyme-specific adapters compatible with DNA possessing EcoRI, HindIII or PstI overhangs were added to the digested samples, and incubated at 16°f or 8 hr. Following heat inactivation at 80°for 20 min, 1/10 th volume of 3M NaAc (pH 5.2) and two volumes of 100% ethanol were added to each sample, and then held at -20°for 1 hr before spinning at high speed for 10 min in a bench-top microfuge. Pellets were washed twice in 1 ml 70% ethanol and resuspended in 130 ml 1X TE. Samples were then sheared to an average size of 350 bp on the Covaris E220 sonicator, and AMPure XP bead purified as per the manufacturers protocol (Beckman Coulter). Sheared DNA fragments were size selected using a Pippin Prep 2% dye-free agarose gel with internal size markers (Sage Bioscience); aiming for 300-800 bp inserts. Recovered samples were cleaned with 1X AMPure XP beads and endrepaired first with Bst DNA Polymerase (NEB), then with a DNA End Repair Kit (NEB), and A-Tailed using Klenow Fragment (39/59 exo-) (NEB) in the presence of 50 nM ATP. An Illumina P7-adapter (Adapter B) was ligated to the A-tailed ends as described above. Following two rounds of AMPure XP bead purification, 150 ng of each pool was then subjected to "pre-selection PCR" (PreCR) in which a biotinylated forward primer (P5-Select) and unique indexed reverse primers (TDX) were used to amplify and tag desired DNA fragments. Reactions (200ml total) contained 200 nM dNTPs, biotinylated forward and two P7-index primers per pool, 4 units Q5 Hi-Fidelity Taq (NEB), and were split into 2 X 100 ml volumes for thermocycling. Following an initial denaturation at 98°for 30 sec, samples were subjected to 15 cycles of 98°for 10 sec, 72°for 30 sec then a final elongation at 72°for 5 min and held at 4°. PCR products were purified using Qiagen PCR purification columns, then 1X AMPure XP beads, and quantified via DeNovix. Removal of undesirable fragments (P5 to P5 and P7 to P7 ligated products) was achieved with Dynabeads M-270 Streptavidin coupled magnetic beads (ThermoFisher). Briefly, 50 ml of beads per sample were captured and washed twice with 1X Bead Washing Buffer (1X BWB, 10 mM Tris-HCl with pH 7.5, 1 mM EDTA, 2 M NaCl). Beads were resuspended in 100ml 2X BWB and mixed with 2000 ng of PreCR product in 100 ml EB. After 20 min at room temperature, beads were captured and washed three times in 200 ml 1X BWB, twice in 200 ml water, and once in 100 ml 1X SSC. Beads were then resuspended in 50 ml 1X SSC, heated at 98°for 5 min, and placed on a magnet, with the supernatant removed thereafter. This elution was repeated and the final supernatants were purified with Qiagen PCR columns, as recommended by the manufacturer. The eluted ssDNA was DeNovix quantified, and diluted to 1 ng/ml with EB. A final PCR was performed on 10 ng of input DNA using FiLi-F1 and FiLi-R1 primers in a 50 ml reaction, with only 8 cycles. Final PCR products representing WT0-EcoRI, WT0-HindIII, WT0-PstI, WT1-EcoRI, WT1-HindIII, and WT1-PstI were purified with 1X AMPure XP beads, quantified and assessed for quality on a Fragment Analyzer (Advanced Analytics), and sequenced (2 · 125 bp, paired end) on the Illumina HiSeq 2500 at the Texas AgriLife Genomics and Bioinformatics Core Facility at Texas A&M University. Raw reads generated for each library were as follows: WT1 EcoRI (134,299,714); WT1 PstI (175,412,740); WT1 HindIII (152,371,052); WT0 EcoRI (145,989,752); WT0 PstI (120,598,450); WT0 HindIII (137,148,058). Primers used were as follows: For ligation to restriction enzyme cut DNA, adapters were made by mixing equimolar amounts of top (T) and bottom (B) oligos in 1X oligo hybridization buffer (50 mM NaCl, 1 mM EDTA, 10 mM Tris-HCl, pH 8.0), heating them to 98°for 1 min, and allowing them to cool to room temperature at a rate of 0. Sequence analysis and affymetrix axiom array design All Illumina sequences were trimmed for quality and adapters using CLC Genomics Workbench 10.1.1 (Qiagen), as previously described (Halley et al. 2014;Halley et al. 2015;Sollars et al. 2017). All trimmed reads were mapped to the WTD genome assembly (GCF_002102435.1 Ovir.te_1.0; https://www.ncbi.nlm.nih.gov/assembly/GCF_002102435.1/) using the CLC Genomics Workbench 10.1.1 reference mapping algorithm (Seabury et al. 2011;Halley et al. 2014;Halley et al. 2015;Sollars et al. 2017). Variant prediction was performed using a probabilistic approach implemented within CLC Genomics Workbench 10.1.1 (Halley et al. 2014;Halley et al. 2015;Oldeschulte et al. 2017;Sollars et al. 2017). This algorithm estimates error probabilities from quality scores, and uses these probabilities to determine the most likely allele combination per nucleotide position, thus facilitating a user-specified minimum probability threshold (P $ 0.95) for variant prediction, and variant quality scores (Halley et al. 2014;Halley et al. 2015;Oldeschulte et al. 2017;Sollars et al. 2017). Additional variant prediction parameters and filters were similar to those recently described (Seabury et al. 2011;Oldeschulte et al. 2017), and the probabilistic approach produced evidence for 6,268,706 variants, which included 5,561,550 putative SNPs (P $ 0.95; Minor Allele Frequency $ 0.01). Variant prediction results were exported from CLC as a single variant call formatted file (VCF), which was used for SNP array design. Briefly, the VCF file was filtered according to the Affymetrix Axiom myDesign guidelines for SNP submission (http:// www.affymetrix.com/support/technical/byproduct.affx?product= axiom_custom_agrigenomics) using a custom python script as follows: Retain only biallelic SNPs with minimum depth = 10, maximum depth = 150, minimum minor allele frequency (MAF) $ 0.15, minimum SNP quality score = 30, identify probe overlaps for exclusion, and prioritize variants that maximize array density (i.e., A/T and C/G take up two spaces on the array). The python script as well as more detailed documentation are available in Additional File 30 (DRYAD). The targeted number of SNPs for submission to Affymetrix was . 200,000; to facilitate internal Affymetrix scoring (i.e., by pconvert, best_pconvert; recommended, neutral, or not recommended) that would enable the design of a final Affymetrix Axiom 200K SNP array. Collectively, 200,000 SNPs were favorably scored (n = 179,508 recommended; n = 20, 492 neutral) and used for array fabrication.
PRNP and affymetrix axiom array genotyping PRNP genotyping for missense variants at codons 37, 95, 96, 116, and 226 was performed at GeneSeek Neogen (Lincoln, NE) as part of an existing commercial genotyping by sequencing (GBS) service. Briefly, the functional PRNP gene was PCR amplified using primers designed to be exclusionary to a processed pseudogene as previously described (O'Rourke et al. 2004), and the resulting amplicons were purified via AMPure XP beads as recommended by the manufacturer (Beckman Coulter); thus allowing for the creation of barcoded Illumina Nextera XT DNA libraries and amplicon sequencing on an Illumina MiSeq. PRNP genotypes were called from the aligned read pileups at GeneSeek Neogen, and delivered in text format. Affymetrix Axiom 200K genotyping was also performed at GeneSeek Neogen using the established Affymetrix best practices workflow; with genotypes delivered in text format. Affymetrix quality control thresholds were DQC $ 0.82, QC call rate $ 97%, passing samples in the project $ 95%, and average call rate for passing samples $ 97%. Collectively, 860 WTD samples with the desired metadata (i.e., sex, age, U.S. general region) passed all Affymetrix QC filters; each with 125,968 SNP array genotypes, and paired PRNP genotypes, thus yielding a combined set of 125,973 SNP genotypes for analysis. Fifty-three WTD did not have CWD diagnostic data at the time of study. SNPs which did not convert on the Affymetrix Axiom 200K WTD array were primarily due to call rates below the best practices threshold (n = 37,197), and failures to meet thresholds in two or more best practices criteria (n = 36,045). All SNP conversion types are comprehensively summarized in DRYAD (Additional File 32).

GWAA and genomic prediction with cross validation
Prior to all analyses, a comparative marker map was created by aligning the WTD PRNP sequence and all Affymetrix Axiom 200K probe sequences with ARS-UCD1.2 (GCA_002263795.2) via blastn, thus providing comparative evidence for the origin of the array SNPs (i.e., autosomal vs. non-autosomal); which was necessary because the draft de novo WTD genome assembly (GCF_ 002102435.1 Ovir.te_1.0) is unanchored (i.e., by maps or in situ hybridization). The comparative marker map was joined to the combined set of all genotypes (PRNP + Affymetrix Axiom array), and quality control analyses were performed in SVS v8.8.2 or v8.8.3 (Golden Helix). Initially, pairwise IBS distances were computed to identify twins and duplicate samples. Eight samples present in both repositories (USDA APHIS, TPWD) were purposely duplicated for use as process controls, and correctly identified by IBS/IBD estimates (Additional File 31 in DRYAD). Eight additional samples were also identified as either duplicates or potential twins. In all cases, only the sample with the highest call rate was retained for further analyses. Additional quality control analyses and filtering were as follows: sample call rate (, 97% excluded), and thereafter, SNP filtering by call rate (. 15% missing excluded), MAF (, 0.01 excluded), polymorphism (monomorphic SNPs excluded), and Hardy-Weinberg Equilibrium (excludes SNPs with HWE P-value , 1e-25), which yielded 123,987 SNPs for all analyses. PRNP SNPs which failed to endure quality control filtering included codons 95 (MAF , 0.01) and 116 (monomorphic), whereas codons 37, 96, and 226 remained. All GWAA's and genomic predictions with k-fold cross validation were performed on 807 WTD with recorded metadata that included sex, age, U.S. general region of origin (i.e., Midwest, Northeast, South), and CWD diagnostics. CWD phenotypes used in all analyses were: CWD Scores (0 = non-detect, 1 = lymph node positive, 2 = lymph node and obex positive); CWD Binary (0 = non-detect, 1 = lymph node positive and/or obex positive). However, at the time of study, one WTD (sample ID: MQ6Q) only possessed diagnostic data for a CWD positive tonsil biopsy, and thus was assigned a CWD Score and a CWD Binary phenotype of "1". All WTD GWAA's were performed using a mixed linear model with variance component estimates, as described and implemented in EMMAX, and executed in SVS v8.8.2 or v8.8.3 (Golden Helix), where all genotypes are also recoded as 0, 1, or 2, based on the incidence of the minor allele (Kang et al. 2010;Segura et al. 2012;Seabury et al. 2017;Smith et al. 2019). Briefly, the general mixed model can be specified as: y ¼ Xb þ Zu þ e, where y represents a n · 1 vector of observed CWD phenotypes, X is a n · f matrix of fixed effects, b is a f · 1 vector representing the coefficients of the fixed effects, u represents the unknown random effect, and Z is a n · t matrix relating the random effect to the CWD phenotypes of interest (Kang et al. 2010;Segura et al. 2012;Seabury et al. 2017;Smith et al. 2019). Herein, it is necessary to assume that VarðuÞ ¼ s 2 g K and VarðeÞ ¼ s 2 e I, whereby VarðyÞ ¼ s 2 g ZKZ ' þ s 2 e I, but in this study Z represents the identity matrix I, and K represents a relationship matrix of all WTD samples.
To solve the mixed model equation using a generalized least squares approach, we must first estimate the relevant variance components (i.e., s 2 g and s 2 e ) as previously described (Kang et al. 2010;Segura et al. 2012;Seabury et al. 2017;Smith et al. 2019). Variance components were estimated using the REML-based (restricted maximum likelihood) EMMA approach (Kang et al. 2010), with stratification accounted for and controlled using a genomic relationship matrix ðG) (VanRaden 2008), as computed from the WTD genotypes. Genomic relationship matrix (GRM) heritability estimates (½h 2 ¼ s 2 g = ðs 2 g þ s 2 e Þ) were produced as previously described (Kang et al. 2010;Segura et al. 2012;Seabury et al. 2017;Smith et al. 2019). Moreover, because previous WTD studies indicate that the probability of CWD infection increases with age (Grear et al. 2006), and that CWD may disparately affect male and female WTD in different U.S regions, including differences in clinical disease progression and mortality (Grear et al. 2006;Edmunds et al. 2016), the possibility for different CWD strains must be considered (Bistaffa et al. 2019). Therefore, the following fixed-effect covariates were specified for comparison of GWAAs: sex, age, U.S. region of origin, and the total number as well as types (0 = none-detected; 1 = lymph node only; 2 = lymph node and obex) of CWD positive diagnostic tissues, with one exception (i.e., sample ID: MQ6Q), as noted above.
For all genomic prediction analyses involving k-fold cross validation, we used GBLUP as previously described (Taylor 2014) and implemented in SVS v8.8.2 or v8.8.3 (Golden Helix), where the variance components were again estimated using the REML-based EMMA technique (Kang et al. 2010) with a genomic relationship matrix ðG) (VanRaden 2008; Taylor 2014). For WTD GBLUP analyses, consider the general mixed model equation: y ¼ X f B f þ u þ e, across n WTD samples where fixed effects specified as B f include the intercept and any additional covariates (i.e., U.S. region, sex, age); but also assume VarðeÞ ¼ s 2 e I, as above, and that the random effects u are additive genetic merits (i.e., genomically estimated breeding values or GEBVs) for these WTD samples, which are produced from m markers as u ¼ Ma, where M is a n · m matrix, and a is a vector where a k is the allele substitution effect (ASE) for marker k: In this study, we used overall normalization for matrix M, as implemented in SVS v8.8.2 or v8.8.3 (Golden Helix), and explored solutions with and without gender corrections (i.e., full dosage compensation, no dosage compensation) (Taylor 2014), to produce GEBVs for all WTD samples as well as estimates of ASE for all SNPs. Moreover, considering that all training set samples precede the validation set, we define Z ¼ ½Ij0, where the width and height of I is given as n t , the width of the zero matrix is given as n v , and the height of the zero matrix is n t . Thus we can partition u, X f , and y according to their origin (i.e., training vs. validation set) as u ¼ u t u v , , and compute a genomic relationship matrix using all samples for use with the EMMA technique (Kang et al. 2010); to implement a mixed model for the training set as follows: where VarðuÞ ¼ s 2 G G and VarðZuÞ ¼ s 2 G ZGZ ' . Finally, we predict the validation set phenotypes as: y v ¼ X fvBf þû v , from the intercept and any validation covariates X fv as well as the predicted values ofû v . Additional formulae and supporting documentation are available at https://doc.goldenhelix.com/SVS/latest/svsmanual/mixedModelMethods/overview.html#gblupproblemstmt. Notably, prior to this study, GEBVs were not estimated or utilized in WTD, and thus they cannot be expected to be intuitive or easily understood by U.S. WTD farmers or relevant regulatory agencies. However, the predicted CWD binary phenotypes are both intuitive and easily understood as estimates of enhanced or reduced susceptibility to CWD. Because GBLUP predicts CWD binary phenotypes across a range of values (i.e., 0 to 1), SVS v8.8.2 and v8.8.3 (Golden Helix) considers predicted values of 0.5 and higher as "1", and , 0.5 as "0", thus facilitating the calculation of important summary statistics which require binary classifications. Justification for rounding is evident within the histograms representing the frequency distributions of the predicted CWD binary phenotypes (Fig. S1), the relevant GEBVs (Fig. S2), and the relationship between the predicted CWD binary phenotypes and the relevant GEBVs (Fig.  S3); with an obvious break that occurs at 0.50 (Fig. S1-S3). Binary summary statistics for GBLUP-based genomic predictions with k-fold (k = 3; k = 5) cross validations (n = 50 iterations) were computed in SVS v8.8.2 or v8.8.3 (Golden Helix) as follows: Area Under the Curve as AUC ¼ U1 n1n2 , where n 1 is the sample size of observations with CWD binary phenotypes of 0, n 2 is the sample size of observations with CWD binary phenotypes of 1, and U 1 ¼ R 1 2 n1ðn1þ1Þ 2 , where R 1 is the sum of the ranks for the predicted binary CWD phenotypes with actual phenotypes of 0 (from CWD diagnostics); Mathews Correlation Co- , where TP is the number of true CWD positives (from CWD diagnostics), TN is the number of true CWD non-detects (from CWD diagnostics), FP is the number of false positives, and FN is the number of false non-detects among the predicted CWD binary phenotypes; Genomic Prediction Accuracy as ACC ¼ y i 2ŷ i .

Randomizing and blinding
All GBLUP-based k-fold (k = 3; k = 5) cross validations (i.e., CWD binary; CWD-scores) were performed using automated random sampling to define the validation set (i.e., to predict on) and the training set, for the specified values of k.
To investigate the potential for developing a genomically-estimated CWD eradication program for farmed WTD, we used genomic best linear unbiased prediction (GBLUP) in conjunction with k-fold cross validation and random sampling (Taylor 2014). Briefly, WTD data (i.e., genotypes, CWD diagnostic phenotypes, and other metadata) were randomly partitioned into k -subsamples (k = 3, k = 5), and one of these subsamples was then selected (i.e., within a discrete fold) to serve as the validation set for genomic prediction; thus the GBLUP model was fit using the remaining (i.e., k -1) subsamples within that fold (i.e., the training data); until all subsamples were used for both training and prediction. All cross validations with random sampling were run for 50 iterations, with each iteration consisting of either three or five folds (k = 3, k = 5), and summary statistics were produced (See Table 1; Methods; Additional Files 6-29 in DRYAD). Binary GBLUP models fit with no fixed-effect covariates (k = 3, k = 5) produced high mean genomic prediction accuracies ($ 0.8167) and specificities ($ 0.9101), with small standard deviations, but lower mean sensitivities (# 0.6496), indicating that false negatives pose the greatest challenge for reducing the prevalence of CWD via genomic prediction (Table 1; See Methods). Similar results were also obtained when binary GBLUP models were fit using sex, age, and U. S. region of origin as fixed-effect covariates (k = 3, k = 5; Table 1). However, in addition to false negatives, some false positive genomic predictions were also observed; most likely due to underlying genomic susceptibility coupled with either very early stages of disease (i.e., CWD non-detect diagnostically) and/or differences in exposure (Tsairidou et al. 2014). All results were robust to the inclusion or exclusion of non-autosomal loci (i.e., X, MT; 123,987 SNPs vs. 120,808 SNPs, respectively), and to full dosage compensation vs. no dosage compensation when putative X-linked SNPs were included (Table 1; Additional Files 6-29 in DRYAD). GBLUP models fit with the CWD-scores (0, 1, 2), thus reflecting the natural progression of disease, produced lower mean genomic prediction accuracies (i.e., # 0.6007; See Methods; DRYAD), regardless of the inclusion or exclusion of fixed-effect covariates (sex, age, U.S. region), nonautosomal loci, or the implementation of full dosage compensation vs. no dosage compensation (k = 3, k = 5).

CONCLUSIONS
Herein, we demonstrate that differential susceptibility to CWD and variation in natural disease progression are both heritable, polygenic traits in farmed U.S. WTD, and that genome-wide SNP data can be used to produce accurate genomic predictions for risk ($ 0.8167); thereby providing the first novel strategy for reducing the prevalence of CWD. Moreover, given the genomic architecture of these traits, we also demonstrate that PRNP genotyping alone cannot be expected to facilitate an eradication program, or to rapidly reduce the overall prevalence of CWD in farmed U.S. WTD.   as well as support from Texas Parks and Wildlife Department (grant 475613) and a charitable gift from the American Chronic Wasting Disease Foundation.