Adapting Genotyping-by-Sequencing and Variant Calling for Heterogeneous Stock Rats

The heterogeneous stock (HS) is an outbred rat population derived from eight inbred rat strains. HS rats are ideally suited for genome wide association studies; however, only a few genotyping microarrays have ever been designed for rats and none of them are currently in production. To address the need for an efficient and cost effective method of genotyping HS rats, we have adapted genotype-by-sequencing (GBS) to obtain genotype information at large numbers of single nucleotide polymorphisms (SNPs). In this paper, we have outlined the laboratory and computational steps we took to optimize double digest genotype-by-sequencing (ddGBS) for use in rats. We evaluated multiple existing computational tools and explain the workflow we have used to call and impute over 3.7 million SNPs. We have also compared various rat genetic maps, which are necessary for imputation, including a recently developed map specific to the HS. Using our approach, we obtained concordance rates of 99% with data obtained using data from a genotyping array. The principles and computational pipeline that we describe could easily be adapted for use in other species for which reliable reference genome sets are available.

genotyping-bysequencing heterogeneous stock rat imputation Advances in next-generation sequencing technology over the past decade have enabled the discovery of high-density, genome-wide single nucleotide polymorphisms (SNPs) in model systems. Comprehensive assays of the standing genetic variation in these organisms has allowed for the identification of quantitative trait loci (QTL) and the application of numerous population genetic and phylogenetic methods. However, due to the high degree of linkage disequilibrium (LD) in QTL mapping populations, sequencing whole genomes is not necessary. Many populations are the result of numerous generations of interbreeding inbred strains, allowing for recombination to produce an admixed population with known founder haplotypes. Due to the relatively slow rate of accumulation of recombination events, these populations contain large chunks of the genome derived from the same founder haplotype. Nearby SNPs are therefore often in strong LD with physically adjacent loci, effectively 'tagging' nearby variation and thereby reducing the number of sites that need to be directly genotyped. Several reduced-representation sequencing approaches that take advantage of LD structure have been previously described (Miller et al. 2007;van Orsouw et al. 2007;Van Tassell et al. 2008;Baird et al. 2008;Huang et al. 2009;Andolfatto et al. 2011;Elshire et al. 2011;Davey et al. 2011;Poland and Rife 2012;Peterson et al. 2012;Sun et al. 2013;Scheben et al. 2017). Using these methods, thousands of SNPs can be identified in large numbers of samples for a fraction of the price of wholegenome sequencing (Chen et al. 2013;He et al. 2014). The advantages of these methods are especially attractive when applied to less commonly utilized species or strains for which genotyping microarrays are not available. (Fu and Yang 2017), chicken (Pértille et al. 2016;Wang et al. 2017), mouse (Parker et al. 2016), fox (Johnson et al. 2015), and cattle (De Donato et al. 2013), among others. The greatly varying genomic composition among organisms necessitates a diverse and customized set of approaches for obtaining high-quality genotypes. As such, both the GBS protocol and computational pipeline require modifications when applied to a new species. Recent work from our group showed that GBS can be effectively applied to outbred mice (Parker et al. 2016;Zhou et al. 2018;Gonzales et al. 2018) and rats (Fitzpatrick et al. 2013). However, those publications used protocols that had not been optimized, leaving significant room for improvement in genotype quality and marker density. Additionally, although several tools and workflows for the analysis of GBS data have been described, including Stacks (Catchen et al. 2013), IGST-GBS (Sonah et al. 2013), TASSEL-GBS (Glaubitz et al. 2014), Fast-GBS (Torkamaneh et al. 2017), and GB-eaSy (Wickland et al. 2017), the majority were developed and optimized for use in plant species. Given the lack of well-developed genomic resources in these species, they do not leverage the wealth of genomic data available for model organisms such as rats. Here we describe the customized computational and laboratory protocols for applying GBS to HS rats.
The HS is an outbred rat population created in 1984 using eight inbred strains and has been maintained since then with the goal of minimizing inbreeding and maximizing the genetic diversity of the colony (Johannesson et al. 2008;Woods and Mott 2017). After more than 80 generations of accumulated recombination events, their genome has become a fine-scale mosaic of the inbred founders' haplotypes. The breeding scheme and the number of accumulated generations has made the HS colony attractive for genetic studies. Additionally, extensive deep sequencing data exists for many inbred rat strains, including the eight HS progenitor strains (Rat Genome Sequencing and Mapping Consortium et al. 2013;Hermsen et al. 2015;Ramdas et al. 2019), allowing for accurate imputation to millions of additional SNPs following direct genotyping of only a subset.
Detailed here are the steps we have taken to optimize a rat GBS protocol and computational pipeline. Drawing on existing protocols (Elshire et al. 2011;Peterson et al. 2012;Parker et al. 2016) as templates, we redesigned our previous GBS approach (Parker et al. 2016;Gonzales et al. 2018) and have developed a novel, reference-based, high-throughput workflow to accurately and costeffectively call and impute variants from low-coverage double digest GBS (ddGBS) data in HS rats. This publication is intended as a resource for others who might wish to perform GBS in rats and should provide a roadmap for adapting GBS for use in new species. We demonstrate that with a suitable reference panel, applying reduced representation approaches and imputation in model systems can provide high-confidence genotypes on millions of genome-wide markers.

MATERIALS AND METHODS
Tissue samples and DNA extraction Samples for this study originated from three sources: an in-house advanced intercross line (AIL) derived from LG/J and SM/J mice , Sprague Dawley (SD) rats from Charles River Laboratories and Harlan Sprague Dawley, Inc. (Gileta et al. 2018), and an HS rat colony (Woods and Mott 2017;Chitre et al. 2018). Early stages of ddGBS optimization utilized AIL genomic DNA extracted from spleen by a standard salting-out protocol. Later optimization steps were performed using genomic DNA from SD rats extracted from tail tissue using the PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific, Waltham, MA). HS rat DNA was extracted from spleen tissue using the Agencourt DNAdvance Kit (Beckman Coulter Life Sciences, Indianapolis, IN). All genomic DNA quality and purity was assessed by NanoDrop 8000 (Thermo Fisher Scientific, Waltham, MA). Interestingly, we observed that rat genomic DNA derived from either spleen or tail tissue appears to degrade faster than mouse genomic DNA following extraction by either of the above protocols; therefore, we recommend storing rat genomic DNA at -20°and using it within weeks of extraction whenever possible.
In silico digest of rat genome We used in silico digests to aid in the selection of restriction enzymes, with the goal of maximizing the proportion of the genome captured at sufficient depth to make confident genotype calls (Kent et al. 2002). We used the restrict function in EMBOSS (version 6.6.0) (Rice et al. 2000) in conjunction with the REBASE database published by New England BioLabs (NEB; version 808) (Roberts and Macelis 1999) to perform in silico digest of the current release of the Norway brown rat reference genome, designated rn6. For the primary restriction enzyme, we chose PstI, which had been successfully used in numerous projects (Fitzpatrick et al. 2013;Parker et al. 2016;Gonzales et al. 2018). We performed the digest with PstI alone and then with PstI paired with each of 7 secondary enzymes: AluI, BfaI, DpnI, HaeIII, MluCI, MspI, and NlaIII. We only considered fragments with one PstI cut site and one cut site from the secondary enzyme because the adapter and primer sets are designed to only allow these fragments to be amplified.

Restriction enzyme selection
Initial criteria for selecting a secondary restriction enzyme were a 4bp recognition sequence, no ambiguity in the recognition sequence (i.e., N's), compatibility with the NEB CutSmart Buffer, and an incubation temperature of 37°. The list of enzymes meeting these criteria at the time included AluI, BfaI, DpnI, HaeIII, MluCI, MspI, and NlaIII. Using the in silico digest data, we looked to maximize the portion of the genome contained within a fragment size range of 125-275bp (250-400bp with annealed adapters and primers) ( Figure 2; Table 1). We excluded enzymes that produced blunt ends, both because it would be more difficult to anneal adapters to blunt ended fragments and because our adapters would then also anneal to blunt ends produced by DNA shearing. We also excluded methylation-sensitive enzymes, as we did not want to limit the breadth of our sequencing efforts, accepting the possibility of read pileup in repetitive regions. Based on these criteria, as well as maximizing the percent of the genome captured, NlaIII, BfaI, and MluCI were selected for further testing. The final choice of enzyme (NlaIII) was determined empirically and is detailed in the Results.

ddGBS library preparation and sequencing
The full ddGBS protocol is available in File S1. In brief, approximately 1mg of DNA was used per sample. Sample DNA, PstI barcoded adapters, and NlaIII Y-adapter were combined in a 96-well plate and allowed to evaporate at 37°overnight. The PstI adapter barcode is "inline" such that each sequencing read from a given sample contains both the PstI overhang sequence (4bps) and a unique adapter sequence (4-8bps) prior to the beginning of the insert sequence. Sample DNA and adapters were re-eluted on day two with a PstI/NlaIII digestion mix and incubated at 37°for two hours to allow for complete digestion. Ligation reagents were then added and incubated at 16°for one hour to anneal the adapters to the DNA fragments, followed by a 30-minute incubation at 80°to inactivate the enzymes. Sample libraries were purified using a plate from a MinElute 96 UF PCR Purification Kit (QIAGEN Inc., Hilden, Germany), vacuum manifold, and ddH 2 O. Once re-eluted, libraries were quantified in duplicate with Quant-IT PicoGreen (Thermo Fisher Scientific, Waltham, MA) and pooled to the desired level of multiplexing (i.e., 12, 24, or 48 samples per library). Each pooled library was then concentrated by splitting the pooled volume across 2-3 wells of the MinElute vacuum plate and resuspending the library at desired volume for use in the Pippin Prep. The concentrated pool was quantified to ensure the gel cassette was not overloaded with DNA (.5mg). The pool was then loaded into the Pippin Prep for size selection (300-450bps) using a 2% agarose gel cassette on a Pippin Prep (Sage Science, Beverly, MA). Size-selected libraries were then PCR amplified for 12 cycles to add the Illumina sequencing primers and increase the quantity of DNA, concentrated again, and checked for quality on an Agilent 2100 Bioanalyzer with a DNA 1000 Series II chip (Agilent Technologies, Santa Clara, CA). Bioanalyzer results were used to assure sufficient library concentration and to identify excessive primer dimer peaks.
As a pilot, an initial 96 HS samples were sequenced, 12 samples per library, at Beckman Coulter Genomics (now GENEWIZ) on an Illumina HiSeq 2500 with v4 chemistry and 125bp single-end reads. Subsequently, we began using a set of 48 unique barcoded adapters (File S2) to multiplex 48 HS samples per ddGBS library. Each library thereafter was run on a single flow cell lane on an Illumina HiSeq 4000 with 100bp single-end reads at the IGM Genomics Center (University of California San Diego, La Jolla, CA). We also obtained ddGBS data on the HiSeq 4000 for a select set of 96 samples that had been previously genotyped on a custom Affymetrix Axiom MiRat 625k microarray (Part#: 550572), providing us with a "gold standard" array-based dataset with which to compare to our ddGBS data.

Evaluation of ddGBS pipeline performance
The steps required to call and impute genotypes from raw ddGBS sequencing data are presented in Figure 1. During optimization of the pipeline, performance was assessed by two primary metrics: (1) the number of variants called and (2) genotype concordance rates for calls made in 96 HS rats that had both ddGBS genotypes and array genotypes from a custom Affymetrix Axiom MiRat 625k microarray. There were two checkpoints in the GBS pipeline where genotype quality (measured by concordance rate) was assessed. The first was after "internal" imputation with Beagle Browning 2009, 2016), whereby we leverage information from samples that had sufficient read depth to make a confident genotype call at a given locus in order to impute the genotype of other samples that had lower read depths at that locus. The second checkpoint was after "external" imputation, meaning imputation to our reference panel with IMPUTE2 to obtain genotype calls at loci we did not directly capture n■ Table 1 Restriction enzyme options for double digest. The percent genome in region columns indicate the percentage of the genome that falls within the provided fragment size ranges and can therefore be captured by GBS a Calculated using rn6 genome length of 2,870,182,909bps. b Restriction enzyme is methylation sensitive. Figure 1 ddGBS sequencing data analysis workflow. Each step of the workflow is described in the text. by our GBS method (Howie et al. 2009(Howie et al. , 2012. A third, additional metric we checked was the transition to transversion ratio (T S T V ), which is expected to be 2 for intergenic regions. The steps as outlined in the following sections reflect the final version of the pipeline. Variant calling and imputation steps utilized all available samples run on the HiSeq 4000 (3,000+ rats), though genotype concordance rates could only be calculated for the set of 96 HS samples for which we had array genotype calls.

Demultiplexing
The PstI adapter barcodes were used to demultiplex FASTQ files into individual sample files. Three demultiplexing software packages were tested: FASTX Barcode Splitter v0.0.13 [RRID: SCR_005534] (Hannon Lab 2010), GBSX v1.3 (Herten et al. 2015), and an in-house Python script (Parker et al. 2016). Reads that could not be matched with any barcode (maximum of 1 mismatch allowed), or that lacked the appropriate enzyme cut site, were discarded. Samples with less than two million reads after demultiplexing were discarded as these appeared to be outliers ( Figure S4) and were observed to have high rates of missingness in their genotype calls. Data concerning demultiplexing are shown in Table S1 and are from a single HS rat sequenced in a 12-sample library on one lane after demultiplexing and adapter/quality trimming.

Read alignment and indel realignment
The Rattus norvegicus genome assembly rn6 was used as the reference genome for read alignment with the Burrows-Wheeler Aligner v0.7.5a (BWA) [RRID: SCR_010910] (Li and Durbin 2009) using the mem algorithm. We then used GATK IndelRealginer v3.5 [RRID: SCR001876] (McKenna et al. 2010) to improve alignment quality by locally realigning reads around a reference set of known indels in 42 whole-genome sequenced inbred rat strains, including the eight HS progenitor strains (Hermsen et al. 2015).
Internal imputation Beagle v4.1 Browning 2009, 2016) was used to improve the genotyping within the samples without the use of an external reference panel. Missing and low quality genotypes were imputed by borrowing information from other individuals in the dataset with high quality information at these same variant sites. Before settling on the combination of ANGSD-SAMtools and Beagle for genotype calling and internal imputation, we also experimented with GATK's HaplotypeCaller (McKenna et al. 2010) with various parameter settings, but with unsatisfactory results (Figure 3).
Quality control for genotypes before imputation using an external reference panel To verify the quality of the "internally" imputed genotypes prior to imputing SNPs from the 42 inbred strain reference panel (Hermsen et al. 2015), we checked concordance rates for the 96 HS animals with array genotypes, examined the T S T V ratio, and assessed whether the sex as recorded in the pedigree records agreed with the sex empirically determined by the proportion of reads on the X-chromosome out of the total number of reads ( Figure S1). We also identified Mendelian errors using the--mendel option in plink and known pedigree information for 1,136 trios from 214 families within the HS sample.
Using the fraction of the trios that were informative for a given SNP and the equation 1-(1-2p(1-p)) 3 , where p represents the minor allele frequency of the allele, we formed curves for the distributions of the expected number of Mendelian errors for both SNPs and samples and chose the inflection points as thresholds for the number of Mendelian errors allowed.
Data preparation for phasing with external reference panel First, in our study sample of 96 samples, we only retained variants previously identified in the 8 HS founder strains because we expected the polymorphisms in our samples to be limited to the variation present in the founders (Hermsen et al. 2015;Ramdas et al. 2019). Further, to improve imputation efficiency, we employed a prephasing step with IMPUTE2 (prephase_g) (Howie et al. 2012) prior to imputation. Pre-phasing only needs to be performed once, allowing us to reuse the estimated haplotypes from our dataset for imputation with multiple different reference panels. A flowchart outlining the pre-phasing protocol is presented in Figure S2.

Genetic maps
Genetic maps are required for phasing and imputation with IM-PUTE2. When we began this project, no strain-specific recombination map was available for HS rats. Thus, we considered a sparse genetic map for SHRSPxBN (Steen et al. 1999). We also tested two types of linearly interpolated genetic maps, with recombination rates set at either 1cM/Mb or the chromosome specific averages for rats, as reported by Jensen-Seaman et al. (Jensen-Seaman 2004). Lastly, late in the course of this project, we experimented with an HS-specific genetic map developed by Littrell et al. (Littrell et al. 2018).

Imputation to reference panel
We used a combination of existing sequencing and array genotyping data from the HS rat founder and other inbred laboratory rat strains (Hermsen et al. 2015) as reference panel for imputation. Genotype data underwent QC and were phased by Beagle into single chromosome haplotype files. Haplotype files were then created using the workflow detailed in Figure S2. Imputation by IMPUTE2 was performed in 5Mb windows using the aforementioned reference panels and genetic maps.

Data availability
The ddGBS protocol and adapter sequences used to generate the data presented in this paper are available at https://doi.org/10.6084/

ddgbs optimization
Previous projects utilizing GBS in mice and rats (Fitzpatrick et al. 2013;Parker et al. 2016;Gonzales et al. 2018) often encountered an issue where certain regions of the genome experienced high pileups of reads per sample (.100x), while other regions were covered by just 1-2 reads. This read distribution imbalance can be caused in part by PCR amplification bias, where a subset of fragments are preferentially amplified until they dominate the final library (Kanagawa 2003;Aird et al. 2011). These previous protocols utilized 18 cycles of amplification. We tested reducing this to 8, 10, 12, or 14 cycles and found that below 12 cycles, there was insufficient PCR product to accurately quantify and pool for sequencing. The reduction in the number of PCR cycles was expected to reduce PCR bias, though this was not explicitly tested. Another concern regarding previous sequencing results was an excess of long fragments (.700bps as determined by in silico digest). We observed that longer sequencing fragments often do not provide sufficient reads to make confident genotype calls (, 5 reads per sample), putatively due to inefficient bridge amplification and clustering on Illumina flow cells. Sequencing these long fragments is therefore wasteful. We tested three methods of combating this issue, including increasing the digestion time or enzyme concentration, performing size selection on the libraries, and using a two-enzyme restriction digest.
We considered the possibility that the restriction enzyme digests might not be running to completion. To address this possibility, we increased the duration of the digestion from 2 hr to 3 or 4 hr. We also tried increasing the number of units of PstI enzyme added, to ensure complete digest. Neither of these modifications impacted the final fragment length distribution of the library, indicating that the digest was reaching completion after 2 hr using the original concentration of PstI (File S3wells 1-6).
Our previous GBS protocol did not have an explicit library fragment size selection step. The final library was purified using a MinElute PCR Purification Kit (QIAGEN Inc., Hilden, Germany), which isolates PCR products 70bp-4kb in length, leaving a wide range of fragment sizes in the final library, under the assumption that only shorter fragments would bridge amplify on the flow cell. This method was imprecise and had low reproducibility, negatively impacting our ability obtain reads at consistent sites across libraries. Rather than attempt size selection by gel extraction, we chose to utilize a Pippin Prep, which automates the elution of DNA libraries of desired fragment size ranges. By using this automated size selection, we reduced the proportion of the genome targeted for sequencing, Additionally, since restriction enzymes make predominantly consistent cuts across samples (barring the presence of polymorphisms in RE recognition sites), it is ensured that highly similar sets of genomic fragments will be sequenced across sample libraries. Since the clustering process involves a bridge amplification step that preferentially amplifies library fragments with shorter insert sizes (Illumina, Inc. 2014), we kept the size selection window narrow (250-400bps) to avoid introducing a bias in which fragments were sequenced. A comparison of the fragment size distributions for the protocols before and after introduction of the Pippin Prep is shown in File S4.
To increase the proportion of the genome captured within the fragment size window, we pursued a double digest of the genome using a secondary enzyme with a more frequently occurring recognition sequence. When used alone, in silico digest of the rn6 reference genome by PstI (Figure 2; Table 1) showed that only 0.5% of the genome would have fallen within a 150bp fragment size window selected on the Pippin Prep. Previously, we performed GBS in CFW mice using the single-enzyme approach and observed that large regions of the genome that were not covered by sequencing reads (Parker et al. 2016). Therefore, we sought to increase the fraction of the genome that was accessible to GBS, so that there would be sufficient SNPs to tag the majority of the variation in the rat genome. Additionally, we were concerned about potential biases in coverage, heterozygosity, and the minor allele frequency (MAF) spectrum that may be introduced by a less complete capture of the genome. Flanagan and Jones have performed an empirical study comparing single-to double-digest RAD-seq and found that double-digest RAD-seq had lower rates of allelic dropout, decreased variance in between-sample per SNP coverage, less allele frequency inflation due to PCR bias, and reduced batch effects (Flanagan and Jones 2018).
The number of fragments with one of each of the cut sites was summed for all observed lengths and the results summarized in Figure 2 and Table 1. BfaI, MluCI, and NlaIII were chosen for further testing due to their compatibility with PstI digestion reagents and temperatures, sticky ends, and the proportion of the genome falling in the size selection window in the in silico analysis. We ruled out BfaI because it only had a 2bp overhang after cleavage, which we empirically showed leads to a high concentration of adapter dimer in the sequencing libraries (S5 File). NlaIII was chosen over MluCI because it contained the greatest portion of the genome in the desired size selection window.
In our previous GBS protocol, all fragments were cut on both ends by PstI. By using a substantially lower concentration of the barcoded PstI adapter than the common PstI adapter, we ensured the barcoded adapter would be the limiting reagent and the majority of fragments with an annealed barcoded adapter would have a common adapter on the other end. This is crucial, as having one of each of the adapters is required for proper amplification of the fragments on the flow cell. However, when using both PstI and NlaIII, the library is predominantly composed of fragments cut on both sides by NlaIII (File S6), which will amplify during PCR with a common adapter, but not on the flow cell. Therefore, we employed a Y-adapter  to control the direction of the first round of PCR and prevent twosided NlaIII fragments from dominating the final sequencing library (File S2).
We tested numerous quantities of PstI and NlaIII adapters in an attempt minimize the amount used and avoid adapter dimers in the final libraries. For the barcoded PstI adapters, we tested 120pmol, 60pmol, 20pmol, 4.0pmol, 2.67pmol, 1.60pmol, 0.53pmol, and 0.20pmol; for the NlaIII Y-adapter, 30pmol, 10pmol, 5.0pmol, 4.0pmol, and 1.0pmol (Files S7 & S8). We found that 0.20pmol of PstI adapter and 4pmol of NlaIII Y-adapter yielded sufficient library and minimized the presence of adapter dimers.
We sequenced a trial flow cell with 8 pooled ddGBS libraries of 12 SD rat samples each (96 total) on a HiSeq 2500 (Illumina, San Diego, CA) with 125bp reads and v3 chemistry, obtaining an average of 15.3 million reads per sample. Given the NlaIII in silico digest results suggested we were capturing 3.4% of the genome and that we were using 125bp reads, this corresponded to approximately 20x coverage of captured sites. We subsequently increased the number of samples to 48 per library for the HS rats because we hypothesized 5x would be sufficient coverage per sample when utilizing imputation to a reference panel. We also discovered that a portion of the reads contained segments of the NlaIII adapter sequence, indicating there were fragments with insert sizes smaller than 125bps in the final library. To avoid this, we increased the fragment size range to 300-450bps (Table 1), which corresponds to a 175-325bp insert size once the adapters and primers are accounted for. We noted however that the library size distribution obtained from the Pippin Prep was uniformly shifted toward higher fragment lengths ( Figure S3). This is a result of the high concentrations of our libraries after pooling and loading the gel cartridge near the upper limit of the recommend number of micrograms of DNA, which can cause slower migration of the DNA across the gel matrix. Figure 2 In silico digest fragment distributions for PstI and potential secondary restriction enzymes. Each panel represents an independent digest of rn6 with the listed enzyme(s). Regions highlighted in blue are fragments that would be selected by the Pippin Prep (125-275bps) after annealing adapters and primers (+125bps). These regions are quantified in Table 1 by multiplying the length of the fragments by the number of fragments to estimate the portion of the genome captured.
The final ddGBS protocol can be found in File S1 and the necessary primer and adapter sequences in File S2. This protocol was used for the sequencing of all HS rats included in subsequent computational optimization steps.

Demultiplexing
The number of base pairs of sequencing data retained after demultiplexing was fairly consistent across demultiplexing software (Table S1). We ultimately decided to use FASTX Barcode Splitter because it yielded the greatest number of reads after quality/adapter trimming and had faster run times. An average of 330 million 100bp reads were obtained per HS library, resulting in 7 million reads per sample. Figure S4 shows the distribution of reads counts for all samples after demultiplexing.
Adapter and quality trimming Read quality was substantially improved after trimming the barcode and adapter sequences and low-quality base pairs at the ends of reads ( Figure S5). Overall read counts were only marginally reduced by quality trimming (Table S1). We observed that the number of called variant sites and the genotyping rate were both greater when using reads initially processed by cutadapt (Martin 2011) than reads processed by the FASTX_Toolkit (Table S2). Importantly, a large portion of the additional identified variants were known variant sites from the 42 inbred strains reference set ( Figure S6), indicating the elevated call rate was at least in part due to capturing more true variant sites. We viewed this as sufficient support for proceeding with cutadapt for adapter removal and quality trimming.

Mapping quality
The number of called variants and genotype call rates were identical at read mapping quality (mapQ) thresholds of either 20 or 30 (Table S3) within ANGSD. As the ANGSD mapQ threshold was raised to 45, there was a small reduction in the number of called variants, and then much greater losses at thresholds of 60 or 90. Fortunately, discordance rates between ddGBS and array genotypes were stable at both low and high mapQ thresholds, despite the putatively higher quality of the alignments ( Figure S7). This permitted us to select a lower mapQ threshold (mapQ = 20), maximizing the number of variants called without sacrificing genotyping accuracy. Figure 3 shows that across all thresholds of genotype discordance (comparing ddGBS with the array genotyping data), the combination of the ANGSD-SAMtools with BEAGLE produced more SNPs than GATK's HaplotypeCaller (McKenna et al. 2010;DePristo et al. 2011). This observation held when variants were limited to biallelic sites and SNPs with an MAF . 0.05 ( Figure S8). We speculate that the poorer performance of HaplotypeCaller may be due in part to the sparsity and non-uniform distribution of GBS genotype data across the genome and the high level of genotype call missingness across samples prior to imputation.

Variant calling
ANGSD supports four different models for estimating genotype likelihoods: SAMtools, GATK, SOAPsnp and SYK. We compared the methods to determine which produced the most SNPs with the lowest discordance rates. The SOAPsnp model demonstrated an advantage in genotype accuracy and number of variants called post-imputation with Beagle ( Figure S9). However, SOAPsnp requires considerably more time (1.7x for 96 samples) and memory and scales poorly with sample size. With greater than 2,000 samples, we were unable to allocate sufficient memory for the SOAPsnp model to successfully run, even after dividing the chromosomes into several, smaller chunks. The marginal benefits of SOAPsnp in accuracy and number of variants were far outweighed by its limitations when applied to a large sample set. The GATK model showed a greater number of variants for more lenient genotype discordance rate threshold. This is in contrast with what was observed in Figures 3 and Figure S8 because ANGSD utilizes the direct genotype likelihood method from the first implementation of GATK's Unified Genotyper, whereas we had previously tested GATK's HaplotypeCaller. Interestingly though, as the stringency for discordance rate increased, the number of variants converged across the SAMtools, GATK, and SOAPsnp models. We proceeded with the SAMtools model for genotype likelihood estimation due to its previous support in the GBS literature (Torkamaneh et al. 2017), accepting a nominal decrease in highly concordant variants ( Figure S9) for a large reduction in run time and memory usage.
Imputation to reference panel Imputation is used in two complimentary ways in our protocol. As described earlier, after ddGBS, not all samples will have sufficient sequencing coverage at captured polymorphic loci to make a confident genotype call. Therefore, we first use imputation from other well-covered samples to "fill in the blanks" and assign genotypes to SNP loci in the subset of individuals that lacked confident calls at these sites. After these missing genotypes have been imputed in all samples, we then use the genotype information we have for the SNPs captured by ddGBS along with the reference panel data on the original 8 HS founders (Hermsen et al. 2015;Ramdas et al. 2019) to impute genotype calls at sites that were inaccessible to ddGBS sequencing. Thus, our second application of imputation is similar to the human genetics application in which imputation using 1000 Genomes (1000Genomes Project Consortium et al. 2010) increases the number of SNPs beyond those included on a given microarray platform. IMPUTE2 was selected over Beagle for this application because it has been shown to perform better with smaller reference panels from populations with substantial LD (Frischknecht et al. 2014;Friedenberg and Meurs 2016) Before starting this imputation step, we observed an inflated transition/transversion ratio (Table S4) in our ANGSD-SAMtools/ Beagle SNPs. This issue was ameliorated when the SNP set was filtered for only "known" variants that were previously identified in either the 42 inbred strains (Hermsen et al. 2015) or the 8 deepsequenced HS founders (Ramdas et al. 2019). For imputation, we therefore only provided IMPUTE2 with previously identified variant sites from our ANGSD-SAMtools/Beagle output. Prior to running IMPUTE2, we also filtered the variants for biallelic sites with a genotype call in at least two individuals. Using pedigree data for the HS rats, we further removed samples showing an order of magnitude higher level of Mendelian error than the sample mean. We further removed SNPs that had an error rate surpassing a threshold of 0.005 ( Figure S10; inflection point). There were 4 samples and 4,179 SNPs removed from subsequent analyses. Lastly, we removed any samples where the X chromosome read ratio (reads mapped to the X chromosome divided by total reads) was incompatible with their reported sex. We used hard threshold of 3% of total reads (empirically determined), where individuals with more than 3% X-mapped reads were determined to be female and below 3%, male ( Figure S1).
There were three major genomic reference datasets available for the HS rats. The first reference set was obtained from Baud et al. (Rat Genome Sequencing and Mapping Consortium et al. 2013) and contained sequence data and genotype calls for the 8 founders of the HS. The second came from Hermsen et al. (Hermsen et al. 2015) which contains sequence and genotype data on 42 distinct laboratory rats strains and substrains, 8 of which were the founders of the HS from Baud et al., but analyzed alongside a new set of strains. The third reference set came from Ramdas et al. (Ramdas et al. 2019), who independently performed whole-genome sequencing and made genotype calls on the 8 HS founder strains. It was unclear which set of genotypes would provide the best reference for imputation from our ddGBS data, so we tested five different possible subsets of available data (Table 2)  HaplotypeCaller and Beagle at various thresholds of genotype discordance with array data. Calls were made using the 96 HS rats with array data. The x-axis represents the genotype discordance rate thresholds and the y-axis is the number of variants that surpass that threshold for each genotype calling method.
n■ Table 2 Imputation accuracy based on different variant reference panels for IMPUTE2. The table includes five different possible reference panels for imputation. The 42 inbred strains, 34 non-founder inbred strains, and 8 HS founders from the 42 inbred strains all were derived from Hermsen et al. 2015(Hermsen et al. 2015. The UMich 8 HS founders were obtained from Ramdas et al. 2019(Ramdas et al. 2019 In contrast, using the 42 rat inbred strains displayed a balance of high accuracy and low missingness, leading us to choose this as our reference set. To better understand the role of the 8 founder strains, which were part of the 42 strain reference panel, we created a reference panel that included only the 34 non-HS founder strains. As expected, discordance rates were much higher when only considering non-founders. However, the genotype missingness was lower for the 34 than the 8 founders alone, suggesting a combination of the two was the optimal set. IMPUTE2 requires a genetic map. As described in the methods section, we considered four different genetic maps, two that were empirically derived and two that were linear extrapolations based on the physical map ( Figure S11). All genetic maps performed similarly (Table S5). Surprisingly, the linear genetic maps performed just as well as the HS-specific map (Littrell et al. 2018). Thus, for simplicity, we chose to use the chromosome-specific values initially published by Jensen-Seaman (Jensen-Seaman 2004).
To obtain our final set of 3.7 million variants, a final round of variant filtering was performed after imputation to the 42 strain reference panel. We removed SNPs with MAF , 0.005, a postimputation genotyping rate , 90%, and SNPs that violated HWE with P , 1x10 210 .

DISCUSSION
The use of microarrays and WGS for genotyping large samples in model organisms remains cost-prohibitive. There is therefore an urgent and wide-spread need for high-performance and economical methods of obtaining genome-wide genotype data. While reduced-representation approaches have been utilized in numerous species of plants and animals, including rodents (Peterson et al. 2012;Fitzpatrick et al. 2013;Parker et al. 2016;Zhou et al. 2018;Gonzales et al. 2018), there has yet to be a published protocol optimized specifically for rats. Prior to sequencing thousands of HS samples with GBS for our mapping efforts, we wanted to ensure we were capturing the greatest possible number of highquality variants at the lowest possible cost. The protocol we present here is the culmination of careful testing and optimization of each step of the GBS protocol for rats. We have now applied the approach to 4,973 HS rats, as well as 4,608 Sprague Dawley rats (Gileta et al. 2018).
Our previous GBS protocol (Parker et al. 2016), which was designed for use with CFW mice, was unsuitable for our current genotyping efforts in HS rats, due to the much higher levels of genetic diversity in the HS population. There are multiple reasons we chose to develop our own computational pipeline for GBS rather than using existing workflows. Foremost, the prominent GBS analysis pipelines were developed and optimized for use in crop species (Sonah et al. 2013;Catchen et al. 2013;Glaubitz et al. 2014;Torkamaneh et al. 2017;Wickland et al. 2017), some of which are polyploid and have differing levels of variation and LD than outbred rodent populations. Additionally, there were elements of each pipeline that did not meet our needs or lacked customizability. For instance, TASSEL-GBS v2 (Glaubitz et al. 2014) trims all reads to 92 base pairs; however, other projects underway in our lab utilized up to 125bp reads, leading to a 20% reduction in data. TASSEL-GBS also ignores read base quality scores, which are informative in probabilistic frameworks for estimating uncertainty in alignments and variant calls (Li et al. 2008;DePristo et al. 2011;Nielsen et al. 2011), and uses a naïve binomial likelihood ratio method for calling SNPs. Stacks has previously shown poor performance in demultiplexing (Herten et al. 2015;Torkamaneh et al. 2017) and does not make use of the reference genome for priors when calling SNPs (Catchen et al. 2013). Fast-GBS relies on Platypus  for variant calling (WGS500 Consortium et al. 2014;Torkamaneh et al. 2017), which employs a Bayesian method of constructing candidate haplotypes that works poorly with low-pass sequencing data and does not scale well to large sample sizes (Li et al. 2018). Lastly, none of these pipelines included an imputation step, which is crucial for filling in missing genotypes in GBS data and can provide millions of additional SNPs given an appropriate composite reference panel (Howie et al. 2011;Huang and Tseng 2014).
Though we have not explicitly tested each alternate GBS pipeline for the purposes of this publication, this has been recently done by Wickland et al. (Wickland et al. 2017). Their pipeline GB-eaSy, which ours most closely resembles, was found to be superior by a number of metrics to Stacks, TASSEL-GBS, IGST, and Fast-GBS. Similar to GB-eaSy, our pipeline utilizes a doubledigest GBS protocol, aligns reads to the reference genome with bwa mem, and uses the SAMtools genotype likelihood model for calling SNPs (Li 2011). The combination of bwa mem and SAMtools algorithm was independently shown to have the best performance for calling SNPs from Illumina data (Hwang et al. 2015), further supporting our choice of these programs for read alignment and variant calling. Additionally, using the ANGSD wrapper provided us with the ability to convert the posterior genotype probabilities into genotype dosages for mapping studies (Korneliussen et al. 2014).
A minor difference between GB-eaSy and our pipeline is the use of cutadapt (Martin 2011) rather than GBSX (Herten et al. 2015) for demultiplexing, though both performed equally well (Table S1). The primary improvement is our extension of the pipeline with the implementation of effective internal and reference-based imputation steps using the 42 inbred rat genomes (Hermsen et al. 2015) and 8 deep-sequenced HS founders from UMich (Ramdas et al. 2019). There are two stages of imputation in our pipeline. The first one is accomplished by Beagle and aims to fill in missing genotypes at called variants using information from other samples. This raises the genotype call rate to 100%, but it may also introduce errors due to insufficient information, emphasizing the need for careful filtering steps. The second stage of imputation made use of IMPUTE2 and an external reference panel of variants called from WGS data on the 8 inbred HS founders, as well as 34 additional inbred rat strains. We decided to include the 34 additional strains because of the elevated genotyping rate we observed upon their inclusion in the IMPUTE2 reference panel. We attribute this to the presence of haplotypes that exist in both the 8 the HS founder strains and a subset of the 34 additional strains in this panel. The benefits of using a composite reference panel have been previously noted (Zhang et al. 2013;Huang and Tseng 2014); there is increased accuracy and decreased missingness in the imputed genotype data.
In summary, we have adapted a GBS protocol and genotyping and imputation pipeline to obtain dense genotypes on genome-wide markers in highly-multiplexed HS rats. After quality filtering on the level of SNP and sample, over 3.7 million SNPs were called with a concordance rate of 99%. The ddGBS protocol and bioinformatic methods used to produce this data are publicly available, easy to handle, and cost-effective. The presented workflow could be feasibly followed with marginal modifications for application in other species. The steps taken toward optimizing the wet lab protocols are easily applied to novel organisms, as is the computational pipeline so long as there are reliable reference genome sets available for use in alignment and imputation.