A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

Abstract The increasing availability of genomic resequencing data sets and high-quality reference genomes across the tree of life present exciting opportunities for comparative population genomic studies. However, substantial challenges prevent the simple reuse of data across different studies and species, arising from variability in variant calling pipelines, data quality, and the need for computationally intensive reanalysis. Here, we present snpArcher, a flexible and highly efficient workflow designed for the analysis of genomic resequencing data in nonmodel organisms. snpArcher provides a standardized variant calling pipeline and includes modules for variant quality control, data visualization, variant filtering, and other downstream analyses. Implemented in Snakemake, snpArcher is user-friendly, reproducible, and designed to be compatible with high-performance computing clusters and cloud environments. To demonstrate the flexibility of this pipeline, we applied snpArcher to 26 public resequencing data sets from nonmammalian vertebrates. These variant data sets are hosted publicly to enable future comparative population genomic analyses. With its extensibility and the availability of public data sets, snpArcher will contribute to a broader understanding of genetic variation across species by facilitating the rapid use and reuse of large genomic data sets.


Introduction
In the past decade, rapidly declining sequencing costs have led to a dramatic expansion in the availability of genomic resequencing data sets in diverse organisms, fueling a wide range of novel insights, including the prevalence of adaptive introgression between species (Huerta-Sánchez et al. 2014;Lamichhaney et al. 2015;Jones et al. 2018), the molecular basis of repeated local adaptation (Jones et al. 2012;Hill et al. 2019;Wooldridge et al. 2022), and the complex demographic histories of humans (Nielsen et al. 2017;Fan et al. 2023) and animals of conservation relevance (Robinson et al. 2018).In parallel, rapidly expanding efforts to generate high-quality reference genomes across the Tree of Life (Rhie et al. 2021;Lewin et al. 2022) are poised to empower population genetic inference across a wide diversity of organisms.The massive accumulation of existing genomic data sets facilitated by these advances can enable broad comparisons between diverse populations and uncover generalized principles that may explain processes that generate diversity across life.These questions include the determinants of molecular variation among species (Romiguier et al. 2014;Corbett-Detig et al. 2015;Buffalo 2021) and indirect estimates of the rates of loss of genetic variation among populations (Exposito-Alonso et al. 2022).
However, despite the rapid increase in accessibility of public sequencing data from diverse organisms, comparative population genetics and reuse of public data remain challenging for several reasons.In the absence of standardized variant calling pipelines for nonhuman species (Regier et al. 2018), computational batch effects introduced by differences in reference choice, alignment, and variant calling algorithms complicate efforts to jointly analyze existing variant calls across populations and species (Lek et al. 2016;Jia et al. 2020;Breton et al. 2021).Considerations must also be given to data quality prior to data processing, particularly in cases of low coverage (Lou et al. 2021), and workflows must be flexible to accommodate these considerations.Because these computational and algorithmic choices can impact downstream analysis (Kulkarni and Frommolt 2017), comparative projects often must reanalyze raw data to produce comparable data sets, which can be computationally expensive.
Extensible, reproducible bioinformatic pipelines can help address these challenges, to facilitate both primary analysis of complex tasks such as variant calling and also allow for consistent reanalysis (Wratten et al. 2021).While reproducible workflows have had a major impact on human population genetics (Chen et al. 2022), the need for significant expertise to adapt pipelines optimized for human genetics to diverse species is a major technical hurdle for many researchers.Additionally, resequencing data sets are increasingly rapidly in scale (Ellegren 2014), driving a need for workflows optimized for computational efficiency and flexibility to be used across a variety of compute resources, including cloud resources that eliminate the need for costly on-site infrastructure (Mangul et al. 2019).
Due to the popularity and need for efficient and reproducible workflows, several solutions have already been proposed for variant calling pipelines (Czech and Exposito-Alonso 2022; Cullen and Friedenberg 2023).Here, we present snpArcher, a reproducible workflow for data set acquisition, variant calling, quality control (QC), and downstream analysis that is optimized for nonmodel organisms and comparisons across data sets (available at https://github.com/harvardinformatics/snpArcher).snpArcher implements a combination of several notable features not included in other existing solutions that address the challenges presented by the expanding scale of comparative population genomic studies.First, our workflow is optimized for nonmodel species, which often lack gene annotations, known variant sites, and other genomic information typically required for human-optimized pipelines.Second, we take advantage of the huge compute power available through cloud resources and large high-performance computing (HPC) clusters by highly parallelizing the workflow's variant calling step and thus greatly reducing analysis time.Finally, we designed snpArcher to be modular and easily extended.By providing module contribution guidelines and example analysis modules, we hope that users will be able to develop and contribute their own modules.This will enrich the snpArcher ecosystem and cater to a diverse range of genomic analyses.
To enable rapid analysis of a growing set of variant calls created in a functionally equivalent way, we apply this workflow to reanalyze public sequencing data from 26 focal species of nonmammalian vertebrates and make the resulting variant calls available for public use.Furthermore, we provide examples of analysis and visualization modules, and we use these to exemplify and enumerate a suite of criteria for future module contributions to this project.This new and immediately available toolset will enable highly reproducible comparative population genomic analyses for a broad range of taxa.

Overview of snpArcher
We developed snpArcher, a comprehensive workflow for the analysis of polymorphism data sampled from nonmodel organism populations (Fig. 1).This workflow accepts short-read sequence data and a reference genome as input and ultimately produces a filtered, high-quality variant call format (VCF) genotype file for downstream analysis.It also accepts as input accession numbers for reads and reference genome, which are then automatically downloaded from public repositories.We largely follow the Genome Analysis Toolkit (GATK) best practices (Van der Auwera et al. 2013) to map reads to a reference genome, call individual-level variants, generate population-level consensus genotypes, apply filters, and generate QC metrics.This workflow is implemented as a Snakemake (Mölder et al. 2021) workflow, which enables scalable, reproducible, and efficient analysis of large-scale genomic data sets.Snakemake manages all aspects of running the workflow, such as the installation of software dependencies, creation of output directories, and execution of workflow steps, so that the user input required is minimal.To use snpArcher, users need only edit a configuration file to customize workflow settings and define their samples in a table.With these files in place, running snpArcher is as simple as running one command.

Example Data Sets Processed Using snpArcher
To thoroughly evaluate snpArcher and to provide a database of comparative population genomic data sets, we ran the workflow on 26 public resequencing data sets (supplementary table S1, Supplementary Material online).We identified 13 bird, 12 fish, and 1 reptile data sets that fit our criteria of whole-genome, multisample, moderate sequencing effort (see Materials and Methods) and have a reference genome available.Data sets vary by number of individuals from 6 to 306, all with a mean depth of coverage of at least 5.We recovered between 3.34 million and 83.83 million total single nucleotide polymorphisms (SNPs) on genomes ranging from 348 Mb to 1.6 Gb (supplementary table S1, Supplementary Material online).Nucleotide diversity (Watterson's θ) varies by an order of magnitude across these species, from 0.00126 in the cichlid Amphilophus citrinellus to 0.01568 in the zebra finch Taeniopygia guttata.

Impact of Sequencing Depth
To evaluate the performance of snpArcher, we selected 10 individuals from a high-quality resequencing data set of zebra finch T. guttata (Singhal et al. 2015) and reanalyzed them using a range of approaches.First, we investigated the impacts of low sequencing depth by subsampling the initially high-depth data set (16.7× to 50.2× coverage) to uniform reduced coverage data sets (4×, 10×, and 20×).We ran each data set using the "low-coverage" and Mirchandani et al. • https://doi.org/10.1093/molbev/msad270MBE "high-coverage" configurations of the pipeline; the "lowcoverage" configuration alters certain GATK parameters to improve SNP calling in low-coverage data sets.After filtering for SNPs that passed all filters, we genotyped about 40, 55, and 50 million SNPs in the 4×, 10×, and 20× data sets, respectively, with about 1 million more SNPs recovered from the low-coverage pipeline at 4× coverage compared with the high-coverage version.There were negligible differences for the 2 pipeline versions at 10× and 20× (Fig. 2a).Assuming the 20× high-coverage pipeline produced the truth set of SNPs, the 4× data set was missing 35.6% (high-coverage pipeline) or 33.8% (lowcoverage pipeline), and the 10× data set was missing 10.3% (high-coverage pipeline) or 11.3% (low-coverage pipeline) of SNPs.CPU time to run the low-coverage version of the pipeline was substantially higher compared with the high-coverage version and increased with sequencing depth (Fig. 2b).The percentage of heterozygous sites per individual was substantially reduced at low coverage, especially when using the high-coverage parameters, and slightly reduced at 10× coverage (Fig. 2c).Individual fixation indices measuring excess homozygosity (F-statistics) were correspondingly higher at lower sequencing depths, especially with the high-coverage parameters (Fig. 2d), indicating more heterozygous dropout.While heterozygous dropout is a substantial problem at low coverage (Nevado et al. 2014;Benjelloun et al. 2019), parameter tuning can partially mitigate its impact on genotype calls, at the cost of longer compute times.

Parallelization
We assessed the effectiveness of our parallelization method for variant calling with snpArcher on the 10× zebra finch data set by comparing our scatter-by-Ns approach to the traditional scatter-by-chromosome approach.Given that GATK HaplotypeCaller has limitations in efficiently utilizing multiple CPU cores, optimal parallelization requires a scatter-gather technique, processing each chromosome independently (Heldenbrand et al. 2019).However, as runtime scales with genomic interval size (Fig. 3a), using this approach will still result in potentially long execution times, especially for organisms with very large chromosomes.To address this, we employ a strategy of partitioning chromosomes at Ns (assembly gaps), creating smaller genomic intervals that can be processed in parallel.This approach shortens the run time per individual (Fig. 3b), as more intervals can be concurrently processed.Although the effectiveness of this approach is dependent on available compute resources, the wide availability of HPC clusters and affordable cloud compute resources renders this constraint generally acceptable.

QC and Data Visualization
An important component of any pipeline is QC and data visualization outputs.We have implemented a module in snpArcher, run by default, that produces an interactive QC dashboard, which can be used to evaluate individuallevel sequencing quality (Fig. 4).This dashboard generates 10 figures that allow visualization of basic summary statistics relating to population structure, batch effects, sequencing depth, genetic relatedness, geography, and admixture.For speed, most of these summaries are based on a random sample of 100,000 SNPs from across the genome.Four panels at the top of the dashboard provide high-level summaries of the full variant data set (i.e.without random downsampling to 100,000 SNPs).
The use case for these simple visualizations is to quickly evaluate potential biases relating to individual-level sequencing variation.For example, in the principal component analysis (PCA) shown in the upper left panel of Fig. 4, it is possible to identify outliers that may represent cryptic genetic variation, batch effects, or otherwise problematic (or interesting) samples.By default, we identify 3 clusters based on PC1 and PC2 with k-means clustering (modifiable in the config file), and the remainder of the

Postprocessing
By default, snpArcher produces a raw VCF file with only basic filters annotated.However, after viewing the individual-level QC visualizations as part of the QC module, users may wish to remove certain individuals from the analysis and apply additional filters on called variants.Additional postprocessing steps are implemented in a module, which runs if the user adds a column to the sample sheet header "SampleType."The postprocessing module will exclude from the filtered VCF any sample with "exclude" as the SampleType, retaining all other individuals.Following this sample filtering, this module implements additional user-configurable filters.By default, the postprocessing workflow removes sites that fall into regions of low mappability, regions with excess coverage, and regions with insufficient coverage (defined by the configuration file) and then removes sites with a minor allele frequency of <0.1 or missingness of >75%.These thresholds can be configured by the user.Finally, 2 clean variant files are produced for SNPs and indels separately.

MK Tests
To demonstrate the potential to extend snpArcher to incorporate downstream analysis, we developed a module to evaluate positive selection among a sample of individuals from a population (the ingroup) as well as one or more diverged samples (the outgroup) by computing MacDonald-Kreitman (MK) tests for each gene (McDonald and Kreitman 1991).This module is triggered when samples are annotated as "ingroup" and "outgroup" using the SampleType column in the sample sheet.Samples that do not have either designation will be excluded from the MK tests.
To facilitate the development of this module, we wrote a standalone Python program, degenotate (https://github.com/harvardinformatics/degenotate), that can retrieve coding sequences from an annotated genome, compute degeneracy across the genome, and calculate MK tables; degenotate can be installed via conda and run independently but is also incorporated into snpArcher's MK module.Briefly, degenotate assesses whether SNPs in the

MBE
postprocessed VCF encode for polymorphic sites within the ingroup or fixed differences between the ingroup and the outgroup.It further classifies whether each SNP, whether polymorphic or fixed, is synonymous or nonsynonymous.Note that certain assumptions, detailed in the Materials and Methods, must be made about how to handle certain rare edge cases when doing this.
Based on these outputs, the MK module (or standalone degenotate) creates tables that are organized by gene and can be analyzed using the standard MK test statistic, using various extensions (Rand and Kann 1996;Stoletzki and Eyre-Walker 2011), or in aggregate to investigate genomewide signatures of natural selection (Messer and Petrov 2013).This module will enable rapid application of population-genomic tests of selection (Fig. 5) and, in combination with the database of processed population data sets, provides a framework for comparing rates of adaptation to a range of species.Intriguingly, 3 collagen genes with potential roles in tooth and spine development across vertebrates (Jonsson et al. 1998;Bosse et al. 2017) are among the strongest outliers (Fig. 5a) and may be involved in the unique pufferfish morphology (Thiery et al. 2017).

UCSC Genome Browser Track Data Hub Generation
To facilitate downstream data exploration and as an example of the module development components of this work, we developed a module to generate UCSC Genome Browser track files to explore population variation data (see Materials and Methods).Briefly, this module computes and generates genome browser tracks for traditional population genomic summary statistics such as windowed estimates of Tajima's D, SNP density, pairwise nucleotide diversity (π), minor allele frequency, and SNP depth.The Genome Browser tracks allow for rapid analysis of common population genomic statistics along with other available genomic feature tracks in an easy-to-access and shareable format (Fig. 6).

Discussion
The production of high-quality and accurate genomic variation data sets for nonmodel species can be a challenging task, especially with the ever-increasing volume of genomic data that are being produced.The massive scale of population-scale whole-genome sequencing data sets presents significant hurdles in data management, processing, and analysis.In this manuscript, we introduce snpArcher, a powerful and user-friendly Snakemake workflow that addresses these challenges and enables the production of reliable and reproducible variation data sets.Crucially, our pipeline is parallelized, efficient, and scales well even up to modern population-scale data sets.snpArcher also provides an ideal tool for reanalyzing population-level data sets that are available on public databases and provides a consistent framework for comparative analyses across different data sets.By offering a reproducible and

Extensibility of snpArcher
A key goal in the design of the snpArcher pipeline is to allow extensibility for subsequent analyses after the primary variant calling.We implement this using Snakemake modules, which allow additional rules to easily extend the main pipeline.To be added to snpArcher, a module only needs a way to indicate that it should be run, such as a flag in the config file or a column in the sample sheet, and for output files from snpArcher to be linked to input files of the workflow.As long as these constraints are met, any user-defined Snakemake workflow can be imported as a module.Furthermore, to enable seamless integration of user-developed modules, we define a set of guidelines for users to follow when developing and contributing modules (https://snparcher.readthedocs.io/en/latest/modules.html#module-contribution-guidelines).Finally, we present several modular extensions of snpArcher here, but we hope also that user-developed modules will grow the set of tools linked to snpArcher in order to facilitate diverse analysis.

Challenges and Prospects for Reuse of Public Data
Publically available data sets provide opportunities for comparative genomics and also present limitations inherent to data reuse.Metadata associated with genomic data is often fragmented or missing, meaning crucial information for QC of reusing data is not always available (Gonçalves and Musen 2019;Toczydlowski et al. 2021).A key function of the snpArcher pipeline is to produce metrics to evaluate potential biases in the data set for common population genomic issues.For example, pedigree information is typically not available for wild populations and likely to be missing from public data sets, but close relatives may bias many common population genomic analyses (Hendricks et al. 2018).Our QC module reports relatedness information, allowing rapid identification of related individuals.In the data sets we analyzed, 14% of all data sets considered included identical individuals by genotype and 47% of data sets included at least 1 firstdegree relative in it.At the population scale, undetected population structure can bias population and quantitative genomic analysis, and the PCA and admixture reports in the QC module will give a first-pass assessment of known or unknown structuring.Sequencing data on public databases can contain contamination, either from other individuals or other species.These can be identified using measures of inbreeding (i.e.low inbreeding values may A Variant Calling Workflow for Population Genomics • https://doi.org/10.1093/molbev/msad270MBE suggest excess heterozygosity and cross-contamination) that are reported in the QC module.Outliers in sequencing depth, missingness, and mapping rate are all quickly identifiable using the interactive QC plots.Finally, data quality at short-scale genomic intervals can be visualized using the genome browser outputs, for example to evaluate sequencing depth and genetic diversity around regions of interest.
Together, we expect these improvements to variant calling for the average user will enable users to focus on analyzing their data, rather than processing it.This is made easier by the extensibility of snpArcher, where it is relatively simple to add on new analyses to the workflow.The standardization inherent to the workflow will further improve our ability to reuse large and unwieldy publically available data sets.We hope the ease of use and flexibility of the snpArcher workflow will enable a more rapid and reproducible workflow for researchers generating large population genomic data sets.

Configuration
Core workflow options in snpArcher are controlled by a YAML configuration file.This file controls options such as module selection, output prefix for final files, and temporary storage location.For complete instructions on setting up snpArcher, please refer to our documentation: https://snparcher.readthedocs.io/en/latest/setup.html.
In order to determine what outputs to create, snpArcher requires users to create a sample sheet file.This comma-separated file contains the required sample metadata about the user's samples in order to run the workflow.At a minimum, the snpArcher pipeline requires that each sample has a unique sample name, a reference genome accession or a path to a fasta file, and a Sequence Read Archive (SRA) accession or path to 2 paired-end fastq files (Table 1).We include with snpArcher a simple script, written in Python, to facilitate the generation of sample sheets from local data sets, and we include examples of how to create snpArcher sample sheets from SRA run tables in R.

Computer Resources and Cloud Configuration
Variant calling for large population-level sequencing data sets is computationally intensive and requires substantial computational resources to run.While it is possible to run snpArcher on a laptop for small data sets, such as the test data set included in the workflow or single samples, we have optimized it to run on HPC clusters and cloud compute platforms.We have tested snpArcher extensively on SLURM-based high-performance clusters and on the Google Cloud Life Sciences platform, and following Snakemake best practices, we provide configurable profiles that can be enabled depending on which computational resources you will use.The SLURM profile and associated bash script provide the basic configuration for running on a SLURM cluster, but the profile will need to be adjusted according to the configuration of the user's specific system.
To run snpArcher on the Google Cloud Platform (GCP), the user must have a Google account linked to a billing account where charges for computational resources can be made.This Google configuration is set up outside of snpArcher, on the command line, and on the Google Cloud Console.Once this is set up in the local environment, snpArcher can be directed to run on Google Cloud instances using the GCP profile provided with the workflow.The user can define how many instances to create and also define the size of required resources in the resources.yamlfile included in the workflow.The GCP profile also is configured to exploit preemptible instances, which are short-term compute instances that are offered at considerable cost savings, but can only run for 24 h and be bought out by other GCP users.The current defaults have been optimized for data sets of genome size of ∼2 Gb, 150 individuals, and 10× sequencing depth with an estimated cost of $1/sample when a Sentieon license is available.Larger or smaller projects may need to tweak these resources to optimize cost/saving benefits best and prevent the preemption of long-running data sets.

Data Acquisition and Preprocessing
The first step of the workflow is the acquisition and preprocessing of raw sequence data and reference genomes.For each sample, 2 paired-end fastq files are required.The default behavior is to retrieve sequencing data from National Center for Biotechnology Information (NCBI) based on an SRA run accession (Leinonen et al. 2011) using prefetch.For various reasons, prefetch may fail.If this happens, ffq (Gálvez-Merchán et al. 2022) is used to generate a FTP link for the accession that is downloaded.Alternatively, users can supply paths to fastq files in the sample sheet, in which case snpArcher will operate on those locally stored files.Next, sequencing adapters are trimmed from the raw fastq files with fastp (Chen et al. 2018), and sequences with greater than 40% of bases with a phred score below Q15 were removed.Reference genomes are retrieved using the NCBI data sets tool (Sayers et al. 2021) if an NCBI accession is specified; otherwise, a path to the reference fasta must be included in the sample sheet.Once available, the reference fasta is processed using bwa index (Li and Durbin 2009), samtools faidx, and samtools dict (Li et al. 2009) to produce the indexes necessary for downstream processes.

Read Mapping
After the raw data are retrieved and preprocessed, the workflow aligns sequencing reads to the reference genome  (Li 2013) to produce per sample BAM files.
For each sample, read groups are appended based on the sample sheet specification.We mark PCR duplicates using sambamba markdup (Tarasov et al. 2015) to exclude these technical artifacts from downstream analysis.Alignment statistics are calculated per sample using samtools flagstat.

Mappability and Coverage
Additionally, mappability statistics are computed on the reference genome using genmap (Pockrandt et al. 2020).Per-site coverage statistics are optionally computed and aggregated using d4tools (Hou et al. 2021), mosDepth (Pedersen and Quinlan 2018), and bedtools (Quinlan and Hall 2010).Mappability statistics for the reference genome, combined with per-site coverage statistics, can be used to generate a bed file delineating callable regions of the genome based on user-configurable thresholds.

Variant Calling
We use GATK (McKenna et al. 2010) for individual variant calling and joint genotyping.First, we employ GATK HaplotypeCaller to call SNPs and indels in each sample.
If the user has selected the low-coverage configuration, we set the --min-pruning and -min-dangling-branch-length options equal to 1 (Hui et al. 2020); otherwise, defaults are used.Next, individual variant calls are aggregated into an efficient data structure via GATK GenomicsDBImport.This step is necessary to enable large cohort joint genotyping.Then, we use GATK GenotypeGVCFs to perform joint genotyping and produce a multisample VCF, retaining only high confidence variants.This approach is broadly adapted by the field as the standard for variant calling, as evidenced by nearly 20,000 citations of the flagship GATK paper to date.Finally, we apply filter annotations to the VCF according to the GATK best practices (Van der Auwera et al. 2013) using GATK VariantFiltration.

Parallelization
Processing even moderately sized data sets can be exceptionally slow with GATK.One solution is to parallelize each GATK step by splitting the reference genome into processing intervals for both the individual and joint genotyping steps.Optimally, this interval creation step divides the genome into shorter subchromosomal (or subscaffold) pieces so that each interval can finish in a shorter amount of time.In order to optimize runtime, we use a two-step interval creation process.We generate an initial set of calling intervals using the ScatterIntervalsByNs tool to divide the reference genome at large blocks of Ns.This is important because SNP calling with GATK Haplotype Caller is based on local reassembly, which can be adversely affected if, for example, reads that map across an interval boundary are discarded.However, for many reference genomes, this can result in thousands of intervals, which leads to inefficient workflows as the time to assess which jobs need to run becomes prohibitive.To create a balanced set of interval lists, we use the GATK SplitIntervals tool using the option <-mode BALANCING_WITHOUT_INTERVAL_ SUBDIVISION>, which creates a set of interval lists (up to a maximum user-specified value) that all have approximately equal numbers of bases.For the joint genotyping step, each site is treated independently, so we can gain efficiency by creating additional intervals without the concern of splitting adjacent regions of the genome.Thus, for the second set of intervals, we use the option <-mode INTERVAL_SUBDIVISION> to produce a scalable number of intervals that can divide adjacent regions.These intervals are then used to parallelize GATK GenomicsDBImport for efficient multisample calling.

Postprocessing
In order to enable users to efficiently filter individuals from their VCF file after initially running snpArcher, we include the postprocessing module.Users can trigger this module by marking individuals for removal using the "SampleType" column in their sample sheet.The postprocessing module applies customizable filters, which by default remove sites in regions of low mappability and excessive or insufficient coverage (as defined in the configuration file) using bedtools and sites with a minor allele frequency of <0.1 or missingness of >75% using bcftools (after recalculating these metrics following sample removal).We also produce separate variant files for SNPs and small indels called by GATK.

Trackhubs
To display population genomic statistics calculated from the VCF generated by snpArcher, we include an optional module to generate a UCSC Genome Browser track data hub (Raney et al. 2014).At time of publication, this module calculates Tajima's D (Tajima 1989), SNP density, nucleotide diversity (π), and allele frequency.These statistics are calculated using VCFtools v0.1.15and converted to bigBed format using bedToBigBed (Kent et al. 2010).
Annotating Codon Degeneracy and Inferring Synonymous and Nonsynonymous Variants snpArcher also includes an optional module that annotates the degeneracy of all coding regions in the reference genome and implements the classic MK test for detecting selection acting in coding regions within a population (McDonald and Kreitman 1991).Briefly, this test compares the number of SNPs present within the population that either change (nonsynonymous) or do not change (synonymous) the amino acid encoded at that position.This is compared with similar counts of fixed differences in a diverged outgroup sample to see if and how the ratio of nonsynonymous to synonymous changes differs between them.While annotating degeneracy and computing tables for the MK test are common tasks in population genetics, we are not aware of any tools that automate these analyses at a genome-wide scale.To facilitate the integration of this functionality into snpArcher, we developed a standalone tool called degenotate (https:// github.com/harvardinformatics/degenotate),which calculates MK tables, performs degeneracy annotation, and allows users to extract coding sequences from a genome by their degeneracy.
To implement the MK test across diverse organisms, we make some assumptions about how to classify polymorphic and divergent sites.We consider a polymorphic site to be any location where at least 1 individual within the ingroup possesses a nonreference allele and divergent sites to be only those where none of the outgroup alleles exist in the ingroup.Using these definitions, it is possible for a site to both be polymorphic and fixed if the outgroup alleles are different from the alleles segregating within the population.For quantifying variants, we also make some simplifying assumptions.First, if a codon has more than 1 variant segregating within a population (either because multiple positions at the codon have segregating sites or because 1 position has a multiallelic SNP), we treat each segregating variant as independent.For the outgroup, if there are multiple fixed differences in a single codon in the outgroup, we compute all possible mutational pathways between the ingroup codon and the outgroup codon and take the average number of nonsynonymous and synonymous changes across these paths, weighted equally.This means we can have fractional numbers of synonymous and nonsynonymous divergence.We also implement calculations of the neutrality index (Rand and Kann 1996) and direction of selection (Stoletzki and Eyre-Walker 2011) based on the MK test results.
We used SRA to search for possible data sets for inclusion, limiting our search space to species with (i) a reference genome and (ii) at least 1 BioProject that contains a minimum of 10 BioSamples sequenced to at least 5× average coverage.The resulting list was then manually curated to identify publications associated with each BioProject, excluding from further consideration data sets for which a publication could not be identified.We then manually assessed the resulting plausible samples to identify a subset for further analysis.R notebooks are provided on GitHub that contain the code for initial and final assessments (https://github.com/sjswuitchik/compPopGen_ms).

Benchmarking
To investigate the impact of low sequencing depth on variant calling, we, first, subsampled the original high-depth data set zebra finch data set to 4×, 10×, and 20× coverage.We ran snpArcher on these subsampled data sets and filtered the resulting VCF files by removing sites not passing standard filters and calculated heterozygosity statistics using VCFtools v0.1.15 (Danecek et al. 2011).Second, we assessed the effectiveness of our variant calling parallelization (scatter-by-Ns) approach to the conventional (scatter-by-chromosome) approach using the 10× data set.We performed these benchmarking runs on Google Cloud compute instances, selecting the instance types for each rule to balance cost and runtime (supplementary table S2, Supplementary Material online).

Fig. 1 .
Fig. 1. snpArcher overview.snpArcher is an automated pipeline implemented in Snakemake (Mölder et al. 2021).It takes short-read wholegenome sequencing data (fastq) and a reference genome as input and produces a multisample variant callset (VCF).With the modules presented here, snpArcher produces basic QC statistics and visualizations.

Fig. 2 .
Fig. 2. Benchmarks for the 10 individual zebra finch coverage and pipeline testing.For each coverage data set (4×, 10×, 20×), we ran the lowcoverage (LC) and high-coverage (HC) version of the pipeline, and calculated a) the overall number of SNPs following standard SNP filtering, b) the hours of CPU time to run HaplotypeCaller for each individual, c) the percentage of heterozygous sites for each individual, and d) the F-statistic calculated for each individual.

Fig. 3 .MBEFig. 4 .Fig. 5 .
Fig. 3. Run time metrics from the HaplotypeCaller step of snpArcher.a) Wall clock time required to run HaplotypeCaller on different genomic interval sizes.b) Wall clock time elapsed per individual to complete the HaplotypeCaller step, using 2 parallelization approaches: scatter-by-chrom and scatter-by-Ns.Wall clock time per individual represents the maximum time taken of all intervals processed for that individual.

Table 1
An example of the minimum required sample metadata to run snpArcher workflow.This software package is proprietary and produces identical results as GATK but has been much more efficiently parallelized, resulting in substantially reduced compute needs.The Sentieon workflow uses Sentieon's drop-in replacement tools for mapping, PCR duplicate removal, metrics, and variant calling.The use of this workflow is a user-specified option in snpArcher and requires a software license from Sentieon that can be specified in the config file.