MutantHuntWGS: A Pipeline for Identifying Saccharomyces cerevisiae Mutations

MutantHuntWGS is a user-friendly pipeline for analyzing Saccharomyces cerevisiae whole-genome sequencing data. It uses available open-source programs to: (1) perform sequence alignments for paired and single-end reads, (2) call variants, and (3) predict variant effect and severity. MutantHuntWGS outputs a shortlist of variants while also enabling access to all intermediate files. To demonstrate its utility, we use MutantHuntWGS to assess multiple published datasets; in all cases, it detects the same causal variants reported in the literature. To encourage broad adoption and promote reproducibility, we distribute a containerized version of the MutantHuntWGS pipeline that allows users to install and analyze data with only two commands. The MutantHuntWGS software and documentation can be downloaded free of charge from https://github.com/mae92/MutantHuntWGS.

Saccharomyces cerevisiae is a powerful model system for understanding the complex processes that direct cellular function and underpin many human diseases (Birkeland et al. 2010;Botstein and Fink 2011;Kachroo et al. 2015;Hamza et al. 2015Hamza et al. , 2020Wangler et al. 2017;Strynatka et al. 2018). Mutant hunts (i.e., genetic screens and selections) in yeast have played a vital role in the discovery of many gene functions and interactions (Winston and Koshland 2016). A classical mutant hunt produces a phenotypically distinct colony derived from an individual yeast cell with at most a small number of causative mutations. However, identifying these mutations using traditional genetic methods (Lundblad 2001) can be difficult and time-consuming .
Whole-genome sequencing (WGS) is a powerful tool for rapidly identifying mutations that underlie mutant phenotypes (Smith and Quinlan 2008;Irvine et al. 2009;Birkeland et al. 2010). As sequencing technologies improve, the method is becoming more popular and cost-effective (Shendure and Ji 2008;Mardis 2013).
Analysis methods that identify sequence variants from WGS data can be complicated and often require bioinformatics expertise, limiting the number of investigators who can pursue these experiments. There is a need for an easy-to-use, data-transparent tool that allows users with limited bioinformatics training to identify sequence variants relative to a reference genome. To address this need, we created MutantHuntWGS, a bioinformatics pipeline that processes data from WGS experiments conducted in S. cerevisiae. MutantHuntWGS first identifies sequence variants in both control and experimental (i.e., mutant) samples, relative to a reference genome. Next, it filters out variants that are found in both the control and experimental samples while applying a variant quality score-cutoff. Finally, the remaining variants are annotated with information such as the affected gene and the predicted impact on gene expression and function. The program also allows the user to inspect all relevant intermediate and output files.
To enable quick and easy installation and to ensure reproducibility, we incorporated MutantHuntWGS into a Docker container (https:// hub.docker.com/repository/docker/mellison/mutant_hunt_wgs). With a single command, users can download and install the software. A second command runs the analysis, performing all steps described above. MutantHuntWGS allows researchers to leverage WGS for the efficient identification of causal mutations, regardless of bioinformatics experience. Figure 1 Flow chart of the MutantHuntWGS pipeline. Input data are colored in blue, the various bioinformatics tools in the pipeline are colored in green, and output data are colored in purple. Arrows identify the path of the workflow at each step of the pipeline.

Pipeline overview
The MutantHuntWGS pipeline integrates a series of open-source bioinformatics tools and Unix commands that accept raw sequencing reads (compressed FASTQ format or .fastq.gz) and a text file containing ploidy information as input, and produces a list of sequence variants as output. The user must provide input data from at least two strains: a control strain and one or more experimental strains. The pipeline uses (1) Bowtie2 to align the reads in each input sample to the reference genome (Langmead and Salzberg 2012), (2) SAMtools to process the data and calculate genotype likelihoods (Li et al. 2009), (3) BCFtools to call variants (Li et al. 2009), (4) VCFtools (Danecek et al. 2011) and custom shell commands to compare variants found in experimental and control strains, and (5) SnpEff (Cingolani et al. 2012) and SIFT (Vaser et al. 2016) to assess where variants are found in relation to annotated genes and the potential impact on the expression and function of the affected gene products (Figure 1). A detailed description of the commands used in the pipeline and all code is available on the MutantHuntWGS Git repository (https://github.com/mae92/MutantHuntWGS; see README.md, Supplemental_Methods.docx files).

Analysis of previously published data
To demonstrate utility, we used MutantHuntWGS to analyze published datasets from paired-end sequencing experiments with DNA prepared from bulk segregants or lab-evolved strains (Birkeland et al. 2010;Goldgof et al. 2016;Ottilie et al. 2017). These data were downloaded from the sequence read archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra; project accessions: SRP003355, SRP074482, SRP074623) and decompressed using the SRA toolkit (https://github.com/ncbi/sra-tools/wiki). Muta-ntHuntWGS was run from within the Docker container, and each published mutant (experimental) file was compared to its respective published control.

Data availability
All code and supplementary information on the methods used herein are available on the MutantHuntWGS Git repository (https://github. com/mae92/MutantHuntWGS).

Utility of the MutantHuntWGS pipeline
MutantHuntWGS processes WGS data through a standard alignment/ variant-calling pipeline and compares each experimental strain to a control strain (Figure 1, see Methods). The pipeline's constituent tools are often used for WGS analysis (Reavey et al. 2015;. However, MutantHuntWGS ensures ease of use by assembling these tools in a Docker container and requiring only one command to run them all in sequence. This approach combines the best aspects of previously published pipelines (discussed below) while allowing inexperienced users to install the software and reproducibly apply popular methods.
MutantHuntWGS also ensures that the output data files are well organized and easy to locate. Output files include aligned reads (BAM format), alignment statistics (TXT format), pre-and post-filtering variants (VCF format), SnpEff output (HTML, VCF, and TXT formats), and SIFT output (VCF, XLS formats). The user thus has all the information needed to identify and visually inspect sequence variants, and to generate figures and tables for publication.

MutantHuntWGS combines versatility and simplicity
Our goal in creating MutantHuntWGS was to simplify the installation and usage of robust bioinformatics tools while maintaining flexibility by allowing users to specify certain critical options.
Examples of this, discussed below, include (1) enabling use with additional organisms, (2) allowing users to specify ploidy, (3) filtering by a user-specified variant-quality score, and (4) exposing all intermediate and final output files to facilitate additional filtering and quality control.
MutantHuntWGS is designed for use with S. cerevisiae by default but can be adapted to analyze WGS data from any organism. At present, only the necessary reference files for S. cerevisiae are included in the MutantHuntWGS download. Investigators who wish to analyze data from an organism other than S. cerevisiae need to provide, at minimum, new Bowtie2 indices, a genome FASTA file, and a ploidy file. Bowtie2 indices and genome FASTA files for many model organisms are available at https://support.illumina.com/sequencing/ sequencing_software/igenome.html. A FASTA index file (genome. fasta.fai) that can be easily converted into a ploidy file is also available at this link. Unfortunately, performing the SnpEff and SIFT analysis would require slight alterations to the SnpEff and SIFT commands in the pipeline script and a copy of the SIFT library for the organism of interest. We chose not to include reference files and SIFT libraries for other organisms within the Docker container due to the large size of these files. If users encounter difficulties when analyzing non-S. cerevisiae WGS data, we encourage them to seek assistance by opening an issue on the MutantHuntWGS Git repository.
Experiments in yeast are often performed in a haploid background, but can also be performed in diploid or occasionally aneuploid backgrounds. The MutantHuntWGS download includes two ploidy files, one for diploids and one for haploids. The user can specify either ploidy file when running the pipeline. Muta-ntHuntWGS will automatically provide this file to BCFtools during the variant-calling step. This may be particularly advantageous for analysis of yeast strains with aneuploid chromosomes. Instructions are provided on the GitHub Readme page explaining how to modify the ploidy file to account for aneuploidy in the analysis.
Users may also set variant-quality-score cutoffs (described in detail on GitHub: https://github.com/mae92/MutantHuntWGS/blob/ master/README.md) to tune the stringency of the analysis. They can also toggle the alignment step to save time when resetting the stringency. This option re-subsets variant calls with a higher or lower stringency cutoff, skipping the more time-consuming upstream steps of the pipeline. Although MutantHuntWGS does not allow users to specify additional cutoffs that filter the output per SnpEff/SIFT effect predictions and scores, users can separately apply such filters to the MutantHuntWGS output files after the fact-thus allowing for increased stringency.
Assessing MutantHuntWGS performance using a bulk segregant analysis dataset To assess MutantHuntWGS performance, we applied it to bulk segregant analysis data (Birkeland et al. 2010) with ploidy set to haploid. MutantHuntWGS identified 188 variants not present in the control strain that passed the variant-quality-score cutoff of 100. Thus only 1.95% of all variants detected in the experimental strain passed the filtering steps (Table 1). Among these was the same PHO81 (VAC6) mutation found in the Birkeland et al. We were surprised by how many sequence variants (relative to the reference genome) remained after filtering. Given our variant-qualityscore cutoff of 100, it is unlikely that these variants were called in error; instead, they likely reflect high sequence heterogeneity in the genetic backgrounds of the experimental and control strains. To further reduce the length of the variant list, we experimented with additional cutoffs, including (1) more stringent variant-quality-score, (2) SIFT score, and (3) SnpEff impact score cutoffs. A SIFT-score cutoff of ,0.05 (deleterious) reduced the number of variants in the SIFT output from 152 to 6 (Table 1). An increased variant-quality-score stringency (.130) reduced the number of variants to 21. A SnpEff impact-score cutoff of . Moderate reduced the number of variants to 55. Finally, a variant quality-score cutoff of .130 and a SnpEff score of . Moderate, used together, reduced the number of variants to only 6. All post-hoc tests retained the causal variant. These tests demonstrate how users might similarly narrow their lists of potential candidates. However, we caution readers that filtering by these metrics has the potential to increase the false negative rate in their analysis.
Assessing MutantHuntWGS performance using lab evolution datasets To test MutantHuntWGS performance on strains that did not undergo bulk-segregant analysis, we analyzed nine datasets from lab evolution experiments (Goldgof et al. 2016;Ottilie et al. 2017), again setting ploidy to haploid and using a variant-quality-score cutoff of 100. In each of these studies, yeast cells were allowed to evolve resistance to a drug and WGS was used to identify mutations (Goldgof et al. 2016;Ottilie et al. 2017).
Existing WGS analysis pipelines Other platforms exist that perform similar analyses. Each possesses a subset of the features enabled by MutantHuntWGS and has notable strengths. MutantHuntWGS is unique in its ability to combine the best attributes of these published tools while including additional functionality and providing output data in standard formats, such as BAM and VCF.
One user-friendly program, Mudi (Iida et al. 2014), uses BWA (Jo and Koh 2015), SAMtools (Li et al. 2009), and ANNOVAR (Wang et al. 2010) for sequence alignment, identification, and annotation of sequence variants, respectively. Like MutantHuntWGS, Mudi performs numerous filtering steps before returning a list of putative causal variants. MutantHuntWGS predicts variant effects and maps variants to annotated S. cerevisiae genes using SnpEff and SIFT instead of ANNOVAR, and also offers access to all intermediate data files.
Another program, VAMP, consists of a series of Perl scripts that build and query an SQL database made from user-provided shortread sequencing data. VAMP identifies sequence variants, including large insertions and deletions. It also has built-in functionality that allows for manual inspection of the data (Birkeland et al. 2010). One advantage of MutantHuntWGS over VAMP is that it adheres to common data formats.
A recent article describing WGS in yeast samples includes a bioinformatics pipeline, referred to as wgs-pipeline . It is built in a Snakemake framework (Köster and Rahmann 2012) that runs in a Conda environment (https://docs. conda.io/en/latest/), similar to the container-based analysis environment we used for MutantHuntWGS. This pipeline uses Bowtie2 (Langmead and Salzberg 2012), SAMtools (Li et al. 2009), Picard (Toolkit 2016), and GATK (McKenna et al. 2010 to align, process, and compare datasets. Compared to wgs-pipeline, MutantHuntWGS, which runs both SnpEff and SIFT on the candidate variants, provides a more comprehensive analysis of the predicted effects of the variants. The Galaxy platform (Giardine et al. 2005;Blankenberg et al. 2010) provides a user-friendly, online interface for building bioinformatics pipelines. Galaxy also offers access to intermediate files. However, analysis with this platform requires the user to select the tools and parameters to incorporate, so some knowledge of the tools themselves is essential. Implementation is straightforward after those decisions are made, and the user need not have any understanding of Unix/Linux. The advantage of MutantHuntWGS over the Galaxy platform and pipelines such as CloudMap (Minevich et al. 2012) is that the user does not need to make decisions about the data analysis workflow.
In summary, the MutantHuntWGS pipeline is among the most user-friendly of these programs. It combines the most useful features of the existing WGS analysis programs while also enabling the user to account for ploidy. Containerization streamlines the installation of MutantHuntWGS and enhances its reproducibility. Thus, Muta-ntHuntWGS offers ease of use, functionality, and data-transparency, setting it apart from other WGS pipelines.

Conclusions
Processing data generated from next-generation sequencing platforms requires significant expertise, and so is inaccessible to many investigators. We have developed a highly effective differential variant-calling pipeline capable of identifying causal variants from WGS data. We demonstrate the utility of MutantHuntWGS by analyzing previously published datasets. In all cases, our pipeline successfully identified the causal variant. We offer this highly reproducible and easy-to-implement bioinformatics pipeline to the Saccharomyces cerevisiae research community (available at https:// github.com/mae92/MutantHuntWGS).