BSA4Yeast: Web-based quantitative trait locus linkage analysis and bulk segregant analysis of yeast sequencing data

Abstract Background Quantitative trait locus (QTL) mapping using bulk segregants is an effective approach for identifying genetic variants associated with phenotypes of interest in model organisms. By exploiting next-generation sequencing technology, the QTL mapping accuracy can be improved significantly, providing a valuable means to annotate new genetic variants. However, setting up a comprehensive analysis framework for this purpose is a time-consuming and error-prone task, posing many challenges for scientists with limited experience in this domain. Results Here, we present BSA4Yeast, a comprehensive web application for QTL mapping via bulk segregant analysis of yeast sequencing data. The software provides an automated and efficiency-optimized data processing, up-to-date functional annotations, and an interactive web interface to explore identified QTLs. Conclusions BSA4Yeast enables researchers to identify plausible candidate genes in QTL regions efficiently in order to validate their genetic variations experimentally as causative for a phenotype of interest. BSA4Yeast is freely available at https://bsa4yeast.lcsb.uni.lu.


Background
Deciphering the genetic basis of diseases or complex traits is a major task in biomedical and basic biological research and is a key rst step towards a better understanding of the molecular mechanisms behind disorders with genetic components. As a forward genetic approach, linkage analysis of Quantitative Trait Loci (QTLs) using bulk segregant analysis (BSA) in model organisms, such as yeast, is an e cient method for identifying novel genetic variants responsible for heritable phenotypic variability [1,2]. By exploiting the capacity of next-generation sequencing (NGS) technologies to assess large numbers of genetic markers e ciently and integrating NGS analysis with linkage mapping, the precision of QTL mapping can be improved signi cantly as compared to traditional approaches. In order to perform a linkage analysis using BSA in practice, relevant software packages, such as the bsaseq python package [3], and web-based software, such as EXPLoRA-web [4], have been made available in recent years. However, these tools require researchers to rst determine genetic markers of interest from sequencing data. Moreover, they only provide limited annotations for the discovered QTLs (i.e. only the QTL coordinates) and do not support the interactive exploration and visualization of detailed QTL annotations in a web-browser. Since the analysis of NGS data involves several di erent command-line software tools and is a time-consuming and laborious task, a more e cient, automated analysis framework that supports annotation-based result interpretation would greatly facilitate NGS-based bulk segregant analysis (NGS-BSA).
For this purpose, we have developed BSA4Yeast, a comprehensive web-based analysis software for QTL mapping via bulk segregant analysis of yeast sequencing data (Fig. 1). BSA4Yeast provides the following main new bene ts and features: Compiled on: October 18, 2018. Draft manuscript prepared by the author.
To the best of our knowledge, BSA4Yeast is the rst comprehensive web-based software that integrates automated NGS data analysis with QTL mapping via bulk segregant analysis.

Functionality and Work ow
The BSA4Yeast framework for QTL mapping via bulk segregant analysis of yeast sequencing data is built on custom scripts and open-source bioinformatics software (Fig. 2).The software work ow covers three major functionalities: 1) Pre-processing and aligning short reads (Illumina format) against an upto-date version of the yeast genome; 2) identifying relevant genetic markers between two parental lines; and 3) performing QTL analyses and comprehensively annotating the results using public data in an automated fashion (Fig. 2). For di erent types of experiments the design of the experiment (DOE) can be adjusted appropriately using exible parameter settings. The web-application supports one-or two-bulk designs and designs with multiple biological replicates in each bulk, as well as three le formats as input (.fastq, .bam aligned against SacCer3, and .map format; both paired-end (PE) or single-end (SE) DNA sequencing data is accepted). For .map input les, BSA4Yeast can be used to identify QTLs for any species of interest, and for S. cerevisiae dedicated annotations are generated additionally. After the pre-processing and alignment computations in the rst step of the work ow, genetic markers will be identi ed automatically in the second phase. Optionally, the user can adjust the trade-o between the stringency and coverage of the marker identi cation by specifying a custom DNA sequencing depth of coverage. For the QTL analyses in the third and nal step, the user can adjust the type and width of the used smoothing kernel and has the option to download intermediate results, such as allele frequency les, bam les or map les, for further independent analyses. The QTL peaks, QTL regions and corresponding empirically estimated p-values are determined using the G' statistic [3]. To facilitate the results interpretation, BSA4Yeast computes various dedicated statistics, such as the allele counts on each chromosome and a summary of the type of mutations in each parental line (e.g. stop gain, stop loss, frameshift or non-synonymous mutations), as well as SNAP scores to evaluate deleteriousness [5]. Additionally, comprehensive annotations for the QTLs and the genome of the parental lines are provided, and all results can be downloaded from the website. The software does not require any registration, but users can optionally create an anonymous account to store results (8 GB) for a longer time to conduct further analyses with di erent parameters. Overall, the software is designed to enable scientists with limited background knowledge in bioinformatics to run all analysis steps with minimal manual e ort, only needing to provide adequately formatted input DNA sequencing les (bam or map les) through a web-browser, and avoiding time-consuming installation and con guration steps on the local computer.

Implementation
The BSA4Yeast web-application has been developed in Python 2.7 using the Flask micro-framework (Fig. 3) [6]. Flask is an extensible web micro-framework, written in Python and therefore fully compatible with the bsa-seq package used for QTL calculations [3] (implemented in Python 2.7). All analyses run as ask asynchronous background tasks using Celery, a task queue/job queue system based on distributed messaging [7], and Redis, an open source (BSD license) message broker between the web-application and the celery worker (Fig. 3) First et al. | 3 Figure 3. The software framework behind BSA4Yeast. The web-application uses Flask and Nginx on a virtual machine, as well as Gunicorn as a Web Server Gateway Interface. Celery is employed as an asynchronous task queue/job queue system, and redis as a message broker. The metadata for the output les is recorded in an SQLite database. [8,6]. Since analyses of fastq or bam les may take hours, the user can optionally be noti ed about the job termination via an email message. Moreover, to avoid blocking of the main application process, analyses run as background tasks of the Celery worker (-concurrency = 4). Result les are stored on the server's hard drive (1 TB) and periodically cleaned (a cron job every week, removing only les after a minimum waiting time of one week). All metadata generated when computing analysis results, such as le name, le creation time, le type, is recorded and saved in an SQLite database. Jinja2 is used as a template language [9] and Gunicorn is employed as Web Server Gateway Interface between the web-server and the web-application [10]. When a job is complete or users decide to delete les for their own analyses via the web-interface, the meta-information and result les will always be updated simultaneously. To allow the user to apply the web-application either anonymously, or optionally, via a registered account to save results for future analyses, the application was extended by an authorization function. The overall web-application is deployed with an Nginx server [11] on a dedicated virtual machine (operating system: CentOS 7.2, speci cations: 16 GB RAM / 8 cores).

Tabular and visual inspection of QTL results
In order to provide an interactive and intuitive exploration of genomic data in the web-browser, the BSA4Yeast graphical interface uses the libraries Bootstrap, jQuery, DataFrame.js and Highcharts.js [12,13,14,15]. The visualization and exploration of large genomic datasets is challenging in both tabular and graphical formats. Therefore, to display large tables the dedicated library DataFrame.js is used, providing an immutable data structure supporting fast SQL queries. Moreover, we use server-side pre-processing to display a requested page, reducing the client side computational burden. Apart from the fast retrieval of genomic information in tabular format, the annotation table is interlinked with an external yeast gene database (SGD: https://www.yeastgenome.org), allowing the user to explore known functions for genes of interest. Visual representations of the BSA4Yeast analysis results using QTL plots and allele frequency plots can be explored dynamically using jQuery and Highcharts.js (see example in Fig. 4). Finally, pie charts for di erent types of mutations in each parental line can be displayed to compare their genomic diversity (see Fig. 5).

Example application
In a rst proof-of-concept study, the BSA4Yeast analysis framework was applied successfully to investigate cellular aging in baker's yeast (S. cerevisiae), detecting two signi cant QTLs associated with chronological life span regulation [2]. Speci cally, a DNA sequencing dataset consisting of paired end (PE) sequenced parental strains and the single end (SE) sequenced segregant bulks was investigated with the software. The web-application was applied on three types of input data from yeast BSA-based QTL studies, representing di erent experimental designs, and two di erent types of sequencing methods (PE and SE). Summary statistics for the three input les used for the example analyses are shown in Table 1. With this data, the BSA4Yeast software can in each setting automatically recompute the QTL results previously published [2]. Representative runtimes for the example input les are ∼1 min, ∼1.5 hours and ∼3.5 hours for .map, .bam and .fastq les. Example parameter settings for fastq analysis are shown in Table 2 (further example settings for other le types are provided on the BSA4Yeast web-site). The resulting annotation table, QTL plot and allele frequency plot for the example analysis are shown in Fig. 4. Fig. 5 additionally displays the QTL region annotation, the G' statistics for each chromosome and the summarized mutation types for the two parental lines. Since the parental DNA data is not available when using map les as input, only the QTL coordinates can be obtained for this input type, whereas the full annotations are generated for bam-le analyses. All of the example datasets are available on the BSA4Yeast server for downloading and testing (https://bsa4yeast.lcsb.uni.lu/).

Discussion and possible future extensions
Bulk Segregant Analysis (BSA) based QTL mapping using next generation sequencing technologies is a valuable new approach to identify genes associated with a phenotype of interest. However, the complexity of the software tools, parameter settings and the underlying algorithms used to process the data may prevent a wider application of the computational methods developed for this purpose. Moreover, setting up a comprehensive and e cient analysis pipeline is a laborious, error-prone and time-consuming task, which requires prior experience in bioinformatics. To address these problems, speed up and greatly facilitate bulk segregant analyses for yeast DNA sequencing data, BSA4Yeast was developed as a dedicated webapplication for BSA-QTL analysis. Instead of the conventional approach for QTL mapping, which investigates the allele frequency distribution across the chromosomes, BSA4Yeast uses a variant of the G-statistic [3], which provides multiple advantages over classical allele frequency analyses. Firstly, the G-statistic is expected to decrease more rapidly around the causal site, providing narrow QTL candidate intervals; and secondly, the G-statistic takes into account the strength of the evidence, which is estimated using the sample size. However, certain characteristics of the G-statistic can also complicate analyses, e.g. the variance in read depth strongly in uences the variance of the G-statistic over small spatial scales. The G'-statistic, a smooth version of the G-statistic, previously developed by Paul Magwene et al. [3], is designed to address this limitation and provides a robust framework to analyze BSA sequencing data.
It is computed in an automated fashion within BSA4Yeast and has been employed successfully for several biological applications, e.g. to identify genes involved in yeast bio lm formation or chronological aging [2,16]. Apart from implementing the G'-statistic for robust QTL analysis, BSA4Yeast aims at addressing some of the main hurdles in BSA-based QTL analyses discussed above both by automating and improving the e ciency of the work ow, and by facilitating the design of experiment and con guration prior to the analysis, as well as the postanalysis data interpretation, in particular for users with limited  prior domain knowledge in bioinformatics. Moreover, webbased work ow implementations do not only have advantages over classical software package installations in terms of the simplicity of usage, but further bene ts arise from the platform-independence of the software (BSA4yeast runs on any operating system that supports modern web-browsers) and the fully reproducible analyses, independent of software updates on the client's computer. Finally, the optional password-protected access to a user account enables users to access data and results from di erent locations and to share them with trusted collaborators.
Since the BSA4Yeast work ow is implemented in a modular fashion, it can be extended and adjusted, e.g. to cover further annotations and reference genomes for other model organisms used to perform linkage QTL studies (e.g. fruit ies or mice). Moreover, the software can be interlinked with other public internet databases and repositories, which contain further information on identi ed genes with a phenotype association of interest. The BSA4Yeast source code has been made available on GitLab (https://git-r3lab.uni.lu/zhi.zhang/bsa4yeast) to allow other users to explore, modify or further extend the software.