Identification of genomic regions associated with a phenotype of interest is a fundamental step toward solving questions in biology and improving industrial research. Bulk segregant analysis (BSA) combined with high-throughput sequencing is a technique to efficiently identify these genomic regions associated with a trait of interest. However, distinguishing true from spuriously linked genomic regions and accurately delineating the genomic positions of these truly linked regions requires the use of complex statistical models currently implemented in software tools that are generally difficult to operate for non-expert users. To facilitate the exploration and analysis of data generated by bulked segregant analysis, we present EXPLoRA-web, a web service wrapped around our previously published algorithm EXPLoRA, which exploits linkage disequilibrium to increase the power and accuracy of quantitative trait loci identification in BSA analysis. EXPLoRA-web provides a user friendly interface that enables easy data upload and parallel processing of different parameter configurations. Results are provided graphically and as BED file and/or text file and the input is expected in widely used formats, enabling straightforward BSA data analysis. The web server is available at http://bioinformatics.intec.ugent.be/explora-web/.
Bulk segregant analysis (BSA) coupled with high-throughput sequencing is a widely used method to associate genomic regions to complex traits (1,2). Identifying these regions, commonly known as quantitative trait loci (QTL), is a fundamental step toward understanding the genomic component of phenotypic variation (3–6). BSA relies on crossing two parents which differ in a trait of interest, having one parent with a desired selectable trait. Segregants displaying the desired phenotype are pooled, the pooled DNA is extracted and subjected to pool sequencing. The sequencing data is mapped to a reference strain (usually the known parent not exhibiting the desired phenotype) and the allele frequency is calculated for all polymorphic marker sites. Genomic regions for which the allele frequency of the marker sites significantly deviates from random segregation are prioritized as QTL for the trait (7,8). Current availability of whole genome sequencing (WGS) provides the capacity to infer population allele frequencies for nearly every site that is polymorphic between the parental lines. Hence, the quality and length of the identified QTL is only affected by the number of crossovers during the development of the population which on its turn is determined by the recombination rate (9,10). BSA has been particularly useful in identifying trait-linked alleles in eukaryotes with small genomes such as yeast, mainly because the smaller number of base pairs per centimorgan reduces the number of candidate genes within each predicted QTL (11–13). However, this method has also been successfully applied in rice (14,15) and other plants (16,17).
To efficiently differentiate true from spuriously linking QTLs, several advanced BSA data analysis protocols have been proposed (1,4,18–20). Previous work has shown that one of the approaches that largely contributes to the power of BSA data analysis is to exploit the properties of linkage disequilibrium (LD) (10,21,22). The restricted number of recombination events cause proximal marker sites to be co-inherited, resulting in LD between neighboring marker sites: in a BSA set up, a causative mutation will therefore be embedded in a larger region of marker sites for which the variant counts all display a similar deviation from the allelic distribution one would expect if loci were not segregating with the trait. LD thus produces deviations of variant counts toward the alleles of the parent with the desired phenotype, not only at the genetic marker site(s) causative to the phenotype of interest, but also in genetic marker sites closely located to these causative marker sites. Although different algorithms for BSA exploiting LD have recently been implemented (10,21,22) in open source scripts and software tools, these tools remain difficult to operate for users.
Therefore we present EXPLoRA-web, an intuitive web server that facilitates users performing data analysis of BSA experiments. EXPLoRA-web is wrapped around our previously published BSA data analysis method that was shown to maximize power and accuracy in detecting QTLs by exploiting the properties of LD. The details of the Hidden Markov Model (HMM) and benchmarking with other tools are discussed in (10). The web service is available at http://bioinformatics.intec.ugent.be/explora-web/.
EXPLoRA web is accessible using any internet browser (Google Chrome, Microsoft Edge and Firefox, among others). The web server's help pages http://bioinformatics.intec.ugent.be/explora-web/help provide detailed guidelines on how to perform the analysis, tune the parameters and interpret the results. The EXPLoRA web server is freely accessible and does not require login, although an optional account can be created to have easy access to the results of previously analyzed experiments.
EXPLoRA, which is the algorithm implemented in this web service, navigates the variable sites of the genome using an HMM that calculates the probability of allele frequencies at each marker site to be emitted by two possible states: a phenotype-linked state and a neutral state. While markers linked to the phenotype are expected to show predominantly the allele of the parent with the desired phenotype, neutral markers are expected to show the alleles of the two parents at a ratio that reflects random segregation.
The effect of LD is modeled by the transition probabilities between two neighboring marker sites. The transition probability models the chance of a change of state between two neighbor sites. Its distribution is described by a negative exponential function of the recombination rate and the physical distance between neighboring marker sites following Haldane's recombination model (23). The model captures the fact that neighboring marker sites are likely to be in LD and hence the probability of a state change between them is small.
Emission probabilities of marker site states are modeled by two β binomial distributions, one for the emission probabilities in phenotype linked states and another for emission probabilities in neutral states. The β-distribution for the neutral states is automatically estimated from the read count data. The distribution describing the expected frequencies of the phenotype-linked variants is defined by the user selecting the α and β parameters. The ratio between α and β defines the degree to which the relative variant frequency at a marker site needs to differ from the one expected based on random segregation in order to be called ‘linked to the phenotype’.
For full description of the model and algorithmic details we refer to (10).
The data necessary to run EXPLoRA consists of the information on the experimental setup, the allele counts from the pooled sequencing and the selection of the method's parameters.
Information related to the experimental setup consists of the number of segregrants that were pooled prior to sequencing and an approximation of the average recombination rate of the organism under study. The latter is required to take into account the effect of LD between neighboring markers. Although we realize recombination rates are not constant across a genome, the average recombination works as a good approximation for our model: as we have shown in our previous work, small deviations from the recombination rate do not interfere with the performance of the algorithm in accurately detecting truly linked regions (10). Additional fields to name and describe the experiment are also provided (Figure 1A).
The allele count at the marker sites derived from the pooled DNA sequencing should be uploaded as a variant call format (VCF) file or as a simple tab delimited file in which rows correspond to the different marker sites (Figure 1B). For the simple text tab delimited file the columns correspond respectively to the chromosome at which the marker sites are located (marker chromosome), the genomic position of the marker sites (marker site position) and two columns describing the read counts containing respectively the total read count at a marker site and the alternative allele read count (which usually corresponds to the allele of the parent with the desired phenotype). VCF files containing these allele counts are produced from raw reads by analysis pipelines for high-throughput sequencing data such as NGSEP (24) and GATK (25), so the outputs of these pipelines can be directly uploaded to EXPLoRA-web. Alleles refer to polymorphisms relative to a reference genome and are thus defined when generating the VCF or counts file prior to using EXPLoRA.
Besides the data and experimental information, EXPLoRA also requires specifying the parameters that control the model. The main parameter to be selected is the α/β ratio that determines the shape of the β distribution that models the emission probability for the phenotype-linked states (Figure 1C). Changing the α/β ratio affects the probability with which an observed relative variant allele frequency is interpreted by the model as representative for a region linked to the trait of interest. Increasing the α/β ratio makes the identification of phenotype-linked regions more stringent, meaning that a higher deviation of the relative allele frequency from the one expected under random segregation is needed before the region is considered linked. The higher the α/β ratio, the less phenotype-linked markers are called and the smaller the size of the called regions: identifying only the most pronouncedly linked regions results in more precisely pinpointing the linked region. However, this comes at the expense of potentially missing some truly linked markers/regions (lower sensitivity). By default, the web server proposes three different α/β ratio's corresponding to different trade-offs between sensitivity and specificity. A region that is identified with a stringent threshold and that thus corresponds to the most reliable signal in the data will by definition also be identified by the less stringent α/β ratio. Providing the results obtained with different thresholds thus allows the user to assess the reliability of the predictions based on the stringency of the threshold at which the linked region was first detected. The effect of changes in the α/β ratio can be observed visually in the graphical interface.
The emission probability of markers to be in the neutral state (i.e. not linked to the trait of interest) is directly estimated from the count data uploaded by the user, hereby assuming that most of the markers are not linked to the phenotype and their count data thus display the allele frequency distribution expected under random segregation. The web server also displays the distribution inferred from the real counts, allowing the user to select a set of running parameters in agreement with the data to be analyzed.
The EXPLoRA web service determines per α/β ratio the posterior probability of each marker site to be linked to the phenotype of interest. Because of LD, neighboring markers on the chromosome will together display either high or low posterior probabilities. Consecutive sets of neighboring markers displaying high posterior probabilities are thus used to identify the QTL. The server displays the marker-specific posterior probability scores graphically (Figure 1D) along the chromosome for each α/β combination together with the observed allele frequencies at the variant sites and accordingly derives the genomic regions covered by the identified phenotype-linked markers. Notice that when parameter settings are chosen too stringent the resulting posterior probabilities of finding a site being linked to the phenotype will become zero for all marker sites and no results can be found. Results can be downloaded in two formats: (i) as a comma separated file, listing the posterior distributions per marker and parameter setting, and (ii) as a BED file containing the genomic information to identify the regions found to be linked to the phenotype. The BED file can be imported into other tools such as the Integrative Genomic Viewer (26).
The web service was implemented in Java and connects to a MySQL database for information storage. The software was built using the model view controller pattern implemented via Tapestry 5.2. The access to the database was achieved using Hibernate for object-relational mapping. The graphical interface was made using bootstrap, jQuery and D3js. The server runs on a 16-core, 64bit CentOS 6.2 system with 128GB of memory.
We implemented EXPLoRA web, a novel web service for the identification of QTL from whole genome resequencing data in BSA experiments. The web server relies on a highly accurate model developed previously (10) that maximizes the power to identify promising candidate genomic regions for further study of causal relationships with complex traits. EXPLoRA was compared and shown to have superior accuracy to other recently published command line executable methods for QTL analysis on BSA experiments (18,21). The use of standard formats such as the VCF allows users to upload directly the outputs of software pipelines to obtain allele counts and genotype calls from WGS data such as NGSEP (24) or GATK (25) as inputs for QTL analysis. With EXPLoRA we anticipate on the increasing use of BSA in combination with pooled sequencing both for fundamental and applied purposes and on the concomitant need for user friendly applications to facilitate its complex data analysis activities.
Ghent University Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’; Fonds Wetenschappelijk Onderzoek-Vlaanderen (FWO) [G.0329.09, 3G042813, G.0A53.15N]; Agentschap voor Innovatie door Wetenschap en Technologie (IWT) [NEMOA]; Katholieke Universiteit Leuven [PF/10/010] (NATAR); International Center for Tropical Agriculture (CIAT). Funding for open access charge: Ghent University Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’; Fonds Wetenschappelijk Onderzoek-Vlaanderen (FWO) [G.0329.09, 3G042813, G.0A53.15N].
Conflict of interest statement. None declared.