Reference-free detection of isolated SNPs

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

• Bacterial Escherichia coli K-12, MG1655 strain, ≈ 4.6 million base pairs. From this reference sequence, we generated 30 Escherichia coli individuals by simulating SNPs based on a site frequency spectrum pattern, i.e. most SNPs are in one sample, half as many in two samples, a third in three samples, and so on. More precisely we introduced X i = max( S i2 i ,1) SNPs occurring in i of the 30 genomes (for i in [1,29]), with S being the total number of sites that were mutated, i.e. 69,600 sites in our case. SNPs were distributed uniformly along the genomes.
We then used our own read simulator, Mutareads, to simulate an Illumina sequencing for the two human * To whom correspondence should be adressed : ruricaru@labri.fr, claire.lemaitre@inria.fr and pierre.peterlongo@inria.fr diploids, and for the 30 bacterial individuals. The sequencing simulation was carried out by sampling equal-length reads (100 bp) from each sequence, with a uniform probability distribution, on a 40x coverage basis (2x20x for the diploid individuals). Substitution errors were uniformly distributed along each read with a fixed probability (1%). DISCOSNP, as well as the other reference-free SNP calling tools, were run on the following datasets: one diploid individual (HG00096), the two diploid individuals (HG00096 and HG00100), two haploids (among the 30 Escherichia coli individuals), three haploids and so on, up to 30.

Precision and recall computation
For the tests on simulated data, both in the main paper and in this file, we provide recall and precision measures. For this purpose, we produce a multi-fasta file for each simulated dataset, ref snps.f a. These files are formatted as the DISCOSNP output and contain all the isolated SNPs among the complete set of generated SNPs for the given dataset. A SNP is considered as isolated if among the whole subset of considered simulated genomes, no other SNPs are simulated in the k −1 positions before and after the SNPs locus. They will be subsequently used as exhaustive and exact reference lists to compute precision and recall for each dataset.
More specifically, a ref snps.f a file contains pairs of sequences, where each pair represents an isolated SNP (one sequence corresponds to one path of the SNP, and the second corresponds to the other path). Every such sequence (or path) has a 2k −1 length, where the first k −1 and the last k −1 characters are identical between the two paths of a SNP, while the two characters placed on the k th position correspond to the mutation. Predicted SNPs are then mapped to these reference SNP sequences using GASSST (1). This enables to validate the predicted SNPs, i.e. the 2k −1 sequences corresponding c 2014 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. to the SNPs detected by DISCOSNP (or by any other tool) are mapped on the reference SNP sequences; a predicted SNP is validated if both its paths map perfectly and entirely (with 100% identity) the two paths of one of the SNPs present in this ref snps.f a file.
Precision and recall are then computed as follows. The results of the mapping enable to compute the true positives (T P ), i.e. predicted SNPs whose both paths are perfectly mapped on two paths of the ref snps.f a file, and the false positives (F P ), i.e. the other predicted SNPs. We call false negatives (F N ) the non mapped SNPs from the reference list.
Finally, the precision is computed as the number of T P divided by the total number of SNPs found by DISCOSNP, while the recall is given as the number of T P divided by the total number of SNPs in the reference list.
In the case when more than two genomes are compared and some branchings are allowed in the bubbles, some nonisolated SNPs can be found by DISCOSNP. This happens when there exists at least two genomes (or individuals) among the compared ones with distinct genotypes for the considered SNPs but the same genotype for the other close SNPs. Such a SNP can be detected by DISCOSNP, but as we focus on isolated SNPs only, it would be considered as a false positive by our evaluation process. To compute more relevant precision values, only predictions that do not correspond to any simulated SNP (isolated or not) were considered as false positives. This was used in the main paper for the two human (chromosome 1) diploids experiment when using DISCOSNP or BUBBLEPARSE with parameters allowing for some branchings inside the bubbles (parameters -b 1 and depth=1 respectively).  Table 1. Results obtained by various tools with various parameters on human chromosome 1 dataset composed of two individuals (described in Section "Data simulation"). Lines starting with were already indicated in the manuscript. * denotes that precision was computed by considering as false positives only bubbles that do not correspond to any simulated SNP (isolated or not). Note that c = 3 for BUBBLEPARSE and c = 4 for DISCOSNP are equivalent as they filter out data seen three times or less.

HUMAN FULL RESULTS
In the paper only best results for each tested tool are presented. The Table 1 presents the full obtained results. The hybrid strategy (SOAP+bowtie2+GATK) provides SNPs with low coverage, which mainly correspond to false positives. As DISCOSNP and other tested tools filter out low covered kmers (seen less than 4 times in each dataset), we decided to apply the same filtering for the hybrid strategy, i.e. filtering out SNPs covered less than four times. The command lines that were used for the different tools are indicated in Table 2.

Preparing the data
The 24 read sets were downloaded from the NCBI Sequence Read Archive (with the accession number SRA054922). Read pairs were separated using the fastq-dump command from sratoolkit, with the -split-3 option. The SRA read file names are detailed in Table 3. Read sets were prepared applying the protocol from the Kvitek study (2). The obtained coverages range from 348x to 1536x (see Table 4) depending on the experiment. We extracted the set of isolated SNPs among the validated ones in (2). We generated the set of associated bubbles in fasta format using the S288c reference sequences from the Saccharomyces Genome Database (http://www. yeastgenome.org/). These bubbles were used as a reference for estimating DISCOSNP recall. Running DISCOSNP DISCOSNP was run independently on the three populations E1, E2 and E3, with defaults parameters, except for the c value that was fixed to 11. The c = 11 value was chosen with respect to minimal k-mer coverage observed in the data (see Table 4). A first experiment was performed with -b 1 option and another experiment with -b 2. For each experiment, the 16 (8x2) read sets were used collectively. For instance the E1 experiment was performed using the following command line: ./run_discoSnp.
The first dataset contains two E. coli individuals simulated as explained in Section "Data simulation and evaluation protocols" (100bp long Illumina reads with uniformly distributed errors, for a 40x coverage).
The second dataset was simulated from the human chromosome 1 (hg19 assembly), by generating a mutated sequence in which only uniformly distributed homozygous SNPs were simulated at 0.1% SNP rate. Sequencing of both the reference and the mutated sequence was then simulated for a 50x coverage. Note that the simulations made on the human sequence are distinct from those presented in the paper, thus explaining why results are slightly different from those presented in the paper with same parameters.

Influence of the read simulation method
Knowing that SNP detection methods can be misled by sequencing errors, as they potentially generate false positives, we checked the robustness of DISCOSNP with respect to the read simulator. Besides Mutareads, five simulation tools were tested: Art (4), GemSim (5), Metasim (6), SimSeq (7), and WGSim from the Samtools package (8), each of them implementing a specific error profile. The six simulators were applied on the two E. coli individuals, as described in Section "Simulation protocol". Results are summarized in Table 5. These results show that, except for Art, DISCOSNP produces similar results regardless the simulation tool. Poor results in Art case are most certainly due to its high error rate, i.e. it simulates more than six errors per read on average, which is not consistent with what is happening in real data. Moreover, GemSim was, in average, 100 times slower than the other read simulators, thus making it unusable on large datasets such as human data.
For the human experiment, due the previously exposed reasons, we excluded Art and GemSim simulators. Once again,  Table 6 show that DISCOSNP produces similar results regardless the simulation method. As we showed that DISCOSNP results are independent of the read simulator that is used, we chose Mutareads to perform the simulation experiments presented in the paper and in this file, as it is the fastest among the tested simulators.

Influence of the sequencing depth
In order to estimate the effect of the read coverage on DISCOSNP results, we performed simulations with Mutareads on the human chromosome 1 sequence by using a range of coverage values. The results presented in Table 7 show that the read coverage influences the result quality. As expected, the higher the read coverage, the better the precision/recall results. However, these results suggest that even for coverage values as low as 20x, DISCOSNP calls 79.5% of SNPs, while maintaining a good precision (89.9%).

Influence of the SNP density
In order to estimate the influence of the SNP density on DISCOSNP results, we generated datasets with densities varying from 0.06% (based on (9)), to 0.1%, and to to an over-estimation of 1% of SNPs.
In the results presented in Table 8, precision varies from 85.8% to 98.3%. The higher the SNP density, the higher the precision. This reveals that the number of false positives is stable regardless the SNP simulation method (19088±2%), while the number of true positives grows linearly with the number of isolated SNPs.

Results when varying DISCOSNP parameters
Influence of the k value Any algorithmic method based on de Bruijn graphs is highly dependent on the k-mer size. In order to analyze the influence of the k-value on DISCOSNP results, we performed an experiment on human chromosome 1. The results that are shown in Figure 1 were produced with DISCOSNP for k-values going from 15 to 45. These results confirm that the k-value influences the quality of the results. For small k-values, due to the increasing number of branching k-mers, a larger number of branching bubbles are discarded, which leads to a low recall. Precision is also affected as there is an increasing number of bubbles that are generated by inexact repeats. Indeed, an inexact repeat of length 2k −1 generates a bubble, and the frequency of such repeats in the genomes, increases when k decreases. On the other hand, the recall decreases for large k values ( 37), as there are fewer reads that overlap on at least k characters.
With current NGS reads, a good trade-off value for k is ≈ 31. As presented in Figure 1, the best precision/recall is reached with k = 37. However, values that are larger or equal to 32 are usable at the cost of either larger RAM consumption or longer data structure creation times. This study also shows that, even if the k value influences DISCOSNP results, any choice of k between 25 and 39 provides high quality results (with precision varying from 88.1% to 92.2% and recall varying from 82.3% to 86.7%).
Influence of the minimal coverage value "discoSnp˙NAR3˙add˙file1" -2014/9/23 -12:09 -page 5 -#5 DISCOSNP offers the possibility to filter out k-mers whose number of occurrences in all read set is below a userdefined threshold, thus enabling to discard k-mers that are most probably due to sequencing errors. The results that are presented in Figure 2 are obtained with DISCOSNP (k = 31) on the 50x dataset simulated from human chromosome 1.
These results show that for large minimal coverage values (≥ 6), recall decreases due to low covered SNPs, while precision slightly decreases as the proportion of SNPs due to inexact repeats increases. With no filter (c = 1), sequencing errors are not filtered out, and a high ratio of the reported SNPs are due to sequencing errors. Moreover, recall is slightly poorer as the graph is more complex, and more bubbles are discarded by DISCOSNP (as they are branching).

RUNNING TIMES FOR THE MULTIPLE INDIVIDUALS STUDY
Here, we present the comparative running times of DISCO-SNP, CORTEX and the hybrid approach when analyzing an increasing number of read sets. The results, presented in Figure 3, were obtained in the study concerning the 30 haploid datasets simulated from E. coli. They show that running times grow linearly with respect to the number of individuals regardless the method, and that DISCOSNP runs faster than the two other methods.