Abstract

Motivation: Next-generation sequencing techniques have facilitated a large-scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals.

Results: Here, we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph, where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance, in particular, for deletions or insertions (indels) of length 20–100 nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular, for insert size based approaches. In this size range, we even outperform split-read aligners. We achieve competitive results also on biological data, where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners.

Availability: CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com.

Contact:  [email protected] or [email protected]

Supplementary information:  Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The International HapMap Consortium (2005) and The 1000 Genomes Project Consortium (2010) have, through globally concerted efforts, provided the first systematic view on the gamut and prevalence of human genetic variation, including larger genomic rearrangements. A staggering 8% of the general human population have copy number variants (CNVs) affecting regions larger than 500 kb (Itsara et al., 2009). The technology enabling this advance was next-generation sequencing and the reduction in costs and increases of sequencing speeds it brought along (Bentley et al., 2008; Eid et al., 2009). The analysis of structural variation, however, has not kept up with the advances in sequencing insofar as genotyping of human structural variation has not yet become a routine procedure (Alkan et al., 2011). Indeed, it is likely that existing datasets contain structural variations indiscoverable by current methods. These limitations are likewise an obstacle to personalized genomics.

Here, we target deletions or insertions (indels) between 20 and 50 000 bp. In particular, the discovery of indels smaller than 500 bp is still challenging (Alkan et al., 2011; Mills et al., 2011), even in non-repetitive areas of the genome. That the majority of structural variants resides in repetitive areas complicates the problem further due to the resulting read-mapping ambiguities.

Categorization of our and prior work. A (paired-end) read is a fragment of DNA in which both ends have been sequenced. We refer to the sequenced ends of the read as (read) ends and to the unsequenced part of the fragment between the two ends as internal segment or insert. An alignment A of a paired-end read is a pair of alignments of both ends. We say that a read has been multiply mapped if it aligns at several locations in the reference genome and uniquely mapped in case of only one alignment. Existing approaches for structural variant discovery can be classified into three broad classes: first, those based on the read alignment coverage, that is, the number of read ends mapping to a location (Abyzov et al., 2011; Alkan et al., 2009; Campbell et al., 2008; Chiang et al., 2009; Sudmant et al., 2010; Yoon et al., 2009), second, those analyzing the paired-end read internal segment size (Chen et al., 2009; Hormozdiari et al., 2009; Korbel et al., 2009; Lee et al., 2009; Quinlan et al., 2010; Sindi et al., 2009) and third, split-read alignments (Mills et al., 2006; Ye et al., 2009). Refer to Medvedev et al. (2009) as well as to Alkan et al. (2011) for reviews. A major difference is that the first two classes align short reads by standard read mappers, such as BWA (Li and Durbin, 2009), Mr and MrsFast (Alkan et al., 2009; Hach et al., 2010) and Bowtie (Langmead et al., 2009). However, split-read aligners compute custom alignments that span breakpoints of putative insertions and deletions. They usually have advantages over insert size based approaches on smaller indels while performing worse in predicting larger indels.

It is common to many library protocols that internal segment size follows a normal distribution with machine- and protocol-specific mean formula and standard deviation formula. On a side remark, we would like to point out that our approach does not depend on this assumption and that we also accommodate arbitrary internal segment size distributions (which may result from preparing libraries without a size selection step, as one example) to the user. One commonly defines concordant and discordant alignments: an alignment with interval length I(A) (see Fig. 1) is concordant iff formula and discordant otherwise. The constant K can vary among the different approaches. A concordant read is defined to concordantly align with the reference genome, that is, it should give rise to at least one concordant alignment.

Left panel: two read alignments. Assuming , where  is the mean of the true insert size distribution, alignment A is likely to indicate a deletion while alignment B may indicate an insertion. Right panel: Read alignment graph for seven closely located read alignments. Note that . Assuming that all alignments have equal weight,  is more likely to indicate a deletion than  through a hypothesis test as in Equations (3) and (2). Note that we have not marked cliques  and . See Figure 2 for definition of edges
Fig. 1.

Left panel: two read alignments. Assuming formula, where formula is the mean of the true insert size distribution, alignment A is likely to indicate a deletion while alignment B may indicate an insertion. Right panel: Read alignment graph for seven closely located read alignments. Note that formula. Assuming that all alignments have equal weight, formula is more likely to indicate a deletion than formula through a hypothesis test as in Equations (3) and (2). Note that we have not marked cliques formula and formula. See Figure 2 for definition of edges

With only one exception (Lee et al., 2009, MoDIL), all prior approaches discard concordant reads. In this article, we present clique-enumerating variant finder (CLEVER), a novel insert size based approach that takes all, including concordant, reads into consideration. Although a single discordant read is significantly likely to testify the existence of a structural variant, a single concordant read only conveys a weak variant signals if any. Ensembles of consistent concordant alignments, however, can provide significant evidence of usually smaller variants. The major motivation of this study is to systematically take advantage of such groups of alignments to not miss any significant variant signal among concordant reads.

We employ a statistical framework, which addresses deviations in insert size, alignment quality, multiply mapped reads and coverage fluctuations in a principled manner. As a result, our approach outperforms all prior insert size approaches on both simulated and biological data and also compares favorably with two state-of-the-art split-read aligners. Beyond its favorable results, our tool predicts a substantial amount of correct indels as the only tool (e.g. more than 20% of true deletions of 20–49 bp in the simulated data). Overall, CLEVER’s correct calls beneficially complement those of the split-read aligner considered (Ye et al., 2009, PINDEL).

Moreover, we need ~8 h on a single CPU for a 30× coverage whole-genome dataset with ~1 billion reads, which compares favorably with the estimated 7000 CPU hours needed by MoDIL, the only method that also takes all reads into consideration.

1.1 Approach and related work

1.1.1 Graph-based framework

Our approach is based on organizing all read alignments into a read alignment graph, whose nodes are the alignments and edges reflect that the reads behind two overlapping alignments are, in rigorous statistical terms, likely to stem from the same allele. Accordingly, maximal cliques (max-cliques) reflect maximal consistent groups of alignments that are likely to stem from the same location in a donor allele. Because we do not discard alignments, the number of nodes in our read alignment graph is large. We solve instances with more than 109 nodes. We determine all max-cliques in this graph by means of a specifically engineered, fast algorithmic procedure.

The idea to group alignments into location-specific, consistent ensembles, such as max-cliques here, is not new. In fact, it has been employed in the vast majority of previous insert size based approaches. We briefly discuss related concepts of the three most closely related approaches by Hormozdiari et al. (2009, VariationHunter [VH]), Sindi et al. (2009, GASV) and Quinlan et al. (2010, HYDRA). Although not framing it in rigorous statistical terms, HYDRA is precisely based on the same concept of max-clique as our approach. After constructing the read alignment graph from discordant reads alone, they employ a heuristic algorithm to find max-cliques. Because no theoretical guarantee is given, it remains unclear whether HYDRA enumerates them all. The definition of a ‘valid cluster’ in VH (Hormozdiari et al., 2009) relaxes our definition of a clique in a subtle, but decisive aspect. As a consequence, each of our max-cliques forms a valid cluster, but the opposite is not necessarily true. The reduction in assumptions, however, allows VH to compute valid clusters as max-cliques in interval graphs in a nested fashion, which yields a polynomial run-time algorithm. Sindi et al. (2009, GASV) use a geometrically motivated definition that allows application of an efficient plane-sweep style algorithm. A closer look reveals that each geometric arrangement of alignments inferred by GASV constitutes a max-clique in our sense, but not necessarily vice versa, even if a max-clique is formed by only discordant read alignments. We recall that GASV, HYDRA and VH do not consider concordant read data and hence consider read alignment graphs of much reduced sizes.

Finding max-cliques is formula-hard in general graphs. On the basis of the idea that the read alignment graph we consider still largely resembles an interval graph, we provide a specifically engineered routine that computes and tests all max-cliques in a reasonable time—about 1 h on a current eight-core machine for a whole human genome sequenced to 30× coverage—despite that we do not discard any reads.

1.1.2 Significance evaluation

Commonly concordant and discordant reads: Testing whether formula, to determine whether a single alignment is concordant, is equivalent to performing a Z-test at significance level formula, where formula is the standard normal distribution function. However, when determining whether m consistent alignments (such as a clique of size m) with mean interval length formula are commonly concordant, a Z-test for a sample of size m is required, which translates to
(1)
Due to the factor formula, already smaller deviations formula turn out to render the alignments commonly discordant. In our approach, we rigorously expand on this idea. Roughly speaking, each max-clique undergoes a Inequality-(1)-like hypothesis test.

Multiply mapped reads: Although we approach the idea of not ‘overusing’ multiply mapped reads in an essentially different fashion, our routine serves analogous purposes as the set-cover routines of VH and HYDRA. The difference is that we statistically control read-mapping ambiguity but do not aim at resolving it.

Following Li et al. (2008), we compute each alignment’s probability of being correctly placed. In case of a max-clique consisting of alignments formula (all from different reads) with probabilities formula, let formula be the event that precisely the alignments formula are correct. We compute formula. Let formula be the null hypothesis that the allele in question that—we recall that max-cliques just represent groups of alignments likely to be from the same allele—coincides with the reference genome. In correspondence to Inequality (1), we compute
(2)
with formula, which is the probability of observing formula when assuming the null hypothesis, given formula. We further compute
(3)
as the probability that max-clique formula does not support an indel variant. We further correct formula with a local Bonferroni factor to adjust for coverage-mediated fluctuations in the number of implicitly performed tests. If the corrected formula is significantly small, it is likely that (at least) one allele in the donor is affected by an indel at that location. See Section 2 for details. In a last step, we apply the Benjamini–Hochberg procedure to correct for multiple hypothesis testing overall. Note that, among the prior approaches, only MoDIL (Lee et al., 2009) addresses to correct for multiple hypothesis testing (also using Benjamini-Hochberg), although many others either explicitly (e.g. Chen et al., 2009) or implicitly (e.g. Hormozdiari et al., 2009; Korbel et al., 2009; Quinlan et al., 2010) perform multiple hypothesis tests.

Among the statistically motivated approaches, Lee et al. (2009), after clustering, use Kolmogorov–Smirnov tests in combination with bimodality assumptions, whereas Chen et al. (2009) measure both deviations from Poisson-distribution based assumptions (BreakdancerMax) and use Kolmogorov–Smirnov (BreakdancerMin) tests to discover copy number changes.

2 METHODS

2.1 Notations, definitions and background

2.1.1 Reads and read alignments

Let formula be a set of paired-end reads, stemming from a donor (genome) that has been aligned against a reference (genome). We write A for a paired-end alignment, that is a pair of alignments of the two ends of a read (Fig. 1) and formula for the set of correctly oriented alignments that belong to read R. We neglect incorrectly oriented alignments and write formula for the set of all alignments we consider. We assume that formula; that is, each read we consider give rise to at least one well-oriented alignment. We do not discard any reads.

We write formula for the rightmost position of the left end and formula for the leftmost position of the right end. We write formula and call this the interval of alignment A (in slight abuse of notation: intervals here only contains integers) and formula for the (alignment) interval length. When referring to alignment intervals, we sometimes call formula the left and right endpoint. See Figure 1 for illustrations.

2.1.2 Internal segment size statistics

We write I(R) for the internal segment (or insert) size of paired-end read R, i.e. the distance between the 3′ ends—the inner ends of the sequenced reads. Note that the distance between the 5′ outer ends is an equally common definition for insert size in the literature. In the datasets treated here, I(R) can be assumed normally distributed with a given mean formula and standard deviation formula (Hormozdiari et al., 2009; Lee et al., 2009; Li and Durbin, 2009; Li et al., 2008), i.e. formula. Estimation of mean formula and standard deviation formula from the alignments A of reads R poses the challenge that statistics on alignment insert size I(A) (further denoted as formula) do not immediately reflect statistics on I(R) because alignment insert size I(A) statistics already reflect the structural variants in the dataset. As a result, statistics on I(A) are fat-tailed and multimodal, even if library protocols determine statistics on I(R) as normal. Here, we rely on robust estimation routines, as implemented by BWA (Li and Durbin, 2009). Note that, in general, we allow to deal with arbitrary internal segment size statistics.

2.1.3 Alignment scores and probabilities

As described by Li et al. (2008), we determine formula, where j runs over all mismatches in both read ends and formula is the Phred score for position j, i.e. formula is the probability that the nucleotide at position j reflects a sequencing error. Hence, formula is the probability that the substitutions in alignment A are due to sequencing errors. The greater formula the more likely that A is correct, so formula serves as a statistical quality assessment of A. Note that to neglect single-nucleotide polymorphism (SNP) rates and indels reflects common practice (Li and Durbin, 2009; Li et al., 2008), which is justified as in Illumina reads substitution error rates are higher than SNP rates, indel sequencing error rates and deletion/insertion polymorphism (DIP) rates by orders of magnitude (Bravo and Irizarry, 2010; Albers et al., 2011).

Following Li et al. (2008) and Li and Durbin (2009), we integrate the empirical interval length distribution formula into an overall score formula and obtain as the probability that A is the correct alignment for its read, by application of Bayes’ formula
(4)

2.1.4 The read alignment graph

We arrange all scored read alignments formula in the form of an undirected, weighted graph formula. Because we identify nodes with read alignments from formula, we use these terms interchangeably. We draw an edge between alignments formula if we cannot reject the hypothesis that, in case they are both correct, their reads can stem from the same allele. See the subsequent paragraph for details. The weight function formula is defined by formula. We further label nodes by formula, where formula iff formula that is alignment A is due to read formula.

As usual, we write formula for the degree of node A. A cliqueformula is defined as a subset of mutually connected nodes, i.e., formula for all formula. A max-cliqueformula is a clique, such that for every node formula there is formula. Note that by our definition of edges, a clique is a group of alignments that can be jointly assumed to be associated with the same allele, or, in other words, to jointly support the same local phenomenon in the donor genome. Max-cliques are obviously particularly interesting: although all alignments in the clique are likely to support the same local phenomenon, joining any other overlapping alignment may lead to conflicts.

2.1.5 Edge computation

See Figure 2 for illustrations of the following. Let A, B be two alignments. We define: Let X be formula-distributed and, as above, formula be the mean and variance of the insert size distribution. We draw an edge between alignments A, B in the read alignment graph iff the reads of A and B are different, formula and
(5)
(6)
Inequality (5) is a two-sided two sample Z-test to measure statistically compatible insert size. Inequality (6) reflects a one-sided one-sample Z-test for statistically consistent overlap (Wasserman, 2004). If two alignments A, B with formula pass these tests, we have no reason to reject the hypothesis that the alignments are from the same allele, so we draw an edge.
  • formula is the absolute difference of interval length.

  • formula, where in case of formula we refer to all positions between formula and formula as their common interval.

  • formula is the mean interval lengths.

  • formula is the difference of mean interval length and overlap. To motivate this quantity, note that, in case A and B overlap [hence, the length of common interval formula] and are from the same allele, a deletion at that location can only happen to take place in their common interval. If U(A, B) is large, then formula significantly deviates from formula and the common interval is not large enough to explain this by a large-enough deletion. Hence, it is unlikely that A,B are from the same allele.

Four scenarios of two overlapping alignment pairs A and B. In the read alignment graph, two alignments are connected by an edge if they are compatible, i.e. they support the same phenomenon. (1) Alignment A has an insert length about the expected insert length , suggesting that there is no variation present but alignment B has an insert length much larger than  suggesting a deletion. Hence, A and B are not compatible. (2) Both alignments have similar insert lengths larger than , both suggesting a deletion of size , but the overlap O(A, B) is too small to harbor a deletion of this size. Thus, they are incompatible. (3) Both alignments do not suggest any variation and are therefore compatible. (4) Similar to Case (2), but now the overlap is large enough to contain the putative deletion
Fig. 2.

Four scenarios of two overlapping alignment pairs A and B. In the read alignment graph, two alignments are connected by an edge if they are compatible, i.e. they support the same phenomenon. (1) Alignment A has an insert length about the expected insert length formula, suggesting that there is no variation present but alignment B has an insert length much larger than formula suggesting a deletion. Hence, A and B are not compatible. (2) Both alignments have similar insert lengths larger than formula, both suggesting a deletion of size formula, but the overlap O(A, B) is too small to harbor a deletion of this size. Thus, they are incompatible. (3) Both alignments do not suggest any variation and are therefore compatible. (4) Similar to Case (2), but now the overlap is large enough to contain the putative deletion

2.2 CLEVER: algorithmic workflow

(7)
as the length formula of the deletion, respectively, formula of the insertion. We determine breakpoints formula or formula such that the predicted deletion or insertion sits right in the middle of the intersection of all internal segments of alignments in formula.
  1. Enumerating max-cliques: We compute all max-cliques in the read alignment graph.

  2. We assign two P-values, formula to each max-clique formula, which are the probabilities that the alignments participating in formula do not commonly support a deletion or insertion. So the lower formula or formula, the more likely it is that formula supports a deletion or insertion, respectively.

  3. For the thus-computed P-value, we control the false discovery rate at 10% by applying the standard Benjamini–Hochberg procedure separately for insertions and deletions. All cliques remaining after this step are deemed significant and processed further.

  4. Determining parameters: We parameterize deletions D by their left breakpoint formula and their length formula, which denotes that reference nucleotides of positions formula are missing in the donor. We parameterize insertions I by their breakpoint formula and their length formula, such that before position formula in the reference there has been a sequence of length IL inserted in the donor. Depending on whether formula represents a deletion or insertion, we determine, defining formula,

2.2.1 Enumerating max-cliques

We identify nodes of the read alignment graph with the intervals of the corresponding alignments. We first sort the 2m endpoints of these intervals, formula, in ascending order of their positions. We then scan this list from left to right. We maintain a set of active cliques that could potentially be extended by a subsequent interval, which initially is empty. If the current element formula of the list is a left endpoint, we extend the set of active cliques according to the following rules. For the sake of simplicity, let us assume that a unique interval starts at formula, corresponding to a vertex A in the read alignment graph G. Let N(A) be the open neighborhood of A. If formula for all active cliques formula, add a singleton clique formula to the set of active cliques. Otherwise, for each active clique formula, Finally, duplicates and cliques that are subsets of others are removed.

  • if formula, then formula, otherwise

  • if formula, add formula to the set of active cliques.

If the current element formula of the list is a right endpoint, we output all cliques that contain at least one interval ending at formula. These cliques go out of scope and are thus maximal. We remove intervals ending at formula from active cliques. Cliques that become empty are removed from the set of active cliques.

2.2.2 Run-time analysis

Let k be an upper bound on local alignment coverage, c be the maximum number of active cliques and s be the size of the output. The detailed run-time analysis of Section A in the Supplementary Material gives a total running time of formula. Despite these rather moderate worst-case guarantees, our algorithm is very fast in practice. See the Supplementary Material, Section A, for an analysis of the corresponding reasons.

2.2.3 P-values for cliques

We proceed as outlined in the Section 1.1.2. Let formula be a max-clique in the read alignment graph and let formula be the weight of the clique. Let formula be the weighted mean of alignment interval length of the clique. Let formula be the standard normal distribution function. Let formula be the number of alignments that are at the genomic location of the clique. For example, in Figure 1, formula is just the number of alignments that overlap with one another at this position of the reference. We compute
(8)
(9)
just as in Equations (3) and (2) with the difference that we distinguish between cliques, which give rise to deletions and insertions. formula is the number of subsets of alignments one can test at this location, that is the virtual number of tests which we perform, so multiplying by formula is a Bonferroni-like correction. This correction accounts for coverage fluctuations.

If formula is significantly small then formula is significantly large; hence, the alignments in formula are deemed to commonly support a deletion. Analogously, if formula is significantly small, then formula is supposed to support an insertion. Refer to Supplementary Material, Section B, for details on how the exponential sums in Equations (8) and (9) can be computed efficiently.

3 RESULTS AND DISCUSSION

3.1 Simulation: Craig Venter reads

We downloaded the comprehensive set of annotations of both homozygous and heterozygous structural variants (also including inversions and all other balanced rearrangements) for Craig Venter’s genome, as documented by Levy et al. (2007) and introduced them into the reference genome, thereby generating two different alleles. If nested effects lead to ambiguous interpretations, we opted for an order that respects the overall predicted change in copy number. We used UCSC’s SimSeq (https://github.com/jstjohn/SimSeq) as a read simulator to simulate Illumina paired-end reads with read end length 100, insert size mean formula (we recall: distance between the inner ends of the sequenced reads) and standard deviation formula, which reflects many biological datasets (see below). See Section J in the Supplementary Material for performance rates on formula that highlights the limitations of insert size based approaches. Coverage 15× for each of the two alleles yields 30× sequence coverage overall.

3.2 Biological data: NA18507

We were further provided with reads of the genome of an individual from the Yoruba in Ibadan, Nigeria, by Illumina. Reads were sequenced on a GAIIx and are now publicly available (ftp://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf/). Read ends are of length 101. Read coverage is 30 × , furthermore formula (see the following paragraph). For benchmarking purposes, we used annotations from Mills et al. (2011, Gen.Res.) merged with NA18507 ‘DIP’ annotations from the HGSV Project (http://hgsv.washington.edu/general/download/SNPs_DIPs) database, lifted to hg18.

3.3 Reference genome and alignments

As a reference genome, we used version hg 18, as downloaded from the UCSC Genome Browser. All reads considered were aligned using BWA (Li and Durbin, 2009) with the option to allow 25 alignments per read end, which amounts to a maximum of 252 alignments per paired-end read. BWA determined mean insert size formula and standard deviation formula for both simulated and NA18507 reads. Note that we are aware that realignment of discordant reads with a more precise (but time consuming!) alignment tool, such as Novoalign (http://www.novocraft.com/main/index.php) (as suggested by Quinlan et al., 2010), can lead to subsequent resolution of much misaligned sequence and hence to improved results for all tools considered.

3.4 Experiments

For benchmarking, we considered five different state-of-the-art insert size based approaches, four of which are applicable for a whole-genome study: GASV (Sindi et al., 2009), VH (Hormozdiari et al., 2009, v3.0), Breakdancer (Chen et al., 2009) and HYDRA (Quinlan et al., 2010). We ran MoDIL (Lee et al., 2009) only on Chromosome 1 of the simulated data which, on our machines, required several hundred CPU hours. In contrast, we process Chromosome 1 in less than 1 h. We also consider the split-read aligners PINDEL (Ye et al., 2009) and SV-seq2 (Zhang et al., 2012). Details on program versions and on how we ran each method are given in Supplementary Material, Section C. In case of deletions, we define a hit as a pair of a true deletion and a predicted deletion that overlap and whose lengths do not differ by more than 100 bp, which roughly is the mean of internal segment size. We say that a true insertion formula and a predicted insertion formula, where B is for breakpoint, L is for length, hit each other if the intervals formula and formula overlap. This ‘overlap criterion’ precisely parallels the one for deletions: if one views deletions in the reference as insertions in the donor, then the deletions in the reference (relative to reference coordinates) hit if and only if the insertions in the donor hit (relative to donor coordinates). Again, we also require formula. We also offer results on alternative hit criteria which, instead of overlap, depend on fixed thresholds on breakpoint distance and differences of indel length in Supplementary Material, Section F. As usual, recall = TP/(TP + FN), where TP ( = true positives) is the number of true deletions being hit and FN ( = false negatives) is the number of true deletions not being hit. For Precision = TP/(TP + FP), TP is the number of predicted indels being hit and FP is the number of predicted indels not being hit. We relate recall and precision to one another and also display the F-measure, F = 2*Recall*Precision/(Recall + Precision), as a common overall statistic for performance evaluation. We refer to Exc. ( = exclusive) as the percentage of true annotations, which were exclusively (and correctly) predicted by the method in question. Because the annotations for the biological dataset are obviously still far from complete, a false positive may in fact be due to a missing annotation. We therefore call the ratio TP/(TP + FP) relative precision (RPr.). For recall on the biological data, note that a good amount of existing annotations may be of limited reliability. Therefore, the F-measure is meaningless for these data and we refrain from displaying it. Last but not least, we present average deviation of breakpoint placement and differences in length for all tools in the Supplementary Material, Section G. In Supplementary Material, Section H, we present CLEVER’s results on simulated data when including true alignments in the BAM files, or even using only true alignments so as to analyze its behavior relative to removal of external sources of errors.

3.5 Results

See Table 1 for performance figures. See also Section E in the Supplementary Material for a further subdivision of the 100–50 000 bp part. Boldface numbers designate the best approach, and italic numbers the best insert size based approach (if not the best approach overall). Comparing absolute numbers of true indels in the biological data with the simulated data points out immediately that the vast majority of annotations is still missing seemingly. Therefore, all results on the biological data, in particular those on precision, can only reflect certain trends. For the simulated data, all values reflect the ground truth. As expected, performance rates greatly depend on the size of the indels. For prediction of indels shorter than 20 bp, split-read based approaches and/or read alignment tools themselves are the option of choice.

Table 1.

Benchmarking results for simulated (Venter) and biological data (NA18507)

DatasetVenter insertions
Venter deletions
NA18507 insertions
NA18507 deletions
Prec.Rec.Exc.FPrec.Rec.Exc.FRPr.Rec.Exc.RPr.Rec.Exc.
Length range 20–49 (8786 true ins., 8502 true del.)(2295 true ins., 2192 true del.)
    CLEVER62.553.020.457.460.466.815.963.47.724.18.48.944.76.6
    BreakDancer5.10.175.57.50.013.60.30.08.25.80.0
    GASVNANANANA5.425.81.88.9NANANA1.020.12.0
    HYDRA0.00.00.00.10.00.00.00.00.00.0
    VH32.48.40.213.466.38.00.314.30.83.80.44.64.60.3
    PINDELa66.144.918.753.549.555.812.152.513.140.025.39.364.926.3
    SV-seq2aNANANANA96.01.20.02.3NANANA15.21.60.2
Length range 50–99 (2024 true ins., 1822 true del.)(303 true ins., 294 true del.)
    CLEVER60.486.67.371.272.780.76.876.51.670.36.95.579.612.2
    BreakDancer86.556.50.268.387.348.10.362.06.415.50.09.844.20.7
    GASVNANANANA46.135.01.539.8NANANA2.334.71.0
    HYDRA0.00.00.05.20.00.00.00.02.40.0
    VH55.876.61.464.566.565.81.566.11.462.72.34.357.11.4
    PINDELa77.520.50.332.572.537.50.449.410.829.71.38.343.90.3
    SV-seq2aNANANANA83.619.80.232.0NANANA9.928.60.3
Length range 100–50 000 (3101 true ins., 2996 true del.)(165 true ins., 414 true del.)
    CLEVER66.223.82.035.187.669.94.177.70.531.51.84.870.32.7
    BreakDancer61.017.63.027.465.857.70.061.50.923.01.85.262.10.5
    GASVNANANANA0.949.21.01.7NANANA0.157.72.4
    HYDRA0.00.00.072.856.80.463.80.00.00.02.065.50.5
    VH60.425.53.535.858.865.11.561.81.844.910.93.070.01.4
    PINDELa1.90.084.739.50.153.90.60.05.951.90.2
    SV-seq2aNANANANA81.637.50.351.3NANANA3.934.50.0
DatasetVenter insertions
Venter deletions
NA18507 insertions
NA18507 deletions
Prec.Rec.Exc.FPrec.Rec.Exc.FRPr.Rec.Exc.RPr.Rec.Exc.
Length range 20–49 (8786 true ins., 8502 true del.)(2295 true ins., 2192 true del.)
    CLEVER62.553.020.457.460.466.815.963.47.724.18.48.944.76.6
    BreakDancer5.10.175.57.50.013.60.30.08.25.80.0
    GASVNANANANA5.425.81.88.9NANANA1.020.12.0
    HYDRA0.00.00.00.10.00.00.00.00.00.0
    VH32.48.40.213.466.38.00.314.30.83.80.44.64.60.3
    PINDELa66.144.918.753.549.555.812.152.513.140.025.39.364.926.3
    SV-seq2aNANANANA96.01.20.02.3NANANA15.21.60.2
Length range 50–99 (2024 true ins., 1822 true del.)(303 true ins., 294 true del.)
    CLEVER60.486.67.371.272.780.76.876.51.670.36.95.579.612.2
    BreakDancer86.556.50.268.387.348.10.362.06.415.50.09.844.20.7
    GASVNANANANA46.135.01.539.8NANANA2.334.71.0
    HYDRA0.00.00.05.20.00.00.00.02.40.0
    VH55.876.61.464.566.565.81.566.11.462.72.34.357.11.4
    PINDELa77.520.50.332.572.537.50.449.410.829.71.38.343.90.3
    SV-seq2aNANANANA83.619.80.232.0NANANA9.928.60.3
Length range 100–50 000 (3101 true ins., 2996 true del.)(165 true ins., 414 true del.)
    CLEVER66.223.82.035.187.669.94.177.70.531.51.84.870.32.7
    BreakDancer61.017.63.027.465.857.70.061.50.923.01.85.262.10.5
    GASVNANANANA0.949.21.01.7NANANA0.157.72.4
    HYDRA0.00.00.072.856.80.463.80.00.00.02.065.50.5
    VH60.425.53.535.858.865.11.561.81.844.910.93.070.01.4
    PINDELa1.90.084.739.50.153.90.60.05.951.90.2
    SV-seq2aNANANANA81.637.50.351.3NANANA3.934.50.0

Performance rates as recall, precision, exclusive predictions (Exc. which are true predictions, uniquely predicted by that tool) and F-measure are grouped by different indel size ranges. Dash and NA stands for ‘no prediction’ and ‘not applicable’, respectively. Insertions significantly exceeding the internal segment size (formula here) cannot be detected by insert size based approaches. PINDEL does not detect such insertions either. aSplit-read approach.

Table 1.

Benchmarking results for simulated (Venter) and biological data (NA18507)

DatasetVenter insertions
Venter deletions
NA18507 insertions
NA18507 deletions
Prec.Rec.Exc.FPrec.Rec.Exc.FRPr.Rec.Exc.RPr.Rec.Exc.
Length range 20–49 (8786 true ins., 8502 true del.)(2295 true ins., 2192 true del.)
    CLEVER62.553.020.457.460.466.815.963.47.724.18.48.944.76.6
    BreakDancer5.10.175.57.50.013.60.30.08.25.80.0
    GASVNANANANA5.425.81.88.9NANANA1.020.12.0
    HYDRA0.00.00.00.10.00.00.00.00.00.0
    VH32.48.40.213.466.38.00.314.30.83.80.44.64.60.3
    PINDELa66.144.918.753.549.555.812.152.513.140.025.39.364.926.3
    SV-seq2aNANANANA96.01.20.02.3NANANA15.21.60.2
Length range 50–99 (2024 true ins., 1822 true del.)(303 true ins., 294 true del.)
    CLEVER60.486.67.371.272.780.76.876.51.670.36.95.579.612.2
    BreakDancer86.556.50.268.387.348.10.362.06.415.50.09.844.20.7
    GASVNANANANA46.135.01.539.8NANANA2.334.71.0
    HYDRA0.00.00.05.20.00.00.00.02.40.0
    VH55.876.61.464.566.565.81.566.11.462.72.34.357.11.4
    PINDELa77.520.50.332.572.537.50.449.410.829.71.38.343.90.3
    SV-seq2aNANANANA83.619.80.232.0NANANA9.928.60.3
Length range 100–50 000 (3101 true ins., 2996 true del.)(165 true ins., 414 true del.)
    CLEVER66.223.82.035.187.669.94.177.70.531.51.84.870.32.7
    BreakDancer61.017.63.027.465.857.70.061.50.923.01.85.262.10.5
    GASVNANANANA0.949.21.01.7NANANA0.157.72.4
    HYDRA0.00.00.072.856.80.463.80.00.00.02.065.50.5
    VH60.425.53.535.858.865.11.561.81.844.910.93.070.01.4
    PINDELa1.90.084.739.50.153.90.60.05.951.90.2
    SV-seq2aNANANANA81.637.50.351.3NANANA3.934.50.0
DatasetVenter insertions
Venter deletions
NA18507 insertions
NA18507 deletions
Prec.Rec.Exc.FPrec.Rec.Exc.FRPr.Rec.Exc.RPr.Rec.Exc.
Length range 20–49 (8786 true ins., 8502 true del.)(2295 true ins., 2192 true del.)
    CLEVER62.553.020.457.460.466.815.963.47.724.18.48.944.76.6
    BreakDancer5.10.175.57.50.013.60.30.08.25.80.0
    GASVNANANANA5.425.81.88.9NANANA1.020.12.0
    HYDRA0.00.00.00.10.00.00.00.00.00.0
    VH32.48.40.213.466.38.00.314.30.83.80.44.64.60.3
    PINDELa66.144.918.753.549.555.812.152.513.140.025.39.364.926.3
    SV-seq2aNANANANA96.01.20.02.3NANANA15.21.60.2
Length range 50–99 (2024 true ins., 1822 true del.)(303 true ins., 294 true del.)
    CLEVER60.486.67.371.272.780.76.876.51.670.36.95.579.612.2
    BreakDancer86.556.50.268.387.348.10.362.06.415.50.09.844.20.7
    GASVNANANANA46.135.01.539.8NANANA2.334.71.0
    HYDRA0.00.00.05.20.00.00.00.02.40.0
    VH55.876.61.464.566.565.81.566.11.462.72.34.357.11.4
    PINDELa77.520.50.332.572.537.50.449.410.829.71.38.343.90.3
    SV-seq2aNANANANA83.619.80.232.0NANANA9.928.60.3
Length range 100–50 000 (3101 true ins., 2996 true del.)(165 true ins., 414 true del.)
    CLEVER66.223.82.035.187.669.94.177.70.531.51.84.870.32.7
    BreakDancer61.017.63.027.465.857.70.061.50.923.01.85.262.10.5
    GASVNANANANA0.949.21.01.7NANANA0.157.72.4
    HYDRA0.00.00.072.856.80.463.80.00.00.02.065.50.5
    VH60.425.53.535.858.865.11.561.81.844.910.93.070.01.4
    PINDELa1.90.084.739.50.153.90.60.05.951.90.2
    SV-seq2aNANANANA81.637.50.351.3NANANA3.934.50.0

Performance rates as recall, precision, exclusive predictions (Exc. which are true predictions, uniquely predicted by that tool) and F-measure are grouped by different indel size ranges. Dash and NA stands for ‘no prediction’ and ‘not applicable’, respectively. Insertions significantly exceeding the internal segment size (formula here) cannot be detected by insert size based approaches. PINDEL does not detect such insertions either. aSplit-read approach.

20–49 bp: CLEVER outperforms all other approaches on the simulated data and is the best insert size based approach also on the biological data. PINDEL achieves best rates on the biological data. Also, CLEVER makes a substantial amount of exclusive calls in all categories. Tables in the Supplementary Material, subsection F.2, points out that 80–90% of CLEVER’s indel calls come significantly close to a real indel. Further analyses (Supplementary Material, Section H) demonstrate that 30% of CLEVER’s false positives are due to misalignments and mapping ambiguities (see External sources of errors below). Obviously many of those extremely close but not truly hitting calls are due to external errors. Breakdancer makes little and highly precise calls at the expense of reduced accuracy in terms of indel breakpoint placement and length (see Supplementary Material, Section G).

50–99 bp: Here, CLEVER achieves substantially better recall and more exclusive calls than PINDEL on the biological data. On the simulated data, CLEVER again achieves best overall performance. In contrast to 20–49 bp, however, Breakdancer and VH already make significant contributions. Although VH achieves good overall performance, Breakdancer mostly excels in precision. As before, when allowing a certain offset of breakpoints (Supplementary Material, subsection F.2) or when integrating correct alignments (Supplementary Material, Section H), CLEVER’s precision substantially rises from 60–72% to 72–96% across the categories.

100–50 000 bp: Also, CLEVER is best while other tools (Breakdancer, HYDRA, VH) also make decisive contributions. This documents that the current challenges for indel discovery are rather in the size range of 20–100 bp. Note that none of the tools makes predictions for insertions longer than 250 bp, see Section E in the Supplementary Material.

MoDIL: We compared MoDIL with all other tools on Chromosome 1 alone because of the excessive run-time requirements of MoDIL (CLEVER is faster by a factor of ~1000). See Supplementary Material, Section I. Overall, MoDIL incurs certain losses in performance with respect to CLEVER across all categories, but outperforms the other insert size based approaches apart from larger indels (formula100 bp). It is noteworthy that MoDIL makes a substantial amount of exclusive calls for insertions of 50–99 bp.

Accuracy of breakpoint and length predictions: See Section G for related numbers. The split-read based approaches outperform the insert size based approaches. Among the latter, CLEVER and GASV are most precise for 20–49 and 100–50 000 bp. For 50–99 bp calls, Breakdancer achieves favorable values.

External sources of errors: See Supplementary Material, Section H, for related results and a detailed discussion on to what degree misalignments and multiply mapped reads/alignment hamper computational SV discovery.

Conclusion: We have presented a novel internal segment size based approach for discovering indel variation from paired-end read data. In contrast to all previous, whole-genome-applicable approaches, our tool takes all concordant read data into account. We outperform all prior insert size based approaches on indels of sizes 20–99 bp and also achieve favorable values for long indels. We outperform the split-read based approaches considered on medium-sized (50–99 bp) and larger (formula100 bp) indels. In addition, our approach detects a substantial amount of variants missed by all other approaches, in particular, in the smallest size range considered (20–49 bp). In conclusion, CLEVER makes substantial contributions to SV discovery, in particular, in the size range of 20–99 bp.

Our approach builds on two key elements: first, an algorithm that enumerates maximal, statistically contradiction-free ensembles as max-cliques in read alignment graphs in short time and, second, a sound statistical procedure that reliably calls max-cliques that indicate variants. Our approach is generic with respect to choices of variants; max cliques in the read alignment graphs can also reflect other variants such as inversions or translocations. For future work, we are planning to predict inversions and to incorporate split read information as a unifying approach.

Conflict of Interest: none declared.

REFERENCES

Abyzov
A
et al.
,
CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing
Genome Res.
,
2011
, vol.
21
(pg.
974
-
984
)
Albers
CA
et al.
,
Dindel: accurate indel calls from short-read data
Genome Res.
,
2011
, vol.
21
(pg.
961
-
973
)
Alkan
C
et al.
,
Personalized copy number and segmental duplication maps using next-generation sequencing
Nat. Genet.
,
2009
, vol.
41
(pg.
1061
-
1067
)
Alkan
C
et al.
,
Genome structural variation discovery and genotyping
Nat. Rev. Genet.
,
2011
, vol.
12
(pg.
363
-
376
)
Bentley
DR
et al.
,
Accurate whole human genome sequencing using reversible terminator chemistry
Nature
,
2008
, vol.
456
(pg.
53
-
59
)
Bravo
HC
Irizarry
RA
,
Model-based quality assessment and base-calling for second-generation sequencing data
Biometrics
,
2010
, vol.
66
(pg.
665
-
674
)
Campbell
PJ
et al.
,
Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing
Nat. Genet.
,
2008
, vol.
40
(pg.
722
-
729
)
Chen
K
et al.
,
Breakdancer: an algorithm for high-resolution mapping of genomic structural variation
Nat. Methods
,
2009
, vol.
6
(pg.
677
-
681
)
Chiang
DY
et al.
,
High-resolution mapping of copy-number alterations with massively parallel sequencing
Nat. Methods
,
2009
, vol.
6
(pg.
99
-
103
)
Eid
J
et al.
,
Real-time DNA sequencing from single polymerase molecules
Science
,
2009
, vol.
323
(pg.
133
-
138
)
Hach
F
et al.
,
mrsFAST: a cache-oblivious algorithm for short-read mapping
Nat. Methods
,
2010
, vol.
7
(pg.
576
-
577
)
Hormozdiari
F
et al.
,
Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes
Genome Res.
,
2009
, vol.
19
(pg.
1270
-
1278
)
Itsara
A
et al.
,
Population analysis of large copy number variants and hotspots of human genetic disease
Am. J. Hum. Genet.
,
2009
, vol.
84
(pg.
148
-
161
)
Korbel
JO
et al.
,
PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data
Genome Biol.
,
2009
, vol.
10
pg.
R23
Langmead
B
et al.
,
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biol.
,
2009
, vol.
10
pg.
R25
Lee
S
et al.
,
MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions
Nat. Methods
,
2009
, vol.
6
(pg.
473
-
474
)
Levy
S
et al.
,
The diploid genome sequence of an individual human
PLoS Biol.
,
2007
, vol.
5
pg.
e254
Li
H
Durbin
R
,
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics
,
2009
, vol.
25
(pg.
1754
-
1760
)
Li
H
et al.
,
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Res.
,
2008
, vol.
18
(pg.
1851
-
1858
)
Medvedev
P
et al.
,
Computational methods for discovering structural variation with next-generation sequencing
Nat. Methods
,
2009
, vol.
6
11 Suppl.
(pg.
S13
-
S20
)
Mills
R
et al.
,
Natural genetic variation caused by small insertions and deletions in the human genome
Genome Res.
,
2011
, vol.
21
(pg.
830
-
839
)
Mills
RE
et al.
,
An initial map of insertion and deletion (indel) variation in the human genome
Genome Res.
,
2006
, vol.
16
(pg.
1182
-
1190
)
Quinlan
AR
et al.
,
Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome
Genome Res.
,
2010
, vol.
20
(pg.
623
-
635
)
Sindi
S
et al.
,
A geometric approach for classification and comparison of structural variants
Bioinformatics
,
2009
, vol.
25
(pg.
i222
-
i230
)
Sudmant
PH
et al.
,
Diversity of human copy number variation and multicopy genes
Science
,
2010
, vol.
330
(pg.
641
-
646
)
The 1000 Genomes Project Consortium
,
A map of human genome variation from population-scale sequencing
Nature
,
2010
, vol.
467
(pg.
1061
-
1073
)
The International HapMap Consortium
,
A haplotype map of the human genome
Nature
,
2005
, vol.
437
(pg.
1299
-
1320
)
Wasserman
L
All of Statistics
,
2004
New York
Springer
Ye
K
et al.
,
Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads
Bioinformatics
,
2009
, vol.
25
(pg.
2865
-
2871
)
Yoon
S
et al.
,
Sensitive and accurate detection of copy number variants using read depth of coverage
Genome Res.
,
2009
, vol.
19
(pg.
1586
-
1592
)
Zhang
J
et al.
,
An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data
BMC Bioinformatics
,
2012
, vol.
13
pg.
S6

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Part of the work was done while the author was at Centrum Wiskunde & Informatica.

Associate Editor: Michael Brudno

Supplementary data