Abstract
Motivation: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database.
Results: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5–500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are ∼100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications.
Availability: NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr
Contact: leec@chem.ucla.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
The continuing exponential growth of biological sequence databases is both a remarkable resource and a difficult challenge for bioinformatics. The value of such databases is limited by our ability to query them in powerful and efficient ways. One pattern that is almost universal in sequence database query is searching for the alignment relationships of a specific region of one or more sequences. Treating each sequence as a 1D coordinate system, this is equivalent to (1) saving interval maps that represent pairwise relations between sequence intervals; and (2) interval overlap query, searching for the set of interval mappings that overlap a given query interval (Birney et al., 2006; Giardine et al., 2003; Kent et al., 2002; see also http://www.gmod.org).
For example, the bioinformatics data model of the NCBI (Ostell et al., 2001) represents annotations and biological features as 1D intervals. Thus, finding biological annotations for a given sequence region requires searching the database for annotated intervals that overlap the query sequence interval. Typically, an annotation database will store various types of information in the form of intervals. For illustration consider the following example. For a region of a chromosome, the database may contain information about all the genes that it contains. Then we may add for each gene information about its structure (i.e. exons, introns, binding motifs, etc.), various polymorphism information like SNPs, and transcripts produced by alternative splicing. All the biological entities above are intervals.
Similarly, a comparative genomics query requires searching an alignment database for intervals in other genomes that align to intervals overlapping the original query region. All of these classes of queries are important not only in their own right (e.g. comparative genomics studies), but also are used by biological researchers to produce better experimental designs in the areas of expression analysis, proteomic analysis and others. Thus, interval overlap query has proved to be a fundamental and necessary tool for querying sequence annotation data, sequence alignment databases and integrating diverse biological information using genomes as the ‘glue’. In our experience, interval overlap query is often the rate-limiting step for all of these queries. Thus, a large improvement in interval overlap query speed and scalability can be an enabling technology for accelerating the rate of discovery from genomics data-mining.
Several approaches have been applied successfully to this problem, since it was first identified (Edelsbrunner, 1980). However, only a few alternative methods are in common use in bioinformatics, although much progress has been made in the related field of spatio-temporal indexing (see Enderle et al., 2004; Kriegel et al., 2000; Preparata and Shamos, 1993; for reviews). Conventional multi-column BTree indexing (such as in a relational database like MySQL) can be used for smaller databases, but provides limited scalability (see below). One solution to this scalability problem is binning, subdividing the database into bins of a given size (possibly hierarchically), which has been used successfully in the UCSC Genome Browser (Kent et al., 2002). R-tree (Guttman, 1984) indexing has also been used, by Gmod (http://www.gmod.org).
Most methods from the spatio-temporal database realm, excluding the original R-tree data structure, have not been attempted yet in the bioinformatics community, mostly because of their complexity. An interested reader may explore Segment R-tree (Kolovson and Stonebraker, 1991), Interval Trees (for example see Cormen et al., 2001), MV3R-Tree (Tao and Papadias, 2001), Overlapping B+-trees (Tzouramanis et al., 1999), or other variations on the general R-tree. A recent book (Manolopoulos et al., 2005) gives detailed review of spatio-temporal indexing. However, no comprehensive comparisons or implementations of them for the bioinformatics applications are available. In this study, we compare our results with previously published genomics interval query methods (Birney et al., 2006; Giardine et al., 2003; Kent et al., 2002; and http://www.gmod.org). Table 1 lists indexing methods and implementations (Postgres, MySQL) that we have tested in this study. Based on Gmod's previous work, we tested spatial indexes using R-tree in Postgres. Other databases (e.g. Oracle, IBM DB2) also implement R-tree spatial indexes, but we did not test them here.
Methods and implementations used in this study
| Method | Index type | Implementation tested |
| R-tree | RTree (start,stop) | Postgres v8.0.3 |
| Multi-column | BTree (start,stop) | MySQL v4.1.10 |
| Binning | BTree (bin,start,stop) | MySQL v4.1.10 |
| Method | Index type | Implementation tested |
| R-tree | RTree (start,stop) | Postgres v8.0.3 |
| Multi-column | BTree (start,stop) | MySQL v4.1.10 |
| Binning | BTree (bin,start,stop) | MySQL v4.1.10 |
Interval overlap query remains an active problem for research, for several reasons. Technical goals for optimal query methods include: (1) Database scalability: for database size (number of intervals) N, the time and memory complexity for overlap query should ideally be O(log N) for both the query time and required RAM, and O(N) disk space for storage of the database. (2) Query scalability: for a query that yields n interval overlap hits, the time and memory complexity should ideally be O(n) for the query time and O(n) or just a constant for the required RAM. (3) Construction scalability: the time complexity for constructing the indexed database should ideally be identical to the computational complexity for constructing a standard 1D index, O(Nlog N), and the memory complexity should be O(N) or better. (4) Updating scalability: the time and memory complexity for adding or altering M entries should ideally be O(Mlog M) or better, enabling rapid dynamic updating of the database. (5) Practicality: the query method should be usable on a typical bioinformatics platform, e.g. 512 MB–1 GB system memory, and able to perform a typical genome-wide analysis (i.e. thousands of individual queries) in seconds to hours.
To understand these technical challenges better, it is instructive to consider why standard multi-column BTree indexing (e.g. in MySQL) fails to provide scalable interval overlap query. Intervals are represented as (start, stop) pairs; start and stop can be indexed separately or together. For the purpose of this example, we will assume the intervals are sorted in ascending start order. For a given query (q.start,q.stop), finding a first overlapping interval (e.g. finding the largest x.start such that x.start < q.stop and checking for intersection) takes just O(log N). However, finding all overlapping intervals poses a problem, since they are not guaranteed to be contiguous on any of the orderings (start), (stop) or (start, stop); for example, another overlapping interval might be present at the very beginning of the start index (see Fig. 1). In practice, this means that scanning for additional hits cannot halt at the first non-overlapping interval, but must continue all the way to the beginning of the BTree index, on average scanning half of the database. This yields a query time complexity of O(log N + N), which is no better than if no index were used.
Storage and querying for interval overlap. In (a), we demonstrate that using conventional sorting of interval database by start, stop coordinates the interval overlap query cannot halt at first non-qualifying interval (see text), but has to continue until the end/start of the database, scanning on average half of the stored intervals. In NCList (b), however, the result set intervals (in black) are located back to back in each of the sublists (each sublist is shown in a separate box) and sublists potentially containing more results are linked to each other by containment links. Therefore, the scanning problem is eliminated in this case, resulting in lower query complexity.
Storage and querying for interval overlap. In (a), we demonstrate that using conventional sorting of interval database by start, stop coordinates the interval overlap query cannot halt at first non-qualifying interval (see text), but has to continue until the end/start of the database, scanning on average half of the stored intervals. In NCList (b), however, the result set intervals (in black) are located back to back in each of the sublists (each sublist is shown in a separate box) and sublists potentially containing more results are linked to each other by containment links. Therefore, the scanning problem is eliminated in this case, resulting in lower query complexity.
In this article, we present a general storage and query method and data structure, which we call Nested Containment List (NCList), that facilitates fast overlap queries of the 1D intervals. We will present all the essential algorithms in the Methods section of the article and present synthetic and real data comparisons with other interval database efforts (Table 1) that employ some sort of spatial R-tree (http://www.gmod.org) or binning indexing (Birney et al., 2006; Giardine et al., 2003; Kent et al., 2002) in the Results section. These results show that NCList offers large improvements in performance over existing algorithms, both in terms of query speed and construction speed, and on a standard PC can perform thousands of genome alignment queries per second even on very large multi-genome alignment databases.
2 METHODS
2.1 Motivations for the Nested Containment List
Interval query can be slow because the overlapping intervals for any given query may not be contiguous in standard indexing. Therefore, the database query cannot stop at the first non-overlapping interval, but must scan the rest of the database. We can easily solve this problem by realizing that it is caused solely by the intervals that are contained within other intervals, i.e. x.start < y.start < y.stop < x.stop. (see Fig. 2). If a sorted list of intervals has both start and stop coordinates in ascending order, then the overlapping intervals for any query are guaranteed to be contiguous in the list. However, consider a list sorted on start coordinate. If any interval x contains another interval y, then their stop coordinates cannot be in ascending order (x.start < y.start but x.stop > y.stop, Fig. 2). Removing contained intervals y from the list guarantees that overlapping intervals for any query will be contiguous in the list. Formally,
Any contained interval breaks sortedness. If intervals (boxes) are sorted on their start coordinate, then any interval y (filled box) that breaks sortedness of the stop coordinate is properly contained in another interval x.
Any contained interval breaks sortedness. If intervals (boxes) are sorted on their start coordinate, then any interval y (filled box) that breaks sortedness of the stop coordinate is properly contained in another interval x.
Proposition 1.
If a list of intervals L is ordered on start and its intervals have no containment relation, then it is also sorted on stop.
Suppose x is before y in L and x.stop > y.stop, then x.start ≤ y.start < y.stop < x.stop , and x ⊃ y (Fig. 2), contrary to hypothesis. This motivates the main idea behind NCList: to partition the list of intervals, such that no interval is contained in another interval within each partition. The practical value of this observation is that if we subdivide the database by moving all intervals that are contained in a given interval into a separate sublist, the time complexity of overlap query is reduced to O(log N + n), since now we can guarantee that no more results are present once we encounter the first non-qualifying interval. Even though the fraction of nested intervals may be low, depending on the specific application (e.g. 2.3% for the human chromosome 1 of the UCSC data tested in Section 3.3), this guarantee is necessary to ensure correct query results.
In Figure 1, we illustrate the difference between a sorted list database and a NCList data structure. Figure 1a shows a list of intervals that are stored in a linear database sorted on the start coordinate of the interval. When such a structure is queried with an overlapping interval, the result set intervals (shown in black) can be spread all over the list, in fact, to find all of them we need to scan on average the half of the list. In NCList (Fig. 1b), however, the subset of intervals that overlap the query are guaranteed to be contiguous in any given sublist. Thus, we only need to search the sublists of these overlapping intervals recursively.
2.2 NCList data structure
NCList representation keeps a separate ‘sublist headers array’ H that records for each sublist Sx its length (number of elements) and the index in L of its first element. By convention H[0] is the record for the top level sublist S0. We augment each interval with a sublist attribute that gives the index of the corresponding record in H or -1 if the sublist of that interval is empty. Thus our representation of the nested list consists of two arrays (L, H) defined above. A schematic representation can be found in Figure 3. For further details on NCList construction see Section 2.7.
Schematic representation of NCList data structure. Here in the interval array L, element L[i] has non-empty sublist, which records is indexed in sublist header array H by L[i].SUBLIST. This sublist starts at START and has length LENGTH. L[i] itself belongs to the top level list S0, which has sublist header H[0].
Schematic representation of NCList data structure. Here in the interval array L, element L[i] has non-empty sublist, which records is indexed in sublist header array H by L[i].SUBLIST. This sublist starts at START and has length LENGTH. L[i] itself belongs to the top level list S0, which has sublist header H[0].
2.3 Overlap query algorithm
NCList overlap query is reduced to a simple recursive algorithm (see box Overlap(L, H, sublist, start, stop)) of (1) binary search within a given sublist; (2) linear scan returning contiguous overlapping intervals in that sublist; (3) for each overlapping interval that has a sublist, recurse to perform the same search on its sublist. The main advantage of NCList as a query system is that all of the sublists are sorted on both start and stop positions (Fig. 1b). Linear scan of the database in this case is bounded by the size of the output set and can be stopped when the first non-overlapping interval is encountered.
If N is the size of the database, α is the average number of intervals contained in any top-level interval, and n is the size of the output set for a given query, the time complexity of this algorithm scales as O(log (N/ α) + n), which is sub-linear in the size of the database and linear in the size of the output set.
| Algorithm 1 Overlap(L, H, sublist, start, stop) |
| R ← ∅ |
| subend ← H [sublist].start+H [sublist].length |
| i ← BinarySearchEnd(L, H [sublist], start) |
| whilei < subend and L[i].start < stopdo |
| R← R ∪ {L[i]} |
| R← R ∪ Overlap(L, H, L[i].sublist, start, stop) |
| i← i+1 |
| end while |
| Return R |
| Algorithm 1 Overlap(L, H, sublist, start, stop) |
| R ← ∅ |
| subend ← H [sublist].start+H [sublist].length |
| i ← BinarySearchEnd(L, H [sublist], start) |
| whilei < subend and L[i].start < stopdo |
| R← R ∪ {L[i]} |
| R← R ∪ Overlap(L, H, L[i].sublist, start, stop) |
| i← i+1 |
| end while |
| Return R |
2.4 Disk-based storage and query
NCList can easily be adapted to storage and query in disk-based files. The indexed database L is stored as a single disk file. In addition, a ‘block index’ B is created by grouping L in blocks of b intervals, where b is a user-specified parameter. Each block in L is represented by a single interval in B representing the cover of the intervals in that block. Sublist S0 and all sublists Sx of size greater than or equal to b are indexed in B. The block index for S0 is kept in RAM, but the block index for any other sublist may be loaded or swapped out according to need.
To perform a query, binary search is applied to the block index for S0 to find the first overlapping block, which is loaded into RAM using the POSIX system call fseeko(). The standard query algorithm is then applied to this block (with scanning into subsequent blocks if needed); recursion to other sublists triggers loading the block indices of those sublists, which are then processed recursively by the same procedure.
Since each interval mapping requires 24 bytes in L, but only 8 bytes in B, even if block indices for all sublists were loaded simultaneously, this would use a factor of 3b times less memory than the in-memory version of NCList. Using a value of b = 350, which corresponds to a block size of about 8 kb (a typical block size for disk I/O), yields more than a 1000-fold reduction in memory use.
2.5 Implementation
We have implemented NCList interval query in several different versions, including the disk-based storage and query method described above (used to generate the results presented in this article), and an in-memory version (nested list stored and queried in RAM; this version is not discussed further in this manuscript). Our implementation was written in C and can be used either as a C library or as a Python extension module. All of the performance tests in this article used the Python extension interface, whose much higher level of overhead may make it more directly comparable with the other databases tested (e.g. MySQL). However, it should be noted that for those willing to work directly with the NCList C library (intervaldb), performance is substantially higher than the results presented in this article. For example, for random queries on human chromosome 1 in the UCSC eight genome alignment, the average query time was 163 µs using the Python interface, but only 74 µs using the C library interface.
Our implementation of the NCList data structure in C and its Python interface can be downloaded from http://sourceforge.net/projects/pygr.
2.6 Performance comparisons with other methods
In order to evaluate performance of the NCList method, we measured the query execution times of several different methods (see Table 1): (1) NCList; (2) R-tree indexing; (3) MySQL multi-column indexing; (4) MySQL binning with 1-kb bin size, similar to the indexing used in the genome browser (Kent et al., 2002). Here, 1-k binning was performed in MySQL by assigning each interval a bin number for each 1-kb region of the genome it overlaps with. We tested all four methods on the same datasets, including both simulated test data (see below), and real-world genome alignment data. All tests (including the example application 3.4) were run on identical hardware (XServe G5 2 GHz and 2.5 GB of RAM). For the R-tree test, we used the Gmod (http://www.gmod.org) implementation of R-tree indexing on Postgres DBMS version 8.0.3. For the MySQL and MySQL binning tests, we used MySQL server version 4.1.10.
For the first set of tests, we generated simulated interval databases by randomly placing equal numbers of intervals of length 1, 10, 100, 1000, 10 000 base pairs uniformly over the whole genome. These interval sizes span the range of typical feature sizes (e.g. SNP, motif, exon, gene) in genomic analysis. We examined the effect of several parameters on query performance: N, the total database size (number of intervals); w, the width of the query interval; L, the total sequence length in the database; and n, the size of the query result (i.e. number of intervals overlapping the query interval). On average, n = Nw/L. For the initial test, we used a genome length L = 100.1 Mb. Other factors can be identified for analysis of query performance, but most of them are correlated with one or more of the above quantities (database size, result set size).
To test these methods on real genome alignment data, we used the UC Santa Cruz 17-mammalian genome sequence alignment of Golden Path assemblies (Blanchette et al., 2004). We constructed a database containing all intervals on the largest chromosome (Human chromosome 1) that are aligned to an interval in another genome.
2.7 Construction of NCList
This section can be skipped by most readers, as it is not necessary for understanding the basics of how to use and query NCList interval databases, but illustrates some useful technical properties of the NCList.
We have tested two different construction algorithms: a simple algorithm that involves temporarily copying sublists during construction; and an in-place construction algorithm that does not require copying sublists. This algorithm is more complex, but this algorithm illustrates some useful properties of the NCList data structure, which we give next along with the basic motivation for the algorithm. For a detailed description of the algorithm including the pseudo-code see Supplementary Data for this article.
We begin with a few formal definitions.
Interval containment: Throughout this article, we will work with half-open intervals defined on discrete finite 1D lattice I as follows x = [x.start,x.stop] = {i ∈ I:x.start ≤ i < x.stop}. For two intervals x , y, proper containment x ⊃ y is defined in the usual point set theoretic way, i.e. x ⊃ y ⇔ x.start ≤ y.start < y.stop < x.stop.
Ordering: For a list L of distinct intervals, we define an ordering function OL:L↔ [0,|L|), such that x.start < y.start or (x.start = y.start and x.stop > y.stop) ⇒ OL(x) < OL(y) for x,y ∈ L. OL orders intervals on their start and end, unless an interval is properly contained, in which case it follows the containing interval. All ties (for duplicate intervals) are resolved arbitrarily, such that OL imposes total ordering on L, i.e. ∀ x,y ∈ L , either OL(x) < OL(y) or OL(y) < OL(x).
Sublist: The sublist of an interval x is defined as
as the set of intervals contained in x and not contained in any other interval ‘after’ x in the ordering OL. This part of the definition is essential for guaranteeing that depth-first traversal of the NCList returns intervals in precisely the same order as in OL (see Theorem 4 below for proof). We define the top-level list S0 to be the set of intervals that are not contained in any other interval, then SL = {S0} ∪ {Sx}x ∈ L forms a partition of L. Thus the sublists form a tree: we will sometimes refer to x as the ‘parent’ of y for y ∈ Sx, and to y as a ‘child’ of x. L is stored such that all the elements of each sublist Sx are contiguous. OL is used initially to construct L, but in the final database the OL ordering is only preserved within a given sublist.The first step of the construction algorithm is to sort the array L using ordering OL. Then Lemma 2 enables us to count the number of sublists, that are present in list L.
The lemma observes that if there are any intervals that are contained in an interval x, the first such interval immediately follows x in ordering OL:
For a given intervalx ∈ L, ifSx ≠ ∅ , then min y ∈ Sx(OL(y)) = OL(x) + 1.
The proof by contradiction follows from the definition of OL. Consider z, such that OL(z) = OL(x) + 1, and suppose z ∉ Sx. Since Sx ≠ ∅, take y = argminy ∈ Sx(OL(y)). Since OL is integer valued, OL(x) < OL(z) < OL(y). Now the following inequalities must be true x.start ≤ z.start ≤ y.start < y.stop < x.stop ≤ z.stop, but this implies that z ⊃ y, therefore y ∉ Sx, which is a contradiction, therefore z ∈ Sx and z = y = argminy ∈ Sx(OL(y)).
Lemma 2 implies that we can find the number of sublists |H| by simply counting how many times L[i] ⊃ L[i + 1].
Lemma 2 is strengthened to produce another property, Lemma 3, that all intervals contained in interval x immediately follow x before any other intervals belonging to the same sublist as x.
Suppose T ∈ SL is one of the partitions (not necessarily contiguous in OL) and x, y ∈ T, then if OL(x) < OL(y), then ∀ z ∈ Sx, OL(z) < OL(y).
Consider to the contrary an interval z ∈ Sx , such that OL(y) < OL(z), then since y ⊃ z is excluded by our definition of Sx, we have x.start ≤ y.start ≤ z.start < y.stop ≤ z.stop < x.stop, which implies that y ∈ Sx, which contradicts y ∈ T, since T ∩ Sx = ∅ by construction of partition SL.
Together Lemmas 2 and 3 imply the following ‘parenthesis theorem’, which allows us to construct NCList from its traversal, OL. The depth-first traversal of the NCList proceeds by returning the first element x of the top-level list S0 and then returning elements of Sx recursively before the next element in S0.
Depth-first traversal of the NCList returns intervals ordered by OL.
Utilizing this property, we can enumerate all the contained sublists present and label all intervals with their sublist id. After doing so, it takes only an additional sorting to rearrange the intervals of L according to their sublist number. Then in a single sweep over the list we can establish containment links between H and the sublist of each interval. For details see the Supplementary Material.
3 RESULTS
3.1 Scalability of NCList query
We first tested the scalability of NCList as a function of increasing database size and of increasing query result size (number of intervals found by the query; see Methods section). Figure 4a plots the absolute query time in milliseconds as a function of result set size (n), with database size fixed at N = 1.6 million intervals, revealing a simple linear relationship. Fitting a linear model to the data produces a fit with R-squared statistic equal to 0.9285 which suggest a very good fit, with the t-test significance for both the intercept and slope of less than 2 × 10−16. This suggests that the running time of NCList scales linearly as a function of result set size.
Effects of result set size and database size on NCList performance. Result set size (a) Query time as a function of result set size. (b) Average query time as a function of database size. The line represents the least squares fit to the data.
Effects of result set size and database size on NCList performance. Result set size (a) Query time as a function of result set size. (b) Average query time as a function of database size. The line represents the least squares fit to the data.
Next we varied the database size (N) while holding constant the result set size (n) in the range between 800 and 1200 in order to observe the effect of database size. The plot of the running time as a function of database size (log. scale) on Figure 4b suggests a log linear trend in the data. The P-values for t-test for the significance of the coefficients of the log-linear model are <2 × 10−16. This suggests that the running time scales as the logarithm of the database size. Moreover, the slope of this log-linear trend is low; a 100-fold increase in the database size resulted in only about a 30% increase in the NCList query time.
3.2 Comparison of overlap query performance
We next analyzed the relative performance of NCList versus other interval query indexing methods, including multi-column indexing, binning and R-tree indexing (see Methods section). For comprehensive testing, we generated a wide variety of simulated interval databases over a large range of database sizes, and tested with a large range of query interval widths. In all cases, NCList was over 5–500-fold faster than existing methods tested in this study. Figure 5 shows the relative speed ratios of the competing methods, measured by dividing the query time of each method by the query time for NCList on the identical dataset and query intervals. Figure 5a, c and e plots the speed ratio as a function of query width, with series (shown as different symbols) representing databases of varying sizes.
Speed ratio for NCList relative to other indexing methods. The speed ratio is computed by dividing the running time of (a), (b) Multi-indexing; (c), (d) 1 k binning; (e), (f) R-tree by the running time of NCList on identical databases and query intervals. Results presented in (a), (c), (e) as a function of query width for different database sizes (open circle: 50000, dot: 100000, filled circle: 200000, open square: 500000, open diamond: 1000000, open triangle: 2000000, open inverse triangle: 5000000); and in (b), (d), (f) as a function of database size for different query widths (open circle: 100, dot: 500, filled circle: 1000, open square: 5000, open diamond: 10000, open triangle: 50000, open inverse triangle: 100000).
Speed ratio for NCList relative to other indexing methods. The speed ratio is computed by dividing the running time of (a), (b) Multi-indexing; (c), (d) 1 k binning; (e), (f) R-tree by the running time of NCList on identical databases and query intervals. Results presented in (a), (c), (e) as a function of query width for different database sizes (open circle: 50000, dot: 100000, filled circle: 200000, open square: 500000, open diamond: 1000000, open triangle: 2000000, open inverse triangle: 5000000); and in (b), (d), (f) as a function of database size for different query widths (open circle: 100, dot: 500, filled circle: 1000, open square: 5000, open diamond: 10000, open triangle: 50000, open inverse triangle: 100000).
These results suggest several conclusions. MySQL multi-column indexing yielded the slowest performance, over 500-fold slower than NCList. In fact, its running time was so slow that we could not complete the test for large database sizes. Binning improved MySQL performance greatly for small query widths (e.g. it was less than 10-fold slower than NCList, for w = 100 nt). However, binning performance degraded severely with increasing query interval width, and was 500-fold slower than NCList for w = 100 kb. R-tree indexing was substantially better, and was 10–50-fold slower than NCList. For MySQL multi-indexing and R-tree approaches, varying the query width yielded approximately constant speed ratios, on average of 500 and 20, respectively (Fig. 5a and e). The benefit of NCList compared to the binning Figure 5c approach increased with a larger query width. In general, the test data show increasing speed advantages for NCList (relative to other methods) for increasing database size and query width (as indicated by the positive slopes of most of the trends in Fig. 5).
Comparing different database sizes, for binning (Fig. 5d) the speed ratio was approximately constant through all database sizes, although it has a significant variance, with benefit of switching to NCList ranging from 5 to 500-fold depending on the width of the typical query. The speed ratio for NCList versus R-tree (Fig. 5f) indexing increased approximately logarithmically with increasing database size, with typical benefit from 10 to > 50-fold.
3.3 UCSC multi-genome alignment database tests
For evaluation of utility of NCList in a real genomic sequence analysis application, we used the UC Santa Cruz eight mammalian genome sequence alignment of golden path assemblies (Blanchette et al., 2004). We constructed an interval database for all alignment intervals for Human chromosome 1 (see Methods section). The resulting database consisted of 3 793 079 interval records. The NCList database built on these data contained 3 706 377 top-level intervals, the rest were spread across 44 862 lower level sublists. The average sublist depth was 1; α = 1.02 due to low density of sublists for this data. Figure 6 shows the average running time of each of the methods over varying query widths. These data show an overwhelming speedup of at least 100-fold.
Human chromosome 1 alignment database query performance. Average query time for each method as a function of query width.
Human chromosome 1 alignment database query performance. Average query time for each method as a function of query width.
The running time for multi-column indexing did not vary with the query width, which is consistent with the fact that it has to do an O(N) scan of the database for each query, regardless of the size of the output set. Binning and R-tree approaches show similar running time and dynamics over this dataset. Their running time increased dramatically with increasing query size. For NCList, the increase in running time was more gradual than that of the previous two methods. Whereas it showed a 10-fold speed ratio at small query size (w = 100 nt), the speed ratio increased to 100-fold at larger query widths (w = 100 kb).
3.4 Example: NCList enables cross-genome analysis of alternative splicing
To demonstrate the value of NCList for enabling complex comparative genomics studies, we have applied it to the analysis of genome-wide comparisons of alternative splicing in 17 different animal genomes. The initial goals of this study included mapping gene structures from five source genomes (human, mouse, dog, cow and zebrafish) onto an alignment of 17 complete genomes (Blanchette et al., 2004), to identify orthologous exons across all these genomes for both alternatively spliced exons and constitutive exons in the source genomes. This analysis had to both identify the orthologous exons (if any), measure their sequence conservation in the target genomes, and also measure the conservation of the source exon’s splice sites. Roughly, the pipeline consisted of three major steps: (1) mapping orthologous genes, (2) mapping all homologous exons and (3) matching orthologous genes with identified homologous exons. The first two of these steps required querying the 17-way multiple genome alignment for a sub-alignment containing (1) a source gene (10–100 kb in length) or (2) a source exon (100–1000 nt). In addition, for each exon we queried for two of its flanking exons and the splice sites around it (five genome alignment queries per exon). In our setup, we stored aligned intervals of each genome in reference to a coordinate system spanning the whole length of the alignment. In order to get all aligned sequences, we needed to perform two queries: one to find the mapping to the coordinate system and another to extract the aligned intervals in the other genomes. Thus for each exon, we needed to perform a total of 10 queries, and 2 queries per gene. Tests with the MySQL database (multi-column BTree index) showed that it took more than a month to complete the analysis of a single genome (human exons only) using MySQL as the query engine and the same hardware (see Methods section for details). Thus the complete analysis of five source genomes using MySQL did not appear practical.
We therefore applied NCList analysis to this problem. This involved querying a NCList interval database of size 34 Gb and sequence databases of 36 Gb, considerably larger than in test examples in the chromosome 1 tests above. Based on the actual number of genes and exons we needed to process (see Table 2), the total number of queries was at least 2.7 million queries, with widths of up to 10–100 kb, including 200 000 gene interval queries. We were able to complete the entire analysis on a single CPU in 58 h using NCList as a querying engine. In practice, we normally use 3 or 4 CPUs for such a computation, enabling this large computation to be completed within a day. It should be emphasized that due to the high speed of the NCList query, it became a minor component of the calculation time, with other components such as sequence retrieval consuming the majority of the computation time. Basic profiling analysis showed that at least 90% of the processing time was spent in pre- and post-processing steps (reading input files, extracting sequences, analyzing sequence conservation, etc.) of the analysis. Thus the total amount of time spent doing NCList queries was only a small fraction of the total processing time. In contrast, using MySQL as the interval overlap query engine, it became the rate-limiting step of the analysis. This example illustrates how NCList can enable comparative genomics studies that might otherwise be impractical using conventional database query as represented by the methods tested in Table 1.
Summary number of queries per genome for alternative splicing analysis
| Organism | Number of genes | Number of exons | Total time (h) |
| Human | 16 759 | 75 568 | 10.27 |
| Mouse | 13 530 | 73 590 | 16.95 |
| Cow | 9057 | 40 656 | 17.98 |
| Dog | 5331 | 19 880 | 8.78 |
| Zebrafish | 10 181 | 48 710 | 14.33 |
| Total | 54 858 | 258 404 | 58.32 |
| Organism | Number of genes | Number of exons | Total time (h) |
| Human | 16 759 | 75 568 | 10.27 |
| Mouse | 13 530 | 73 590 | 16.95 |
| Cow | 9057 | 40 656 | 17.98 |
| Dog | 5331 | 19 880 | 8.78 |
| Zebrafish | 10 181 | 48 710 | 14.33 |
| Total | 54 858 | 258 404 | 58.32 |
3.5 Database construction time
We also evaluated the speed of database construction for the competing methods. Figure 7 shows average time taken to build the database of given size in seconds. NCList showed the best performance over the whole range of realistic database sizes attempted. NCList database construction was ∼100-fold faster than the two methods with reasonable query scalability (binning and R-tree indexing). All of the methods show the same approximately O(N log N) time complexity, which indicates that database construction problem is as hard as sorting.
Database construction time. Total construction time taken by each method as a function of increasing database size.
Database construction time. Total construction time taken by each method as a function of increasing database size.
4 DISCUSSION
We have designed and implemented the NCList data structure for secondary memory storage and query of interval databases with emphasis on biological sequence analysis applications. An interval database presents a flexible way of representing sequence feature annotation information on an abstract level, a necessary component of any sequence analysis project. Efficient overlap query is essential for making large-scale comparative genomics queries easy and practical. Our results suggest that NCList provides superior performance under realistic conditions versus several other methods that are commonly used in the bioinformatics community.
Since the update problem is not yet solved efficiently, NCList is most suited for static data applications such as genome browsers, or data mining of static genome alignment or annotation databases. This encapsulates a wide variety of possible applications. NCList query efficiency is superior to other commonly used methods in a wide range of settings, providing significant gains in speed. These advantages are reduced when the result set becomes a substantial fraction of the total database. When the results become too large, performance is limited simply by disk speed and the search algorithm cannot help much.
NCList provides a scalable way of storing large amounts of data in the secondary memory (disk), keeping only a small portion of the index in memory. In our tests, we were able to index Human chromosome 1 aligned intervals in a mere 32 KB of RAM without a significant performance loss. This is achieved in part because of proximity of the results in secondary memory of the NCList, which allows for efficient hardware level read caching, which does not require special software handling, over what is inherent in the NCList data structure.
Our implementation of NCList as a stand-alone application can be useful for computational biologists, but would be most useful if implemented in common DBMS such as MySQL or Postgres which will more easily integrate into most of the common computational pipelines. This problem is more of an engineering one and can probably be tackled by software developers in the community. In addition, one may also consider extending NCList or/and other interval query methods to perform join operations, which might have biological relevance in some applications. Currently, if the interval overlap query time is the most critical factor, then our implementation of the NCList should provide a convenient means for performing them at the cost of having to install it as a stand-alone library. Our implementation includes code for secondary memory NCList maintenance with constant primary memory requirements, which can be adjusted for a particular platform to efficiently utilize the architecture’s read–write caching.
ACKNOWLEDGEMENTS
We would like to thank Ruey-Lung Hsiao and Stott Parker for valuable discussion of interval database indexing methodologies. A.V.A. was supported by UCLA-IGERT bioinformatics traineeship (NSF DGE-998764) and by the NIH under Ruth L. Kirschstein National Research Service Award (GM008185) from the National Institute of General Medical Sciences. C.J.L. was supported by the National Institutes of Health Grant (U54-RR021813) and Department of Energy Grant (DE-FC02-02ER63421). Funding to pay the Open Access publication charges was provided by Department of Energy grant DE-FC02-02ER63421.
Conflict of Interest: none declared.



![Schematic representation of NCList data structure. Here in the interval array L, element L[i] has non-empty sublist, which records is indexed in sublist header array H by L[i].SUBLIST. This sublist starts at START and has length LENGTH. L[i] itself belongs to the top level list S0, which has sublist header H[0].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/23/11/10.1093/bioinformatics/btl647/2/m_btl647f3.jpeg?Expires=1500730288&Signature=Um3RgRQbf-3aNU8ZglWm~~5CABw7fHrHVZB5C~rL4gQLx2K87STa5Ab96AyHzG~YQr7KOx5r65xwED6sD8DO2C3Quq3CazXBt5j9r0hcA9Thi5WCDDXx3oUyKxoIgkP9kg3y7OS2t6JRUyCWL1OLYUyjBD2FARC~VrosOe9Hb2Z~UrFvNX50eoENAnq8TYvwy8piZI9r~PftDnhSID8cpXuLS4GUczBj~ZS-uvFjd2jkNoRBjUTFHbYOdO42kr9cecKpTROtuwBCLiZ-GQ1-SINoLGIyAneMZEsrQkQOVKShP8NXYwLRSeWi1oCqOKdB6XCc0Eg6NgH2YYLmK-~icw__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q)





Comments