Recent releases of genome three-dimensional (3D) structures have the potential to transform our understanding of genomes. Nonetheless, the storage technology and visualization tools need to evolve to offer to the scientific community fast and convenient access to these data. We introduce simultaneously a database system to store and query 3D genomic data (
The release of the first draft of the human genome (1,2) announced the beginning of a data era in genomics. Gathering information does no longer appear as a major bottleneck in molecular biology studies. However, by contrast, storing and mining the massive amounts of data generated by genomics studies becomes increasingly difficult.
The development of efficient computing infrastructures to store and retrieve genomic information is thus an essential part of the discovery pipeline in genomic research. The UCSC genome browser (3,4) and the Ensembl genome browser (5,6) were among the first systems specifically developed to address this issue and opened the access of sequencing data to the whole scientific community.
Since then, the complexity and the nature of the data themselves has changed. In particular, the primary structure of genomes does no longer appear to contain all the information required to decipher our genetic code. Indeed, recent studies suggest that the three-dimensional (3D) structure of genomes is essential to understand some regulatory mechanisms (7).
3D structures and Hi-C data sets of Human (8,9) or Yeast genomes (10) are now available. Viewers have been developed to visualize complete genome structures (11) and Hi-C data annotations have been integrated in classical genome browsers (12). However, to date, there is no scalable solution to query simultaneously primary and tertiary genome structures. Moreover, unlike classical human genome browsers, these viewers are currently only available as Operating System (OS)-specific standalone applications that are not embedded into Web browsers.
In this paper, we aim to address these challenges and develop a complete solution for storing and analyzing 3D genomic data. More specifically, we develop a new database system for storing and querying 3D genomic data, and a lightweight 3D genome browser for real-time visualization and exploration of 3D genome structures. We emphasize that the development of efficient database architectures is key for the success of a novel generation of 3D genome browsers.
There has been substantial related work on storing and querying 3D data in the context of astronomy, remote-sensing, neurology or more broadly for the storage of spatio-temporal data. Existing approaches can be broadly categorized along three axes:
Index-based versus cluster-based. Most existing work builds a 3D index over the raw data, but does not attempt to physically reorganize the raw data (index-based) so that co-queried objects are stored near each other (cluster-based).
Adaptive versus non-adaptive. Adaptive systems adjust storage structures based on the query size and/or the data. They generally require less tuning and can typically handle a broader range of data sets than non-adaptive systems.
On-line versus off-line. On-line systems change their storage representation as data arrives, whereas off-line systems assume that the indexed data does not change frequently and must recompute their storage layout from scratch as data arrives.
The ‘classic’ database structure for indexing objects along multiple dimensions is the R-Tree (13). Unlike
There have been many optimizations to R-Trees for spatio-temporal data, including TB-Trees (14) and SEB-Trees (15). TB-Trees are optimized R-Tree indices with special support for temporal predicates. They also do not deal well with very long 3D trajectories that tend to have very large bounding rectangles, and can include a high number of I/Os per lookup. SEB-Trees segment space and time, but are not specifically designed for indexing trajectories. Research on TB-trees and SEB-trees does not explicitly discuss how to cluster data, and both are non-adaptive (i.e. they do not reorganize previously added pages as new data arrives).
To address the concern with very large 3D meshes or trajectories, several systems have proposed segmenting the trajectories to reduce the sizes of bounding boxes and group portions of trajectories that are near each other in space together on disk. Rasetic et al. (16) propose splitting trajectories into a number of sub-trajectories, and then indexing those segments in an R-Tree. They propose a formal model for the number of I/Os needed to evaluate a query, and use a dynamic programming algorithm to minimize the I/O for an average query size.
SETI (17) also advocates a segmentation-based approach like
PIST (18) focuses on indexing individual points rather than 3D meshes. PIST is similar in spirit to
A number of other systems, such as STRIPES (19), use a dual transformed space to index meshes or trajectories. While such indices are very compelling when indexing the future positions of moving objects, they are known to be suboptimal for answering historical or ad hoc queries (20).
Spatial clustering has been extensively studied (21–23). These approaches focus on generic methods to extract cluster information from large collections of ad hoc data points. Our clustering problem is more specific, since we deal with series of points ordered along 3D genes, and more importantly on the (potentially large) metadata associated to the 3D models.
In this paper, we introduce a complete efficient and scalable database system to query genomes in space. The system includes two components: (i) a database
Our system aims to foster the discovery of spatial relationships between genomic elements and accelerate the large-scale analysis of space-dependent regulatory mechanisms. Here, we map data from the 1000 genomes project (24) and experimental Chip-Seq data (4) onto most recent 3D models of the Human genome (8,9), and use
MATERIALS AND METHODS
3D genome database
We have implemented a fully functional database based on YARN (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), currently one of the most promising Big Data processing framework available.
a 3D client, which allows the user to navigate through the 3D genomic space. It dynamically retrieves high-resolution 3D data and genomic metadata from the rest of the system as the user moves through the 3D space and makes queries about certain 3D regions.
a sparse, adaptive 3D index, which dictates how genomic metadata associated to contiguous regions in the 3D space are co-located in the distributed filesystem. The 3D index translates the 3D query posed by the user into a series of data chunks that have to be retrieved from the distributed filesystem.
immutable data chunks that compactly store genomic data and metadata in the distributed filesystem.
Figure 1 gives an overview of our database. We implemented our own indices and ancillary data structures to optimize all operations, and bypass the Hadoop NameNode whenever possible to reduce the end-to-end latency of the queries. We do not rely on higher-level Hadoop data structures such as those offered by SpatialHadoop (25) or Impala (http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html), since these higher-level structures negatively impact the performance of on-line queries. Along similar lines, we do not directly use large-scale batch-processing features a la MapReduce, since they would introduce unreasonably high latencies in our context, but could take advantage of such functionalities for off-line operations such as batch updates or complex analytics.
A detailed description of each component of our database, as well as an explanation on our query insertion and query execution techniques, is available as supplementary material. The full codebase of our current implementation is available online at https://github.com/mavlyutovrus/3d_genome_browser.
Data and web services
Currently, three complete models of human 3D genome structures are stored in our database. We retrieve these data from (9,11) and describe them in Table 1. It is worth noting that (9) provides individual structures for each chromosome, but no global relative arrangement of all chromosomes. For this reason, we provide independently each chromosome structure.
Origin and description of the 3D models stored in 3DBG
|B-cell GM06990||human||real||individual chromosomes||13||(9)|
|B-cell leukemia||human||real||individual chromosomes||13||(9)|
The structures are interpolated with a finite number of points. This number of points depends of the resolution of the model and therefore varies from one model to another.
The volume encompassing the genome structure is segmented in 3D cubic cells. Spatial queries use the coordinates of a cube (starting and ending positions on the x, y and z axis) as an input and return the coordinates of the interpolation points modeling the DNA chains contained within this cell. In particular, it allows us to identify the ranges of DNA subsequences within this volume.
The syntax of a query is http://1kgenome.exascale.info/<mode>?xstart=<x1>& xend=<x2>&ystart=<y1>¥d=<y2>&zstart=<z1>&zend=<z2>, where
We use GRCh38 assembly of the human genome from the UCSC genome browser as our reference human sequence (3,4). Nucleotide sequences can be accessed from their chromosomic location. The syntax of a query is http://1kgenome.exascale.info/range?start=<start>&end=<end>&chrid=<chr>, where
Single nucleotide polymorphism
We store the Single Nucleotide Polymorphism (SNP) data from the 1000 Genomes Project (24). Web users can retrieve SNPs data within a specific range of a chromosome with the following query http://1kgenome.exascale.info/js_snp?chr=<chr>& start=<start>&end=<end>, where
A query returns an array of arrays showing information for each individual SNP found within this interval. This information is represented as a 4-tuple including the SNP position, the SNP ID and the two alleles.
Experimental ChIP-Sequencing data
We recorded experimental ChIP-Sequencing (Chip-Seq) data from the ENCODE project (26) stored in the UCSC genome browser (3,4). These data help us to identify transcription factors binding sites (TFBS).
ChIP-Seq data can be retrieved with a query to http://1kgenome.exascale.info/chipseq?chr=<chr>& start=<start>&end=<end>&celline=<cellid>, where the variables
The output of such query is an array of 7-tuples that contain basic information on the Chip-Seq data. A 7-tuple stores the chromosome number, starting and ending index of the Chip-Seq peak, the transcription factor name, a normalized value (ranging from 1 to 1000) indicating the magnitude of the binding, the cell lines with similar TFBS and a list of SNPs occurring in this binding site.
Determining single nucleotide 3D coordinates
A key feature of a system for querying genomes in space is its capacity to directly access the 3D coordinates of any nucleotide. However, 3D genome structures are often modeled with (sparse) discrete sets of points corresponding to enzyme cut sites. In that case, it is useful to directly access the closest cut site (in each strand direction) of a model.
This information is accessible with a web query to http://1kgenome.exascale.info/chr_pos?chrid=<chr>&bp=<index>&m=<mode>, where
3D genome Web visualization interface
Before starting to explore 3D genome structures, the users must select a model. The front page of
Once a model is selected, users access a search engine that enables them to directly request specific genomic locations (i.e. chromosome number and position), target genes or arbitrary spatial coordinates. Queries re-direct the users to a 3D structure viewer pointing at the desired location. From there, they can explore and navigate the genome structure in real-time. The web client downloads all genomic and structural data in the neighborhood of the query location. More data are dynamically loaded when the user travels in the 3D space. This allows a smooth exploration of the 3D genome structure on any computing device. A screenshot of the 3D genome browser is presented in Figure 2.
The viewer implements multiple features allowing its users to access and visualize Human genome data stored in the database. At the core of
Alternatively, the users can switch to a linear mode. In that case, the neighborhood of the query position is defined as a sequence interval. It is equivalent to the viewing frame used in classical (i.e. one dimensional) genome browsers. This mode also allows the users to retrieve all SNPs present in this one-dimensional neighborhood.
The third mode enables the users to highlight transcription factor binding sites in the viewer. TFBSs are represented as colored regions of the DNA chain. The color indicates the intensity of the Chip-Seq experiment (green for low and red for high). The user can access detailed information about the Chip-Seq data by clicking on the TFBS region, or access UCSC genome browser records through a hyperlink.
Finally, we also implemented a distance calculation tool that enables the users to automatically determine the physical distance between two points in space. We intentionally did not use physical units, but instead rely on the model coordinates. Indeed, the determination of physical distances requires to interpret experimental data and make approximations which are often subject of discussions. By contrast, we believe that arbitrary units allow the users to estimate relative differences and leave them the freedom to interpret the experimental data used to obtain the 3D model.
Visualization of custom genotyping data
An important feature of our viewer is to enable users to map their private genotyping data onto reference 3D architectures, and allow them to visualize the data within our browser. This functionality is intended to provide users with tools to identify geometrical dependencies in custom genotyping data sets. The query interface allows users to upload a local file containing genotyping data. In order to prevent any formatting issues, we implemented a program to validate and convert most standard genotyping data file.
Once uploaded, the users can browse and query the 3D genome as described above. In addition to the reference data stored in our database, the users can now access simultaneously the reference SNPs collected from (24) together with those stored in the local file. To prevent any privacy issues, user data are stored locally and not transmitted to our server. A similar solution has been adopted by the UCSC genome browser (29).
To evaluate the performance of our system, we used sequence read alignments of chromosome 11 available from the 1000 Genomes project (24). This data consists of short (around 100 bases) DNA sequence reads, mapped onto the Human reference genome. We used ∼1.5 billions of records, which constitute 250GB of raw data.
All data have been stored in a cluster (Hadoop version 2.2.0) of 10 machines. Worker nodes were commodity machines with Quad-Core Intel i7-2600 CPUs @ 3.40GHz, 8GB of DDR3-1600 RAM, 500GB Serial ATA HDD, running Ubuntu 12.04.2 LTS. The index node was similar, but with 16GB RAM. The replication factor was set to 3.
The main metric we take into account is response time (latency). As a matter of fact, execution time depends on the amount of records to be returned. In our context, we considered simple, uniform and fixed-size cube queries returning from 100 to 1000 records.
Benchmarking against the PostGIS database
The performance of storage systems can be characterized by their speed to access the data (i.e. by the average time needed to execute a query) and the influence of the size of the output on the time required for returning a response. In this section, we evaluate the performance of
We uploaded in the database a data set of ∼1.5 GB (gigabytes) that contains the 3D coordinates of reference points of a simulated model of the human genome (11). All these positions were indexed in the database using the spatial index (the description of PostGIS's spatial index can be found at http://revenant.ca/www/postgis/workshop/indexing.html). Then, we measured the speed of reaching the data through the Java application using the PostGIS JDBC driver, and the influence of the size of the output (results of query) on the processing time.
In our experiments, we queried for all different reference points available in the model from (11), and called the database to get all points that were stored in the cube centered around the current reference point, with a constant edge size (100, 200, 300 and 400 base units). Our results are shown in Figures 3 and 4.
Figure 3 shows the speed of accessing the data. Here, PosGIS has on average a query execution time well above 300 ms, and thus well above the time to gracefully retrieve and visualize data dynamically for on-line 3D browsing. By contrast, when we run the same experiment with
Next, for each edge size (size of the neighborhood delimited by the cube query), we plot in Figure 4 the relation between the processing time (i.e. latency), the size of the neighborhood that we wish to explore and the number of records returned. Experiments were repeated five times to obtain the variances. Here again, we observe that PostGIS yields unsatisfactory latencies, which rapidly grow as we retrieve more data and increase the size of the query. In particular, PostGIS latencies for large queries (edge size of 400 units) exceed the threshold required for real-time visualization (∼200 ms), while
Benchmarking against the Hbase database
To compare the performance of our back-end solution with Hbase (http://hbase.apache.org/), we installed an Hbase cluster (version 0.96.1) on our experimental infrastructure. We also split all data according to our index for HBase, but used a standard HBase database rather than our own chunk storage. The caching for the Hbase cluster was switched off to ensure valid results. Figure 5 shows the results. As can be observed, the execution times of our system are much lower than those of HBase.
Using 3DGB to explore the 3D neighborhood of a gene
We illustrate the usefulness of
We started our investigation by exploring a 3D neighborhood centered on the promoter of the RB1 gene in the 3D structure of chromosome 13 in normal B-cells (GM06990) (9). We retrieved the list of SNPs found in the promoter region of RB1, and in other DNA strands that are not in the immediate sequence neighborhood of RB1 promoter.
Distribution of SNPs in the 3D neighborhood of RB1
In addition to the promoter region, we found three other strands in the spatial vicinity of RB1: S1 (44762907, 45379923), S2 (57184305, 57531250) and S3 (58747059, 59087738). These strands are located in a radius R = 0.2 of RB1 transcription start site, which corresponds approximately to 88 Å.
A total of 1199 SNPs were identified in this 3D neighborhood, for which we retrieve their associated phenotype from (6). A complete list of these SNPs with associated phenotypes is available in the supplementary data. As expected we identified many SNPs related to various types of cancer. However, another interesting finding has been to detect the occurrence of one SNP (rs10492604) related to sleep disorders in the strand S3. Importantly, we found only two SNPs related to sleep disorders in the whole chromosome. Moreover, with a distance of 230 Å from RB1 transcription start site (R = 0.53), this other SNP (rs10492507) is also in the vicinity of RB1 gene.
Previous studies have identified that children with hereditary retinoblastoma have also an increased risk of developing trilateral retinoblastoma (30). Trilateral retinoblastoma is the combination of retinoblastoma (usually bilateral) and pineoblastoma (a tumor in the brain's pineal gland). The pineal gland secretes multiple hormones (including melatonin) that are implicated in the regulation of sleep patterns in seasonal and circadian rhythms (31).
Although our finding does not imply any causation, it suggests possible interesting genetic relationships between retinoblastoma and sleep disorders. The scripts used in this experiment are available at http://3dgb.cs.mcgill.ca/scripting.html.
Although this paper focuses on the technical description of the database system and the evaluation of its performances, we designed
Finally, even though we did not specifically tailor
Supplementary Data are available at NAR Online.
Genome Canada and Genome Québec (Bioinformatics and Computational Biology competition, in part); Canadian Institutes of Health Research [CIHR BOP-130836 to J.W. and M.B.]; Natural Sciences and Engineering Research Council of Canada Discovery [NSERC RGPIN 386596-10 to J.W.]; Swiss National Science Foundation [200021_143649 to PCM]. Funding for open access charge: Genome Canada/CIHR funding.
Conflict of interest statement. None declared.