LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor

Summary: Genomic datasets are often interpreted in the context of large-scale reference databases. One approach is to identify significantly overlapping gene sets, which works well for gene-centric data. However, many types of high-throughput data are based on genomic regions. Locus Overlap Analysis (LOLA) provides easy and automatable enrichment analysis for genomic region sets, thus facilitating the interpretation of functional genomics and epigenomics data. Availability and Implementation: R package available in Bioconductor and on the following website: http://lola.computational-epigenetics.org. Contact: nsheffield@cemm.oeaw.ac.at or cbock@cemm.oeaw.ac.at

Many types of biological data can be interpreted by comparing them to reference databases and searching for interesting patterns of enrichment and depletion. A particularly successful approach focuses on identifying significant overlap between gene sets. To this end, a gene set of interest is compared with a large compendium of existing gene sets with biological annotations, and the observed patterns of overlap are used for interpreting the new gene set. This type of analysis is exemplified by the popular GSEA tool (Subramanian et al., 2005), and it relies on existing gene set annotation databases such as Gene Ontology, KEGG Pathways and MSigDB.
Although gene set analysis has been pivotal for making connections between diverse types of genomic data, this method suffers from one major limitation: it requires gene-centric data. This is becoming increasingly limiting as our understanding of gene regulation advances. Genes are no longer viewed as monolithic building blocks but as multifaceted elements with alternative splicing and alternative promoters, as well as various types of non-coding, antisense and regulatory transcripts. Furthermore, it has become evident that gene expression and chromatin organization are controlled by 100 000s of enhancers and other functional elements, which are often difficult to map to gene symbols. The increasing emphasis on genomic region sets has been propelled by next generation sequencing-a technology that produces data most naturally analyzed in the context of genomic regions, for example as peaks and segmentations. Driven by projects such as ENCODE (Encyclopedia of DNA Elements) and IHEC (International Human Epigenome Consortium), the research community has established large catalogs of regulatory elements and other genomic features across many cell types.
Here, we present an R/Bioconductor package called LOLA (Locus Overlap Analysis) for enrichment analysis based on genomic regions. LOLA builds upon analytical concepts that we developed and applied in previous work (Bock et al., 2012;Farlik et al., 2015;Tomazou et al., 2015), and our software makes genomic region set analysis fast and easy for any species with an annotated reference genome. LOLA complements existing tools for gene set analysis (Khatri et al., 2012), tools that convert gene sets into genomic loci such as GREAT (McLean et al., 2010) and the ChIP-Seq Significance Tool (Auerbach et al., 2013), and other related tools including GenometriCorr (Favorov et al., 2012), Genomic HyperBrowser (Sandve et al., 2013), EpiGRAPH (Bock et al., 2009) ColoWeb (Kim et al., 2015) and ReMap (Griffon et al., 2015). Key features of LOLA are its integration with R and Bioconductor; a command-line interface supporting automated data processing; compatibility with high-throughput pipelines as well as interactive scripting in R; fast runtime even for very large region lists and reference databases; a comprehensive core database of regulatory elements; and convenient support for users to create custom reference databases.
Each LOLA analysis is based on three components (Fig. 1A): (i) The query set-one or more lists of genomic regions to be tested for enrichment; (ii) a region universe-the background set of regions that could potentially have been included in the query set; and (iii) a reference database of genomic region sets that are to be tested for overlap with the query set. LOLA includes a core reference database assembled from public data, including, for example, the CODEX database (Sanchez-Castillo et al., 2014) and cross-tissue annotation of DNase hypersensitivity (Sheffield et al., 2013). Alternatively or in addition, users can create problem-specific custom regions sets. To build a custom reference database, it is sufficient to collect text files with genomic coordinates (BED files) into a folder and to annotate them with descriptive names.

A simple example
Here we analyze a set of the top-100 strongest EWS-FLI1 binding peaks from a previous study (Tomazou et al., 2015) and assess their overlap with public data. The query set and the LOLA core database are available from the LOLA website. LOLA identifies all genomic regions from a query set that overlap with each region set in the reference database. This analysis is performed against a user-specified region universe, which is defined as the set of regions that could, in principle, have been included in the query set (e.g. subject to coverage constraints of the assay that was used to identify the query regions). By default, a single shared base pair is sufficient for regions to count as overlapping, but a stricter criterion can be chosen by the user. Next, considering each region as independent, LOLA uses Fisher's exact test with false discovery rate correction to assess the significance of overlap in each pairwise comparison (Fig. 1B). The resulting rank score for each region set is then computed by assigning it the worst (max) rank among three measures: P-value, log odds ratio and number of overlapping regions. This ranking system emphasizes overlaps that do well on all three measures, and it tends to prioritize biologically relevant associations (Assenov et al., 2014). Results are returned as a data.table object (Fig. 1C), providing a powerful interface to sort, explore, visualize and further process the results. In our example, the top hits accurately identify Ewing sarcoma specific regulatory elements.
LOLA implements several helper functions to explore and export the results. All functions are described on the LOLA website with vignettes illustrating the basic and advanced features. In particular, a tutorial on manipulating the universe region set helps with configuring the most biologically relevant comparisons. Furthermore, the buildRestrictedUniverse() function automatically builds a universe based on query sets and can be used to test two region sets for differential enrichment against a reference database.
LOLA facilitates large-scale comparisons by using optimized code for storing region sets and running vector calculations with the   (Dowle et al., 2015) and GenomicRanges packages (Lawrence et al., 2013). It also uses database caching and multiple CPUs to speed up the analysis. These optimizations make LOLA analyses fast and memory-efficient, completing within a few minutes on a standard desktop computer.
Gene sets are sometimes regarded as a universal language connecting genes, diseases and drugs. We anticipate that sets of genomic regions can similarly connect diverse types of genome, epigenome and transcriptome data to identify relevant associations in large datasets, thereby leveraging the broad investment into large-scale functional genomics and epigenomics for biological discovery. Such analyses can now be done easily and efficiently using LOLA.