The Genomic HyperBrowser: an analysis web server for genome-scale data

The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome.


INTRODUCTION
The immense increase in the production of genomic scale datasets, e.g., through the ENCODE (1) and Roadmap Epigenomics (2) projects, poses an unmet challenge in terms of available methodology and tools for analytic investigations. These datasets provide unprecedented opportunities for individual researchers to elucidate particular biological mechanisms. However, analysis of these datasets and their relations to each other typically require development of a range of ad hoc scripts for generating, manipulating and analyzing genomic data.
For a range of organisms, well-established and internationally accepted reference genome assemblies now exist. Using coordinates on such assemblies, data related to particular locations on the genome can be represented in a precise and unambiguous manner. This avoids many previous difficulties in the field, such as confusion due to incompatible gene terminology. A genome-wide collection of coordinates for a particular genomic feature is often referred to as a genome annotation track, or just genomic track. Such genomic tracks can, e.g., refer to the location of genes, binding of transcription factors, methylation of DNA or modification of histones. Genomic tracks not only allow unified visualization and browsing, such as through the UCSC Genome Browser (3), but also provide a powerful and unified basis for statistical analysis. The base pair positions of reference genomes serve as coordinates on a line, allowing entities such as genes or epigenetic modifications to be viewed as elements positioned on such a line. A statistical question, posed on the relation between two genome-scale datasets, may then be formulated as a simple question relating such elements. An example is to ask whether points on a reference line as defined by one dataset falls unexpectedly often within segments on the same line as defined by another dataset.
The Genomic HyperBrowser web server provides a broad suite of functionality for rigorous statistical analysis of genomic data. At the core of the system is a set of statistical analyses, available through a single tool: 'Analyze genomic tracks'. Descriptive statistics, test statistics and null models are described in terms of welldefined elements along a linear representation of the genome, in the form of genomic tracks. This tool and its underlying methodology has been described in a previous publication (4), and has since been expanded with tens of new descriptive analyses and hypothesis tests. The statistical analysis is augmented by a collection of data preparation tools that support the processing of genomic data into forms that subsequently allow sophisticated questions to be posed in a simple and intuitive manner. All 42 tools at the server are based on the generic treatment of genomic data as elements along a linear representation of the genome, allowing questions related to different biological application domains to be treated in the same manner. The tools share an underlying analysis code base, which is open-source and tightly integrated with the Galaxy framework (5) for handling of web access, users and data. Through the integration with Galaxy, the standard Galaxy tools are also available and can be used together with the HyperBrowser-specific functionality. The HyperBrowser website is free and open to all, and there is no login requirement.
The Genomic HyperBrowser is designed to be as openended as possible: instead of being developed around a few canonical usage scenarios, it provides a core set of abstractions and components that can be used and combined in a myriad of ways to answer precisely formulated biological questions. Figure 1 gives a schematic overview of how various tools at the HyperBrowser server can be used as part of a full analysis scenario.

ANALYSIS OF GENOMIC TRACKS
A large collection of analytical functionality is available through the tool 'Analyze genomic tracks' under the 'HyperBrowser analysis' menu. This opens for a range of genomic investigations that query characteristics of individual tracks or relations between pairs of tracks along the genome (4). After selecting one or a pair of tracks, the analysis of interest can be selected among a set of analyses deemed meaningful based on the type of track(s) selected. For instance, selecting two tracks of segments (intervals) along the genome (e.g. two tracks of ChIP-seq peak regions, without any values associated with the peaks) will allow questions related to co-localization (overlap). On the other hand, selecting two tracks of values per base pair along the genome (e.g. two tracks of bp-level ChIP-seq signal values for every position of the genome) will allow questions related to correlation of values. The HyperBrowser system distinguishes between 15 types of tracks at the generic level (6), where the most widespread types are tracks of points and segments.
Analyses are divided into descriptive statistics (such as counts, base pair coverage and averages) and hypothesis tests (such as whether two tracks are overlapping more than expected by chance). A total of 56 descriptive statistics and 20 hypothesis tests are available, depending on the type of tracks (listed in Table 1). Each hypothesis test may be seen as a generic genomic question that can be parameterized in several ways. The statistical testing procedure used to resolve the question not only varies between questions, but also between parameterizations. One parameterization is the selection of an appropriate null model. Statistical hypothesis testing requires a notion of randomness for the null hypothesis, and careful attention has been given to making such randomness assumptions transparent to the user. For most tests, the randomness assumptions can also be selected from a list of possibly meaningful alternatives (Figure 2A). For instance, one can for hypothesis tests involving a gene track choose a simple null model where genes are randomized independently and uniformly along the genome. Alternatively, one can select a null model where the empirically observed clustering tendency of genes (distribution of inter-gene distances) is preserved. A further alternative is to sample gene positions according to a separately specified intensity track, which can for instance be used to control for influence by external confounders. Depending on the assumptions deemed appropriate by the user for the hypothesis test (through, e.g., the selection of a null model), the system will determine whether to use either an asymptotic computation or a Monte Carlo (MC) based evaluation of P-values. This is handled by the system, but at the same time transparent to the user. For MC-based evaluation of P-values, a sequential sampling scheme, MCFDR, is used to automatically determine the appropriate number of samples for statistical testing (9).
The output of the 'Analyze genomic tracks' tool ( Figure 2B) presents the main conclusion from the analysis, along with some interpretations and restrictions on its applicability. This main conclusion is complemented by a range of detailed results in the form of tables and figures, provided at both the global level and for local regions along the genome. The tool emphasizes reproducibility by providing rich analysis output, describing the methodologies that have been used, and reporting all parameter settings and data sources. Screencasts, tutorials and demo buttons for five genome analysis examples are provided with the tool.
A set of tools focusing on visual analysis of track data is available under the menu 'Visual analysis of tracks'. Under the menu 'Specialized analysis of tracks', we provide a tool containing a recently developed hypothesis test querying whether the elements of a track are spatially co-localized with respect to the three-dimensional structure of the genome, as defined using results from recent Hi-C experiments (10). A tool for unsupervised analysis of track similarities (clustering) is also available under the same heading (manuscript submitted). Tool details are given in Table 2.

PROCESSING DATA INTO FORMS SUITABLE FOR ANALYSIS
In many situations, a complex formulation of a biological question may be simplified if the original data are first transformed into a form that more directly reflects the question of interest. An example of this is a question of how often DNA binding locations of a given TF (as a first genomic track) fall inside or in the close vicinity of genes (as a second track). Although clearly manageable, the concept of proximity in this setting requires some thought and further specification. If one transforms the gene track by expanding the gene intervals to include, say, one kbp flanks, one can afterwards ask the more simple question of how often the TF binding locations fall inside these expanded gene intervals. This latter version is easy to envision and does not involve any ambiguity. This example shows the redefinition of a problem originally formulated to involve vicinity to fit with an analysis based on the simpler concept of containment. Thus, by combining a set of basic, generic analyses with a collection of track transformation functionality, a core set of well-understood analyses can be applied to a much broader range of biologically motivated questions. Several tools for customizing data into forms that may simplify subsequent analyses are available under the menu 'Customize tracks', and are summarized in Table 3.
In some analysis scenarios, a feature of interest is not explicitly available in the form of a genomic track, but can be derived from properties of other genomic tracks. The HyperBrowser menu 'Generate tracks' includes several tools for generation of datasets in such situations. Tracks can be generated based on DNA sequence properties along the genome, or based on density of, or distance to, certain genomic features along the genome. An overview of these tools is given in Table 3.
In other analysis scenarios, genomic coordinates are available for the data of interest, but not in a format that can be readily used in the tool of interest. Genomic datasets come in a variety of forms, including raw lists of coordinates not adhering to any specified format. The data are usually in tabular format, typically as raw text files or as spreadsheet documents. The HyperBrowser recognizes most commonly used tabular formats, in addition to a recent unified format, GTrack, supporting all 15 basic types of tracks handled by the system. A format conversion tool is available under the menu 'Format and convert tracks', alongside a tool for structuring raw tabular data into a GTrack file (Table 3). A set of tools for validating and editing GTrack files are also available, as introduced in (6). Hypotheses on 3D co-localization supported by data 3D analysis (Table 2) Analyze spatial co-localization Generate tracks (Table 3) 6 tools HB track repository (Table 3) Extract track tool Customize tracks (  (Table 3) 2 tools

Statistics on tracks and relations
Descriptive statistics (Table 1) Analyze genomic tracks Figure 1. Schematic overview of tool categories available at the Genomic HyperBrowser server. The figure indicates at which points of a typical analysis scenario the various tools may be of use, from the initial collection and preparation of data, through customization of data to match the analysis, to the statistical evaluation of a biological hypothesis. For boxes representing several tools, the precise list of tools can be found under the corresponding header in the table that is referred to (for instance, the two tools represented by the 'Format and convert' box can be found under the heading 'Format and convert tracks' of Table 3). Which categories of track1-points fall more inside which categories of track2-segments?
LGP P Co-localized in 3D? Are the points of track2 closer in 3D (as defined by track1) than expected by chance?
Each analysis is defined for either one or two tracks, with the corresponding track type denoted in the columns 'Track1 type' and 'Track2 type'. The track type abbreviations, as defined in (6)

Statistical analysis
Analyze genomic tracks The main analysis interface of the Genomic HyperBrowser (4). Executes analyses on a single genomic track or on the relation between two tracks. Allows specification of additional input parameters for the analyses, specifically including the specification of alternative hypotheses and null models for the hypothesis tests. Contains 56 descriptive statistics and 20 hypothesis tests.
Analyze cell-specificity of active chromatin in disease regions, as described in section 'Full analysis scenario.

Visual analysis of tracks
Visualize track elements relative to anchor regions Allows visualization of the distribution of track elements along chromosomes, or along custom-specified bins. The specified regions are displayed vertically, in order to simplify visual comparison.
Visualize the detailed positioning of histone modifications relative to the TSS of a selected set of gene regions.
Create high-resolution map of track distribution along genome Visualizing track elements along a line, such as in the UCSC genome browser or the relative positioning visualization tool, can necessarily only offer a global overview at a very limited resolution. This tool instead uses a fractal layout of the genome line (similar to Hilbert curve (11) to map genome locations to individual pixels in a matrix instead of along a line, effectively increasing the resolution quadratically. Although the interpretation requires a certain effort, this form of visualization can potentially be very informative.
Visualize the genome-wide distribution of a densely populated track, such as repeating elements or a DNase accessibility experiment.
Create high-resolution map of multiple track distributions along genome Similar to the one-track version above, but uses up to three separate color channels (red,green,blue) to visualize the presence of up to three different tracks in corresponding parts of the genome by combining their color channel values at individual pixels.
Visualize the comparative distribution of DNase accessibility in three different cell types to see patterns of similar and distinct accessibility.

Visualize relation between two tracks across genomic regions
Used to reveal complex relations between tracks along the genome.
For each defined analysis region (bin), a score is calculated for both tracks, using the specified summarizing function. The resulting (x,y) scores are then visualized as a single point in a scatter plot.
Plot exon density versus average melting temperature in 10 mbp bins along the genome.

Aggregation plot of track elements relative to anchor regions
Used to reveal trends of how track elements are distributed relative to a set of anchor regions (bins). All anchor regions are divided into the same number of sub-bins, and a summary statistic is calculated for each sub-bin and averaged across all anchor regions. The tool returns a plot of the average values with 95% confidence intervals.
Positions of histone modifications around TSS.

Specialized analysis of tracks
Analyze co-localization of input genomic regions Analyze a selected track of genome locations for spatial co-localization with respect to the three-dimensional structure of the genome, as defined using results from recent Hi-C experiments.
The Hi-C data have been corrected for bias using a method presented in a recent paper (10), and further normalized by subtracting the expected signal given the sequential distance between elements.
Analyze whether somatic mutations in cancer are co-localized in 3D in a relevant cell type.

Perform clustering of genomic tracks
Used to investigate relations between multiple tracks in an unsupervised manner (manuscript submitted). This tool allows an essentially unlimited number of tracks to be selected, and further allows the distance measure to be used for the clustering to be precisely specified through selection among a varied set of a notions of track similarity.
Analyze similarities between histone modifications in different cell types.
Analyze k-mer occurrences Used to analyze a global track of occurrence locations for a specified k-mer from a particular reference genome. All relevant analyses in the 'Analyze genomic tracks' tool can be used.
Analyze correlation of a specific k-mer with other tracks, e.g. genes, in order to find functional significance.

Inspect k-mer frequency variation
Used to calculate and visualize the frequency distribution of a particular k-mer along a genome reference. Splits the selected analysis regions (e.g. chromosomes) into a suitable number of subregions (bins). For each bin, the number of occurrences of the selected k-mer is counted and plotted.
Inspect the frequency variation of a particular k-mer along the genome.
Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each tool can be used. The 'Analyze genomic tracks' tool has previously been described (4). Extract the RefSeq gene track, in order to expand the gene segments with the 'Expand BED segments' tool.

Customize tracks
Expand BED segments Allows extracting start-, mid-or endpoints of genomic intervals, as well as expanding either the original intervals or the extracted start-/end-/mid-points. This is useful in a variety of situations where an analysis of interest involves either proximity to or positioning relative to the original track elements, or where a size unification of track elements is desired (based on, e.g., taking midpoints and then expanding a certain distance). Also, if the expanded region crosses any chromosome borders, this is handled correctly.
An example of an analysis involving both proximity and relative positioning is the analysis of histone modification frequencies in bins of particular distances relative to the upstream end points of genes (transcription start sites).
Combine two BED files into single case-control track Allows combining elements from two separate datasets into a single track where the elements are denoted as case (target) or control, depending on their source. This allows analyses of how other tracks preferentially interact with case elements as opposed to control elements.
An example is to combine chromatin states from two different cell types as case and control elements, in order to ask whether regions associated to MS susceptibility overlap more with case than control segments. See section 'Full analysis scenario'. Generates so-called 'intensity tracks' which are used in controlling for confounder tracks in particular analyses. The user selects a target track as well as a set of control tracks, i.e. a set of tracks whose influence on the target track one aims to control for. The generated intensity track defines, for each base pair, the probability that an element of the target track lands at that position during randomization. The intensity track can afterwards be selected as part of the null model specification when doing hypothesis testing through the 'Analyze genomic tracks' tool.
Can, e.g., be used to control for the influence of gene proximity when analyzing the relation between TF binding locations and active regions in a given cell type.

Generate k-mer occurrence track
Generates a global track of occurrence locations for a specified k-mer on a particular reference genome.
Generate a track of all occurrences of the 8-mer 'ACGTTGCA' in the human hg19 genome assembly.
Generate track of genes associated with literature terms (using Coremine) Generates a track of gene segments along the human genome, where the genes are associated with one or more specified literature terms. The associations are provided by the CoreMine medical database, which is regularly updated with term-gene associations mined from published literature.
Find a set of genes associated with melanoma. Each gene will have an attached P-value, denoting the strength of the association.

Format and convert tracks
Convert between GTrack/BED/ WIG/bedGraph/GFF/ FASTA files The most commonly used formats for genomic location data are (arguably) the formats BED, BedGraph and WIG defined by the UCSC Genome Browser, as well as the format GFF in various versions. The tool allows converting between these formats, to the degree they are able to represent the same information. The tool also allows converting data to and from the recent GTrack format, which is a recent, unified format that is capable of representing data of any track type, and thus data stemming from any of the other file formats (6).
Convert a GTrack file to the BED format in order to use BED-specific Galaxy tools.
Create GTrack file from unstructured tabular data The tool allows structuring unformatted tabular data into a GTrack file by specifying the necessary meta-data through simple selection boxes, inferring further properties of the data where possible.
Import virus integration sites of the Human Papilloma Virus (HPV) from an Excel spreadsheet into a GTrack file for further analysis by the 'Analyze genomic tracks' tool.
Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each tool can be used. The GTrack-related tools have previously been described (6).

SUPPLEMENTING GUI SELECTION WITH COMMAND-BASED BATCH EXECUTION
A web interface based primarily on point-and-click selection has several advantages compared to a command-linebased approach to data analysis. A main advantage is that it does not require the recollection of suitable commands and parameters to achieve a given analysis objective.
A typical disadvantage is that it may be cumbersome to perform a multitude of similar analyses. This is in contrast to the command-based approach, where slight modifications to an analysis can often be done very quickly, and where looping may allow multiple analyses to be performed without a huge manual effort. We believe this is rapidly becoming an important issue for genome analysis, as e.g. the ENCODE and Roadmap Epigenomics projects generate chromatin and transcription factor binding tracks for hundreds of different cell types.
To meet this challenge, we have combined advantages of both worlds, the point-and-click based and the command based, through what we refer to as 'batch execution functionality'. For the initial specification of an analysis, we mainly rely on a GUI-based approach, using selection boxes as described in the section 'Analysis of genomic tracks'. After an analysis has been specified through the GUI, one can click on 'Inspect parameters of the analysis' to obtain a 'corresponding batch command line'. This purely textual representation of the analysis can now be modified and/or duplicated according to customized needs, and executed in the 'Execute batch commands' tool under the menu 'Text-based analysis interface'. Two options that increase the flexibility is the possibility to use a slash (/) to denote that an analysis is to be performed with multiple alternative tracks or parameter values, and the use of a star character (*) to denote that a given analysis is to be performed on all sub-tracks at a given level of the HyperBrowser track collection hierarchy. These extensions of the format greatly simplify the process of running a given analysis on a set of related tracks, e.g., for different chromatin marks or cell lines.

FULL ANALYSIS SCENARIO
The full reach of the Genomic HyperBrowser system becomes apparent when considering the combination of various tools for processing and analyzing data. By employing an appropriate combination of data preparation and analysis functionality, a range of sophisticated and precisely specified hypotheses can be investigated.
An example of such an analysis is the investigation of whether regions associated with a given disease overlap preferentially with marks of active chromatin in a certain cell type compared to another reference cell type. A sequence of steps for analyzing multiple sclerosis (MS) associated regions in B-cells versus hepatocytes is given in a Galaxy Page at http://bit.ly/hb_example. This page shows the sequence of tools that has been used, along with the exact input parameters and resulting outputs for each of the tools. Any step can be easily reproduced exactly or with modifications to the input parameters.
The analysis starts with a set of SNP coordinates in a form reflecting a typical starting point with data in a raw text or a spreadsheet document. The SNP data are uploaded and formatted, and two genomic tracks of active chromatin state regions (12) in B-cells and hepatocytes are extracted from the HyperBrowser track repository. In their original track representations, the question of interest would be whether the track of active regions in B-cells shows a stronger presence in the vicinity of SNP positions than the hepatocyte track, after appropriate normalization based on overall differences between the tracks of active regions. Both the concept of vicinity and the need for normalization complicates the precise formulation of an appropriate question. By expanding the SNPs to include flanks, and by combining the two tracks of active regions into a single case-control track, the final question becomes whether the MS SNP proximity regions overlap preferentially with segments of the combined active chromatin state track marked as case versus control. As can be seen from the result output of the final step of the analysis, this is indeed the case (13).
The Genomic HyperBrowser is complementarily integrated with other systems for working with genomic track data, both conceptually and implementation-wise. A powerful way to work with genomic data may be to, e.g., first get some general impressions and ideas about the data through direct visualization and browsing in the UCSC genome browser (3), followed by genome-scale exploration using EpiExplorer (14). Relevant hypotheses may then be evaluated by robust statistical analysis within the Genomic HyperBrowser. Throughout such an analysis scenario, one may also use a variety of Galaxy tools that work well together with all the mentioned systems.

CONCLUSIONS
The Genomic HyperBrowser is a comprehensive system for statistical analysis of genomic tracks. A range of genomic investigations can be addressed through a combination of data processing and analysis tools. Novel features and analyses are continually added to the system. Furthermore, if a user faces a track analysis challenge that cannot be resolved through the present version of the system, we take it upon us to react promptly to expand the system.