HAL: a hierarchical format for storing and analyzing multiple genome alignments

Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. Contact: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.


Figure S1: Drosophila Progressive Alignment Size
Scanning through the entire alignment and counting all substitutions is a simple way to benchmark file access without any regards to indexing. The HAL API contains tools to do perform this scan on both MAF (mafMutations) and HAL (halSummarizeMutations). The time required to perform this scan on the MAF and HAL versions of the drosophila alignment on a single Intel Xeon x7560 2.27Ghz core using under 4G RAM is reported in Figure S2. Despite being compressed, the HAL alignment can still be scanned more quickly than the text--based MAF. The advantage becomes more noticeable when performing a partial query, as shown in the HAL (sim vs. sec, d=2) column of Figure S2. This is the time taken to count all substitutions between the first million bases on chromosome 2L of drosophila simulans (droSim1) and aligned sites within drosophila sechellia (droSec1), which are separated by two branches. In the MAF file, which is indexed on melanogaster, this query can only be achieved by scanning the entire file. In contrast, HAL tools can be used to perform this query much more efficiently.  The above query was repeated using drosophila yakuba, drosophila pseudoobscura, and drosophila virilis as targets, with the respective results displayed in the last three columns of Figure S2. These genomes cover a range of distances on the tree, from 5 branches for yakuba to 14 (the diameter of the tree) for virilis. The running time of these queries is, as expected, proportional to the number of branches that must be analyzed.

S2.
Projection to 10000 Vertebrate Genome Alignment The range of branch lengths on the drosophila tree (from 0.006 -0.26 substitutions per site), as well as the diversity of assembly qualities, makes it a good candidate to extrapolate, albeit crudely, trends for the eventual progressive alignment and reconstruction of the vertebrate phylogeny targeted by the Genome 10k project (www.genome10k.org). Assuming the average vertebrate genome is 20x larger than a fly genome (in the following section, we verify that HAL scales in this dimension using Multiz alignments), we use a simple linear scale to estimate alignment the Genome 10K HAL alignment (10000 genomes and 9999 ancestors) size to be 40TB, over 400TB smaller than MAF. Further gains in storage footprint could theoretically be obtained by using a wider (degree > 2) tree topology, requiring fewer ancestral genomes.
We use a similar strategy to infer the running times associated with counting substitutions in the alignment as shown in Figure S3 arbitrary clade of twenty species (rooted subtree with 39 nodes) in HAL is estimated as 20x the time to scan all of the flies. We can make this assumption because of the modular structure of the HAL graph. The same analysis on a non--modular format such as MAF would be proportional to the total scan time, thousands of hours longer.
We claim that the decomposition that HAL provides is therefore necessary for the analysis of large genome alignments in reasonable timeframes, as it provides a principled framework for parallelization.

Figure S3: Genome10k Progressive Alignment Query Time Estimates
In general, the analysis of a set of species in a HAL format will be dependent on the size of their spanning tree. In this sense, a clade is a best--case scenario since the number of internal branches is minimal. Still as shown in Figure S2, we can expect gains when analyzing distant sets of species when compared to the entire tree.

Time (hours)
To benchmark how HAL scales to full vertebrate genomes, we began by extracting a three--way alignment between mouse, rat, and kangaroo rat from the 60--way MultiZ alignment to mouse available on the UCSC Genome Browser: rsync -avz --progress \ rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/multiz60way/ ./ for i in *.maf.gz; do gzip -c -d $i >> mouse60.maf; done mafFilter 1 mouse60.maf --includeSeq mm10,dipOrd1,rn5 > mouse3.maf maf2hal mouse3.maf mouse3.hal Since the MultiZ alignment is reference--based, we use a HAL star tree with the reference (mouse) as the ancestral node to represent it. While the semantics between a reference species and ancestor a very different, in terms of the HAL format they can be represented equivalently. The alignment sizes and scan times are shown in Figure S4 and Figure S5, respectively. The relative performance of the HAL alignments in this case is lower than the drosophila alignments, with the HAL file being significantly larger than the gzipped MAF and taking longer to scan. This is due to low coverage of the Multiz reference alignments: only 67% and 19% of the rat and kangaroo rat genomes are aligned to mouse, respectively, and stored in the MAF file.
In contrast, the HAL alignment stores a position for every base in each genome.
Finally, even in this case where HAL is storing more information, its indexing allows partial queries to be performed much more efficiently. For example, printing all (~11M) substitutions between rat chromosome 20 and mouse (last column of Figure S5) is still over an order of magnitude faster in HAL: echo chr20 0 57791882 > rn5.chr20.bed halBranchMutations mouse3.hal --refTargets rn5.chr20.bed \ --snpFile rn5.chr20_snp.bed