Smoother: on-the-fly processing of interactome data using prefix sums

Abstract Nucleic acid interactome data, such as chromosome conformation capture data and RNA–DNA interactome data, are currently analyzed via pipelines that must be rerun for each new parameter set. A more dynamic approach is desirable since the optimal parameter set is commonly unknown ahead of time and rerunning pipelines is a time-consuming process. We have developed an approach fast enough to process interactome data on-the-fly using a sparse prefix sum index. With this index, we created Smoother, a flexible, multifeatured visualization and analysis tool that allows interactive filtering, e.g. by mapping quality, almost instant comparisons between different normalization approaches, e.g. iterative correction, and ploidy correction. Further, Smoother can overlay other sequencing data or genomic annotations, compare different samples, and perform virtual 4C analysis. Smoother permits a novel way to interact with and explore interactome data, fostering comprehensive, high-quality data analysis. Smoother is available at https://github.com/Siegel-Lab/BioSmoother under the MIT license.

This pattern also holds for the -dimensional case.We now give a generalized formal definition.Let be   a set of -dimensional interactions.The prefix sum matrix contains, for each position, the number of   interactions that are located at or before that position in each dimension.Formally, we define: Where is the prefix sum in at position and is the position of an   fraction of the points before the hyperrectangle.However, these fractions of points also overlap in various ways.To cancel out these overlaps, we combine the prefix sums of all corners where prefix sums from corners with an even distance to the upper corner are summed up, while those with an uneven distance are subtracted.Here, distance between two corners denotes the minimal number of edges that separates them.Formally, we count the number of points between and , by:  Supplementary Note 2 -Alternative data structures to prefix sums Prefix sums are not the only data structure that offers fast region count query speeds.Other options are R-trees or range trees.The main reason to use prefix sums is that they offer the best query times out of the three options.Additionally, prefix sums lend themselves nicely to our task, while the other listed data structures have particular caveats.R-trees recursively group nearby points into bounding boxes.Queries are performed by traversing down the tree into bounding boxes that match the search while ignoring the boxes that do not.For region count queries, one could annotate each bounding box with a counter that stores the total number of points inside; hence a count query that fully encloses a bounding box would not need to traverse into that box.
Range trees, such as e.g.k-d trees, are binary search trees that partition data points using a different dimension of these points at each layer of the tree.Similarly to R-trees, nodes can be annotated with counters to store the total number of datapoints below them, removing the need to traverse past nodes that are fully enclosed by the query range.
However, with both R-trees and k-d trees, the surface of the queried region will very likely not match any of the stored bounding boxes or nodes, even on the lower layers of the tree.Hence, for all bounding boxes or nodes that overlap the surface of the queried region, it will be necessary to descend to the lowest layers of the tree and count the stored points individually.When computing a heatmap that covers the entirety of the genome, these surface-overlapping bounding boxes are expected to significantly slow down these approaches.
Even in an ideal case, where the bounding box or node surfaces are aligned to the queried area, a lookup in these tree based datastructures requires descending down the tree, making prefix sums the superior datastructure for our purpose.Supplementary Note 3 -storing hyperrectangles using prefix sums Below, we give an example for 2-dimensional multimapping interactions.Multimappers are shown as orange and green rectangles.Each rectangle is the smallest rectangle that surrounds all mapping loci of the multimapping interaction.The 4 outer panels show the counting operations performed for the four corners of a bin (dashed rectangle).For each corner of the bin, we count the number of rectangles that have the equivalent corner to the bottom left of the bin's corner.Here, a black cross marks the corner of the bin, while the equivalent corners of all rectangles are indicated with colored crosses.

Supplementary Figure 2. A diagrammatic representation of querying data rectangles in a prefix sum index.
This counts merely rectangles that are fully enclosed by the bin if no rectangle fully encloses the bin.To filter out such enclosing rectangles, we filter out all rectangles that are wider or higher than the bin.To do this we introduce two new dimensions.Below we give a one-dimensional example, where intervals that are fully enclosed by a bin are counted.For this, we introduce a second dimension, where intervals are stored at a position according to their width.While querying a bin, we then adjust the top face of the bin to exclude too large intervals.3. A filter dimension is used to exclude data intervals that are larger than the query interval.

Supplementary Figure
Since the bottom edge of the query rectangle is always at zero for this filter dimension, the prefix sums of these edges' points must be zero and need not be queried.
For counting enclosed data rectangles, we pick the same corner that is used for the query rectangle.However, it is also possible to count rectangles that overlap the queried bin by picking the opposite corner and skipping filtering by rectangle width and height.This pattern also holds for the -dimensional case.We now give a generalized formal definition.Let be   a set of -dimensional hyperrectangles, in -dimensional space, defined by their lower and    Further, we add more filter dimensions to our data space (one for each of the data hyperrectangle  dimensions).These dimensions will be used to filter out data hyperrectangles larger than the query hyperrectangle (see the second figure).For these filter dimensions data hyperrectangles are flat and placed at a position matching their width in the corresponding regular dimension .
We compute the prefix sums, individually for all corners of the hyperrectangles.We denote these 2  corners (and so prefix sum sets) by an -tuple , with , for upper and lower corners,

1 ,
, the time required to count the number of interactions in any given interval, rectangle, cuboid, or -hyperrectangle is independent of the hyperrectangle size or the size of the dataset.It  always requires lookups.E.g., for intervals, 2 lookups are required (start and end position of the 2  interval), while our 2-dimensional contact data requires 4 operations per bin.

1 ,
dimensionality of the dataspace (meaning data hyperrectangles are allowed to be ≤ placed in higher dimensional space, being flat in the dimensions they do not share with the dataspace).For non-hyperrectangle dimensions, the upper and lower corner coordinates must be equal .

1 ,𝑒 1 ,
We query the number of enclosed data hyperrectangles in a query hyperrectangle defined by its lower and upper corner as follows: