Training-free measures based on algorithmic probability identify high nucleosome occupancy in DNA sequences

Abstract We introduce and study a set of training-free methods of an information-theoretic and algorithmic complexity nature that we apply to DNA sequences to identify their potential to identify nucleosomal binding sites. We test the measures on well-studied genomic sequences of different sizes drawn from different sources. The measures reveal the known in vivo versus in vitro predictive discrepancies and uncover their potential to pinpoint high and low nucleosome occupancy. We explore different possible signals within and beyond the nucleosome length and find that the complexity indices are informative of nucleosome occupancy. We found that, while it is clear that the gold standard Kaplan model is driven by GC content (by design) and by k-mer training; for high occupancy, entropy and complexity-based scores are also informative and can complement the Kaplan model.


Shannon Entropy
Central to information theory is the concept of Shannon's entropy, which quantifies the average number of bits needed to store or communicate a message. Entropy determines that one cannot store (and therefore communicate) a message with n different symbols in less than log(n) bits. In this sense, Entropy determines a lower limit below which no message can be further compressed, not even in principle. Another application (or interpretation) of Shannon's information theory is as a measure for quantifying the uncertainty involved in predicting the value of a random variable.
Shannon defined the Entropy H of a discrete random variable X with possible values x 1 , . . . , x n and probability distribution P (X) as: where if P (x i ) = 0 for some i, the value of the corresponding summand 0 log 2 (0) is taken to be 0.

Entropy Rate
The function R gives what is variously denominated as rate or block Entropy, and is Shannon Entropy over blocks or subsequences of X of length b. That is, If the sequence is not statistically random, then H R (X) will reach a low value for some b, and if random, then it will be maximally entropic for all blocks b. H R (X) is computationally intractable as a function of sequence size, and typically upper bounds are realistically calculated for a fixed value of b (e.g. a window length). Notice that, as discussed in the main text, having maximal Entropy does not by any means imply algorithmic randomness (c.f. 1.3).

Lossless compression algorithms
Two widely used lossless compression algorithms were employed. On the one hand, Bzip2 is a lossless compression method that uses several layers of compression techniques stacked one on top of the other, including Run-length encoding (RLE), Burrows-Wheeler transform (BWT), Move to Front (MTF) transform, and Huffman coding, among other sequential transformations. Bzip2 compresses more effectively than LZW, LZ77 and Deflate, but is considerably slower.
On the other hand, Compress is a lossless compression algorithm based on the LZW compression algorithm. Lempel-Ziv-Welch (LZW) is a lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch, and is considered universal for an infinite sliding window (in practice the sliding window is bounded by memory or choice). It is considered universal in the sense of Shannon Entropy, meaning that it approximates the Entropy rate of the source (an input in the form of a file/sequence). It is the algorithm of the widely used Unix file compression utility 'Compress', and is currently in the international public domain.

Measures of Algorithmic Complexity
A binary sequence s is said to be random if its Kolmogorov-Chaitin complexity [7, 10, 4] C(s) is at least twice its length. It is a measure of the computational resources needed to specify the object. Formally, where p is a program that outputs s running on a universal Turing machine T . C as a function taking s to the length of the shortest computer program that produces s is semi-computable, and upper bound estimations are possible. The measure is today the accepted mathematical definition of randomness, among other reasons because it has been proven to be mathematically robust by virtue of the fact that several independent definitions converge to it.
The invariance theorem guarantees that complexity values will only diverge by a constant (e.g. the length of a compiler, a translation program between T 1 and T 2 ) and will converge at the limit. Formally, |C(s) T1 − C(s) T2 | < c

Lossless Compression as Approximation to C
Lossless compression is traditionally the method of choice when a measure of algorithmic content related to Kolmogorov-Chaitin complexity C is needed. The Kolmogorov-Chaitin complexity of a sequence s is defined as the length of the shortest computer program p that outputs s running on a reference universal Turing machine T . While lossless compression is equivalent to algorithmic complexity, actual implementations of lossless compression (e.g. Compress) are heavily based upon Entropy rate estimations [13,14] that mostly deal with statistical repetitions or k-mers of up to a window length size L, such that k ≤ L.

Algorithmic Probability as Approximation to C
Another approach consists in making estimations by way of a related measure, Algorithmic Probability [6,9]. The Algorithmic Probability of a sequence s is the probability that s is produced by a random computer program p when running on a reference Turing machine T . Both algorithmic complexity and Algorithmic Probability rely on T , but invariance theorems for both guarantee that the choice of T is asymptotically negligible.
One way to minimise the impact of the choice of T is to average across a large set of different Turing machines, all of the same size. The chief advantage of algorithmic indices is that causal signals in a sequence may escape entropic measures if they do not produce statistical regularities. And it has been the case that increasing the length of k in k-nucleotide models of structural properties of DNA has not returned more than a marginal advantage.
The Algorithmic Probability [10] (also known as Levin's semi-measure [8]) of a sequence s is a measure that describes the expected probability of a random program p running on a universal prefix-free Turing machine T producing s. Formally, The Coding theorem beautifully connects C(s) and m(s):

Bennett's Logical Depth
Another measure of great interest is logical depth [2]. The logical depth (LD) of a sequence s is the shortest time logged by the shortest programs p i that produce s when running on a universal reference Turing machine. In other words, just as algorithmic complexity is associated with lossless compression, LD can be associated with the shortest time that a Turing machine takes to decompress the sequence s from its shortest computer description. A multiplicative invariance theorem for LD has also been proven [2]. Estimations of Algorithmic Probability and logical depth of DNA sequences were performed as determined in [6,9].
Unlike algorithmic (Kolmogorov-Chaitin) complexity C, logical depth is a measure related to 'structure' rather than randomness. LD can be identified with biological complexity [3,5] and is therefore of great interest when comparing different genomic regions.

Measures Based on Algorithmic Probability and on Logical Depth
The Coding theorem method (or simply CTM) is a method [6,9] rooted in the relation between C(s) and m(s) specified by Algorithmic Probability [10,8], that is, between frequency of production of a sequence from a random program and its Kolmogorov-Chaitin complexity as described by Algorithmic Probability. Essentially, it uses the fact that the more frequent a sequence the lower its Kolmogorov-Chaitin complexity, and sequences of lower frequency have higher Kolmogorov-Chaitin complexity. Unlike algorithms for lossless compression, the Algorithmic Probability approach not only produces estimations of C for sequences with statistical regularities, but it is deeply rooted in a computational model of Algorithmic Probability, and therefore, unlike lossless compression, has the potential to identify regularities that are not statistical (e.g. a sequence such as 1234...), that is, sequences with high Entropy or no statistical regularities but low algorithmic complexity [13,12]. Let (n, m) be the space of all n-state m-symbol Turing machines, n, m > 1 and s a sequence, then: That is, the more frequently a sequence is produced the lower its Kolmogorov-Chaitin complexity, and vice versa. CTM is an upper bound estimation of Kologorov-Chaitin complexity.
From CTM, a measure of Logical Depth can also be estimated-as the computing time that the shortest Turing machine (i.e. the first in the quasilexicographic order) takes to produce its output s before halting. CTM thus produces both an empirical distribution of sequences up to a certain size, and an LD estimation based on the same computational model.
Because CTM is computationally very expensive (equivalent to the Busy Beaver problem [11]), only short sequences (currently only up to length k = 12) have associated estimations of their algorithmic complexity. To approximate the complexity of genomic DNA sequences up to length k = 12, we calculated D(5, 4)(s), from which CT M (s) was approximated. To calculate the Algorithmic Probability of a DNA sequence (e.g. the sliding window of length 147 nt) we produced an empirical Algorithmic Probability distribution from (5,4) to compare with by running a sample of 325 433 427 739 Turing machines with up to 5 states and 4 symbols (the number of nucleotides in a DNA sequence) with empty input (as required by Algorithmic Probability). The resulting distribution came from 325 378 582 327 non-unique sequences (after removal of those sequences only produced by 5 or fewer machines/programs).

Relation of BDM to Shannon Entropy and GC Content
The Block Decomposition Method (BDM) is a divide-and-conquer method that can be applied to longer sequences on which local approximations of C(s) using CTM can be averaged, thereby extending the range of application of CTM. Formally, where the set of subsequences s k is composed of the pairs (r, n), where r is an element of the decomposition of sequence s of size k, and n the multiplicity of each subsequence of length k. BDM (s) is a computable approximation from below to the algorithmic information complexity of s, C(s). BDM approximations to C improve with smaller departures (i.e. longer k-mers) from the Coding Theorem method. When k decreases in size, however, we have shown [14] that BDM approximates the Shannon Entropy of s for the chosen k-mer distribution. In this sense, BDM is a hybrid complexity measure that in the 'worst case' behaves like Shannon Entropy, and in the best approximates C. We have also shown that BDM is robust when, instead of partitioning a sequence, overlapping subsequences are used, but this latter method tends to over-fit the value of the resultant complexity of the original sequence that was broken into k-mers. Table 2: Distance in number of nucleotides to local minimum (local maximum for LD and greatest local min/max for GC content) around a window of length 73 nts. In all cases, the same sequence was used and was assembled by flanking true nucleosomal regions with pseudo-randomly generated sequences with the same GC content as the mean of the GC content of the nucleosomal regions. Even in cases when GC content is not informative (by design) because neither the local min or max values were found closer to the centres than 20 nts on average (and median of 22), max values of BDM were better able to pinpoint nucleosome centres in a large number of cases and with an accuracy of less than 10 nts on average (and a median of less than 7 nts). Entropy was found to be off by around 11.5 nts on average (median of 10 nts), lossless compression by more than 21 nts (median of 36), and LD (max values) by less than 7 nts on average (median 4.5 nts). Unlike Kaplan's, BDM and LD are informative but training-free, followed closely by entropy.