OLOGRAM-MODL: mining enriched n-wise combinations of genomic features with Monte Carlo and dictionary learning

Abstract Most epigenetic marks, such as Transcriptional Regulators or histone marks, are biological objects known to work together in n-wise complexes. A suitable way to infer such functional associations between them is to study the overlaps of the corresponding genomic regions. However, the problem of the statistical significance of n-wise overlaps of genomic features is seldom tackled, which prevent rigorous studies of n-wise interactions. We introduce OLOGRAM-MODL, which considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values. An optional machine learning method is proposed to find complexes of interest, using a new itemset mining algorithm based on dictionary learning which is resistant to noise inherent to biological assays. The overall approach is implemented through an easy-to-use CLI interface for workflow integration, and a visual tree-based representation of the results suited for explicability. The viability of the method is experimentally studied using both artificial and biological data. This approach is accessible through the command line interface of the pygtftk toolkit, available on Bioconda and from https://github.com/dputhier/pygtftk

Comparison of the functionalities of OLOGRAM-MODL with existing tools. "Multiplicity only" means the individual members of a combination are not detailed. "Model-dependant" means the model makes questionable statistical assumptions. For MODL's focus of filtering, "Customizable" means the loss can be customized, which we give an example of in this study through a supervised loss. ChromHMM has a slightly different principle in that it is not designed to give enrichments, but partition the genome by combination. C is [Query + other third + third + third bis + ...] Even with multiple overlaps, the S statistic still follows Negative Binomial distributions. Longer combinations (such as C here) will be more rarely observed. This supports fitting a Negative Binomial model, as otherwise determining the significance of larger fold changes would require many shuffles. Case C shows an example with a more difficult fitting, as there are few expected overlaps. This leads to a lower precision estimate for the underlying Negative Binomial, but such a low average does not mean much biologically regardless. A and B are the artificial coarse datasets (called coarse as they contain a large proportion of the query), while "Regular negative control" is for the negative control in the regular artificial dataset (Supplementary Note S-IV). 1000 shuffles were used in both. The statistic considered is always S(γ), the number of base pairs. S obs denotes the true value in the data, while S exp is the value observed in the shuffles. Our p-values differ somewhat, since the expected mean and variance differ due to different shuffling models, but only slightly. In a more typical use case however, with fewer and larger regions, (regular negative control) the p-values match. Importantly, when calculating a p-value using a Negative Binomial and MULTOVL's statistics, this matches MULTOVL p-values for easy cases (p > 1/n s where n s is the number of shuffles), which strongly supports our Negative Binomial model. MULTOVL only gives scores by multiplicity (number of sets overlapping at each position) unlike OLOGRAM which gives the IDs of the sets, which is why we limit ourselves to 2 sets in order to be able to extract information. Supplementary Table 3. Comparison of itemset miners. Comparison of itemsets found by MODL, exact itemset miners (meaning apriori, FP-Growth and LCM) and CL-MAX (approximate itemset miner), on a large artificial matrix. The itemsets used when generating the matrix are represented in mentioned in the "True" column. An uniform noise of 12 % was also applied. FP-growth and apriori have the same results, as expected. LCM will also return the supports for all the 2 6 = 64 combinations, since there is both additive and substractive noise: an A + B +C observation can become AB (negative) or ABCD (positive), and no itemset is exactly closed compared to another. As they are exhaustive algorithms, their results are the same. For MODL, we queried 3 atoms, as this corresponds to the number of complexes used when generating the data. The first three atoms returned by MODL correspond to the itemsets defined when generating the data, with the remaining candidates corresponding to noise patterns or compromises. When ordering by support in apriori however, the rank of the true itemsets is much lower. MODL still gives the correct results even with uniform noise from 0 to 12% (12% is shown here). Furthermore, as expected, the results of an approximate itemset miner such as CL-MAX (based on clustering) are closer to those of MODL: we give CL-MAX's results depending on the discretization threshold (note that d = 0.9 was the default parameter), and on how many clusters are requested (number between the brackets) and add an asterisk (ie. "[6*]") for the itemsets returned instead after using their second step (MAFIA algorithm). However, clustering-based algorithms such as CL-MAX are more vulnerable to a poor choice of hyperparameters, resulting in learning compromises or erroneous itemsets. Results given by CL-MAX also vary heavily depending on random seed. MODL is less vulnerable to this, since in step 2 the selection is based on rebuilding quality, adding irrelevant candidates does not change the first selections.  Table 2). MODL is run on the matrix of intersection described in Figure 2 (1 line per intersection, not per basepair). Our selection is made differently, though. We give the ranking by order of selection in MODL for the combinations. When pushing the number of combinations required, we get some similarities. Differences can be chalked up to different selection criterion, as we discuss in the main text. Unsupervised MODL is trying to rebuild the original data matrix, not find combinations that best explain the query. For example, while MODL does not select the {4,5} combination, it does select the {2,4,5} one, which represents 91% of the true base pairs of the {4,5} combination. Our analysis took about a minute, less the 20 min advertised for GINOM.

3/16
To perform a relevant comparison, we also ran a supervised version of MODL using a custom error function. Here, the error to be minimized is equal to −F where F is the F1-score of a Naive Bayes Gaussian classifier trying to predict the presence of the query region. At each step, features are added to the data given to the classifier, by selecting at each step the new feature that best reduces the error. The presence or absence of each combination is one-hot encoded as a variable (1 means presence). Using a subsampling to get a ratio between the classes "0" and 1 (0 means query is absent, 1 present) of 30:1 and using a weight of 30 for class "1". Note that the subsampling can introduce result variability. Although some necessary assumptions for Naive Bayes are not satisfied (namely, the features are not independent), this is not a problem if the correct class is still more likely than the incorrect one (1). Here, {4} is selected first, even though it's not the most abundant, and GINOM agrees. The rest of the selection has a stronger emphasis on set 5, and combinations containing it, as well as combinations containing set 2. To further explain the differences, keep in mind that Naive Bayes classifiers are comparatively not very efficient (even the final classifier with many features still returns many false positives), that the most discriminant features may not be the most abundant (in contrast to unsupervised MODL) and that the most enriched features (as per OLOGRAM and GINOM) may not be turn out to be the most useful predictors (as shown by supervised MODL). Indeed, {2, 4} being selected before {4, 5} makes sense since a higher proportion of occurrences of {2, 4} are overlapping with the query. Results could be improved with further parameter improvement, but the principle is the same. In this analysis, GINOM's selection seems to be equivalent to selecting the lowest OLOGRAM p-values, but mysteriously stopping at order 2. As such, they do not highlight full complexes like {2,4,5}. . Using a single thread on a 2.6 Ghz CPU. Apriori "pure Python" is a Python implementation, the other apriori and FP-growth are the mlxtend implementation, DL is the Scikit-Learn Python implementation. The left figures gives the absolute time, and the right figures give the times relative to the smallest for this algorithm (to show asymptotic scaling and erase the overhead multiplier due to differences of implementation). (A) a scaling factor of F means DL was instructed to learn F atoms, and apriori and FP-growth had a minimum support of 1 F . The matrix had 5,000 rows and 13 sets. α for the DL was 0.5. 50% noise was used. (B) Scaling with the number of sets (columns) and a matrix of 5,000 rows. Minimum support for apriori was 0.01, α for DL was 0. DL was trying to learn 120 atoms. Standard noise is 50%, low noise is 1%. (C) Scaling with the number of lines (in thousands). Noise was 50%, minimum support for apriori and fpgrowth was 0.01. The matrix had 15 columns. DL had an α of 0 and learned 120 atoms. DL scales linearly with the number of sets, while apriori scales exponentially and FP-growth scales quadratically (2). Apriori and FP-growth will scale more than linearly with min support, until they reach a plateau where the minimum support is so low that all 2 k combinations are considered. On the other hand, DL scales linearly with the number of queried atoms to be learned. Some points on the curves are missing, which is due to implementations used having an impractical RAM or time cost for these parameters, so we stopped before reaching those points. The most peculiar behavior of DL is that, depending on the entropy of the data, it can scale differently. If early convergence cannot be reached, the time cost may be higher for lower-sized data (seen in figure B). But within each sub-slope (early convergence or not) the scaling is linear. DL cost increases faster once early convergence is no longer possible (seen in figure A). When using less noisy data (so the data has less entropy, meaning less information, seen in figure B) DL converges faster and takes less time. Some scalings are not displayed here. K-means clustering (used in algorithms such as CL-MAX, with Llyod's algorithm) scales mostly linearly with data size, number of sets and number of queried clusters, but it can be superpolynomial in the worst case (3): this is equivalent-to-worse than DL's scaling. GINOM's scaling involves stepwise selection: it is conceptually similar to MODL's step 2 (evaluate all possible additions and carry the best into the next step, see ref. 4) but MODL's advantage is that it does not consider all 2 k candidates, thanks the selection via matrix factoring in step 1.

7/16
Supplementary Figure 5. Tree of combinations for the FOXA1 analysis in MCF7 but restricted to combinations selected by MODL. MODL selects atoms that best help rebuild the matrix of original overlaps. Rarer but more enriched factors such as EP300, are found with their usual correlators and not alone. MAX and MYC, more frequent, are found more here. We can see that MODL selects relatively closed itemsets.

8/16
Supplementary Figure 6. Enrichment of sc-ATAC-seq cell combinations by entropy. We randomly selected 75 cells from three different superclusters, based on the classification on the Signac vignette: a "CD14" supercluster, a "CD4 and CD8" supercluster, and a "pre-lymphocyte B" supercluster. The shuffles were not restricted, and MODL is also not used here. We then consider the enrichment for each possible combination γ depending on its heterogeneity (Shannon entropy H(γ)). For example, consider the combination A + B +C where all three are sc-ATAC-seq sites from CD4 cells: it is homogeneous and has an entropy of 0. However, if D and E are sites for respectively a CD14 and pre-B cell (different superclusters), the combination A + D + E has a high entropy. We expect that heterogeneous combinations (with higher entropy) are less enriched since the open genomic sites of cells of different superclusters would be less similar than between the same supercluster. As usual, we compare one order at a time: (A) 2-wise combinations with 2-wise combinations, and (B) 4-wise with 4-wise. Homogeneous combinations indeed have higher enrichment (about 1.5 to 2x higher). However, this difference is small, as there are very high individual variations between the cells even within the same supercluster. This is confirmed by simply looking at the true S(γ). For higher orders, we instead focus on comparing between H(γ) = 0 and H(γ) > 0, since in the latter cases we are different supercluster compositions may have the same entropy, which would complicate the interpretation, but only pure combinations have zero entropy. This shows that similar cells have similar patterns of chromatin openness, although individual variations are also an important contributing factor.  Recall that α and β are the shape parameters, and that a and c are the scaling parameters giving respectively the minimum and maximum of the possible range of values. Then, the parameters are estimated from the samples themselves using method-of-moments fitting. It can be seen that at 200 data points (corresponding to the usual number of shuffles in OLOGRAM), the estimates for the parameters are very imprecise. This is considerably worse than simply fitting a Negative Binomial: in the subfigure (E), we compare the theoretical variance against the empirical variance. The "theoretical variance" is the expected variance in the samples, calculated using the estimations for the model parameters gained through the method-of-moments fitting. The empirical variance is simply the variance in the generated samples. We can see that the Beta fitting results in a poorer estimation of the true variance. By extension, we deduce that in a set of samples that imperfectly resembles a Negative Binomial distribution, the Beta-fitted tails (and by extension the p-values) would be estimated with even more errors than merely using a Neg. Binom. fitted based on the empirical variance.

SUPPLEMENTARY NOTES
S-I Pre-processing of matrices before MODL Definition 1. The abundance of a row r in a given matrix X is the number of rows in X which are exactly equal to r. It is noted a X (r).
Definition 2. Let s(X) be the smallest abundance in X among the abundances for all the unique rows of X. The smothered version of the matrix X is the matrix ψ(X). For each unique row x of X, a ψ(X) (x) = a X (x) ν , where ν is the highest of either s(X) or the abundance threshold τ. Row order is unimportant.
After smothering, X is reduced to one elementary repetition of itself to save computing time. The default abundance threshold is τ = 1e−4, combinations rarer than this are ignored. The abundance threshold τ helps reducing the time cost of MODL, analogous to the minimum support used in other itemset miners to limit the exponential complexity of the problem.
The use of the square root of the abundances gives more weight to the rarest combinations, instead of simply focusing on an even better reconstruction of frequent combinations. It is the smothered version of X that is passed to the MODL algorithm, not X itself.

S-II Additional information on MODL's library creation step
The reconstructions are repeated on a 3-fold cross-sampling (rotating 2/3 of data) to increase result variety, as random initialization can result in different reconstructions. Coordinate descent with LASSO is used for the fitting with 200 iterations, where negative atoms are disallowed to allow later interpretation. The more widespread Least-angle regression (LARS) fitting algorithm is known to select wrong features when the features are correlated (5) and as such is is only used as a fallback if LASSO fails to converge.
A higher number of atoms in the learned dictionary would result in more precise reconstructions, but lessen the need to learn itemsets instead of individual components or simply unique rows (i.e. words). Conversely, fewer atoms will result in learning compromise atoms. Although this is the point of the approach and grants resistance to noise (both false positive and negatives), too much compromise can result in learning potential correlation groups such as using only an atom (111) to reconstruct the words (100) or (011). As a trade-off between having variety and the effects just mentioned, the number of atoms in the learned dictionary is by default 2 × q.
As too low values can hinder convergence, the sparsity constraint α used in those reconstructions begins at 1/k where k is the number of sets (ie. number of columns of X). Using increasing steps avoids lingering at high α, where results can be redundant. This can however result in the algorithm spending more time at high α, which will result in longer words and compromise words (as discussed above), which will then have higher total usage (see below). As such, the sparsity constraints used in this step can be overriden as a parameter in the API, if one wishes to produce more shorter words by spending more steps at low α.
After each factorization, each atom v is binarized into a discretized vector Discr(v), and saved along with its total usag (the multiplicative coefficient associated to it in the reconstructions, see section 2.2.1 for the precise definition). For each value v i in the atom v (v is a vector), the corresponding value in the discretized vector Discr(v) is : The d is called the discretization threshold, with a default value of 0. The use of custom discretization thresholds, as well as the manual α selection discussed above, would be key to stop MODL from learning too many "compromise words" as discussed above and learn words that are closer to classical association rules, which can then be sorted in step 2.
Finally, once all reconstructions are done, a library Λ of candidate atoms is obtained. In order to save time, only the top 3 * q atoms ordered by their summed usage across all reconstructions are kept before passing Λ to the next step. This will also discard leftover atoms with low usage. An optional filtering by atom length is possible.

S-III Additional information on MODL's atom selection step
The sparsity controller α tends to emphasize the longest atoms: this impact is reduced by projecting each λ ∈ S on the surface of the 1-unit ball (∥λ ∥ 2 = 1). To ensure no two atoms have the exact same dot product with a given word, a small jitter of √ i 10 4 is added to each value of the i-th atom. S is sorted in lexicographic order. As the set of atoms usually has some degree of degeneracy (similar atoms), the sparse coder used is LASSO-LARS. Coordinate Descent does not handle it well, but LARS tends to drop correlated regressors, which is a strength here. In any case, the process does not compare the usage of each atom, only the quality of the reconstruction.

13/16
The sparsity controlling parameter α on both the coder's LASSO and the d 1 error is nonzero, in order to encourage adding (11) to the dictionary even if (01) and (10) are already present, as that would bring an improvement by using only one atom. Manually computing the Manhattan error in the evaluation instead of Euclidian penalizes rebuilding both (01) and (10) as ( 1   2   1 2 ). A relatively high α = 1 √ k (capped at 0.5) is used by default, but can be changed. This helps convergence and emphasizes using as few words as possible to get closed itemsets, instead of focusing on improving the rebuilding of frequent combinations.
In case of a tie, the first atom is selected. This is a greedy algorithm, in that it makes the locally optimal choice at each iteration. The raised solution is optimal if the optimization problem is the maximization of a submodular function (diminishing returns when adding new elements). A set function f is submodular if: Then the use of a greedy algorithm to maximize f will produce a solution within a factor of 1 − 1/e, which is the best possible approximation (6). Proposition 1. The problem of finding S * = arg max S f (S) admits a good submodular approximation.
Proof sketch. Consider the gain made by adding λ so a set S, and consider S ′ a set of atoms such that S ⊆ S ′ . Assuming the sparse encoder optimizes correctly, the encoding cannot be made worse by adding more atoms to choose from in the dictionary: if useless, they will simply be ignored. So f (S) ≤ f (S ′ ), and f is monotonous.
Furthermore, adding redundant atoms does not improve the reconstruction further, resulting in diminishing returns: for each row (i.e. word) x ∈ X to be rebuilt, the solution found is a linear combination of the atoms in the dictionary. Adding a new atom λ to this dictionary will only improve the reconstruction depending on the coefficient given to the new atom, which is turn depends on how much error the previous dictionary made on this word x that the new atom was brought to correct. Indeed, the improvement that a candidate λ can bring is bounded by the difference in the projections of x on the subspaces defined by V t and by V t ∪ λ (7). While this approximate submodularity of f is valid only up to a margin dependent on the incoherence of the matrix of candidates, in practice even coherent ones result in good approximations (7,8). This is even more true due to α which penalizes the use of too many atoms to rebuild x.
Perspectives In order to improve the scaling with the number of queried atoms, it would be possible in this test to not always test all candidates, but only the children combinations (see definition in section 2.1.1) of the already selected combinations/atoms, assuming that if an atom A is better than an atom B, then its best parent is better than the best parent of B, proceeding until an optimum is reached.
If the problem is not submodular, there is a potential alternative (but costly) method. After adding the best word, the least useful word can be removed after each iteration. The algorithm would then continue until the maximum number of steps is reached, or where the word being added and removed are the same ("convergence"). This would be analogous to gradient descent.

S-IV Data
Artificial and comparison datasets and tools These datasets are used to demonstrate and quantify the behavior of our algorithms.
1. Noisy intersections matrices: an artificial overlap matrix whose unique rows are representation of A + B, A + B +C + D or E + F. This equates to row vectors of respectively (110000), (111100), (000011). A NOT is then applied to each element (turning a 0 into 1, and 1 into 0) with a probability p N to represent noise. We use a matrix with a large number of rows to perform a large-scale experiment and ensure the results are not due to random chance. The itemsets selected represent the biologically interesting cases, with a diversity of complexes and two overlapping ones.
2. Artificial BEDs of regions: a query set of 1,000 artificial genomic regions of length 200,000 has been generated in in the hg38 genome and compared against the following sets: (A) a random third of the regions of the query set, (B) a copy of this A set, (C) a different third of the query set that does not overlap with the selection in A, and (D) a negative control made of other random peaks. This is termed the regular artificial data in the following text. We also prepare a different, coarse artificial data set with a query set of 10,000 regions, 250bp length each, in an artificial genome of one chromosome of 1Mbp. The A set takes the first 21% of query, and B takes the last 26% of query (so we would expect A and B to overlap less than by chance). It is called coarse as the subsets contain 14/16