Predicting chromosomal compartments directly from the nucleotide sequence with DNA-DDA

Abstract Three-dimensional (3D) genome architecture is characterized by multi-scale patterns and plays an essential role in gene regulation. Chromatin conformation capturing experiments have revealed many properties underlying 3D genome architecture, such as the compartmentalization of chromatin based on transcriptional states. However, they are complex, costly and time consuming, and therefore only a limited number of cell types have been examined using these techniques. Increasing effort is being directed towards deriving computational methods that can predict chromatin conformation and associated structures. Here we present DNA-delay differential analysis (DDA), a purely sequence-based method based on chaos theory to predict genome-wide A and B compartments. We show that DNA-DDA models derived from a 20 Mb sequence are sufficient to predict genome wide compartmentalization at the scale of 100 kb in four different cell types. Although this is a proof-of-concept study, our method shows promise in elucidating the mechanisms responsible for genome folding as well as modeling the impact of genetic variation on 3D genome architecture and the processes regulated thereby.


INTRODUCTION
Three-dimensional (3D) genome architecture allows linearly distal genomic loci to interact with one another, thereby impacting genome function. Chromosome conformation capturing techniques, in particular high-throughput chromosome conformation capture (Hi-C) [1,2], have enabled us to systematically catalog genomic interactions and features of 3D genome architecture in various cell types.
Hi-C data are typically summarized in a contact map, a matrix that estimates the probability of interaction between any two loci in the genome. Such maps are characterized by a plaid pattern ref lecting enrichment or depletion of Hi-C interactions. This was observed already by early Hi-C studies, which proposed to segregate the loci into two sets of compartments, and arbitrarily termed them "A" and "B" [1]. Loci in A compartments preferentially interact with other loci in A compartments, while loci in B compartments tend to interact only with other loci in B compartments. Additionally, loci in A compartments are associated with transcriptionally active euchromatin, are gene-rich, and are centrally located in the nucleus [3]. In contrast, loci in B compartments are in transcriptionally inactive heterochromatin, and tend to be gene-poor and occupy the nuclear periphery [3]. A and B compartments have been shown to be associated with distinct histone acetylation and methylation patterns that ref lect their transcriptional activity, and more refined subcompartmentalizations have been suggested on the basis of the observed chromatin states [2][3][4][5]. Compartmentalization has been found to be evolutionary conserved across species [6][7][8]. Nevertheless, it can differ substantially between cell types [3][4][5] and sequence variation between individuals has also been shown to underlie changes in 3D genome architecture, in many cases with pathological consequences [9][10][11].
Hi-C assays have revolutionized our understanding of 3D genome architecture, but they are expensive, time-consuming and expertise-demanding. Therefore, Hi-C data are only available for a limited number of human cells, and substantial effort has been put into deriving predictive computational models. Here we present DNA-DDA, a computational method that is based on the principles of chaos, ergodic and embedding theory to predict A/B compartments from the DNA sequence alone.
Chaos is widespread in biological systems [12][13][14]. It manifests itself as the seemingly random behavior of a deterministic process which is hypersensitive to f luctuations in initial conditions [15]. A deterministic dynamical system can be described by its current state (i.e. the system variables' current values) and a system of differential equations (i.e. rules) that govern the evolution of the system (i.e. the sequence of states it passes through) [16,17]. The set of all possible states, i.e. the solutions to the system of differential equations for every possible set of initial conditions, is known as the system's state space [18,19]. A trajectory in the state space is a sequence of states resulting from a particular set of initial conditions. For most chaotic systems, there exists an "attractor" [20,21], i.e. a point or a set of points towards which trajectories from almost any set of initial conditions will approach, and that represents the long-term behavior-the dynamics-of the underlying system. Two initially infinitesimally close trajectories of such a system diverge exponentially and yet, bounded by their attractors, they will be similar in a topological sense; i.e. they can be deformed into each other continuously by stretching and folding [22]. Due to this counterintuitive property of "deterministic chaos," the global structure of the state space can be investigated temporally, e.g. by studying the rate at which neighboring trajectories with similar initial conditions diverge as the system evolves [23], or spatially, e.g. by determining a trajectory's fractal dimension [24]. A bridge between these two perspectives is ergodic theory [25][26][27], which studies the statistical properties of a dynamical system. Trajectories of an ergodic system will eventually cover the entire state space so that under certain conditions, the time average of a function along a trajectory is related to the space average for almost all initial conditions [18].
The analysis of DNA sequences in the context of nonlinear dynamics (ergodic theory [28,29] and chaos theory [14]), information theory [30,31] and time series analysis (signal processing, spectral analysis [32][33][34]) has a long standing history and their concepts and ideas are interconnected [35]. These methods aim to gain insight into the macroscopic behavior of genomic control systems without having access to their innumerable variables and governing equations (states and rules). Variables underlying 3D genome architecture include, for example, the DNA sequence, histone modifications [36], DNA methylation [37] or the interaction of the DNA polymer with the surrounding environment [38]. Inspired by many of the aforementioned concepts and ideas, we adapted a nonlinear time series classification framework, delay differential analysis (DDA) [39], for the prediction of A/B compartments from the DNA sequence.
The fundament of DDA is given by Takens embedding theorem, which states that under certain conditions, the measurement of a single variable of a high dimensional dynamical system providing good or global observability of the system, is sufficient to reconstruct the system's state space [40][41][42]. DDA relates delay and derivative embeddings of a measured variable in a (nonlinear) functional form, and uses the fitting coefficients as classifying features. A particular f lavor of DDA, dynamical ergodicity-DDA (DE-DDA) [43] is used for assessing dynamical similarity. We hypothesize that the DNA sequence and the interaction frequencies between genomic loci obtained from a Hi-C assay, are variables which are highly observable of 3D genome architecture, and that sequences in close proximity in 3D space will share certain dynamical properties. In accordance with the ergodic hypothesis [25], we compare the ensemble and time averages of dynamical information inherent in the numerical representation of the DNA sequences as described by DE-DDA, to infer their proximity in 3D space and, in turn, to predict A/B compartments.
A Hi-C map can be understood as a 2D projection of the ndimensional state space of the system, i.e. a recurrence plot [44]. A recurrence plot is a method from nonlinear data analysis obtained by recording the instances t, when a trajectory visits the immediate proximity of a state it has visited in the past. Analogously, a Hi-C map visualizes how often the genomic locus at position t is involved in interactions with a subsequent locus in the DNA sequence (i.e. how often it is revisited). The patterns which arise in Hi-C maps are a hallmark for an underlying chaotic process, and DNA-DDA sifts out its dynamical signatures by mapping the DNA sequence onto an embedding space. DDA has been applied most extensively in epilepsy research [45][46][47] and our study confirms its potential to be extended to the field of genomics.

Pre-processing of Hi-C data sequencing data
Raw FASTQ files from Hi-C experiments involving four cell lines were downloaded from the Gene Expression Omnibus database (Table 1; [2,48,51]) and mapped to the human reference genome (GRCh38/hg38) using bowtie 2 (v.2.4.1; [53]) with optionsreorder and -very-sensitive-local. The deepest sequenced data set which we considered was the "primary" GM12878 Hi-C data set comprising 3.6 billion sequence reads followed by the the K526 Hi-C data set with a library of 1.4 billion sequence reads. The older hESC and IMR90 data sets comprised 0.3 and 0.4 billion reads respectively.

Hi-C contact maps
The contact map of each autosome was generated from the mapped reads using the HiCExplorer pipeline (v.3.7.2; [54][55][56]). "hicBuildMatrix" was called with parameters "-binSize 5000 -minMappingQuality 10 -restrictionSequence RS -danglingSequence DS," where DS and RS are the restriction and dangling sequences listed in Table 1. The resulting Hi-C matrix was balanced with the algorithm introduced by Knight and Ruiz [57] using "hicCorrectMatrix correct"; the "-filterThreshold" parameter was chosen based on the histogram produced by "hicCorrectMatrix diagnostic_plot" (Supplementary Table S1). From this matrix, a contact map at the resolution of 100 kb was derived using "hicMergeMatrixBins" with parameter "-numBins 20." For comparability, the contact map obtained in this manner was scaled to the 0 to 1 range with "hicNormalize -normalize norm_range". An entry in the resulting matrix H = (h i,j ) represents the contact frequency between genomic bins i and j. Bins enclosing contromere locations (obtained from the UCSC table browser [50]) as well as low coverage bins (below 10% of overall contact probability) were excluded from analyses (Supplementary Tables  S5 -S8 ).

Delay differential analysis in general
Let x(t) be a dynamical sequence of length L where t represents increments in time or space. A nonlinear DDA model has the following general functional form ( [43] and citations therein): whereẋ =ẋ(t) is the derivative of the original time (or space) series x = x(t) and x(t − τ ) is the value of the series shifted by τ steps ()x τn = x(t − τ n ). The model parameters are: K, the number of monomials; N, the number of delay embeddings x τn contained in each monomial; m n,k ∈ N 0 , the order of nonlinearity of the nth delay embedding in the kth monomial. Finally a 1 , a 2 , a 3 are the fitting coefficients, and ρ is the least-square error of the model: A full list of all possible three term DDA models up to cubic nonlinearity and two delay pairs (K ∈ {1, 2, 3}, m ∈ {1, 2, 3}, n ∈ {1, 2}) can be found in Supplementary Table S4. For a dynamical sequence x(t), Equation 1 can be written as the over-determined system of equationṡ where each element in˙ x is the center derivative at each time/ ST and CT DDA features have recently been combined by Lainscsek et. al [43] in a way which allows testing for dynamical similarity between two dynamical input sequences x i (t) and x j (t).
Here the mean of the ST error is representative of the temporal average, and the CT error is representative of the ensemble average. In accordance with the ergodic hypothesis [25], if x i (t) and x j (t) are dynamically similar, the mean of the ST errors ρ s i , ρ s j = ρs i +ρs j 2 and the CT error ρ c i,j will also be similar and therefore, their quotient close to one. DE-DDA E i,j is defined as The smaller is E i,j , the more similar are the dynamics of the two signals under investigation, i and j.

Numeric representation of DNA sequences
For DNA-DDA, t corresponds to the genomic position of a nucleotide and L is the resolution or number of nucleotides in each bin. We modeled the DNA sequence as a 1D random walk (DNA walk). Specifically, for every genomic bin, the walker starts at x(t = 1) = 0 and progresses along the DNA sequence taking a step upwards (x(t + 1) = x(t) + 1) if the nucleotide at position t + 1 is C or G and down (x(t + 1) = x(t) − 1) if the nucleotide at position t + 1 is A or T [58]. The DNA walk of five exemplary nonempty bins covering the genomic region chr1:800001-1300001 is depicted in Supplementary Figure S1. A time delay in a DNA-DDA model is a shift in genomic coordinates. A delay embedding of the DNA sequence relates the value of the DNA walk at the genomic coordinate t to its "previous" value at x(t − τ ). We are trying to gain access to (1) the interaction frequency of two genomic loci by considering only (2) the DNA nucleotide sequence within these loci. We achieve this by exploiting the concept of embedding theory, which suggests that certain variables of nonlinear systems are coupled and entail information about one another.

DE-DDA classifying feature set computation
The sequence of the GRCh38/hg38 assembly of the human genome was partitioned into 100 kb non-overlapping bins, and the sequence of each bin was represented as a 1D DNA walk. For a pair of bins i, j, we computed the ST-and CT-errors ρ s i , ρ s j and ρ c i,j using a C implementation of DDA provided by the author [43]. This executable takes as input a time series and parameters listed in Supplementary Table S2 and outputs ST-and CT-classifying feature set (a 1 ,a 2 ,a 3 ,ρ). We combined the errors ρ s i , ρ s j and ρ c i,j into E i,j (Equation 5), which we propose as an estimation of the contact probability between the two pairs of genomic bins. We repeated this process for all bin pairs to obtain the DNA-DDA contact map D = (d i,j ) of a given chromosome.

Matrix post-processing
We hypothesize that bins with higher contact probability will have similar certain dynamical properties. Thus, we inverted D so that the highest values are mapped to the lowest and vice versa. The logarithm of each non-zero value in D was taken. The matrices were saved in the file format of HOMER [59] and then converted to h5 with HiCExplorer's "hicConvertFormat" function. The resulting matrices were normalized to the 0 to 1 range with "hic-Normalize -normalize norm_range". Bins that were excluded in the Hi-C contact maps H were also excluded in the DNA-DDA contact maps D.

Compartment calling
To call compartments, we first derived the Pearson correlation matrix C H = (c H i,j ) from the normalized Hi-C matrix H as described by Lieberman, and then applied principal component analysis (PCA) to C H using MATLAB 's "pca" function. For each principal component (PC), values larger than three scaled median absolute deviations (MAD) from the median were considered extreme outliers and replaced with nearest value that was not an outlier using MATLAB 's "filloutliers(PC,'nearest')" function. The PCs were then normalized to zero mean and unit variance. Lastly, values greater than 0 were assigned to the A compartment and scaled to [0, 0.5]; values below 0 were assigned to the B compartment and scaled to [0.5, 1]. We used ChIP-seq data for H3K4me1 or H4K20me1, two epigenetic modifications associated with open chromatin [60], to determine which PC defines compartments (Table 1). More specifically, among the first four PCs, we selected the one with the largest absolute Pearson correlation  Table S9). ChIP-seq profiles were generated from the corresponding bed files; an empty vector of the same length as the PC was generated and +1 was added to each bin for each called peak falling into the bin's respective genomic region. If the Pearson correlation coefficient was negative, the PC was multiplied by −1. In two exceptional instances (chr1 for Hi-C contact maps of GM12878 and IMR90), a different PC with approximately the same Pearson correlation coefficient was deemed more likely to be associated with the compartments upon inspection, and used instead. From here on, we refer to the PC obtained from C H as PC Hi-C .
Computationally derived DNA-DDA matrices D were processed in the same manner with one exception, their corresponding PC's were smoothed with a sliding window of 5 using MATLAB 's "movmean()" function before computing the Pearson correlation coefficients with the ChIP-seq profiles. We further refer to the resulting DNA-DDA Pearson correlation matrices as C D = (c D i,j ) and to the corresponding PCs defining A/B compartments as PC DNA-DDA .

Saddle plot analysis
We performed saddle plot analysis on the contact maps predicted by DNA-DDA and Hi-C respectively in the K562 and IMR90 cell lines.

Correlation with CG content
We compared DNA-DDA's performance to predict compartments with a baseline of AT/CG percentage over 100 kb windows in all four cell types.

Structure selection and testing
A key difference to traditional machine learning-based approaches is that DDA models are not updated iteratively (i.e. they do not learn). Instead, an exhaustive sweep is performed over a list of putative models to search for those best suited to discriminate the dynamics of interest; this step is called structure selection.
The functional form of a DDA model is dictated by the overall system and obtained data type. The data type most extensively studied using DDA is EEG, for which a particular functional form has been established [45-47, 61, 62]. Typically, most terms in Equation 1 are set to zero to reduce the chances of overfitting (e.g. K ∈ {1, 2, 3}, m ∈ {1, 2, 3} n ∈ {1, 2}). This work is a proof of concept to explore the feasibility of applying DDA to genomics data. Thus, we decided to use a simple, symmetric model with only quadratic degree of nonlinearity (DDA model number 2 in Supplementary  Table S4) Symmetric models require half the computational effort of non-symmetric ones to compute the classifying feature set {a 1 , a 2 , a 3 , ρ} for a range of delays between r 1 and r 2 ; τ 1 , τ 2 ∈ (r 1 , r 2 ) (r i ∈ [1 : 50] in this study).
Next, we searched for the delay pair τ 1 , τ 2 that best captured A/B compartments in each of the four cell types. Specifically, this was done using a 20 Mb region on chr22 (chr22:16200000-36200001, Supplementary Figure S2). The chromosome was chosen arbitrarily and the region thereon corresponded to the one with the lowest compartment agreement between the four cell types considered in this study. The performance of each delay pair was measured as the Pearson correlation coefficient between PC Hi-C and PC DNA-DDA in this region.
The four obtained DNA-DDA models were tested on all 100 kb genomic loci of all human autosomes with various performance measures (Pearson correlation coefficient r PC between PC Hi-C and PC DNA-DDA , area under the receiver operating characteristic (ROC) curve AUC, accuracy ACC and F1-score F1 for classifying A/B compartments). We would like to emphasize that once we determined the DNA-DDA model for each cell type (Supplementary Table S3), all subsequent analyses and results were performed on never before seen data (with the small exception of chr22:16200000-36200001).

Comparison to other methods
We compared DNA-DDA to three other methods that can predict A/B compartments from sequence-based features or the sequence directly ( Table 2). The "Sequence-based Annotator of chromosomal compartments by Stacked Artificial Neural Networks" (SACSANN [8]) predicts A/B compartments based on features derived from GC content, transposable elements (TE), and putative transcription factor binding sites (TFBS). It identifies the 100 most important species/cell-type specific features with a random forest predictor and then trains two stacked artificial neural networks (ANN) to classify 100 kb-long genomic bins into either the A or the B compartment.
The "A/B Compartment Network" (ABCNet [63]) is a deep convolutional artificial neural network (CNN) that takes a one-hot encoding of the sequence within 100 kb-long bins and extracts features by passing them through two-layer convolutional kernels, an average pooling layer, and a fully connected dense layer to output a single value representing the predicted PC value of each bin, classifying it either as an A or B compartment.
Finally, "Orca" [64] is a multiscale prediction model composed of two CNNs, a hierarchial multi-resolution sequence encoder and a cascading series of sequence decoders. The encoder takes up to 256 Mb one-hot-encoded sequence as input, and generates a series of decreasing resolution sequence representations, Both SACSANN and ABCNet model architecture were applied to numerous data sets ( Table 2) in terms of feature selection, training/testing procedures. Resulting models were evaluated using a chromosome-wise leave-one-out cross validation. We compared the performance of DNA-DDA on the hESC data set (GSE35156) to that reported by SACSANN's and ABCNet's authors on the same data set. In addition, we compared the overall performance of the methods, i.e. the average AUC or ACC across various data sets and models. In the case of SACSANN, we considered all examined data sets ("Summary (ROC AUC score) SACSANN" in supplemental_file_S3.xls at https://github.com/ BlanchetteLab/SACSANN/tree/master/supplemental_files/, last accessed in July 2022). For ABCNet, we restricted the comparison to the ACC reported for human data sets (i.e., average μ across all "secondary" data sets in Table III in Kirchof [63]. Orca was trained on two of the highest resolution micro-C (an improvement of Hi-C) data sets available for the H1 human embryonic stem cell line (H1-ESC) and Human foreskin fibroblast cell line (HFF) [65]. We compared to Orca models which predict interactions at the closest resolution that DNA-DDA currently operates on (128 kb and 100 kb respectively). To increase comparability, we considered an Orca model trained on the same cell line (H1-ESC 4DNES21D8SP8); it takes 32 Mb as input and predicts 128 kb interactions within this region. Therefore, we split one of the Orca hold-out chromosomes, chr9, into three 32 Mb regions (Supplementary Table S11), derived a Pearson correlation matrix from the predicted Orca contact maps, performed a PCA in MATLAB and computed the Pearson correlation coefficient r PC of the resulting PCs to those identified by the HiC-data for hESC GSE35156 at 128 kb.

RESULTS
DNA-DDA is a statistical approach derived from DDA, a method based on nonlinear dynamics and traditionally used for time series data, which predicts A/B compartments from the reference sequence ( Figure 1). DDA models relate the numerical derivatives of the input data to their time-delayed versions in a nonlinear functional form, of which the fitting coefficients and model error {a 1 , a 2 , a 3 , ρ} are used to access information about the underlying dynamical system. With simplicity and efficiency in mind, we chose the functional form given in Equation 6. In principle the analysis can be summarized in three steps: (1) segment the reference sequence into 100 kb long bins and represent the sequence in each bin as a 1D DNA walk, (2) determine the best suited model parameters (delay pair τ 1 , τ 2 ) in each cell type by supervised structure selection with Hi-C derived compartment labels and (3) apply the model to the rest of the genome to predict A/B compartments.

Structure selection
To find the delay pairs τ 1 , τ 2 in Equation 6 that best capture A/B compartments in each cell type, we tested model performances for all possible combinations of values for τ 1 , τ 2 on a 20 Mb-long region of chr22 (Methods). Brief ly, we partitioned this region into 200 100 kb-long bins, estimated the contact probability between each pair of bins i and j as E i,j (Equation 5), built a DNA-DDA contact map D, and applied PCA to the respective DNA-DDA Pearson correlation matrix C D to call A/B compartments. For each cell type, we then chose the delay pair that resulted in the highest absolute Pearson correlation coefficient r PC between PC DNA-DDA and PC Hi-C (Supplementary Figure S4). This resulted in four different delay pairs, one for each cell type (Supplementary  Table S3). Pearson correlation coefficients between PC DNA-DDA and PC Hi-C ranged between 0.21 (hESC) and 0.49 (GM12878) and AUCs ranged between 0.61 (hESC) and 0.77 (GM12878).

Testing
The DNA-DDA Pearson correlation matrices C D exhibited strikingly similar global patterns to the experimentally obtained Hi-C Pearson correlation matrices C H (Figure 2 and Supplementary  Figures S9 to S11). In general, supervised methods are limited by the uncertainty of the labels derived from the experimental data. It is known that identifying A and B compartments by PCA of the Pearson correlation matrix derived from the contact map is suboptimal and misses many features, especially at higher resolutions. This is particularly apparent in the IMR90 data set (Supplementary Figure S7) or particular chromosomes (e.g., chr22). Nevertheless, DNA-DDA achieves exceptional performances on never before seen genome, with AUC = 0.81, ACC = 0.74, F1 = 0.72, r PC = 0.54 over all chromosomes and cell lines (Table 3)

Saddle plot analysis
Saddle plot analysis revealed that DNA-DDA derived matrices had overall stronger saddle strengths strengths S than those derived from the expected/observed Hi-C matrix H oe (Supplementary Figure S15

Correlation with CG content
CG content is known to correlate with the compartment signal. Nonetheless, DNA-DDA predicted compartments better for 100 kb windows than the corresponding percentage of AT/CG content in each window. The mean Pearson correlation coefficients across all chromosomes was between 0.45 and 0.60 as opposed to the means between 0.45 and 0.56 observed for the GC-content-based prediction. Specifically, when examining individual chromosomes, DNA-DDA performed significantly better for GM12878, K562 and hESC (p < 0.05; Wilcoxon signed-rank test) and exhibited no difference for IMR90, the cell line for which DNA-DDA performed the worst overall. These results suggest that DNA-DDA and GC content are complementary methods.

Comparison to other methods
We compared DNA-DDA to SACSANN [8], ABCNet [63] and Orca [64]. To our knowledge, these are the only methods that are directly comparable to DNA-DDA, as they rely only on the DNA sequence or sequence-based features to predict A/B compartments. DNA-DDA competes well with other methods predictive of A/B compartments when assessed on the same Hi-C data set (hESC GSE35156) as well as overall (Table 4). Indeed, all four methods showed similar scores measured by AUC, ACC or r PC . Note the vast difference in the size of the "training" (pendant to  chr  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21   structure selection in DNA-DDA) and test data sets between the methods in Table 2. Furthermore it should be noted that although the Orca model we compared to was trained on the same cell line, a far more deeply sequenced HiC library was used in the training/validation procedure (0.3 billion vs 5.6 billion (4DNES21D8SP8) total reads).

DISCUSSION AND CONCLUSION
DNA-DDA is a nonlinear dynamics method for predicting A/B compartments based on the DNA sequence alone. We derived DNA-DDA models for four human cell types using only a 20 Mb-long region on chr22 and corresponding compartment labels defined by Hi-C data. DNA-DDA achieved mean AUCs across all held-out chromosomes between 0.75 and 0.84. The achieved performances on never-before-seen testing data demonstrate the potential of DDA, a method which has been shown to accurately classify time series data, in the field of genomics. Many computational methods for predicting genome architecture have been developed in recent years [66], but only a few are able to predict A/B compartments from the sequence alone. ABC-Net [63] and SACSANN [8] are both convolutional neural network (CNN) based models that can predict A/B compartments. While ABCNet relies only on the sequence, SACSANN [8] uses counts of sequence-derived features (e.g. TEs, TFBSs). Both achieve AUC scores close to 80%, but while ABCNet does not require any previous knowledge about the genome of interest, SACSANN derives its input features from annotation data sets that are not available for every species. The currently most comprehensive sequence-based approach for modeling 3D genome architecture is Orca [64]. Orca is also the first sequence based model able to predict truly longrange interactions (> 1 Mb). Orca models have been trained on two of the highest resolution microC data sets to date, and predict interactions simultaneously at different resolutions. The models' ability to capture sequence dependencies of 3D genome architecture has been experimentally validated. DNA-DDA competes well with these state-of-the-art methods which use a larger portion of the genome for training and/or input features other than solely the sequence (Table 4).
This study is a proof of concept meant to illustrate the potential in applying DDA-based approaches in the field of structural genomics. We want to emphasize that the DNA-DDA models presented here highly unlikely represent an optimum. Many aspects of our analysis could be evaluated and modified, including the use of a different numerical DNA sequence representation, other functional forms for the DDA models, and alternative targets to assess model performance.
We encoded the DNA sequence as a 1D DNA walk [58,67], which is a common representation for analysis of DNA sequences in time series frameworks and has been the basis for numerous studies that applied spectral analysis and signal processing methods such as discrete or Ramanujan fourier transform and, wavelet or fractal analysis for revealing high-level periodicities and patterns with biological significance [33,34,68,69].
In the 1D DNA walk using the hydrogen bond energy (SW) rule [58,70], the walker starts at zero and continues along the linear chain of nucleotides taking a step up for strongly bonded pairs (C or G) and down for weakly bonded pairs (A or T). Thus, DNA-DDA models capture dynamical properties based mainly on GC-content. Of course, more complex representations of the DNA sequence have been proposed as well, some of which take all four nucleotides into account (overview and comparison in [68,71,72]). We initially considered the alternative and equally simple 1D mapping integer representation (T = 0, C = 1, A = 2, G = 3). However this method implies biologically irrelevant properties on the bases such that purines are weighted more than pyrimidines ((A, G) > (C, T)). In future work, we plan to resort to DNA representations that include information of all nucleotides and do not have such a bias such as the 2D DNA walk [73]. Naturally, the ergodicity measure has to be substantially modified to achieve this.
The functional form of the DNA-DDA model was chosen based only on simplicity and computational efficiency. Previous work has shown that the overall functional form of a DDA model tends to be specific to the data type used to measure the system of interest (e.g. EEG, ECG, DNA sequence), while the delay pairs are sensitive to the question we ask about the system [62]. A largescale exhaustive sweep of model-delay-pair combinations such as described in Lainscsek et al. [45] could be implemented to optimize model-delay pair combinations.
To predict A/B compartments, DNA-DDA first constructs a (DNA-DDA) contact matrix. This matrix is then post-processed (e.g. filtering, logarithmic transformation, etc.) before being subjected to PCA, as proposed by Lieberman et al [1]. Although compartments are routinely identified using PCA, the binary classification into A and B compartments is most likely an oversimplification of 3D genome architecture [5]. In fact, Rao et al [2] showed that genomes segregate into at least six subcompartments, each exhibiting a distinctive pattern of genomic and epigenetic features. Furthermore, we chose the PC to be most representative of compartments based on its correlation with a histone mark associated with open chromatin in the respective cell type and genomic region of interest, as suggested by [4]. However, the chosen PC is clearly dependent on the region being considered and often, two or more PCs will often exhibit very similar correlation coefficients to the ChIP-seq signal. Naturally, this is not a limitation exclusive to DNA-DDA, but rather, of all methods calling compartments based on PCA. A large improvement in the structure selection step would be to select delay-pairs based on an alternative mutli-variate compartment classification method or the similarity of the DNA-DDA and Hi-C contact matrices directly. Since all relevant 3D structures could be extracted from interaction matrices (at various resolutions), we strongly believe that choosing the delay-pairs with a better suited target label will substantially boost cell-type specificity.
It is important to remember that a Hi-C contact map represents the average 3D genome architecture of a cell population -a multitude of systems (different cell types) at various initial conditions. DNA-DDA models were trained using a small portion of the reference genome under the assumption that the sequence-based mechanisms contributing to chromatin folding are location-independent and cell-type invariant. Recent studies comparing bulk RNA-Seq and single-cell RNA-Seq transcriptomic profiles have demonstrated the existence of distinct expression clusters corresponding to cell types sharing similar functions [74]. Thus, despite the hypersensitivity of chaotic systems to initial conditions, the state spaces of two cells of the same type can be assumed to be more similar than those of two different types, in the topological sense. Since in the present study the target labels were compartment calls derived from bulk Hi-C data, DNA-DDA is also expected to predict the compartments that ref lect the most common 3D chromatin structures in a hypothetical cell pool.
DDA models are sparse and comprise a small number of features (typically 1-4), making them robust to overfitting and well generalizable to new data [39]. Large models such as deep learning networks run the risk of capturing irrelevant patterns or "noise" in the data. In contrast, small and simple models typically fail to capture dominant signatures in the data. Since DDA does not look for the most predictive, but rather the most discriminative model, most terms in Equation 1 are set to zero. This is the power of DDA, it does not aim to model but rather capture dominant dynamical signatures in the data and 3-term DDA models have been proven sufficient for classifying complex biological data sets (eg. [45,62,75]). Still, one caveat of DNA-DDA concerns feature interpretability. Although models with fewer parameters are often more interpretable than the immense feature spaces that are typical of deep learning, this is not the case with DDA due to its foundation and motivation in embedding theory [40]. DDA has be related to spectral analysis: The estimated coefficients and delays of linear DDA models relate to the frequencies of a signal, and the estimated coefficients of nonlinear DDA models are connected to higher-order statistical moments [39]. In general, the delays of nonlinear DDA models are characterized by complex phase and frequency couplings which are not attributable to one particular system property. Nevertheless, due to the extremely low computational load of DNA-DDA, sequence-based mechanisms and the impact of genetic variants, such mutations in known or putative binding sites for regulatory proteins or genomic rearrangements, could easily be tested.
The power of methods like DNA-DDA and Orca lies in their intermediate step of predicting the contact matrix before calling A/B compartments, since all relevant 3D structures could be extracted from them in the same manner as Hi-C maps. Although Orca currently predicts contact maps for higher resolutions, DNA-DDA shows promising results whilst tackling the problem from a different angle and requiring far less "training data." We chose a relatively low resolution (100 kb) for this proof-of-concept study to (1) efficiently test the many possible parameters and aspects of our approach; and (2) allow for a better comparison with SACSANN and ABCNet, which both operate at 100 kb, and together with Orca are the only other purely sequence based methods to predict compartments of which we are aware. We acknowledge that our current DNA-DDA contact maps capture global large-scale patterns rather than subtle cell-type specific changes. Nevertheless, we believe that the use of a different numerical DNA sequence representation and/or other DDA model forms, and alternative targets to assess model performance will greatly improve cell-type specificity and result in more diverse DNA-DDA contact maps (Figure 4). Moreover, our results show that DDA models can be derived from a small amount of supervised data, which would enable the prediction of genome-wide interactions in other cell types from a very limited amount of chromosome conformation capturing data (e.g. generated with 5C). With some optimization, we are convinced that the method might be able to do this as well as deep learning algorithms, but using just a miniscule fraction of the volume of data required by such approaches.
The DNA sequence plays a fundamental role in the formation and maintenance of 3D genome architecture, which in turn is a central orchestrator of the gene regulatory network. It is indisputable that the sequence contains key underlying features contributing to genome folding, however, to what extent it alone can be used to gain access to all 3D interactions and relevant structures remains an open question. Our findings strongly support the hypothesis that the DNA sequence represents a highly observable variable of chromatin architecture and that chromosomal compartments can be predicted solely from the DNA sequence. This opens up a variety of possibilities such as discovering novel sequence signatures imperative to structural genome function and how disruption of 3D genome architecture relates to human disease.

Key Points
• Substantial information about 3D genome architecture can be uncovered from from solely the DNA sequence. • DNA-DDA models derived from a fraction of the typical amount of required target labels, already show high predictive power in classifying A/B compartments. • Delay differential analysis, a technique with foundations in chaos theory, is suited for analyzing genomic sequence data in the context of structural genomics.

SUPPLEMENTARY DATA
Supplementary data are available online at http://bib.oxfordjournals. org/.