Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

Abstract Motivation Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation Code is available at: https://github.com/kundajelab/dfim. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Genome-wide biochemical profiling experiments have revealed millions of putative regulatory elements in diverse cell states. These massive datasets have spurred the development of deep neural network (DNN) models to predict cell-type specific or context-specific molecular phenotypes such as TF binding, chromatin accessibility and gene expression from DNA sequence (Alipanahi et al., 2015;Kelley et al., 2016;Zhou and Troyanskaya, 2015). Beyond high prediction accuracy, the primary appeal of DNNs is that they are capable of learning predictive sequence features and modeling non-linear feature interactions directly from raw DNA sequence without any prior assumptions. Hence, interpreting these purported black box models could reveal novel insights into the combinatorial regulatory code.
Advances in feature attribution methods for DNNs have enabled the identification of predictive cis-regulatory patterns in DNA sequences used as input to the models. Feature attribution methods estimate the contribution (or importance) of features, such as individual nucleotides or contiguous subsequences (e.g. motifs), in an input DNA sequence to a model's output prediction. A perturbation-based, forward-propagation approach known as in-silico mutagenesis (ISM) quantifies the importance of a nucleotide in an input DNA sequence as the maximal change in the output prediction from the DNN model when the observed nucleotide (e.g. a G) at that position is mutated to any of the alternative bases (e.g. A, C or T). ISM has been used to score the potential molecular impact of genetic variants in regulatory DNA sequences (Alipanahi et al., 2015;Kelley et al., 2016; V C The Author(s) 2018. Published by Oxford University Press.
i629 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics, 34, 2018, i629-i637 doi: 10.1093/bioinformatics/bty575 ECCB 2018Zhou and Troyanskaya, 2015. However, ISM is computationally inefficient since each perturbation at every position in an input sequence requires a separate forward propagation to the output through the network. ISM also fails to highlight important features masked by saturation due to buffering interactions with other features (e.g. multiple motif instances in a sequence that buffer each other) (Shrikumar et al., 2017). SHAP is a perturbation-based feature attribution method that borrows from game theory (Lundberg and Lee, 2017). Max-Ent is a feature attribution method that uses a Markov chain Monte Carlo algorithm to find the maximum-entropy distribution of inputs that produced a similar hidden representation to the chosen input (Finnegan and Song 2017). While SHAP and Max-Ent show improved sensitivity and specificity relative to ISM, they do not scale efficiently to comprehensively characterize feature importance across millions of regulatory sequences. An alternative family of computationally efficient backpropagation approaches decompose the output prediction corresponding to an input sequence by recursively propagating contribution scores through the layers of the DNN from the output to the input. One single backpropagation pass provides the contribution of all nucleotides in an input DNA sequence to the output prediction. The gradient of the output with respect to each nucleotide in the input DNA sequence-known as a saliency map (Simonyan et al., 2014)is one such estimate of importance and has been used to identify predictive nucleotides in regulatory DNA sequences. Other related approaches such as DeepLIFT (Shrikumar et al., 2017) and integrated gradients (Sundararajan et al., 2017) differ in the definition of the importance score that is backpropagated and provide improved sensitivity in the presence of saturation effects. DeepLIFT (Shrikumar et al., 2017) has also been shown to be an efficient approximation of SHAP scores (Lundberg et al., 2018). Current feature attribution methods only provide the importance of individual features. They do not highlight predictive, higher-order feature interactions that are learned by the DNN model. Perturbation-based approaches such as ISM cannot scale to comprehensively score all pairwise and higher-order interactions between nucleotides or subsequence features. Recently, an efficient algorithm was proposed to calculate SHAP-based pairwise feature interaction scores (Lundberg et al., 2018) specifically from tree-based ensemble models. However, computing SHAP interactions from neural network models between all pairs of features in regulatory DNA sequences is computationally inefficient and cannot scale to reveal comprehensive interaction maps across millions of regulatory sequences.
Here, we present an efficient approach called Deep Feature Interaction Maps (DFIM) to estimate pairwise interactions between features (nucleotides or subsequences) in an input DNA sequence mapped to an associated regulatory phenotype by a neural network. We define a novel Feature Interaction Score (FIS) between any pair of features (source feature and target feature) in an input DNA sequence as the change in the importance score of the target feature when the source feature is perturbed, while keeping all the other features in the sequence intact. By leveraging efficient backpropagation-based feature attribution methods, we can efficiently compute FIS between all pairs of nucleotides or predictive motifs across large sets of input DNA sequence. Aggregate summary statistics of the pairwise Feature Interaction Score across multiple sequences provide insights into common, shared patterns of feature interactions.
We benchmark DFIM in controlled simulations that explicitly encode motif interactions. We use DFIM to reveal synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. We apply DFIM to reveal epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci.

Materials and methods
We assume that we have trained a deep neural network to accurately map one-hot encoded DNA sequences X of length L to a categorical (binary or multiclass classification) or continuous (regression) output O. Let Y refer to the scalar predicted output O from the neural network for regression tasks. For classification tasks, let Y refer to the scalar input to the final output sigmoid (i.e. logit) of the neural network.

Nucleotide-resolution feature interaction score (FIS)
We are given a one-hot encoded input DNA sequence X 0 2 f0; 1g f4ÂLg i.e. a matrix of size 4; L ½ such that X 0 b; p ½ ¼ 1 for the observed nucleotide b 2 fA; C; G; Tg at position 1 p L (Fig. 1).
First, we compute C X0 a matrix of size 4; L ½ that contains the importance (or contribution) of every nucleotide (rows) at each position in the sequence ( Fig. 1 Step 1). While our approach extends to any other efficient feature attribution method, for the analyses in this paper, we show results using both DeepLIFT (Shrikumar et al., 2017) and gradient saliency maps as importance scores (Simonyan et al., 2014). In gradient-based saliency maps, for a specific input sequence X 0 , the output Y 0 can be approximated by a first-order where w 0 is the partial derivative (gradient) of Y with respect to the input sequence variable X evaluated at X 0 i.e. w 0 ¼ @Y @X j X0 . It is worth noting that the entire gradient matrix w 0 can be computed efficiently in one backpropagation pass. We then perform a point-wise multiplication of the gradient matrix w 0 with the one-hot encoded observed input sequence X 0 to obtain the importance scores for the observed nucleotides b at each position p i.e. C X0 ¼ w 0 b; p ½ X 0 b; p ½ . Only the observed nucleotides at each position can have non-zero values. DeepLIFT contribution scores quantify the sensitivity of the output to finite changes in the input (Shrikumar et al., 2017). This is in contrast to gradients, which measure the sensitivity of the output to infinitesimal changes in the input. Specifically, the DeepLIFT algorithm backpropagates a score (analogous to gradients) which is based on comparing the activations of all the neurons in the network for the actual input sequence X 0 to those obtained when using neutral

AACGTCTG AA C G T A TG
One hot encoding   'reference' sequences (Shrikumar et al., 2017). We use dinucleotideshuffled versions of X 0 as reference sequences unless otherwise specified.
Our goal is to query the neural network to estimate the interaction between the observed nucleotide at one position in the sequence (source feature) and the observed nucleotide at some other position (target feature) in the sequence. Let a; s ð Þ represent the source feature i.e. the observed source nucleotide a 2 fA; C; G; Tg at a source position s such that X 0 a; s ½ ¼ 1. Let b; s ð Þrepresent the target feature i.e. observed target nucleotide b 2 fA; C; G; Tg at some target position t such that X 0 b; t ½ ¼ 1. Intuitively, we define the Feature Interaction Score FIS X0 b; t ð Þj a; c; s ð Þ ð Þof the target feature on the source feature as the change in the importance score of the target feature b; s ð Þ when the source feature a; s ð Þ is mutated to a different nucleotide c; s ð Þ. To compute FIS, we create a new mutated sequence X 0 0 from X 0 where we switch the observed nucleotide a at source position s to a different mutant nucleotide c 2 fA; C; G; Tg, while keeping the nucleotides at all other positions as they were in X 0 ( Fig. 1 Step 2). We then compute the importance matrix C X0 0 for X 0 0 as we did for X 0 ( Fig. 1 Step 3).
The FIS of the target feature with the source feature is defined as Since, only two backpropagation passes are required to compute C X0 ; t ½ and C X0 0 ; t ½ for all 1 t L, we can efficiently compute the FIS of all target features FIS X0 Ãj a; c; s ð Þ ð Þin a sequence with respect to a specific source feature mutation ( Fig. 1 Step 4). Note that the FIS is a directional interaction score of the target with the source. In some cases, we may only be interested in the magnitude of the score rather than its sign. In such cases, we use the absolute value of the FIS.
We define the maximal Feature Interaction Score (maxFIS) of the target feature with the source feature as the maximal FIS marginalized over all possible values of the mutant nucleotide c at the source feature a; s Fig. 1 Step 5).
A nucleotide-resolution Deep Feature Interaction Map (DFIM) summarizes the maxFIS scores for all pairs of source and target features in an input DNA sequence ( Fig. 1 Step 6).

Aggregate statistics of nucleotide-resolution FIS over multiple input sequences
In order to analyze the prevalence of the FIS between a source position s and target position t across a collection of input sequences X i , we first identify the subset of sequences S ¼ fX i g that have identical source nucleotides at the source position and identical target nucleotides at the target position i.e 8X i ; X j 2 S; X i a; s

Motif-resolution feature interaction score
We are often interested in the FIS of a specific target motif fðb p ; t p Þ; :::; ðb q ; t q Þg i.e. a specific subsequence of nucleotides fb p ; :::; b q g at a specific subset of contiguous target positions ft p ; :::; t q g with a source nucleotide-resolution feature a; s ð Þ (i.e. specific source nucleotide at specific source position) such as a regulatory single nucleotide variant (SNV). In such a case, we compute the FIS of a target motif with a source nucleotide feature as the difference of the sum of importance scores across all target nucleotides fðb p ; t p Þ; :::; ðb q ; t q Þg in the target motif in the original sequence X 0 and the mutated sequence X 0 0 (obtained by mutating a; s ð Þ in X 0 to c; s ð Þ).

FIS X0
fb p ; :::; b q g; ft p ; :::; t q g j a; c; s ð Þ ¼ X b;t ð Þ2fðb p ;tpÞ;:::;ðb q ;tqÞg To compute the FIS of a target motif fðb p ; t p Þ; :::; ðb q ; t q Þg with a source motif f a k ; t k ð Þ; :::; a l ; t l ð Þg(See Fig. 3 as an example), we use a different source mutation method. One option would be use the maximal FIS of the target motif over all possible single nucleotide mutations of each position in the source motif. However, this procedure is computationally infeasible for long motifs. We instead, generate one mutant sequence, where we mutate the one-hot encoding (where rows 1-4 correspond to A, C, G, T) of all positions fs k ; :::; s l g in the source motif to the expected background GC nucleotide frequency f GC i.e. the mutant sequence X 0 0 has The FIS of the target motif with the source motif is once again the difference of the sum of importance scores across all target nucleotides fðb p ; t p Þ; :::; ðb q ; t q Þg in the target motif feature between the original sequence X 0 and the mutated sequence X 0 0 . FIS X0 fb p ; :::; b q g; ft p ; :::; t q g j fa k ; :::; a l g; f GC ; fs k ; :::; s l g ð Þ ¼ X b;t ð Þ2fðb p ;tpÞ;:::;ðb q ;tqÞg

Statistical significance of FIS
Given a continuous distribution of FIS, across a collection of input sequences, we define statistically significant interactions based on an empirical null distribution of scores from dinucleotide shuffled versions of the input sequences. For each dinucleotide shuffled input sequence, we compute FIS for all nucleotide pairs. We fit a Gaussian distribution to this null empirical distribution of FIS scores. FIS values passing a P-value of 0.05 with respect to this null distribution are considered statistically significant. We use the Benjamini-Hochberg procedure for multiple hypothesis correction. Supplementary Figure S1 demonstrates how the null model can be used to identify responding motifs in the context of a longer sequence.

Comparison of DFIM to SHAP interaction scores and pairwise ISM interaction scores
For an input sequence with F features (nucleotides/motifs), SHAP interaction scores scale at least quadratically to compute all pairwise interactions giving a complexity ranging from O F 2 À Á to O 2 F À Á (Lundberg et al., 2018). A pairwise ISM-based interaction score, defined as the difference between the ISM score obtained by jointly mutating two features and the sum of the ISM scores of individual features, also has a complexity of O F 2 À Á . For DFIM, we require one backpropagation pass to obtain importance scores for the original sequence. Then for each of the F source features, we need one more backpropagation pass to obtain FIS of that source with all target features. Thus, DFIM exhibits a complexity of O(F) scaling linearly in the number of features. Our proposed FIS is essentially an DFIM i631 efficient approximation of SHAP interaction scores. Further, in contrast to SHAP interaction scores and pairwise ISM interaction scores which are necessarily symmetric over the source and target, FIS is directional and can produce asymmetric interaction scores.

Benchmarking FIS on ground-truth motif interactions embedded in simulated regulatory DNA sequences
To benchmark FIS, we simulated 60 K random DNA sequences (0.46 G/C frequency) of length 200 bp. We divided these into 3 sets of 20 K sequences. We randomly embedded 1 or 2 instances of the ELF1 motif [using the highest affinity sequence from Position Weight Matrix (Kheradpour and Kellis, 2014)] in the sequences in Set 1, 1 or 2 instances of the SIX5 motif in Set 2 and 1 or 2 instances of both ELF1 and SIX5 motifs in Set 3. We further independently embedded 0 or 1 instances of the AP1 and TAL1 motifs in a random subset of sequences across all 3 sets (Kheradpour and Kellis, 2014) (Supplementary Methods). We then set up a binary classification task where all sequences in Set 3 (ELF1 and SIX5) were labeled as positive and all other sequences from Sets 1 and 2 were labeled as negatives ( Fig. 2A). We trained a Convolutional Neural Network (CNN) with one convolutional and one dense layer (Supplementary Methods). We achieved 100% classification accuracy on held out validation set of sequences indicating the model had learned the necessary interaction between ELF1 and SIX5. We computed motifresolution FIS for all pairs of embedded motif instances (SIX5, ELF1, AP1 and TAL1) for all sequences in the positive class (i.e. Set 3). We used DeepLIFT with a fixed GC reference for computing importance scores since the underlying sequences were generated using a fixed GC background. Only pairs of SIX5 and ELF1 motifs (positive control) showed strong FIS (Fig. 2B, green distribution), compared to all other pairs of motifs (negative controls) demonstrating that can effectively discriminate ground truth interactions learned by a neural network. We further assessed the significance of these interactions using a empirical null distribution from dinucleotide shuffled sequences and found that the vast majority of true ELF1-SIX5 interactions have significant (P < 0.05) P-values, even after multiple hypothesis correction. None of the other motif pairs show statistically significant interactions (Supplementary Fig. S2A and B). The results are replicated using gradient saliency maps as importance scores ( Supplementary Fig. S2C and D).

Uncovering epistatic motif interactions of co-binding TFs from CNN models of in vivo TF binding
We analyzed CNN models of in vivo TF binding to investigate epistatic interactions between motifs of co-binding TFs. We trained a multi-task CNN model to classify 1 kbp sequences centered at GATA1, GATA2 and TAL1 ENCODE ChIP-seq peaks ( Next, we identified all matches to the known motifs of GATA1 and TAL1 in all ChIP-seq peak sequences (Supplementary Methods). We then computed motif-resolution FIS (using DeepLIFT with shuffled reference as importance scores) for all pairs of GATA1, TAL1 motif instances across all sequences using GATA1 as the source motif. We observed several instances with strong FIS between proximal GATA1 and TAL1 motifs which corroborates their experimentally validated co-binding interactions (Kassouf et al., 2010) (Fig. 3A). To understand the relationship between the distance between motif instances and their interaction scores, we binned GATA1 and TAL1 motif pairs into 4 distance bins-within 20 bp (n ¼ 13 004), 20-50 bp (n ¼ 18 898), 50-100 bp (n ¼ 28 684) and 100-200 bp (n ¼ 211 154). We compared the distribution of FIS for the motif pairs across the bins. As expected, TAL1 and GATA1 motifs in close proximity (<20 bp) showed statistically significant higher interaction scores than all three other bins (P < 1eÀ16, Mann Whitney test for all 3 comparisons) (Fig. 3A). However, interestingly, we observed some strong long-range interactions between motifs as far as 70 bp apart (Fig. 3B), an observation corroborated by a recent analysis of SNP effects on TAL1 ChIP-seq signal in erythroid cells that found that GATA1 motif mutations impact TAL1 binding at distances as great as 75 bp (Behera et al., 2018). The interactions were also symmetric, such that mutating TAL1 demonstrated a similar distribution of FIS on GATA1 ( Supplementary Fig. S4).

Discovering interactions between regulatory variants and their target TF motifs from CNN models of in vivo chromatin accessibility
DNNs mapping regulatory DNA sequences to TF binding and chromatin accessibility have been previously used to score the predicted in-silico allelic effects of putative regulatory genetic variants based on ISM (Alipanahi et al., 2015;Kelley et al., 2016;Zhou and Troyanskaya, 2015). Here, we instead use FIS to investigate an orthogonal question-What proximal sequence features are affected   (Tehranchi et al., 2016). They provide coordinates, effect sizes, reference/alternative alleles and the allele with stronger binding for statistically significant binding QTLs (bQTLs) and non-significant background control SNVs in ChIP-seq peaks for JUND, NFKB, SPI1, STAT1 and POU2F1. This dataset provides an excellent resource to investigate the feature interactions of bQTLs. Further, we wondered if we could discover bQTL feature interactions for different TFs from a single DNN model trained to predict chromatin accessibility (instead of TF binding) from sequence. Hence, we trained a multi-task (18 tasks) CNN model to map 1kbp length DNA sequences to binary chromatin accessibility profiles across 16 primary hematopoietic cell types (with ATAC-seq data) (Corces et al., 2016) and 2 ENCODE cell-lines (with DNaseseq data) including the GM12878 lymphoblastoid cell-line (LCL) (ENCODE Project Consortium, 2012). The model achieved high performance on the test set (average auPRC ¼ 0.69, auROC ¼ 0.91). We used the LCL task to investigate bQTL feature interactions using DeepLIFT with shuffled reference as importance score. We restricted our analysis to the statistically significant (allelic binding P < 5eÀ05 as recommended by Tehranchi et al., 2016) bQTLs that overlapped the DNase-seq peaks in GM12878.

Feature Binary class label
To understand proximal interactions, for each bQTL, we used FIS to estimate the effect of mutating the reference allele to all alternate alleles at the source QTL on every target nucleotide 615 bp around the QTL. First, we observed strong positive (Fig. 4A) and negative (Fig. 4B) interactions of bQTLs with nucleotides of overlapping target TF motifs. The direction of the allelic effect (stronger or weaker ChIP-seq signal) of the reference and alternate bQTL alleles on TF binding also matched the predicted direction of change (stronger or weaker motif score). E.g. A significant JUND bQTL at chr22: 42925130 falls in a high affinity JUND binding motif (Fig. 5A). The reference A allele has higher binding than the alternative G allele with P-value 1.71eÀ140 in the Tehranchi et al. (2016) study. FIS predicts that the G allele (weaker allelic binding) but not the A allele (stronger allelic binding) will destroy the importance of the entire JUND motif.
Next, we also found several TF-bQTLs in the flanking nucleotides of weak affinity motif matches of the target TF having significant interaction effects with the entire motif. E.g. a significant SPI1 bQTL at chr1: 94169843 has reference allele T (with stronger binding) and alternate allele C. The bQTL is in the flanking nucleotides of a low affinity SPI1 site where only the core GGAA matches the canonical motif. FIS predicts that the C allele (weaker binding) destroys the importance scores of the core GGAA element (Fig. 5A). Tehranchi et al. (2016) and several other studies have reported that a large fraction (70-90%) of QTLs do not overlap high affinity instances of canonical TF motifs. We hypothesize that several QTLs may be affecting flanking nucleotides of weak affinity TF motif instances. Finally, while most bQTLs with statistically significant interactions exhibit the maximal absolute interaction with other nucleotides within 10 bp of the bQTL, we also observe Top row shows the importance scores for the original sequence. When the source GATA motif is mutated, the feature importance of the target TAL1 motif drops (second row) indicating a strong interaction between the two motifs (high FIS) (third row). (B) Distribution of FIS scores for GATA1-TAL1 motif pairs stratified by relative distance between motifs. GATA1 motifs exhibit strong interactions with TAL1 motifs that are within 20 bp distance strong and significant longer-range interactions at distances ranging from 20 to 200 bp (Fig. 6A). E.g. an SPI1 bQTL has a significant interaction with a proximal SP1 motif but also a strong interaction with a RUNX1 motif 20 bp away (Fig. 6B). SPI1 QTLs were also found to affect motifs 100 s of base pairs away ( Supplementary Fig. S5).
As a negative control, for each TF, we also evaluated the FIS of a matched number of conservative control SNVs from the Tehranchi et al. (2016) study that overlap the TF's ChIP-seq peaks and LCL DNase-peaks with least significant allelic effects on binding (allelic binding p % 1). For each bQTL and control SNV, we recorded its maximal absolute FIS (maxAbsFIS) over all target nucleotides 615 bp around the SNV. For all the TFs, we found that the bQTLs exhibit significantly (Mann Whitney test) stronger maxAbsFIS than control SNVs (Fig. 7), indicating that FIS may be an alternative approach to ISM to identify putative regulatory variants. This result was replicated using gradient based important scores (Supplementary Fig. S3).  (Le et al., 2018). They used the BET-seq assay to measure high-resolution in vitro binding affinity landscapes of the yeast TFs Cbf1 and Pho4 to a high complexity library of >1 million DNA sequences with a fixed central core E-box sequence (CACGTG) and 5 variable flanking nucleotides on either side. They trained a feed forward neural network to predict relative binding affinity (DDG) for each of the TFs from the 10 bp flanking sequences (using a flattened one-hot encoding) in the library (Le et al., 2018) (Fig. 8A). The model architecture consisted of 3 dense layers of sizes 500, 500 and 250 with ReLU activation followed by batch normalization and dropout (P ¼ 0.25) with a final dense classification layer having a linear activation. They used a distillation approach to interpret the NN model by fitting a linear model with all mononucleotide features across all positions and all dinucleotide features across all pairs of positions to the output predictions of the NN. They found that dinucleotide features were critical for the linear model to have a good fit (r 2 > 0:95) especially for Cbf1. They then estimated the contributions of all pairwise interaction terms by comparing the mononucleotideþdinucleotide linear model to a mononucleotide-only linear model. Cbf1 was found to exhibit significant interactions between several pairs of flanking nucleotides (Le et al., 2018). We instead used DFIM to directly query the Cbf1 neural network model and estimate pairwise nucleotide-resolution FIS between all pairs of nucleotides at all positions for all sequences in the library (Fig. 8B). We computed aggregate statistics (mean) of the absolute nucleotide-resolution FIS for all pairs of nucleotide features across the 5000 sequences with strongest binding affinity (lowest measured DD G). We obtain four (40 Â 40) aggregate DFIMs where each map corresponds to one of the 4 bases fA; C; G; Tg as the observed source nucleotide. The rows in each 40 Â 40 map correspond to 4 mutant bases Â 10 source positions, while the columns correspond to 4 target bases Â 10 target positions. To ease interpretation, we compute a marginalized 40 Â 40 aggregate DFIM that records the maximal average score over all mutant bases for each source base, source position, target base and target position (Fig. 8C), marginalized over the 3 potential mutations for a given source base. We observe that the marginalized aggregate DFIM for the high binding affinity sequences exhibit several strong interactions between flanking nucleotides (Fig. 8C). The map corroborates several of the strongest interactions identified by Le and Shimko using the distillation approach such as the strong interaction between a T at the À1 position and an A at the þ1 position (Le et al. 2018). Our maps also identify novel interactions such as a strong interaction between T at À1 and T at þ2. In contrast, the aggregate DFIMs across 5000 sequences with weakest binding affinity (highest measured DDG) exhibit uniformly weak interaction scores.

Discussion
We present an efficient method called Deep Feature Interaction Maps (DFIM) to identify epistatic interactions between all pairs of nucleotides or motif features in any DNA sequence input to a deep learning model for regulatory genomics. Our method accurately  recovers ground truth interacting motifs in simulated regulatory DNA sequences. When applied to deep learning models of in vivo TF binding, we recover known proximal interactions between motifs of interacting co-factors while also discovering long-range interactions between motifs as far as 75 bp apart. We interpret deep learning models trained on in vitro TF binding to discover extensive interactions between pairs of nucleotides in sequences flanking core TF binding motifs. Finally, we interpret deep learning models of in vivo chromatin accessibility to generate nucleotide-resolution interaction maps for non-coding regulatory sequences surrounding SNVs (bQTLs) that affect binding of transcription factors. Our maps link binding QTLs to nearby sequence features including high and low affinity matches to the canonical binding site of the TF whose binding is disrupted. We also find bQTLs interacting with motifs of multiple co-binding TFs. These epistatic interactions seem to capture both cooperation and competition. While our primary focus in this manuscript is on interpreting feature interactions in DNA sequence inputs, DFIM can easily be generalized to other data modalities. Partial dependence plots are commonly used to understand the sensitivity of a prediction to a one or more features (Friedman, 2001). DFIM serves as complementary approach to understand the predictive higher-order, non-linear interactions between features. DFIM is most efficient to estimate all pairwise interactions between pre-determined features such as known binding sites or SNVs or a sparse set of de-novo discovered predictive features with significant importance scores. However, DFIM also scales well to estimate interactions between all nucleotides in large sets of sequences because it leverages efficient backpropagation-based feature attribution methods. While DFIM is generally compatible with any efficient feature attribution method, we have not evaluated our approach on all such methods. However, we have found overall strong replication of DFIM results and associated conclusions by using two separate importance scores, namely DeepLIFT and gradient saliency maps. This suggests that DFIM could generalize to other importance scoring approaches.
There are some caveats to interpreting feature interactions derived from DFIM. Feature importance scores from any feature attribution method are only meaningful for examples that are predicted correctly. Since feature interaction scores from DFIM are based on feature importance scores, the validity of DFIM is also restricted to examples that are correctly predicted by high performance models. Further, vulnerabilities of the feature attribution method used in DFIM transfer over to the interaction scores. Hence, we recommend using multiple feature attribution methods to obtain robust estimates of interactions. Changes in model architecture can also change the interactions encoded by the model and thus the interactions learned with DFIM. Despite these mentioned caveats, the case studies we present here showcase the utility of DFIM to provide a nuanced view into the combinatorial code of regulatory DNA sequences through the lens of predictive neural network models.