Uncovering tissue-specific binding features from differential deep learning

Abstract Transcription factors (TFs) can bind DNA in a cooperative manner, enabling a mutual increase in occupancy. Through this type of interaction, alternative binding sites can be preferentially bound in different tissues to regulate tissue-specific expression programmes. Recently, deep learning models have become state-of-the-art in various pattern analysis tasks, including applications in the field of genomics. We therefore investigate the application of convolutional neural network (CNN) models to the discovery of sequence features determining cooperative and differential TF binding across tissues. We analyse ChIP-seq data from MEIS, TFs which are broadly expressed across mouse branchial arches, and HOXA2, which is expressed in the second and more posterior branchial arches. By developing models predictive of MEIS differential binding in all three tissues, we are able to accurately predict HOXA2 co-binding sites. We evaluate transfer-like and multitask approaches to regularizing the high-dimensional classification task with a larger regression dataset, allowing for the creation of deeper and more accurate models. We test the performance of perturbation and gradient-based attribution methods in identifying the HOXA2 sites from differential MEIS data. Our results show that deep regularized models significantly outperform shallow CNNs as well as k-mer methods in the discovery of tissue-specific sites bound in vivo.

For training of transfer-like models, a 1-convolutional layer CNN is trained first for the regression task, and the convolutional filter parameters are subsequently reused in the classification CNNs. Similarly, a serial model is constructed by training a multilayer perceptron that follows the regression model. In both cases, the transferred parameters are not updated for classification to prevent overfitting. Parallel models created here used the deep model settings, however the concept of parallel training could be applied to any architecture. The input to the model is shared across all tasks, while the outputs are separate. Training is performed in batches by alternating task-specific outputs. Different batch sizes were used for regression and classification tasks, as well as a scaling factor for the classification learning rate (LRF). LRF divides the regression learning rate (0.001) and its range should be adjusted to approximate the proportion of regression to classification data. Sum of classification task losses is used as a stopping criterion. Regression loss is not used for early stopping.
Genomic regions are resized to desired model input width (constrained between 200nt and 2000nt and rounded to 100nt, randomly sampled for 1-layer CNN; for deep models, receptive field is calculated and multiplied by expand RF hyper-parameter), one-hot encoded, and shuffled. Classification tasks are optimised using categorical crossentropy loss after applying softmax activation in the final layer. Regression is optimised using mean-squared error loss, on log 2 of RPKM counts. In order to report test performance a fifth of the data is held-out. The remaining part is used for 3-fold cross-validation. During the cross-validation hyper-parameters are uniformly sampled at random from the specified intervals. Models are trained on each of the 3 folds until the validation loss ceases to improve for a specified number of epochs (early stop). Up-sampling is used to handle imbalanced data (sampling the same number of examples at random from each class). Training loss is recorded at epoch of lowest validation loss. Both training and validation losses are averaged between the 3 folds. After a given number of hyper-parameter sets is sampled, the set with the lowest validation loss is selected. Model is then trained on the entire cross-validation data until the corresponding mean training loss is reached. Test performance of this model is reported on the held-out set.
Number of CV iterations is given in table S1 and is equal for all deep models, and larger for 1-layer CNNs due to shorter training time. 50 CV iterations were used for 1-layer CNN RPKM model. For validating the bottleneck layer (box plots in Fig. 5) 100 iterations of CV were performed on MEIS RPKM regression dataset, with 1x layer not present, or present with x0.25, x0.5, or x0.75 dimensionality reduction with equal probability. If the 1x layer was present, the activation was linear or ReLU with equal probability. These results were then sub-sampled to obtain 30 sets for MEIS RPKM model with ReLU activation, which were used to train the serial model. For stability validation, every transfer and serial model also included a newly trained base RPKM model.

Receptive field calculation
Receptive field is calculated as the maximum span in the input sequence, change in which can affect the activation of neurons prior to the global pooling layer, see algorithm 1.

Reverse complement augmentation
Reverse complement (a sequence of the complementary DNA strand, obtained by swapping C and G, T and A, and reversing order) is added to augment the the least frequent BA1-down class, doubling the example count. We only augment other down-binding classes if their example count is below the augmented BA1-down, resulting in partial augmentation of BA2-down. Other classes are not augmented. Augmentation is performed after splitting data into training and validation folds, to avoid leaking examples between them.

K-mer counting methods
For Homer results we used findMotifsGenome.pl module to count in the differential regions, setting non-differential regions as background. 200nt input length was used with motif length k from 5 to 12. HOXA2 motif is the most confident PWM in BA1-downbinding when counting de novo and appears as (Pdx1(Homeobox)/Islet-Pdx1-ChIP-Seq(SRA008281)/Homer) in known results. To annotate the regions scanMotifGenomfeWide.pl module was used with LOD=1 to find the motif genome-wide. Results were intersected with MEIS BA1-down regions and a single location with the highest LOD was selected in each region.
To obtain KSM/KMAC results we ran KMAC with MEIS BA1-down fasta file, using non-differential regions fasta as background. Motif length 5 to 12 was used. Similarly to Homer, Hox is the first cluster in the resulting prediction. We only considered k-mers in the first cluster and for each BA1down region identified the location of k-mer with the lowest p-value.  Figure S2. Precision-recall curves for model selection with and without RC augmentation. To increase class balance of BA1-down, only this class was fully augmented. BA2-down was partially augmented to match the count of augmented BA1-down. PBA and non-differential classes were not augmented.

Mutagenesis
Mutagenesis scores are obtained by changing each of the bases in a region to its three alternatives and using the scoring function defined in DeepBind supplementary material (2). For classification models attribution is obtained from class-specific activations preceding the softmax layer. Regions larger than 2000nt are trimmed to this size around the centre, for smaller regions mutations are evaluated only within the original span. If the region is smaller than the input size of the model, it is expanded and centred within the model input. For regions larger than model input we use a sliding approach, in which the attributed nucleotide is always centred in the input, except when the nucleotide is positioned closer to region start or end than half of model input width, in which case the input is not moved past the region span.

Integrated gradients
Similarly to mutagenesis pre-softmax activations are used, regions are trimmed to a maximum of 2000nt and centred if smaller than model input. For regions larger than model input attribution is performed in strides, with overlap bigger or equal to 0.25 of model input width. Nucleotides with overlapping attribution have their scores averaged. When attributing using several references scores are combined by arithmetic mean. Summation-to-delta in Fig. S5 is calculated as |S attr −P d |/|P d |, where S attr is the sum of attribution in a region, and P d is the difference in model prediction between the region and a reference. Values are calculated for 100 regions, 10 references each, and averaged. Figure S3. Attribution method comparison for BA1-downbinding for all deep learning models.

Attribution peak calling
For all used attribution methods the results are equal in size to the one-hot input (region length * 4). Following attribution, one dimensional per-nucleotide scores are obtained by summing the 4 channels. For a given set of regions, strongest features are identified using a sliding window approach. A range of window lengths can be evaluated (11 to 25 in our tests, increasing by 2). To identify locations of strongest features of length k in r regions, based on 1-dimensional attribution scores, we create a r*k scores array and convolve it with a length-k vector of values all equal to 1/k, for all desired values of k, and keep the maximum value for each nucleotide. These scores are then sorted from strongest to weakest. To avoid the sliding window prioritising offsets which avoid strong negative scores, sorting is performed on absolute values of attribution first, after which only a fixed number of strongest features is kept (20000 for BA1-down regions only and 50000 for all MEIS regions). The remaining features are sorted again based on their original (non-absolute) scores and non-positive features are discarded (see Fig. S7 for ablation results). For Poisson and stability tests we find a strongest single feature per peak from the resulting list, if at least one feature is present in a peak, and select the 25nt region around its centre.

Poisson test
Poisson test is performed on the identified feature locations using the approach in MACS2 (3), with 500nt window around each feature as suggested in GEM (4). Test is performed separately on two HOXA2 BA2 ChIP-seq replicates and p<0.05 is required in both in order to pass. As in MACS2, we calculate: where Λ BG is the expected lambda value for a random 500nt region, and the remaining lambdas are calculated for the tested region in the corresponding sequence span. We perform total count normalisation across IP and Input ChIP-seq to make the values comparable, and use ppois command from the R package (3.5.1) to obtain the p-value: ppois(Current window count, Λ max , lower.tail = False).

Comparison with LS-GKM SVM
The performance of BA1-downbinding classification was compared to a gapped k-mer SVM method LS-GKM (5), which was evaluated with two types of kernels: wgkm and its radial basis expansion wgkmrbf. Since SVM models are binary classifiers, the predictive performance was compared with CNN models trained for binary task of classifying BA1downbinding regions against the non-differential background. Precision-recall plot is shown in Fig. S9. Model selection for 1-layer CNN was performed for 50 iterations. CNN models outperform SVM in this task even without regularisation with regression data.
For attribution with SVM in-silico mutagenesis was used, as well as GkmExplain (6). Results are shown in Fig. S10. For this comparison, non-binary CNN models were trained on the same classification data fold as the SVM. Therefore, unlike the main paper results, this attribution was performed using a single CNN model (fold 0) for the entire BA1-downbinding set. LS-GKM performs well in identifying the most confident features, and performs similarly to a shallow CNN, but is outperformed by the serial model as the number of included regions is increased. GkmExplain outperforms other methods in the top range, but the performance declines in the less confident regions to the advantage of mutagenesis. SVM with wgkm kernel outperforms wgkmrbf, and models trained on 600nt regions have better performance to those trained on 200nt. Fig. S11 illustrates the strongest features obtained from the SVM model by mutagenesis and GkmExplain. While the latter results in smoother features, it is perhaps detrimental to precise ranking.

Time of model selection and attribution
Fig. S12 shows the total time it would take to perform model selection, including that of RPKM models, if a single GPU was used. In practise, model selection was performed in parallel on several nodes. Training times of models with known hyper-parameters are given in table S1. The majority of time spent on creating serial and transfer-like models is consumed by model selection of the base models, which can be reused for subsequent tasks. In training the shallower models, larger batch size can noticeably speed up the process due to better utilisation of GPU memory.
The comparison of model creation time between CNN, SVM and Homer was performed on a binary dataset, and the results are shown in table S5. Although deep learning models can take a long time to train, 1-layer CNNs for direct classification can be created from scratch in around an hour, which is much faster than LS-GKM on this dataset. However, comparing creation time between models is approximate, as the optimisation is performed with different hardware. In this work, neural networks were trained on a cluster consisting of Nvidia V100-SXM2 GPUs and Intel Skylake CPUs, with 8 CPU cores per GPU, benefitting from GPU acceleration in training and attribution. SVM and Homer were trained on Intel Broadwell CPUs with a total of 28 cores. Since these models do not require extensive hyper-parameter tuning, a single training time is reported.
Time of attribution per 1000 regions is given in table S6. CNN models were tested at their native input size, with batch size close to the limit imposed by GPU memory (16GB).

MEIS up-binding features in PBA
HOX factors are known for their posterior prevalence, a phenomenon in which posterior HOX TFs (expressed in more posterior tissues) override the function of more anterior HOX. The parallel 3-task model was used to identify MEIS co-binding partners in PBA. Unlike in BA2, MEIS exhibits several co-binding partners in PBAs, examples of which are illustrated in Fig. S13.
In order to cluster the features based on the underlying sequences, locations of highest ranked sites were identified in the PBA-upbinding regions using 25nt sliding window, and sorted by attribution sum. K-mer counting was performed in 100 thousand strongest features. Each feature was assigned the Homer annotation of a PWM with the highest sequence correlation. Fig. S14 shows maximum alignment of PBAupbinding features with the PWMs. Figure S13. Example features of MEIS up-binding in PBA. MEIS exhibits several co-binding partners in this tissue. HOXA3 and GATA were validated by ChIP-seq. Forkhead (FOX) and HAND were validated by alignment with the known PWM annotated by Homer. Figure S14. Correlation of strongest PBA-upbinding features with PWMs identified by Homer. Features were obtained from a 3-task parallel model using mutagenesis. Maximum cross-correlation with the underlying sequence and its reverse complement was calculated, and averaged across 500 ordered regions.

MEF2D differential binding
MEF2D is a TF broadly expressed in a wide range of cells, regulating the development of tissues including neuronal, muscle, and retina (7). It was shown to cooperate with CRX in mouse retina, allowing binding to non-specific motifs (8). ChIP-seq data from mouse was analysed (retina: 3 replicates, cortical neurons: 4 experiments, myotubes: 2 replicates), and regions of up-binding and down-binding were identified. Sample counts are shown in table S8. For the baseline, 1-layer CNN classifiers were trained directly. For RPKM regression, a 1-layer CNN was trained for use with transfer-like models, and a deep CNN was trained for serial models, using between 1 and 5 dilation blocks in model selection (final setting = 4). Predictive performance of regression is shown in table S7, indicating that deep CNN increases performance over 1-layer model. Classification performance is shown in Fig. S15 and table S9 and S10. Models regularised with regression result in significantly higher AUC than CNN trained directly. Deep, serial models exhibit higher PR-AUC than transfer-like models for most classes. Transfer-like CNN shows increased recall of the non-differential class, and results in highest F1 score in the down-binding task. The cortical neuron up-binding and retina down-binding classes show smaller variation in performance likely due to low example count and weak motif enrichment (top motif p-values identified by Homer are 1e-18 and 1e-19 respectively in those classes, counting against shuffled background).
Both transfer-like and serial models were subsequently used to obtain features of each class. In particular, Fig. S16 illustrates two example regions where differential features of MEF2D up-binding in retina consist of neighbouring MADSdomain and CRX binding sites. Example features for all classes are shown in Fig. S17. MADS binding sites appear as features for all up-binding classes, suggesting that co-factors and the surrounding sequence are important for specific differentiation. The models uncover known co-factors, such as CRX in retina (8) and MYOD in mytoubes (7). Other possible co-factors were also identified, such as BATF in myotubes and cortical neurons (or other AP-1 factor, previously reported to interact with MEF2C by (9)), OTX in myotubes down-binding (a CRX-like factor), and IRF4 in cortical neurons up-binding (reported to interact with MEF2 in B-cells, (10)).  Figure S15. Test set precision/recall curves for the MEF2D differential tasks. 1-layer CNN was trained directly on the classification data. Transfer and serial models used regression data for regularisation. Figure S16. Example regions of MEF2D differential up-binding in mouse retina compared to myotube and cortical neuron tissues. Differential features were identified by a serial model using mutagenesis. PWMs for MEF2D and CRX were aligned to the underlying sequence and shown in the position of highest cross-correlation. CRX RPKM features are shown below, as well as CRX ChIP-seq profile in expanded regions. RPKM features were obtained from a model trained on CRX ChIP-seq data (not used to train the differential model). Figure S17. Example features of MEF2D differential binding in mouse retina, myotube, and cortical neuron tissues, identified by transfer-like or serial models using mutagenesis. Homer annotation of the enriched k-mers is shown below the differential tracks.