On the problem of confounders in modeling gene expression

Abstract Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.


Details on TRAP
Extensive details on the mathematical background of TRAP can be found in Roider et al. (Bioinformatics, 2007).
Here, we only provide a brief summary of Section 2.3 of the aforementioned paper.
In TRAP, one assumes that the fraction of TFs bound to a certain genomic location S is at an equilibrium such that the fraction of bound sites p(S) can be denoted as Here K denotes a site-specific equilibrium constant, which depends on the site with highest affinity (S0), a TF specific mismatch energy E(S) and the Boltzmann constant kB: Thus, we can denote p(s) as: The mismatch energy E(S) is computed using a TF motif matrix according to: Here , is an indicator function evaluating to 1 if the considered sequence S has letter α at position i. The most frequent element in the motif matrix is denoted by , . is a parameter used to scale the mismatch energy.
Thus, there are only two sequence and TF-independent parameters R0 and . For details on how these parameters are determined, please consult Sections 2.3 and 3.1 of Roider et al. (Bioinformatics, 2007).
Overall, TRAP computes the expected number N of TFs bound to sequence s with length l by summing up the binding score for each individual binding site in s: Here, W denotes the length of the motif for the TF of interest.

Schematics of feature matrices
In the following, we sketch the content of the feature matrices used for the linear regression setups depending on the used annotation version. Note that the gene expression used as response is not contained in the feature matrix.

Permute
In addition to the feature matrices discussed in the main paper, which are listed in Sup. Section 4, we tested the performance of feature matrices that are scaled according to the maximum value per row (i.e. per gene).
In the DNase1 case the scaled score ̃, for gene g and TF t is computed according to ̃, = , max ∈ ( , ) .
We refer to ̃, as maximized D scores.
Comparably, scaled DS scores are computed as , which we refer to as maximized DS scores.
The corresponding feature matrices are: .

Predicted TF 1 … Predicted TF n
Here, the corresponding feature matrix can be sketched as: Chipped TF 1 … Chipped TF n Gene 1 ̃1 ,1 ̃1 , ... Gene m ̃, 1 ̃, Note that we do not additionally consider scaling per column (i.e. per TF), because the feature matrices are already scaled per column in our regression setup. Results based on these scores are shown in Sup. Figures. 18  and 19.