Universal correction of enzymatic sequence bias reveals molecular signatures of protein/DNA interactions

Abstract Coupling molecular biology to high-throughput sequencing has revolutionized the study of biology. Molecular genomics techniques are continually refined to provide higher resolution mapping of nucleic acid interactions and structure. Sequence preferences of enzymes can interfere with the accurate interpretation of these data. We developed seqOutBias to characterize enzymatic sequence bias from experimental data and scale individual sequence reads to correct intrinsic enzymatic sequence biases. SeqOutBias efficiently corrects DNase-seq, TACh-seq, ATAC-seq, MNase-seq and PRO-seq data. We show that seqOutBias correction facilitates identification of true molecular signatures resulting from transcription factors and RNA polymerase interacting with DNA.


INTRODUCTION
The field of molecular genomics emerged as classical molecular biology techniques were coupled to high-throughput sequencing technology to provide unprecedented genomewide measurements of molecular features. Molecular genomics assays, such as DNase-seq (1,2), ChIP-exo (3) and PRO-seq (4,5), are converging on single-nucleotide resolution measurements. The enzymes that are routinely used in molecular biology and cloning have inherent and often uncharacterized sequence preferences. These preferences manifest more prominently as the resolution of genomic assays increases. Therefore, we developed seqOutBias (https://github.com/guertinlab/seqOutBias) to characterize and correct enzymatic biases that can obscure proper interpretation of molecular genomics data.
Enzymatic hypersensitivity assays, such as DNase-seq (1,2), TACh-seq (6) and ATAC-seq (7), have the potential to measure transcription factor (TF) binding sites genome-wide in a single experiment. These assays strictly measure enzymatic (DNase, Tn5 transposase, Benzonase or Cyanase) accessibility to DNA and not a specific biological event, making data challenging to deconvolve. Standard algorithms scan for footprints, which are depletions of signal in larger regions of hypersensitivity (8)(9)(10)(11)(12). Many TFs, however, do not exhibit composite footprints if enzymatic cut frequency is averaged at all ChIP-seq validated binding sites with strong consensus motifs (10)(11)(12)(13). Moreover, the inability to detect a footprint at any individual TF binding site results in high false negative rates for footprinting algorithms (14). Accurate footprinting is also confounded by the artifactual molecular signatures that result from enzymatic sequence preference (10)(11)(12). DNase footprinting algorithms can incorporate DNase cut preference data to abrogate this bias (12,14,15). However, no existing tools specialize in correcting intrinsic sequence bias for a diverse set of enzymes and experimental methodologies.
We find that correcting for enzymatic sequence bias highlights true molecular signatures that result from TF/DNA interactions. Despite the limitations of enzymatic hypersensitivity footprinting and sequence bias signatures, hypersensitive regions reveal a near-comprehensive set of functional regulatory regions in the genome (16). Therefore, we present seqOutBias, which calculates sequence bias from an aligned BAM file and corrects individual reads accordingly. While this software does not directly infer TF binding, correction of sequence bias provides a more accurate measurement of three key features of enzymatic hypersensitivity data: (i) raw peak height; (ii) footprint depth; and (iii) true molecular signatures. These measurements, taken together with DNA sequence, can be used to develop algorithms that infer TF binding genome-wide. Moreover, footprint depth and the presence of true molecular signatures are unique to each TF and these features should be characterized for each TF using corrected data in order to optimize TF-binding inference algorithms.

Sequence bias correction
Enzymes that are commonly used in molecular biology have nucleic acid preferences for their substrates and the sequence at or near the active site of the enzyme typically dictates enzymatic preference. Let the sequence context be defined by a k-mer proximal to the start of the detected sequence read. A sequenced read corresponds to a k-mer observation if it occurs at a specific offset with respect to the edge of the k-mer ( Figure 1A). Assuming a systematic k-mer dependent bias, the true read count will be a scaled version of the observed read count, that is: where, for position i, R i, j is the true read count, O i, j is the observed read count and α( j ) is the scale factor, or bias, corresponding to k-mer j.
In an unbiased setting, the observed frequency of k-mers should be proportional to their genome-wide counts or a regional subset of the genome. For example, condensed chromatin may restrict enzyme access and these can be excluded from the genomic k-mer counts. Let I(m, j ) be the indicator function that genome position m is assigned to k-mer j, then we have: where M is the total number of observable genome positions and N R is the total read count. Further taking the observed total read count N O to be N O ≈ N R , then α( j ) can be approximated by: The seqOutBias software aims to correct sequence biases by scaling the aligned read counts by α( j ), which is effectively the ratio of genome-wide expected read counts to the observed sequence counts for each k-mer. The seqOutBias software additionally takes into account mappability, which means that observable positions can differ per strand, thus we compute a separate value of α( j ) on each strand. A B Figure 1. SeqOutBias overview and parameter definitions. (A) An enzymatic cleavage event that results in a blunt end can be detected by sequencing the upstream or downstream DNA (red bases). The hexamer sequence centered (red block) on the nick sites (dotted vertical lines) confers specificity; this parameter is referred to as the k-mer. The plus-offset and minusoffset parameters specify the nick site relative to the first position and last position of the k-mer. As opposed to specifying the immediate upstream base for the minus strand, we shift the base position by +1 to match the first position of the plus aligned read. (B) This panel illustrates the high-level overview of the inputs, intermediate files and output of the seqOutBias program and the computation steps that the program performs. The tallymer step indexes the reference sequence (FASTA) and computes mappability for the given read length. The seqTable step parses the reference sequence together with the mappability information to compute the k-mer that corresponds to each possible read alignment position. The tabulate step tallies the k-mer counts across the selected regions (or the full genome), as well as the k-mers corresponding to observed aligned reads (if a BAM file is supplied). At last, scale computes the genome-wide aligned read pile-ups, scaling sequence reads by the expected/observed k-mer frequency.

Computational workflow of seqOutBias
The seqOutBias software uses a genome FASTA file and an aligned and sorted BAM file as inputs. Each intermediate step within seqOutBias corresponds to a seqOutBias subcommand, which permits them to be run individually or separately on different machines. Due to the large size of genomic datasets, seqOutBias can read compressed FASTA files.

PAGE 3 OF 12
Nucleic Acids Research, 2018, Vol. 46, No. 2 e9 Computing genomic mappability. In the implementation of seqOutBias, our algorithm first calculates the expected sequence detection frequency for each k-mer by determining the positions in the supplied reference genome that are uniquely mappable for a given sequence read length. It is important to compute mappability for a read length because the algorithm should only count k-mers that correspond to genomic positions that have the potential to be detected based on their unique genomic mappability when it calculates the expected k-mer detection frequency from the FASTA file. SeqOutBias invokes GenomeTools' tallymer to compute mappability at each position in the genome (17,18). First, the reference genome FASTA file is indexed using GenomeTools' Tallymer program (17,18). Next, Tallymer computes unique mappability for each position in the genome for a given input read length. This step corresponds to a seqOutBias subcommand, tallymer, which can be run individually or on different machines. The tallymer subcommand computes the mappability information, parses the reference sequence to compute k-mer indexes, and creates a mappability file for a given read length ( Figure 1B). This process consists of three parts: (i) creating a suffix tree; (ii) creating a genome index; and (iii) creating the mappability file. These processes are the most computationally intensive steps and seqOutBias will recognize the existence of intermediate files in the directory to avoid unnecessary recomputation. For instance, if the seqOutBias tallymer step is executed for different read lengths, but using the same FASTA file, then the first suffix-tree portion is re-used across invocations.
Mapping k-mer indexes to the aligned read positions. The next step, seqOutBias seqtable, creates an intermediate table that combines mappability information, read length and plus/minus offsets ( Figure 1A) to map k-mer indexes to the aligned read positions ( Figure 1B). The seqtable subcommand parses the reference sequence (FASTA) together with the mappability information to compute the k-mer that corresponds to each possible read alignment position. The resulting binary file stores this information in a compressed form that can be easily used for subsequent computation steps, as well as storing the corresponding parameters (read length, k-mer size, and cut-site offsets). This intermediate file reduces the amount of computation needed when processing aligned read files and provides an intermediate TBL file that decouples the reference sequence processing from the remaining steps.
Tallying the k-mer counts in the reference sequence and the aligned reads. The resultant TBL file is an input for the seqOutBias tabulate subcommand, which tallies the k-mer counts across the selected regions (or full genome), as well as the k-mers corresponding to observed aligned reads from the BAM file. In contrast to other methods (10)(11)(12)15), these numbers are used to scale the reads without the need for Naked DNA to calibrate (14). This subcommand produces a k-mer count table based on the TBL sequence information and the optional sorted BAM file. Counts correspond to the entire genome by default, but counts can be constrained to specific regions by supplying a BED file with the regions option. When no BAM file is supplied, the output will have four columns: k-mer index, k-mer string, plus strand count and minus strand count. If a BAM file is supplied, the output will have two additional columns with the plus and minus strand counts of observed aligned reads.
Scaling individual sequence reads. The final subcommand, seqOutBias scale, computes the genome-wide aligned read pile-ups and scales them by the expected/observed k-mer detection frequency. This produces the corrected aligned read pile-ups, both as BED and bigWig files. This command provides flexibility in the output, including the -shift-counts and --tail-edge options. The --shift-counts option shifts minus strand pile-up positions to align with the plus strand pile-up, making reads from both sides of a cleavage site pile up at the same position regardless of whether the upstream or downstream sequence was detected by sequencing ( Figure 1A). This option is used when enzymatic cleavage of individual sites can result in a single base shift depending on whether the nicking event was detected by sequencing the upstream or downstream DNA (red nucleotides in Figure 1A). The tail-edge option outputs the 3 end of the reads; this option is used primarily for analysis of PRO-seq data (4,5). Therefore, seqOutBias reads compressed files (FASTA, mappability information and sorted BAM files), reuses intermediate results, and allows for flexibility in specifying sequence features for data correction.

K-mer mask optimization
In seqOutBias, the k-mer sequence that is recognized by the enzyme to confer specificity is characterized by three parameters: k-mer size and a pair of offsets for the plus and minus strands ( Figure 1A). These parameters enable flexibility and seqOutBias works with enzymes that have a variety of recognition site lengths. The k-mer mask parameter of seqOutBias restricts which positions contribute to the bias correction. A gapped k-mer enables the user to capture bias due to distal contributions and ignore uninformative positions that are more proximal. Positions within the mask that are ignored are represented by an X and informative positions by an N. This parameter provides an alternative way to specify the position intervening between the first base sequenced and the base directly upstream by inserting a C in the mask string. For example, a possible 8-mer that spans 16 bp could be represented as NNXX NNXX CX XNNX XNN; likewise, NNNCNNN, would represent a recognition site with kmer-size = 6, plus-offset = 3 and minus-offset = 3 ( Figure 1A). We developed two approaches to guide the choice of the k-mer mask.
The first of these approaches, implemented in the Rust program kmer mask em (https://github.com/guertinlab/ kmer mask em), is aimed at assays like DNase-seq and TACh-seq where enzymatic digestion liberates free DNA ends and the digestion event can be detected by sequencing the upstream or downstream sequence. The second approach, implemented in R (https: //github.com/guertinlab/seqOutBias/tree/master/docs/R), aims to flatten the composite profiles of the input molecular genomics data at TF position-specific weight matrices (PSWM) and is offered as an alternative that imposes less constraints on the k-mer mask. k-mer mask optimization via enzyme cut-site model. In the k-mer mask optimization via enzyme cut-site model method, we model the enzyme cut bias as a PSWM of length K, over the sequence surrounding the cut site. Each k-mer corresponds to a possible cut-site with unknown orientation. We apply expectation maximization (EM) to infer the PSWM from the table of k-mer counts.
Let the PSWM be defined as where p j, b be the probability of observing the nucleotide b at cut-site position j andb to be the complement of nucleotide b (e.g. b = A,b = T). Given a table of k-mer counts, let x i be the sequence of k-mer i occurring n i times, z i ∈ { f wd, rev} be unknown orientation of x i and P(Z i = f wd) = γ i . The full likelihood of the model is depicted in Supplementary  Figure S1, thus the full log-likelihood is given by: which results in the simple definitions for the EM steps: E-step: use Bayes rule to get the posterior of Z, i.e. com- M-step: update the model parameters: with: In practice, we add a pseudo-count to the values of A j (b) of 0.1. Furthermore, since EM does not guarantee a global optimum, the kmer mask em program computes the optimization from multiple random starting points, taking the best global result as the optimum.
The inferred PSWM is used to infer a sequence of k-mer masks of increasing complexity by making use of the Information Content score at each position. Positional Information Content refers to the relative difference in observed sequence at that position relative to what is expected, so high Information Content means that a position is likely to influence enzyme preference. Bases that have a value lower than a given Information Content threshold are excluded from the mask. Finally, the resulting 'forward' mask is combined with it's own 'reverse' mask. Therefore, a position in k-mer mask is unmasked if it is unmasked in either the forward or the reversed mask. This is necessary since during the execution of the seqOutBias program we do not know the orientation of each specific site.
This approach is ideal for assays where a single enzyme, such as DNase-seq and TACh-seq, will free DNA ends and sequencing will produce reads in both directions. The resulting matrix provides candidate positions that are influencing the enzyme specificity. This approach: (i) assumes that the mask is symmetric; (ii) requires a full counts table (all positions unmasked) as input; and (iii) requires multiple runs of the same computation with random starting sites, which are automatically done in parallel, to ensure a reasonably good global optimum for the PSWM.
k-mer mask optimization using profiles at PSWMs and hill climbing optimization. We implemented a hill climbing method to optimize the k-mer mask. This method takes a starting k-mer mask and a set of read count tables (one for each TF) as inputs to guide a greedy search over the space of possible k-mer masks. The metric we use to evaluate kmer masks aims to measure the effect these masks have on the TF composite profile at the binding sites.
A TF composite profile is the position-wise sum of the scaled read counts across the set of binding sites. That is, for k-mer mask m and a given TF t, let c i, j (t, m) be the scaled read count at position j of the binding site i, then the TF profile is the vector G t (m) = [g j (t, m)] where: We define our metric M t (m), as the sum of the per-TF profile standard deviation: Let H(m) be the set of masks that differ from m by only one X being changed to N (i.e. by unmasking an additional position). The search procedure, given a starting mask m 0 (for example the mask of all X, indicating that all positions in the mask are excluded from the k-mer), is simply the iteration: stopping when there are no more masked X positions in the current mask.
This empirical approach requires many k-mer mask evaluations, which correspond to complete runs of se-qOutBias. Our implementation allows running multiple instances of seqOutBias in parallel (see mc.cores parameter in https://raw.githubusercontent.com/guertinlab/ seqOutBias/master/docs/R/seqOutBias hcsearch.R).

SeqOutBias application programming interface (API)
We structured the code into two parts: a main library (seqoutbiaslib) and the command line program (seqOutBias) implemented using the main library. This split allows the code to be reused to implement different interfaces with similar functionality or as a component in a larger program. The library is exposed as both a Rust library and a C library.

PAGE 5 OF 12
Nucleic Acids Research, 2018, Vol. 46, No. 2 e9 The Rust interface includes everything used to build the main program. The code is split into a series of modules which correspond to the subcommands of seqOutBias: tallyrun--code to execute GenomeTools (17) to produce the mappability file tallyread--code to read and access the mappability information seqtable--code to read and write seqtable files to disk fasta--code to read in the FASTA file, combine it with the mappability information, and produce the seqtable file (via calls to the seqtable module) filter--code to filter BAM records based on features such as length, quality, etc. counts--code to tabulate k-mer counts scale--code to compute read pile-ups and scale them appropriately bigwig--code to write the chromInfo and wiggle files and convert them to a bigWig file using the wigToBigWig program The C API exposes the ability to generate a seqtable file and query pile-ups in memory without the need to write them to disk. The C API exposes two query functions, one to query specific genomic coordinates and one that returns a full chromosome as an array. The C library and the corresponding header files are built as part of the main compilation process and can be linked as typical for C libraries. Functions can be grouped into four sets: i) Functions to manage the seqtable generation parameters. ii) A function to create a default set of pile-up generation parameters. iii) A function to generate the seqtable file. iv) Functions to generate and query the pile-ups.

Deproteinized DNA ATAC-seq
The naked DNA ATAC-seq library was prepared as previously described (7) with several modifications: (i) we used purified genomic DNA, as opposed to crude nuclei isolations; (ii) we omitted IGEPAL CA-630 from all buffers; and (iii) we performed PCR cleanup using AMPure XP beads to select DNA <600 bp. The naked DNA ATAC-seq data were deposited in the Gene Expression Omnibus (GEO) database, with accession number GSE92674.

Correction of individual DNase-seq reads
DNase-seq measures the accessibility of the phosphodiester backbone of DNA at single-nucleotide resolution (1,2,9,19). Composite DNase-seq profiles that are centered on sequence motifs of TF binding sites accentuate molecular features that inform on TF binding properties. For example, DNase footprints are defined as depletions of sensitivity within large regions of hypersensitivity; footprints align with TF recognition sites and result from TF interactions with DNA (20,21). High-throughput DNase-seq experiments described a cleavage pattern at the footprint that was interpreted as a measure of TF/DNA interactions (9); however, subsequent work attributed these artifactual signatures to differential substrate specificity of DNase conferred by the presence of the TF motif (10-12). As a result, some footprint detection programs now incorporate sequence biases into their algorithms (12,14,15,22). SeqOut-Bias provides the option to correct enzymatic sequence bias prior to footprint detection and the output files can be used with future footprinting algorithms.
Previous studies used a hexamer (11) and a tetramer (10) centered on the DNase cut site to account for the intrinsic sequence bias of DNase. We systematically explored how the individual bases within a 10-mer contribute to the preference of DNase (see 'Materials and Methods' section). We found that there was little Information Content beyond position 3 from the DNase cut site (Supplementary Figure  S2A and B). This EM method identifies candidate positions that contribute to enzyme specificity based on sequence content, so we sought to directly test how each position in the mask contributes to the smoothing of the composite profiles. For each TF PSWM, we computed the standard deviation for the profile obtained by summing the scaled reads across all sites at each position in the PSWM (see Methods). We summed these standard deviations across a set of PSWMs as a metric to determine the contribution of each position to DNase preference. We found only a modest improvement beyond tetramer correction (Supplementary Figure S2C).
We scaled individual reads based on the preference of DNase using seqOutBias and a 6-mer correction factor ( Figure 2). Figure 2A illustrates that DNase prefers to nick the sequence CCTTGC and the read associated with this window was reduced to an intensity of 0.15. DNase disfavors nicking of GGGGAA, thus the read associated with this hexamer was scaled to an intensity of 5.6.
DNase sequence preferences are most apparent in composite profiles of DNase cut frequency surrounding TF motifs. We tested the efficacy of 6-mer correction on DNasedigested naked DNA (23); corrected profiles of naked DNA digestion should not exhibit footprints or molecular signatures that result from protein/DNA interactions. We observe that sharp peaks and troughs are smoothed in the corrected composite profiles for ELF1, GATA3 and MAX motifs ( Figure 3).
True signatures that result from TF/DNA interactions are not smoothed by seqOutBias. For instance, ChIP-seq validated CCCTC-binding factor (CTCF) binding sites ex-  hibit strong footprints and composite profiles at CTCF motifs highlight a sharp signature upstream of CTCF binding (24). This CTCF signature is unaffected after correcting for DNase intrinsic sequence preference (12). We plotted DNase-seq profiles at GATA3 and MAX binding sites to determine whether true molecular signatures are apparent after intrinsic bias correction (Figure 4). We observe a clear composite footprint at MAX binding sites in chromatin, as expected, this footprint is not present in the naked DNA digestion ( Figure 4A). The MAX footprint is obscured by sharp peaks of hypersensitivity (sequence artifact signatures) in the raw uncorrected traces ( Figure 4A). Individual footprints also exhibit these sharp peaks at the site of TF binding in raw data. At a ChIP-seq validated MAX binding site we observe a footprint only after correcting for DNase sequence bias (Supplementary Figure S3). We observe a sharp DNase signature upstream of GATA3 binding sites, which is present only in the chromatin digested samples ( Figure 4B). We conclude that this molecular signature is a result of GATA3/DNA interactions, because this peak is neither smoothed following seqOutBias correction nor present in the naked DNA DNase digested sample. Note that GATA3 does not have an appreciable composite footprint, but TF inference algorithms may use TF-specific signatures, as we observe for GATA3, to inform on TF occupancy and binding intensity. The sharp peak with raw DNase-seq data within the GATA3 motif obscures footprints at individual GATA3 binding sites ( Figure 2B and Supplementary Figure S4). Bias correction enhances both the footprint and the signature upstream of the GATA motif (Supplementary Figure S4). Therefore, correction of intrinsic DNase sequence bias highlights true molecular features: footprints and sharp hypersensitivity peaks. We propose that these features can be systematically characterized for all TFs and used as informative priors when inferring TF binding profiles genome-wide from enzymatic hypersensitivity data.

Correction of TACh-seq, MNase-seq, ATAC-seq and PROseq data
We characterized and corrected the biases of Benzonase and Cyanase using Tissue Accessible Chromatin (TAChseq) data (6). TACh-seq is a variant of traditional enzymatic hypersensitivity assays whereby frozen tissue samples are treated with either Benzonase or Cyanase endonuclease. Benzonase is an endonuclease cloned from Serratia marcescens that functions as a dimer and Cyanase is a non-Serratia monomeric enzyme. These enzymes are more highly active under high salt and high detergent conditions, so these enzymes are more suited for digestion of solid tissue sample, which requires harsh dissociation treatments. We corrected TACh-seq data generated from frozen mouse liver tissue (6). Composite profiles from CEBP-beta, FOXA2 and CTCF binding sites (25)(26)(27) in mouse liver indicate that an eight base pair mask centered on the nick site is sufficient to correct both Cyanase and Benzonase biases (Supplementary Figures S5 and S6). Next, we applied seqOut-Bias correction to MNase-seq data generated from MCF-7 cells (28). An eight base pair mask abrogates the intrinsic sequence bias of MNase-seq data (Supplementary Figure  S7). ATAC-seq is unique among enzymatic accessibility assays because each transposition event inserts two sequencing adapters into the chromatin. Each Tn5 molecule can be pre-loaded with any combination of the paired-end 1 and paired-end 2 adapter. Reads that align to the plus and minus strand are processed separately because the Tn5 recognition site is distinct for plus and minus reads. We applied seqOutBias correction to published ATAC-seq data from GM12878 cells (7). We generated and analyzed naked DNA libraries using the ATAC-seq work flow to measure Tn5 specificity in the absence of chromatin (GEO accession: GSE92674). We optimized the k-mer mask for ATAC-seq data by starting with a k-mer mask of 12 X bases flanking each side of the Tn5 insertion site, then we systematically changed each X position into a masked N using a hill climbing mask optimization method (see Methods). We chose the position that results in the lowest summed standard deviations across a set of PSWMs and iterated until we found the top 11 positions that contribute to Tn5 sequence bias (Supplementary Figure S8). We used the N positions of the 8mer NX NX XX CX XNNX NNNX XN for the ATAC mask because these are the most influential for Tn5 recognition of plus strand reads (Supplementary Figure S8). The reciprocal mask, NX XNNNXNNX XCXX XNXN, is the optimal 8-mer mask for minus strand reads. The sharp ATAC-seq spikes at the sites of TF binding for SP1, REST and EBF1 (29) are reduced in the corrected data ( Supplementary Figures S9 and 10). The complex nature of Tn5 recognition and dual loading of adapters, taken together with the incomplete smoothing of ATAC composite profiles, suggests that a simple spaced k-mer correction may not be sufficient to fully correct Tn5 bias.
PRO-seq couples terminating nuclear run-on assays with high-throughput sequencing to quantify engaged RNA polymerase molecules genome-wide at nucleotide resolution (4). Sequence composition of transcripts may affect run on efficiency, therefore, the sequence immediately downstream of RNA polymerase may influence detection of RNA molecules. The sequence upstream of RNA polymerase could affect ligation efficiency because T4 RNA ligase treatment may exhibit sequence preference. We used se-qOutBias to scale published PRO-seq data from K562 cells (30). We specifically used annotated transcripts to calculate expected k-mer frequency, as opposed to genomic kmer frequency, because the vast majority of transcription occurs within gene annotations (31). We found that a kmer mask that spans the last three bases of the ligated RNA molecule and the three bases downstream from RNA polymerase is sufficient to correct the PRO-seq data (Figures 5 and 6). RNA polymerase density decreases at the polypyrimidine tract upstream of the 3 splice site, which suggests an increased RNA Polymerase elongation rate at this tract ( Figure 6). These data indicate that in addition to U2AF (32,33) recognizing the pyrimidine residues in the pre-mRNA polypyrimidine tract, this tract increases RNA polymerase elongation rate.
We also observe a sharp peak at position −3 from the 5 end of exon. This sharp peak is absent using seqOutBiascorrected reads (Figure 6), therefore we hypothesized that  this peak results from either inefficient adenine incorporation during the nuclear run-on or a preference for cytosine or uracil during either the run-on or ligation reaction. We classified 3 splice acceptor sequences into AAG, CAG and TAG to generate composite RNA polymerase profiles-note that very few splice acceptor sequences are GAG, so they were excluded. The sharp peak at position -3 is exclusively found in the CAG splice acceptor profile (Supplementary Figure S11), which indicates that cytosine is preferentially incorporated during the run-on or preferentially ligated. We examined the composite profiles at corrected CAG and TAG splice acceptor sites and we observed that RNA polymerase density is higher following CAG splice acceptor sites (Supplementary Figure S12). This result indicates that the consensus splice acceptor site, CAG, decreases RNA polymerase elongation rate in the 5 region of the exon. Genome-wide binding data for CTCF is available for K562, GM12878, mouse liver and MCF-7 cells (27,29). Upon correcting for enzymatic sequence bias, the sharp signature artifacts at CTCF motifs are abrogated in each molecular genomics dataset we tested ( Figure 5). The naked DNA profiles for ATAC-seq and DNase-seq are not restricted to CTCF-bound sites; all genomic CTCF motifs are included in these composites ( Figure 5). In the chromatin TACh, DNase and ATAC experiments, we observed protection resulting in a footprint and a sharp peak upstream of the CTCF motif. Taken together, we show that seqOut-Bias effectively corrects enzymatic sequence bias resulting from a diverse set of molecular genomics experiments.  Figure 5. SeqOutBias corrects sequence bias at CTCF binding sites associated with DNase (9,23), Tn5 Transposase (ATAC) (7), Benzonase (TACh) (6), Cyanase (TACh) (6), MNase (28) and T4 RNA ligase (PRO) (4). Upon correcting for enzymatic sequence bias, the artifactual spikes at the CTCF binding site are abrogated in each molecular genomics dataset we tested. However, in cases of CTCF binding to chromatin, we observe protection that results in a footprint; note that MNase is not expected to result in a composite footprint. We observe the previously characterized sharp peak upstream of the CTCF motif and this molecular signature is likely caused by CTCF-mediated enhancement of cleavage activity. . SeqOutBias corrects sequence bias associated with the 3 splice site recognition motif. Upon correcting for enzymatic sequence bias, the artifactual signature at the 3 splice site is abrogated. The first base of the exon spans position 0-1 on the x-axis and the sequence bias peak is at position −3.

Enzymatic DNA end repair and ligation bias
We found that the bases upstream and downstream of a DNase nick site are not equally likely to be detected by sequencing (red nucleotides in Figure 1A). In Figure 1A, for GATGTC we would expect the ratio of reads that begin with GACCAGATGACA (plus strand) and ATCATATCCCGT (minus strand) to be approximately equal to one if this site was nicked repeatedly and there was no enzymatic end repair and ligation bias. We performed this analysis for all instances of each NNNGAC-mer in the genome ( Figure 7A) and all 4096 pairwise combinations of 3-mers ( Figure 7B). Palindromic 6-mers are balanced (Supplementary Figure  S13), but for most 3-mer combinations we identified a preference for which 3-mer is detected by sequencing, we term this 'detection bias. ' We took the top 5% most skewed 6- mers and generated a composite motif. This motif indicates that an Adenine in position 4 of the 6-mer is preferentially sequenced compared to the oppositely oriented nucleotide in position 3 (Supplementary Figure S14). Preparing digested DNA for Illumina high-throughput sequencing requires several enzymatic treatments. T4 DNA Polymerase treatment removes 3 overhangs and fills in 3 recessed (5 overhang) ends. T4 Polynucleotide kinase phosphorylates the 5 end and Klenow Fragment (3 to 5 exo-) adds a single 3 adenine overhang. We hypothesized that the overhanging sequences dictate the detection bias, because the detection bias is distinct for Benzonase and DNase (Figure 7C, top panel). Although four nick events are necessary to sequence a DNA molecule, enzymatic (DNase, Cyanase and Benzonase) hypersensitivity assays only detect one nick on each end of the molecule and it is impossible to determine the precise location of the other nicks. By assuming that two enzymes with similar nick specificity (Supplementary Figure S15) will have comparable distribution of sequence overhangs, we can test the hypothesis that the overhang sequences contribute to post-nicking enzymatic treatment biases. We compared this post-nicking bias using DNase-seq data from two different labs and two different organisms (Supplementary Figure S15). We also compared the detection bias of Cyanase and Benzonase, which have similar sequence preferences (Supplementary Figure S15). Indeed, digestions with enzymes that have similar nick preferences, which results in comparable distributions of overhanging sequences, have highly correlated detection biases ( Figure 7C, bottom two panels and Supplementary Figure