Correction of transposase sequence bias in ATAC-seq data with rule ensemble modeling

Abstract Chromatin accessibility assays have revolutionized the field of transcription regulation by providing single-nucleotide resolution measurements of regulatory features such as promoters and transcription factor binding sites. ATAC-seq directly measures how well the Tn5 transposase accesses chromatinized DNA. Tn5 has a complex sequence bias that is not effectively scaled with traditional bias-correction methods. We model this complex bias using a rule ensemble machine learning approach that integrates information from many input k-mers proximal to the ATAC sequence reads. We effectively characterize and correct single-nucleotide sequence biases and regional sequence biases of the Tn5 enzyme. Correction of enzymatic sequence bias is an important step in interpreting chromatin accessibility assays that aim to infer transcription factor binding and regulatory activity of elements in the genome.


INTRODUCTION
Chromatin accessibility assays measure the relati v e frequency that exogenous enzymes access DNA.Chromatin accessibility is not a direct measure of molecular features of chromatin such as transcription factor occupancy or histone modification status.Howe v er, accessibility is considered a proxy measurement of regulatory element activity irrespecti v e of the constellation of factors bound to the DNA ( 1 , 2 ).Accessible regions are enriched for transcription factor binding and histone modifications that are hallmarks of functional cis -regulatory elements (3)(4)(5)(6)(7).Deconvolution of genome-wide chromatin hypersensitivity data to infer individual transcription factor binding e v ents remains a challenge ( 8 ).
ATAC-seq revolutionized the chromatin accessibility field by providing a straightforward method that r equir es fewer than 5000 cells ( 9 ).ATAC-seq leverages a hyperacti v e Tn5 transposase that directly inserts high throughput sequencing adapters into accessible DNA to create sequencing libraries.The analysis of ATAC-seq data r equir es additional considerations beyond traditional hypersensitivity assays because the molecular biology of transposase function is distinct compared to DNase and other enzymes that are used to measure chromatin accessibility ( 10 , 11 ).
Although using Tn5 to measure chromatin accessibility is experimentally easier than using DNase, the dimeric Tn5 enzyme recognizes a wider region when interacting with DNA and this is reflected in the sequence bias of Tn5 ( 12 ).Characterizing and correcting enzymatic sequence bias is an essential step for accurate interpretation of chromatin accessibility data (13)(14)(15).Strategies to correct biases use a variety of sequence inputs to build models, including kmer instances, position weight matrices, and long stretches of DNA.DNase bias can be directly scaled based on the 6-mer sequence centered on the DNase cleavage site (16)(17)(18)(19).The advantage of direct k-mer scaling is the simplicity: r eads ar e scaled by the e xpected / observ ed k-mer count ratio.If a k-mer is found less often than expected by chance, then the read is scaled to a higher value.Howe v er, direct k-mer scaling does not effecti v ely correct Tn5 biases from ATAC-seq data ( 16 , 18 , 20 ), so other methods were de v eloped to correct Tn5 bias.ATACorrect scales individual ATAC-seq reads based upon a dinucleotide weight matrix r epr esenting Tn5 bias and introduces simulated reads to offset observed biases ( 21 ).Similar to weight matrix scaling, HINT-ATAC employs position dependency models, which account for interactions between weight matrix positions to model ATAC-seq bias ( 12 ).SELMA encodes k-mer data into a simple x v ector, which is incorporated into a Hadamard matrix for all mono and dinucleotide combinations.These data are input into a linear regression model to capture and correct both DNase and Tn5 bias ( 22 ).More sophisticated methods that correct ATAC-seq data are often less interpretable than k-mer and weight matrix scaling.Seqbias tr ains a Bay esian network to encode nucleotide pr efer ences using a 40 base window centered on the Tn5 recognition site ( 23 ).MsCentipede uses naked DNA cleavage to train a Bayesian multi-scale model of a Poisson distribution of reads to account for sequence bias ( 24 ).Another method 2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 2 employs a convolutional neural network to account for intrinsic sequence biases ( 25 ).Among bias correction software packages, seqOutBias is the only existing method that scales each individual input read and does not couple the bias correction to downstream processing steps.
We previously found that direct k-mer scaling corrects the majority of nuclease sequence bias.We de v eloped seqOut-Bias to implement k-mer scaling for molecular genomics data.The seqOutBias package is stand-alone software that specializes in sequence bias correction for a range of kmer lengths and gapped k-mers.SeqOutBias output files can be piped into programs that specialize in peak calling and footprint inference ( 26 ), but seqOutBias performs poorly with ATAC-seq data.Here, we expanded the se-qOutBias package to accommodate ATAC-seq data by coupling seqOutBias output to a rule ensemble modeling frame wor k tha t ef fecti v ely scales indi vidual ATAC-seq reads to correct Tn5 bias.This modeling approach captures complex interactions between k-mers and quantifies the importance of individual positions that contribute to o verall Tn5 bias.Moreo ver, the importance of k-mers and positions are intrepretable locally for each position in the genome, because the rules and terms applied to each aligned r ead ar e explicit.This r eproducib le wor kflow efficiently corrects single-nucleotide resolution Tn5 sequence bias and addr esses r egional baseline bias determined by GC content.To facilitate bias correction of chromatin accessibility data, we de v eloped a wor kflow that can be applied to existing high-throughput sequencing analysis pipelines.The maintained and updated repository for this work will be within the lab's GitHub repository: https://github.com/guertinlab/Tn5bias .The archi v ed Zenodo link for the analysis at the time of submission will be stable at the following link: https: //doi.org/10.5281/zenodo.7757436 .

Reproducible analysis and plotting code
We provide a detailed and reproducible vignette to reproduce all analyses and figure panels directly from raw data her e: https://github.com/guertinlab/Tn5bias/tree/master/ Manuscript Vignette .We provide a workflow vignette using only chr21 ATAC-seq reads, the chr21 r efer ence genome, and estrogen receptor motifs as a quick-start reference document: https://github.com/guertinlab/Tn5bias/tree/master/seqOutATACBias workflow Vignette .

Determining sequences around each cleavage site
All aligned reads (plus, minus and unseparated) were used individually as input for unscaled seqOutBias ( 16 ) runs, which generated bigWig and bed format output files.We deconvolved independent reads that aligned to the same genomic coordinate into separate bed file entries, then converted to sequence files using fastaFromBed ( 35 ).Sequences corresponding to minus-aligned reads were reverse complemented before conca tena ting them with the sequences corresponding to plus aligned reads.

Counting positional nucleotide frequencies for each enzyme
We determined nucleotide counts at each position relati v e to the start of a read by loading all the sequences into R, separating sequences into columns, and tallying nucleotide bases at each position.Results were output in TRANSFAC format and input into Weblogo ( 36 ).

Plotting background nucleotide fr equency-corr ected sequence logos
We desired background corrected information content values for our sequence logos generated from Weblogo ( 36 ).We retrie v ed these values by modifying the source code of the Weblogo command line interface.A step-by-step guide on the modifications we made to retrie v e these values is included in the vignette on Github.As a coherence check, we calculated the background corrected information content values for each position.First, we calculated the Shannon entropy: Here, H ( l ) is the entropy at any gi v en position, and f ( b , l ) is the frequency of a base ( b ) at this position ( l ).Subtracting this value from 2 is the classic calculation for information content.We next calculated the background entropy: w here H ( bac kground ) is the correction for background nucleotide frequency.And f ( background , b ) is the background frequency ( background ) for a base ( b ) Thus, f or an y position the background corrected information content is: These values were plotted using the Weblogo command line interface.

Determining enzyme bias motif genomic coordinates
Starting with the TRANSFAC format nucleotide count files, we used transfac2meme ( 37 ) to convert to the meme file format.We then used these files as input into FIMO to generate region sets for each bias motif using the appropriate r efer er ence genome ( 38 ).The highest scoring 400 000 r egions wer e used for each composite profile plot.

Plotting composite signal of genomic coordinates
We visualized the average signal at genomic coordinates by aligning plus and minus locations on the same position and retrieving the signal in the plotted interval from a gi v en data set's bigWig file, using the bigWig R package ( https://github.com/andrelmartins/bigWig).We then divided all signal in this interval by the number of genomic coordinates within the plot to calculate the average cut or insertion (for Tn5) frequency at this position relati v e to the motif.

Determining background nucleotide frequencies of r efer ence genomes
Background nucleotide frequencies for reference genomes were calculated by using the grep command to count the occurrences for a gi v en nucleotide.

Calculating scale factors for k-mer position
Genomic and data set k-mer frequencies were determined by using the seqOutBias table ( 16 ) command to tally each k-mer in the r efer ence genome and those in the data sets for each k-mer size and position.We then used these k-mer counts to calculate scale factors for each k-mer by dividing the expected k-mer frequency in the genome by the observed k-mer frequency in the data set ( 16 ).

Calculating observed and expected upstream k-mers
Starting with the sequences flanking each insertion in the Tn5 data set in fasta format, we counted each 3-mer in relevant positions from the cut site to determine observed kmer occurrence.To determine expected k-mer occurrence, nucleotide frequency from the sequence logo motif at each position was m ultiplied to gether, in all possible combinations, to construct each possible k-mer.This value is the expected frequency for each k-mer; we plotted the log 2 ratio of observed to expected k-mers.

CAG peak direction
We generated a list of all 3-mer locations in the genome by invoking the seqOutBias dump ( 16 ) command using a 3-mer .tblfile (output from previous invocations of seqOutBias table ) for input.From this file, we determined the genomic location of all CAG instances in the reference genome.We then plotted ATAC signal at 400 000 random CAG instances from this bed file.Rationally designed masks were implemented using the seqOutBias ( 16 ) software with the noted k-mer masks.

T r anscription factor motif genomic interval determinations
Motifs for each transcription factor included in the test and training sets were downloaded from the JASPAR ( 39 ) database in meme format.These motifs were then used as input, along with the hg38 r efer ence genome, into FIMO to output genomic regions conforming best to the motif.We took the top 400 000 highest scoring genomic instances for each motif for both the plus and minus strands and used them as input for the rule ensemble model.

Rule ensemble target input
We calculated the target values by plotting the composite signal at each transcription factor's genomic coordinates using unscaled data pr oduced fr om seqOutBias ( 16 ), using the --no-scale option.bigWig files were accessed and plotted using the bigWig R package ( https://github.com/andrelmartins/bigWig ).These plotted values were normalized using a common factor, to preserve variation between motifs and allow for accurate rule ensemble prediction.

K-mer frequency calculation
We determined k-mer frequency using the 400 000 genomic locations for each strand of each transcription factor motif previously generated using FIMO .From these locations, we retrie v ed the sequences using fastaFromBed ( 35 ).We split each sequence into k-sized slices for each k-mer.For example, using a 5-mer, the sequence 'AAACCAAA' would be split into: AAA CC,AA CCA,A CCAA,CCAAA.Each of these sections are a position within the original sequence: position 1-AAACC; position 2-AACCA; etc.For each position, we then determined the frequency of each k-mer among the original 400 000 input sequences.

Rule ensemble independent variable input
We computed the rule ensemble input for each combination of transcription factor motif region set and kmer size / position.As we previously determined k-mer frequency at each position of a composite profile, we m ultipl y these frequencies by the inverse of the scale factor for each matching k-mer, which was also described above.Finally, we sum these values for all possible k-mers at a position for the input value.This process was repeated for each modeled position in the composite corresponding to each included k-mer size and location relati v e to the cut site.These values are the seqOutBias predicted values output for a k-mer size and position, and modeled genomic regions.

Rule ensemble modeling
After calculating the independent variable input, a rule ensemble model was trained to predict the training set biased target values using the prediction rule ensemble package in R ( 40 ).This package implements the original rule ensemble modeling frame wor k ( 41 ).The output rules and scaling coefficients were recorded as a single equation for later implementation.

Rule ensemble model implementation
We implemented the rule ensemble model by combining scaled seqOutBias ( 16 ) output as defined by the model.
We first aggregated the required seqOutBias output by combining output bed format files using unionbedg software ( 35 ).These output values were then combined by applying the modeling equa tion a t e v ery read to generate a rule ensemble-scaled bed file.This bed file was then scaled to the same total read depth as the original, unscaled data.Finally, we converted the bed file to bigWig f ormat f or subsequent analysis using the bedGraphToBigWig command line interface ( 42 ).

k-mer importance formalization
Calculations for the importance of a gi v en input variab le are conducted by the PRE package in R. The formalized method is imported with little modification from the accompanying paper ( 40 , 41 ).
In order to calculate input k-mer position importance, the importance of individual rules and linear terms in which it appears must be determined.We can calculate linear term importance using the formula: Here, l j ( x j ) indicates the linear term for predictor variable x j (k-mer frequencies of a gi v en size and position relati v e to cut site multiplied by inverse scale factors for the same kmer size and position), sd signifies standar d de viation, and y r epr esents output, or bias pr edicted by the model.The term | ˆ b j | is the absolute value of the k-mer's coefficient in the trained model.
To calculate the global importance of a gi v en rule, we use the equation: Similar to a linear term's importance, s v (1 − s v ) evaluates to the rule's sample standard deviation.This value is measured by first determining s v , or the support for rule v , which is the fraction of training data for which the rule's conditions are satisfied.
We can calculate the support for a gi v en rule ( r v ) using the expression: To evaluate the total importance of a predictor k-mer, we sum all rules in which it applies, in addition to its linear term.This is calculated in the following expression: In which the variable c v is equal to the number of r equir ements for rule v .This results in the importance of any gi v en rule being equally divided between the various k-mers included in the rule.

Calculating positional importances from rule ensemble models
We calculated variable importances, for each position within our range of inputs to visualize how the rule ensemble model combined input positions.Each k-mer's importance was deri v ed from the model and we calculated the importance of each position by summing the values of each position for all k-mers.We performed this calculation for DNase, howe v er Tn5 k-mers were not equally distributed at each input position and this caluclation was not appropriate.We scaled the normalized Tn5 positional importance by dividing the sum of values by the number of inputs that included the respecti v e position in the input masks.

Calculating impr ov ement of rule ensemble modeling compared to seqOutBias
We dir ectly compar ed the improv ement of rule ensemb le modeling over seqOutBias by first determining the absolute value difference between scaled output from either method and the calculated unbiased output.This value was determined for each position in our test set composite profiles.Each of these values is the difference between either method and perfect bias correction at a gi v en position --a value of 0 means the method perfectly corrected bias at this position.For each position we subtracted the rule ensemble output from the seqOutBias ; for e v ery value above 0, rule ensemble modeling outperforms seqOutBias correction.

Tn5 sequence bias is more extensive than other nuclease biases
ATAC-seq measures how well Tn5 transposase accesses DN A in chromatin, w hich is a proxy measurement for regulatory element activity.Although chromatin condensation is the main contributor that influences Tn5-mediated DNA cleavage, the local sequence content biases transposition activity ( 10 ).A dimeric Tn5 complex nicks opposite strands of DNA 9 base pairs apart, and each monomer inserts a sequencing-compa tible adapter a t each nick position (Supplementary Figure S1A) ( 43 , 44 ).The interaction of Tn5 with a precise genomic region results in reads aligning to two converging genomic coordinates 9 bases apart on opposite strands.Ther efor e the field shifts the r eads aligning to either r efer ence genome strand so that a common position anchors the Tn5 recognition site regardless of the orientation of the adapter.We refer to the center nucleotide of this 9 base insertion site as the central Tn5 recognition base , in contrast to a traditional nuclease which cleaves a single position between two adjacent bases.Shifting ATAC-seq reads results in base-pair resolution data that precisely identifies the center of the Tn5 recognition site.
Detection of a single Tn5-inserted adapter necessitates an independent Tn5 insertion within a pproximatel y 500 bases.Only half of the Tn5 insertion e v ents are capab le of amplification, e v en when an independent insertion occurs within 500 bases (Supplementary Figure S1B ,C ,D).To specify this central base, we shift the forward-strand aligned reads downstream by 4 bases and we shift the re v erse-strand aligned reads upstream by 4 bases (Supplementary Figure S1E).Most current ATAC-seq analysis workflows shift the re v erse-strand aligned reads upstream by 5 bases so that the data looks more similar to the chromatin accessibility predecessor DNase-seq.Howe v er, this approach r equir es an independent shift of one base for re v erse-aligned sequences to ensure that a single DNA cleavage e v ent is specified by a single genomic coordinate (Public Twitter correspondence: https://twitter.com/jef fvierstra/sta tus/1396900282634625025 ).
All enzymes that recognize nucleic acids as substrates have sequence biases.These sequence biases are quantified as the observed frequency of each base at individual positions relati v e to a cleavage site (or the centrall y reco gnized base in the case of transposases) compared to the expectation of random genomic cleavage.Enzyme bias motifs are r epr esented by a position probability ma trix tha t reports the fraction of each base observed at each position relati v e to the cut site, which is most easily visualized as a seqLogo ( 36 , 45 ).We confirm, as others have shown, that Tn5 favors CG-rich DNA and the seqLogo is a re v erse complement palindrome, which is because Tn5 recognizes DNA as a dimer (Figure 1 A) ( 10 , 12 , 21 , 23 ).The nucleases Benzonase , Cyanase , DNase , and MNase have distinct biases that span fewer positions compared to Tn5 (Figure 1 A) (30)(31)(32).In addition to Tn5 having a wider bias window, the cumulati v e information content within the window is high-est for Tn5 compared to the nucleases (Figure 1 A).Importantly, we consider background nucleotide frequency when calcula ting informa tion content.If we did not backgroundcorrect, then random cleavage would be interpreted as an AT bias because A / T bases account for 59% of the genome (Supplementary Figure S2A).Previous methods that accurately correct enzyme sequence biases use k-mer scaling ( 16 ).Howe v er, k-mer scaling does not effecti v ely correct Tn5 bias because of the wide bias window and the limitation that large k-mers occur rarely in both the observed and expected k-mer counts.
If nucleotide sequence far from the Tn5 recognition site center influences cleavage and insertion, then we would expect that k-mers found at positions distal from Tn5's centrall y reco gnized base would deviate substantially from random expectation.We quantified the influence of distal and proximal k-mers on cleavage bias by plotting the log 2 ratio of observed 5-mer frequency to expected 5mer frequency for all 1024 5-mers; expected frequency is based on the genomic frequency of each 5-mer.The log 2 (observ ed / e xpected) ratio is zero if a 5-mer is observed at the expected frequency.We find that the distribution of this ratio becomes more tightly centered around zero as we query more distal 5-mers (Figure 1 B).The distribution does not stabilize for Tn5 until we exceed 11 bases from the Tn5 reco gnition site, w hile the other four enzymes stabilize within 4 bases of the cleavage e v ent.This analysis confirms Figur e 2. K-mers ca ptur e mor bias complexity than matrices and provide a for which to build sophisticated bias-correction models.( ) We quantified the frequency that all 64 3-mers occur at positions -4 to -6 center of a recognition site and observed the weight matrix prediction.bar chart of log observed by (from k-mer prevalence in the bias position probability matrix) kmer frequency highlights the interdependency between sequences different positions.( B ) We plotted the signal from 400,000 randomly selected 'CAG' instances in the genome.Overlaid on top of the unscaled signal, each plot shows that the individual peaks from composite profile can be corrected by rationally designed spaced scaling, on the position of the scaling k-mer and the peak.that distal sequences more strongly influence Tn5-mediated cleavage compared to the other enzymes, and indicates that a k-mer spanning at least 21 base pairs is r equir ed to model Tn5 bias effecti v ely.
The background corrected sequence motif (Figure 1 A) for Tn5 is a re v erse palindrome, which is common for homodimeric DNA binding factors ( 46 , 47 ).The consensus Tn5 recognition sequence contains CAG trimers present on each strand of the DNA, each CAG is displaced from the adjacent re v erse complemented CAG by 5 bases.This unique feature suggested that Tn5 dimers may interact with this sequence motif in multiple orientations to direct the transposition of adapters with higher frequencies.We plotted the read-depth normalized composite profile of each enzyme's bias motif to compare their relati v e strength of dir ecting DNA cleavage (Figur e 1 C).The Tn5 recognition site directs a central peak that is substantially higher than the other enzymes.The composite profiles also highlight peaks 5 and 10 base pairs downstream and upstream of the centrall y reco gnized base (Figure 1 C).These flanking peaks are not observed in the other nucleases, although they are comparable in intensity as the central peak of other enzymes.Although not definiti v e, this pattern is consistent with Tn5 dimers interacting with opposite sides of the DNA helix, displaced by fiv e bases.Taken together, the length of the Tn5 bias motif, the strong influence of distal k-mers, the stronger cutting pr efer ence, and 5-base spaced cutting pat-tern all preclude conventional k-mer based approaches for correcting Tn5 sequence bias.

Tn5 sequence bias is complex and not modeled well by previous methods
A key assumption of r epr esenting sequence biases as probability matrices or seqLogos is that any nucleotide observed at a position within the motif does not influence the probability of observing nucleotides in other positions, this property is r eferr ed to as positional independence ( 6 , 48 ).If each position is independent from the others, observed k-mer frequencies should be equal to those expected based on nucleotide frequency in the position probability matrix.To test Tn5 sequence bias position independence, we determined the observed frequency of each 3-mer found in position -6 to -4 relati v e to the central Tn5 recognition base and divided by the expected fr equency (Figur e 2 A).We find that 3-mers are observed at frequencies that deviate from the calcula ted expecta tion.We can infer basic features describing which 3-mers are disfavored (NAA) and preferred (NCA), but characterizing a comprehensi v e set of basic rules f or man y k-mer combinations and k-mer positions is not straightforwar d.Howe v er, since k-mer scaling does consider dependencies between positions, k-mer scaling remains an attracti v e starting point for de v eloping more complex bias correction models.We generate the rule ensemble input by combining seqOutBias scaling output with k-mer frequencies observed at transcription factor motifs found throughout the genome.This input is then used to train a rule ensemble model to predict the enzymatic sequence bias.Finally, the model is tested for its ability to correct this bias.( B ) For each training motif, we identify the 400,000 occurrences in the genome that conform most stringently to the weight matrix.We extract sequences (200 bases) in the region for each identified motif and calculate the frequency of each k-mer at each position relati v e to the motif center.We perform independent runs of seqOutBias to calculate scale factors for each k-mer in each position within a defined window from the center of the motif.K-mer frequency for each position is then multiplied by the inverse scale factor for e v ery frame of r efer ence.These ar e the rule ensemble input values for each motif / k-mer (size and position) pairing.( C ) We train a rule ensemble model to predict the biased signal measured at each of these region sets in a deproteinized data set.( D ) Output from seqOutBias corresponding to all input frames of r efer ence ar e then combined according to the rule ensemble model to generate a single BED or bigWig file with bias-corrected values for each sequence read.Successful bias correction is evaluated using a held-out test set of TF motifs.
Next we tested the feasibility of using rationally spaced kmers to systematically reduce Tn5 bias.The 3-mer 'CAG' is over r epr esented in three positions in the Tn5 bias and each CAG bias is displaced by 5 bases (Figure 1 A).We measured the composite ATAC-seq signal in the window centered on 400 000 random CAG instances in the hg38 genome assemb ly.As e xpected, we observ e that the CAG 3-mer directs cutting in three peaks, each displaced from one another by 5 bases (Figure 2 B).Conventional k-mer scaling is performed with any reasonably short k-mer size ( < 9 bases) and the kmer can be positioned at any distance from the observed Tn5 recognition site.In an effort to determine whether kmer scaling would effecti v ely flatten these three peaks, we used seqOutBias to scale reads using 3-mers that are loca ted a t the Tn5 recognition site and up / down stream by 5 bases.Recall that each peak corresponds to a peak of Tn5 recognition and that each peak is relati v e to the CAG 3mer that anchors the plot at position x = 0. Ther efor e, we expect that scaling based on the k-mer located 5 bases upstream will flatten the +5 peak because the CAG directing this downstream peak is located 5 bases upstream.Likewise, the central k-mer scaling would abolish the central peak and downstream k-mer scaling would ablate the upstream peak.This ex er cise indica tes tha t we can rationally design k-mer scaling to correct known biases, but it also highlights the fact that multiple k-mers are needed to correct biases e v en within the context of this simple example.Ther efor e, we pursued a rule ensemble modeling approach to integrate multiple k-mer sizes from a range of positions relati v e to the Tn5 recognition site as input data and determine whether interactions between these inputs contribute to Tn5 insertion bias.

Rule ensemble modeling lever ages inter action terms and kmer scaling to correct sequence biases
We previously developed seqOutBias , which directly scales chromatin accessibility data by e xpected / observ ed kmer frequency.Although we specify seqOutBias to scale based on a 6-mer centered on the DNA cleavage e v ent for DNase, one can specify any k-mer length and relati v e kmer position.Our data indicate that Tn5 bias is too broad to scale with a single k-mer and this suggests that bias correction would need to incorporate information from se v eral k-mer positions.Ther efor e, we chose to use seqOutBias scaling outputs with dif ferent invoca tions of k-mer positions and lengths as the input for the rule ensemble model, and to test its capacity to correct Tn5 bias (Figure 3 A).Sequence biases were first characterized by quantifying enzymatic cut frequency in average composite profiles that are centered on transcription factor binding motifs ( 14 , 15 ).We pre viously de v eloped a model (seqToSign) that reproduced these composite profiles by scaling the genome-wide cut frequency of each k-mer found in a gi v en position in the composite by the number of occurrences of the k-mer at that position ( 15 ).We attempted a similar approach to predict biased composites by using the inverse of seqOutBias scale factors as an input, since this value r epr esents the genomic cut frequency for a k-mer.For each k-mer we multiplied its frequency at each position in the composite by the inverse of that k-mer's corresponding seqOutBias scaling factor.We summed these values for all 4 k k-mers at each position in the composite profile (Figure 3 B).Each of these values at each position in the composite is a covariate input for the rule ensemble model and we repeated the process for many k-mer length and relati v e k-mer position combinations (Supplementary Figure S3A).We trained a rule ensemble model using these covariate inputs to predict multiple transcription factor composite profiles.The output of this trained model is the formulaic combination of k-mer positions using linear r egr ession coefficients and decision tr ees (Figur e 3 C).To test the efficacy of the model, we first combined the scaled read values from each k-mer covariate according to the output formula.This generated a single rule ensemble-scaled read file which we tested for its ability to correct test group composite profiles to resemble theoretically unbiased genomic cleavage (Figure 3 D).

Choosing training and test transcription factor motifs
Transcription factors recognize a di v ersity of DNA sequences that vary by length, degeneracy, and nucleotide content.We chose training and test data for the model by selecting sequence motifs based on these features (Figure 4 A).In a sequence logo r epr esentation of DNA binding pr efer ence, the information content is a function of sequence length and degeneracy.We used hierarchical clustering of these values for 43 input motifs that r epr esent a wide range of information content and GC content, encompassing well-characterized binding motifs (Figure 4 B).Groupings consisted of either doublets or triplets, with a single singlet.Within a group, a single motif was assigned as testing data and the others were assigned as training data.

Rule ensemble modeling corrects DNase sequence bias
We chose rule ensemble modeling because the importance of covariate inputs are easily interpreted in the overall model and for any individual model prediction.Since DNase-seq biases are w ell-characterized, w e have an expectation of the importance of positions that lead to DNase specificity.We chose to test the feasibility of this model using naked DNA DNase-seq data.We invoked seqOut-Bias with staggered 5-mers within 9.5 bp of the cut site as input for the model.The DNase rule ensemble model confirmed that the 3 bases on either side of the cleavage e v ent most strongly influence DNase cleavage (Figure 5 A).
We visualized the correction of single nucleotide bias by calculating the log 2 ratio of theoretically unbiased signal to either unscaled, direct k-mer scaling (seqOutBias), or rule ensemble output within the region of ±10 base pairs for each motif (Figure 5 B & Supplementary Figure S4A).Our results indica te tha t dir ect k-mer scaling corr ects DNase bias well, but rule ensemble modeling outperforms k-mer scaling (Supplementary Table S1).This is most easily illustrated by observing individual composite profiles (Figure 5 C & Supplementary Figure S4B).Although improvement is comparable between direct k-mer scaling and rule ensemble modeling, we sought to directly compare the two more rigorously.We calculated the absolute value of the difference between each corrected value and random cleavage for each point within 10 bp of each motif for both bias correction methods.This metric represents how much the values deviate from perfect bias correction for each method ( seqOutBias and rule ensemble).For each position, we subtracted the rule ensemble deviation from unbiased cleav age v alue from the seqOutBias deviation from  We visualized improvement of rule ensemble modeling over seqOutBias by using a violin plot of distance from calculated random cut frequency for rule ensemble output subtracted from seqOutBias .Positions which rule ensemble outperforms seqOutBias will be above 0 (dashed red line); 68% of positions in this plot were improved.
unbiased cleavage and plotted the differ ence (Figur e 5 D).Each point r epr esents the differ ence in improvement between seqOutBias and rule ensemble modeling.For each point greater than 0, the rule ensemble modeling outperforms seqOutBias (Figure 5 D).We find that 68% of all points showed improvement, indicating that rule ensemble modeling generally outperforms direct k-mer scaling for correcting DNase bias.

Rule ensemble modeling corrects local Tn5 transposition bias
Since this modeling approach corrected DNase bias, we developed a rule ensemble model to correct bias from deproteina ted ATAC-seq da ta.Since Tn5 bias is more complex than DNase, we ran seqOutBias for all contiguous 5mers , 6-mers , 7-mers and spaced 6-mers within 19 bases of the DNA cleavage / insertion site for the model input.In total, we included 663 distinct k-mer combinations as inputs (Supplementary Figure S3A).The number of input variables was much higher compared to DNase, so we prioritized the input cov ariates b y first modeling the data using linear r egr ession to determine the most influential k-mers (formalized calculations are included in the Methods), then we incorporated the most influential 10% of these variables into a full rule ensemble model to predict the biased signal for each training motif (Figure 3 C).The relati v e importance values for each nucleotide position re v ealed three central peaks that are separated by 5 bases (Figure 6 A).Comparison of individual composite profile traces highlights the improvements (Figure 6 B & Supplementary Figure S5A&B).We systematically measured how bias-corrected signal deviated from random DNA cleavage at sequence motifs ( ±10 bases) and compared to unscaled and k-mer scaled correction (Figure 6 C).The rule ensemble bias correction outperforms k-mer scaling with respect to reducing variance and scaling reads to more accurately approximate random DNA cleavage (Supplementary Table S2).For each individual position, the rule ensemble model outperforms k-mer scaling 78% of the time (Figure 6 D).Examination of the 22% of positions with worse performance re v eals the magnitude by which the model is outperformed is modest compared to the magnitude of improvement for the other 78% of instances.This can be visualized by comparing the distribution above zer o (impr ov ement) v ersus below zero (outperformed by k-mer) in Figure 6 D. Visual inspection of the composite traces of the 18 test motif composite profiles illustrates that rule ensemble models effecti v ely flatten these profiles (Figure 6 B & Supplementary Figure S5B).The traces approach the theoretical random cleavage red dotted line more closely than k-mer scaling alone (Figure 6 B & Supplementary Figure S5B).Therefore, the local sequence bias, which we define as the region within 10 bases of sequence motifs, is more effecti v ely corrected with rule ensemble modeling.We visualized improvement of rule ensemble modeling over seqOutBias by using a violin plot of distance from calculated random cut frequency for rule ensemble output subtracted from seqOutBias .Positions which rule ensemble outperforms seqOutBias will be above 0 (dashed red line); 78% of positions in this plot were improved and those that are not improved have lower magnitude deviations from the expectation of random cleavage.

Rule ensemble modeling corr ects r egional sequence biases caused by GC-content
We de v eloped these models and measured correction of enzyme biases using these composite motif profiles because this visual r epr esentation is an intuiti v e way to observe enzyme biases in genomic assays.As we move away from the sequence 'anchor' in DNase-seq composite profiles, the traces flatten and approach the expectation of random cleavage because the sequence content in these regions is more random (Figure 7 A).Unlike DNase-seq, the composite profiles for ATAC-seq do not approach random expectation, e v en at distances 100 bases from the sequence motif center.The Tn5 recognition site is GC-rich (Figure 1 A) and it is known that AT / GC richness is clustered throughout the genome ( 27 ).We hypothesized that Tn5 pr efer ence for GCrich regions would lead to generally elevated signal within these regions, accompanied by depleted signal in AT-rich regions.Ther efor e, we determined whether there is a relationship between motif GC content and this regional 'baseline signal'.Motif GC content linearly correlates with baseline signal (Figure 7 B, C).Rule ensemble modeling outperforms seqOutBias for 88% of the positions when regional GC bias correction (Figure 7 D, E) and scaling within the motif (Figur e 6 C) ar e consider ed together.We find that rule ensemb le corrections improv e scaling compared to seqOut-Bias in terms of both reducing baseline mean and variance (Supplementary Table S3).Rule ensemble modeling effecti v ely corrects Tn5's regional and local sequence biases.

DISCUSSION
Transcription factor binding sites and promoters are the most inter esting r egions of the genome with respect to gene r egulation.ATAC-seq signal corr elates with the regulatory potential of these regions.ATAC-seq is a simple experimental assay, but analysis of the data r equir es dedicated pipelines , specialized software , and unique considerations ( 11 ).Importantly, Tn5 sequence bias was described in the first ATAC-seq paper ( 10 ) and many groups hav e de v eloped methods to characterize and correct these biases.Approaches often combine bias correction and footprinting or propose models that cannot be easily interpr eted.Her e, we provide a workflow that directly scales individual ATACseq reads to correct Tn5 bias and provides common output files that can be used for peak calling or footprinting.The rule ensemble modeling approach that we employed is interpretable globally, in that we identify the positions relati v e to Tn5 recognition that influence the model.Moreover, the model is interpretable locally for any individual scaled read.If a rule does not a ppl y to the local sequence read, then the rule drops out of the model and does not contribute to that individual instance of read scaling.Enzymatic sequence pr efer ences must be considered when interpreting molecular genomics data, particularly when the data is single-nucleotide resolution.We introduce this rule ensemb le frame wor k as a novel method that efficiently scales individual sequence reads to correct Tn5 bias.

Figure 1 .
Figure1.Tn5 sequence bias is more complex than nuclease sequence bias.( A ) The seqLogo sequence bias motifs ar e corr ected for background nucleotide content and they illustrate that Tn5 bias is wider and more complex than other commonly used chromatin accessibility enzymes.Nucleotide frequencies listed in the inset to the right correspond to the highest information content position; f or instance, G is f ound at position -4 45% of the instances that Tn5 inserts into DNA.Nuclease cleavage sites are indicated by dashed red line.Total information content (IC) from positions -10 to 10 is listed in the inset to the upper left.( B ) We plotted the log 2 (observ ed / e xpected 5-mer frequency) for all positions surrounding DNA cleavage sites or Tn5 recognition sites as box and whisker plots.( C ) We plotted read depth normalized composite signal from various molecular genomics assays for the top 400,000 sites that conform most stringently to the respecti v e enzyme's bias motif.

Figure 3 .
Figure3.Rule ensemble modeling of enzymatic bias combines k-mer scaling approaches to enhance bias correction.( A ) We generate the rule ensemble input by combining seqOutBias scaling output with k-mer frequencies observed at transcription factor motifs found throughout the genome.This input is then used to train a rule ensemble model to predict the enzymatic sequence bias.Finally, the model is tested for its ability to correct this bias.( B ) For each training motif, we identify the 400,000 occurrences in the genome that conform most stringently to the weight matrix.We extract sequences (200 bases) in the region for each identified motif and calculate the frequency of each k-mer at each position relati v e to the motif center.We perform independent runs of seqOutBias to calculate scale factors for each k-mer in each position within a defined window from the center of the motif.K-mer frequency for each position is then multiplied by the inverse scale factor for e v ery frame of r efer ence.These ar e the rule ensemble input values for each motif / k-mer (size and position) pairing.( C ) We train a rule ensemble model to predict the biased signal measured at each of these region sets in a deproteinized data set.( D ) Output from seqOutBias corresponding to all input frames of r efer ence ar e then combined according to the rule ensemble model to generate a single BED or bigWig file with bias-corrected values for each sequence read.Successful bias correction is evaluated using a held-out test set of TF motifs.

Figure 4 .
Figure 4. Clustering of transcription factor motifs creates di v erse test and training sets.( A ) We clustered transcription factor motifs into test and training sets based on motif information content and GC content.We indicate clusters with transparent ovals.Training and test sets are indicated as red or blue text tha t indica tes the motif name.( B ) The seqLogos of training and test transcription factor motifs illustrate their di v ersity.

Figure 5 .
Figure 5. Rule ensemble modeling effecti v ely corrects DNase bias.( A ) Rule ensemble modeling provides information about the importance of variable in the model.The positions proximal to DNase cleavage are most important.( B ) Violin plots quantify the log 2 ratio of unbiased signal to output for each transcription factor's composite profile gi v en the method of k-mer scaling.Regions measured in each plot are within ±10 base pairs of each motif, as indicated by the graphic above the figure panel.( C ) We plotted composite profiles of transcription factors from rule ensemble output compared with direct k-mer scaling of the 5-mer that encompasses the fiv e most influential positions in panel A. ( D )We visualized improvement of rule ensemble modeling over seqOutBias by using a violin plot of distance from calculated random cut frequency for rule ensemble output subtracted from seqOutBias .Positions which rule ensemble outperforms seqOutBias will be above 0 (dashed red line); 68% of positions in this plot were improved.

Figure 6 .
Figure 6.Rule ensemble modeling corrects Tn5 transposon single nucleotide bias in ATAC-seq data.( A ) The importance values for each position relati v e to Tn5's recognition site mirror the information content values of positions in the seqLogo r epr esentation of Tn5 bias.( B ) We directly compare bias correction of Tn5 data by seqOutBias and rule ensemble modeling by visualizing the composite corrected signals and comparing to the unscaled composite traces.( C ) We quantify the single nucleotide correction for each transcription factor's composite profile by measuring di v ergence from unbiased signal.Regions measur ed ar e within ±10 base pairs of each motif, as indicated by the graphic above the figur e panel.( D )We visualized improvement of rule ensemble modeling over seqOutBias by using a violin plot of distance from calculated random cut frequency for rule ensemble output subtracted from seqOutBias .Positions which rule ensemble outperforms seqOutBias will be above 0 (dashed red line); 78% of positions in this plot were improved and those that are not improved have lower magnitude deviations from the expectation of random cleavage.