Abstract

With the advent of next generation high-throughput DNA sequencing technologies, omics experiments have become the mainstay for studying diverse biological effects on a genome wide scale. Chromatin immunoprecipitation (ChIP-seq) is the omics technique that enables genome wide localization of transcription factor (TF) binding or epigenetic modification events. Since the inception of ChIP-seq in 2007, many methods have been developed to infer ChIP-target binding loci from the resultant reads after mapping them to a reference genome. However, interpreting these data has proven challenging, and as such these algorithms have several shortcomings, including susceptibility to false positives due to artifactual peaks, poor localization of binding sites and the requirement for a total DNA input control which increases the cost of performing these experiments. We present Ritornello, a new approach for finding TF-binding sites in ChIP-seq, with roots in digital signal processing that addresses all of these problems. We show that Ritornello generally performs equally or better than the peak callers tested and recommended by the ENCODE consortium, but in contrast, Ritornello does not require a matched total DNA input control to avoid false positives, effectively decreasing the sequencing cost to perform ChIP-seq. Ritornello is freely available at https://github.com/KlugerLab/Ritornello.

INTRODUCTION

Reliable and precise characterization of where proteins, such as transcription factors (TFs), interact with the genome, enables biologists to understand how gene expression is regulated at the molecular level. The human genome, for example, encodes about 1500 TFs (1) and many of them directly recognize and bind to specific DNA sequences to regulate gene expression. Therefore, identification of where each TF binds to the DNA is critical for reconstructing the complex regulatory network of gene expression. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) is a powerful tool for detecting protein–DNA interactions at the genome-wide scale and has become the method of choice. In a ChIP-seq experiment, first, proteins interacting with the DNA are chemically attached to the DNA using formaldehyde-mediated crosslinking. Then the DNA is fragmented into short pieces and antibodies specifically targeting the protein of interest are used to pull down DNA fragments bound by that protein. Finally, the immunoprecipitated DNA fragments are released from the protein of interest and subjected to high-throughput DNA sequencing. The resulting sequenced reads are mapped to a reference genome and computational peak calling algorithms are applied to process mapped reads and infer protein-binding positions.

TFs usually bind to short specific DNA sequences (motifs) and generate sharp point-source peaks (2). For most ChIP-seq experiments currently available, only one of the two 5′ ends of each double-stranded DNA fragment has been sequenced (single end sequencing), so the read coverage near the point-source peaks follow a characteristic bimodal shape. However, calling peaks accurately from a large quantity of mapped reads is nontrivial and over 40 algorithms have been developed (344) since the ChIP-seq technology was first introduced (45). Peak calling remains challenging due to the presence of artifactual-binding events (false positives) and background noise from reads outside of peaks, multi-binding events with overlapping read contributions and variability of experimental quality. Additionally, for most peak calling algorithms, matched negative controls, which are usually DNA samples obtained without performing immunoprecipitation (total DNA input control) or immunoprecipitated by non-specific antibodies (IgG control), are often required to control the false positive rate (FPR).

Performing a negative control experiment for each sample, effectively doubles the sequencing cost of ChIP-seq, limiting the number of samples that can be run per experiment. Peak calling algorithms that do not use the control (including those that have the option to run with or without it) have been developed; however, they underperform due to the lack of a detailed characterization of ChIP-seq signal and noise.

Binding events can also occur in close proximity to one another and it is often difficult to resolve how many binding sites are present and precisely where binding occurs. BRACIL (46) and CSDeconv (47) use blind deconvolution algorithms to resolve individual peaks at multi-binding loci but are not scalable for peak calling and are thus used for post processing when peaks have been identified by other peak callers. GEM (23) incorporates de novo motif discovery into the peak identification process aiding in resolving individual peaks, but may not be suitable if the TF of interest does not bind to DNA directly or does not have any specific motif.

ChIP-seq experiments can also be of varying quality. Collective efforts by large consortia have provided guidelines on how to evaluate the quality and signal-to-noise ratio of ChIP-seq experiments. The opposing strand cross-correlation between the read coverage on the positive and that on the negative strands has been used to assess experimental quality by ENCODE (2). The cross-correlations of ChIP-seq as well as input control experiments exhibit two modes, one at or around their respective average fragment lengths and an additional one at or around their respective read lengths. High quality experiments tend to have a greater contribution from the fragment length mode, while low quality experiments and input controls tend to have a larger contribution from the read length mode. Specifically, to assess the quality of ChIP-seq experiments, ENCODE recommends two metrics, the normalized strand coefficient (NSC) and the relative strand correlation (RSC)(2). The NSC is the ratio between the the fragment length mode and the baseline for large offsets of cross-correlation. The RSC is the ratio between the fragment length mode and the read length mode of the cross-correlation. If the NSC or RSC scores are low, indicating poor experimental quality, ENCODE recommends repeating the experiment. Given the considerable cost of repeating a ChIP-seq experiment, it is useful to be able to ‘rescue’ samples with suboptimal quality for use as additional replicates, rather than discarding them.

Here, we present Ritornello, a novel algorithm for finding binding sites of TFs. Ritornello is based on both digital signal processing (DSP) and statistical techniques. In the current work, we contribute the following innovations and insights:

  • a peak caller, that does not require a matched control and still maintains a low FPR, outperforming even algorithms that use the control.

  • an efficient method to perform full deconvolution of multi-binding events on a genome wide scale.

  • samples of low quality can be ‘rescued’, instead of being discarded.

  • a rigorous characterization of the binding signals and artifacts in the presence of noise in ChIP-seq data.

  • a non-parametric approach to calculate the fragment length distribution (FLD) for any single-end NGS experiment.

We benchmarked Ritornello against MACS2 (3) and GEM (23), two algorithms recommended by the ENCODE consortium (2). In the default modes each requires the matched control. We demonstrated that Ritornello, a matched control free method, outperformed MACS2 and GEM.

MATERIALS AND METHODS

We have developed Ritornello to find candidate peaks efficiently, with minimal use of memory and computation time, by using a DSP technique called a matched filter, classify candidate peaks as true-binding events or artifacts based on their shape and finally test candidate-binding positions for significance based on comparison to a model absent of binding at that position. The scheme of the Ritornello method is detailed in Figure 1.

Overview of the Ritornello approach. Step 1: the reads are mapped to the genome and the distributions of read starts $\Pr [R^p]$ and $\Pr [R^n]$ are calculated. Step 2: the FLD $\Pr [F]$ is calculated using only those areas identified as background coverage (free of peaks and artifacts). Step 3: the expected distribution of reads around a binding event is calculated from a model including the FLD $\Pr [F]$ as well as the distribution of relative-binding positions within fragments $\Pr [K]$. Initially, K is modeled with a uniform distribution or equivalently K ∼ β(α, α) with α = 1. Step 4: the expected distribution of reads around a binding event is used to locate candidate-binding event peaks (e.g. bk − 1, bk and bk + 1) by identifying the positions with the highest match with impulse response function. Step 5: candidate-binding events are classified as either read length artifacts (e.g. bk + 1) or are retained as putative-binding events (e.g. bk − 1 and bk) based on the shape of the local cross-correlation between opposing strands. Step 6: the binding intensity, $\beta ^p_k+\beta ^n_k$, of each peak, bk are deconvolved from the mixture of local peaks bk − 1 and background noise using a maximum likelihood approach. The likelihood ratio test is then applied to determine the significance of the peak. Step 7: the distribution of relative-binding positions within fragments, $\Pr [K]$, is updated using a maximum likelihood estimate of α, where K ∼ β(α, α). This is obtained using a combined likelihood model for the top 200 most significant peaks. Steps 3–6 are repeated using the new $\Pr [K]$.
Figure 1.

Overview of the Ritornello approach. Step 1: the reads are mapped to the genome and the distributions of read starts |$\Pr [R^p]$| and |$\Pr [R^n]$| are calculated. Step 2: the FLD |$\Pr [F]$| is calculated using only those areas identified as background coverage (free of peaks and artifacts). Step 3: the expected distribution of reads around a binding event is calculated from a model including the FLD |$\Pr [F]$| as well as the distribution of relative-binding positions within fragments |$\Pr [K]$|⁠. Initially, K is modeled with a uniform distribution or equivalently Kβ(α, α) with α = 1. Step 4: the expected distribution of reads around a binding event is used to locate candidate-binding event peaks (e.g. bk − 1, bk and bk + 1) by identifying the positions with the highest match with impulse response function. Step 5: candidate-binding events are classified as either read length artifacts (e.g. bk + 1) or are retained as putative-binding events (e.g. bk − 1 and bk) based on the shape of the local cross-correlation between opposing strands. Step 6: the binding intensity, |$\beta ^p_k+\beta ^n_k$|⁠, of each peak, bk are deconvolved from the mixture of local peaks bk − 1 and background noise using a maximum likelihood approach. The likelihood ratio test is then applied to determine the significance of the peak. Step 7: the distribution of relative-binding positions within fragments, |$\Pr [K]$|⁠, is updated using a maximum likelihood estimate of α, where Kβ(α, α). This is obtained using a combined likelihood model for the top 200 most significant peaks. Steps 3–6 are repeated using the new |$\Pr [K]$|⁠.

Derive fragment length distribution via deconvolution

For fragments overlapping a binding position, the positive strand mapping reads will be upstream of the binding site whereas negative strand reads will be downstream of the site. As such, the distance between a read and the binding position is dependent on the fragment length, which most peak calling algorithms must estimate to obtain accurate predictions for binding locations. Specifically, most current peak callers only estimate the average fragment length rather than the whole distribution. Our first innovation for Ritornello is calculating, not just the mean fragment length, but the entire sample specific empirical FLD from single-end reads (step 2 of Figure 1). Ritornello utilizes this FLD, as a key component, for more accurate peak predictions, which we will describe in detail below.

The opposing strand ‘cross-correlation’ |$\Psi _n = \Pr [R^n]\ast \Pr [-R^p]$| is the single strand ‘autocorrelation’ |$\Phi _p = \Pr [R^p]\ast \Pr [-R^p]$| convolved against the FLD as has previously been shown (48):
(1)
where, |$\Pr [R^p]$| is the probability of choosing a read starting at a position Rp on the positive strand, |$\Pr [R^n]$| is the probability of choosing a read start at a position Rn on the negative strand, |$\Pr [F]$| is the FLD and * is the convolution operator. The FLD can then be obtained by deconvolution as follows:
(2)
where, |$\mathcal {F}$| is the Fourier transform operator and |$\mathcal {F}^{-1}$| is its inverse.
In the current work, we clarify that Equation (1) assumes that Rp and F are statistically independent and thus |$\Pr [R^p+F]$| can be simplified to |$\Pr [R^p]*\Pr [F]$|⁠. If we assume that Rn (as opposed to Rp) and F are independent, then we can instead write the following relationship:
(3)
and deconvolve as follows:
(4)
Equations (2) and (4) describe how to obtain the FLD under two different and mutually exclusive assumptions (i.e. RpF or RnF where ⊥ denotes statistical independence). To estimate the FLD empirically, Ritornello must locate genomic positions where either Rp or Rn is roughly independent of F and invoke the corresponding equations locally.

We note that the local coverage on either Rp or Rn is unlikely to be uniform when the FLD changes as a function of genomic position on that strand. Therefore, to estimate the fragment lenght distribution, we exclude regions that clearly do not satisfy the assumptions made in Equations (1) and (3) by retaining only regions whose coverage is roughly locally uniform for Rp in Equation (1) and Rn in Equation (3).

We identify these candidate regions by looking for read coverage that is locally uniform on either strand, using a χ2 goodness of fit test (i.e. |$\Pr [R^p]\sim \mathcal {U}$| or |$\Pr [R^n]\sim \mathcal {U}$|⁠). For each window of size 2Fmax (twice the maximum fragment length) centered at position i on either strand, we calculate the χ2 test statistic as follows:
(5)
We then sum the local empirical autocorrelations, Φp, i or Φn, i, for those windows where either the positive or negative strand is independent of the fragment length, as determined by the χ2 test for local uniformity, across the genome G. Additionally, we sum the local opposing strand cross-correlations, Ψp, i or Ψn, i, associated with each autocorrelation according to Equations (1) and (3) as follows:
(6)
where,
(7)
Using the global autocorrelation, Φ and cross-correlation, Ψ, functions from Equation (6), we estimate the FLD as follows:
(8)

Why the fragment length distribution can be inferred from single end data?

We have just shown that the fragment length distribution can be derived via deconvolution from |$\Pr [R^P]$| and |$\Pr [R^n]$| when those distributions are related by Equations (1) or (3). This is intuitive in paired-end sequencing. However, in single-end sequencing, only one end is randomly selected from each fragment, and it is hard to imagine how the fragment length information can be inferred. For a given set of fragments, single end sequenced reads are a sub-sample of paired-end sequenced reads. Thus, |$\Pr [R^P]$| and |$\Pr [R^n]$|⁠, the only inputs to Equations (2) or (4), should be the same for the single-end reads as that in the paired-end, resulting in similar estimated FLDs, |$\Pr [F]$|⁠.

To directly demonstrate that the computed |$\Pr [F]$| of singled-end data is a reasonable approximation for the FLD, we have applied Equations (6) and (7) to a pseudo single-end data created by randomly sampling one end from each fragment of a paired-end experiment. In Figure 2, we show that the FLD calculated using Equations (6) and (7) from this pseudo single-end data (black) is similar to the true FLD of the paired-end sample (green).

Ritornello captures the FLD from single-end sequencing data. Given a set of fragments, singled end sequenced reads are a subsample of paired-end sequenced reads. When enough single end fragments are sampled the distribution of read coverage on the positive and negative strands $\Pr [R^p]$ and $\Pr [R^n]$ are equivalent to their paired-end counterparts. A single read from each paired-end fragment of an EZH2 sample was randomly chosen to simulate singled end data. The FLD calculated by Ritornello from this data (black curve) closely approximates the true FLD (green curve).
Figure 2.

Ritornello captures the FLD from single-end sequencing data. Given a set of fragments, singled end sequenced reads are a subsample of paired-end sequenced reads. When enough single end fragments are sampled the distribution of read coverage on the positive and negative strands |$\Pr [R^p]$| and |$\Pr [R^n]$| are equivalent to their paired-end counterparts. A single read from each paired-end fragment of an EZH2 sample was randomly chosen to simulate singled end data. The FLD calculated by Ritornello from this data (black curve) closely approximates the true FLD (green curve).

Local fragment length distribution varies around binding events

The fragments generated from a binding event overlap the binding site, thus any given fragment must be at least as long as the distance from its start position to the binding site. This creates dependence between the fragment length and genomic position on both strands because reads that are further from the binding site are necessarily longer on aggregate. Consequently, neither the positive nor the negative strand read coverage is independent of the fragment length within a binding event, making it inappropriate to recover the FLD using Equation (2), as shown by simulation in Figure 3D. In contrast, in simulated event free regions, it is shown that the FLD can be correctly recovered using Equation (2) as seen in Figure 3A.

The presence of binding events and read length artifacts hinders reconstruction of the FLD by deconvolution. Coverage patterns (positive strand in red and negative strand in blue) were generated from reads sampled randomly from either end of simulated fragments. Equation (2) was applied to infer the FLD, FLD (black). The actual FLD was calculated from the simulated fragments (green). The read length is shown in gray. (A) The FLD, inferred from reads in simulated binding regions, deviates from the true FLD. (B) The FLD, inferred from genomic background coverage outside of binding events, agrees with the true FLD. (C) In the presence of read length column artifacts, the inferred FLD deviates from the true FLD and exhibits a ‘phantom peak’ at read length. (D) In the presence of read length missing-column artifacts, the inferred FLD deviates from the true FLD and exhibits a ‘phantom peak’ at read length.
Figure 3.

The presence of binding events and read length artifacts hinders reconstruction of the FLD by deconvolution. Coverage patterns (positive strand in red and negative strand in blue) were generated from reads sampled randomly from either end of simulated fragments. Equation (2) was applied to infer the FLD, FLD (black). The actual FLD was calculated from the simulated fragments (green). The read length is shown in gray. (A) The FLD, inferred from reads in simulated binding regions, deviates from the true FLD. (B) The FLD, inferred from genomic background coverage outside of binding events, agrees with the true FLD. (C) In the presence of read length column artifacts, the inferred FLD deviates from the true FLD and exhibits a ‘phantom peak’ at read length. (D) In the presence of read length missing-column artifacts, the inferred FLD deviates from the true FLD and exhibits a ‘phantom peak’ at read length.

Local fragment length distribution varies around read length artifacts

In addition to binding events, we have observed local artifactual patterns that create dependence between fragment length and genomic position on both strands, preventing the reconstruction of FLD. These artifactual patterns fall into the following two categories:

  • a pile of reads whose start positions constitute a read length width column pattern on the positive strand, followed by a read length width column pattern on the negative strand, a read length downstream. We refer to this artifact as a column artifact. We simulate it in Figure 3B and show it in Figure 4A.

  • a binding peak (or background read coverage) but with a read length width column pattern of missing reads on the positive strand followed by a read length width column pattern of missing reads on the negative strand, a read length downstream. We refer to this as a missing-column artifact. We simulate it in Figure 3C and show it in Figure 4B.

Column and missing-column artifacts and the nonrandom FLD in their neighborhoods. Examples of read length column (A) and missing-column (B) artifacts in a single-end human GM12878 cell anti-Serum Response Factor (SRF) ChIP sample on chromosome 1. Examples of a read length column (C) and missing-column artifact (D) in a paired-end MEF input control sample on chromosomes 1 and 18 respectively. Positive strand reads are shown in red while negative strand reads are shown in blue. The artifact center positions x where the sample genome differs from the reference are marked with asterisks. The paired-end scatterplots show each read’s position (x-axis) and associated fragment length (y-axis) to demonstrate the dependence relationship between genomic position and fragment length. Every positive strand read (red point) is accompanied by a negative strand read (blue point) originating from the same fragment. For clarity, in the paired-end column artifact plot (C), we have plotted each fragment and separated them to two groups such that in one group all fragments have positive strand reads within the positive strand column (light red column) and in the other group all fragments have negative strand reads within the negative strand column (light blue column). Explicitly, reads from fragments contributing positive strand columns are shown between the pink lines, while those from fragments contributing to the negative strand column are shown between cyan lines. The paired-end missing-column artifact (D) has a column of missing reads on the positive strand followed by a column of missing reads on the negative strand. The positive strand column of missing reads is linked to a diagonal of missing reads on the negative strand, representing the associated missing downstream fragment ends and are together outlined in light red. Similarly, the negative strand column of missing reads is linked to a diagonal of missing reads on the positive strand, representing the associated missing upstream fragment ends, and are together outlined in light blue. We highlight two fragments to demonstrate the coupling between reads on the positive strand and reads on the negative strand.
Figure 4.

Column and missing-column artifacts and the nonrandom FLD in their neighborhoods. Examples of read length column (A) and missing-column (B) artifacts in a single-end human GM12878 cell anti-Serum Response Factor (SRF) ChIP sample on chromosome 1. Examples of a read length column (C) and missing-column artifact (D) in a paired-end MEF input control sample on chromosomes 1 and 18 respectively. Positive strand reads are shown in red while negative strand reads are shown in blue. The artifact center positions x where the sample genome differs from the reference are marked with asterisks. The paired-end scatterplots show each read’s position (x-axis) and associated fragment length (y-axis) to demonstrate the dependence relationship between genomic position and fragment length. Every positive strand read (red point) is accompanied by a negative strand read (blue point) originating from the same fragment. For clarity, in the paired-end column artifact plot (C), we have plotted each fragment and separated them to two groups such that in one group all fragments have positive strand reads within the positive strand column (light red column) and in the other group all fragments have negative strand reads within the negative strand column (light blue column). Explicitly, reads from fragments contributing positive strand columns are shown between the pink lines, while those from fragments contributing to the negative strand column are shown between cyan lines. The paired-end missing-column artifact (D) has a column of missing reads on the positive strand followed by a column of missing reads on the negative strand. The positive strand column of missing reads is linked to a diagonal of missing reads on the negative strand, representing the associated missing downstream fragment ends and are together outlined in light red. Similarly, the negative strand column of missing reads is linked to a diagonal of missing reads on the positive strand, representing the associated missing upstream fragment ends, and are together outlined in light blue. We highlight two fragments to demonstrate the coupling between reads on the positive strand and reads on the negative strand.

Based on paired end data, we have observed that the FLD is invariant with respect to genomic position with the exception of these artifacts and binding events which Ritornello excludes from the FLD calculation using Equations (6) and (7). Both of these artifacts cause local disturbances in the FLD. This is easier to see in paired-end sequenced reads. The paired-end column artifact shown in Figure 4C contains fragments with length distributed according to the sample’s FLD. However, the fragments are organized such that longer fragments extend further from center of the artifact (denoted with an asterisk) than shorter fragments, implying F is dependent on Rp and Rn. The paired end missing-column artifact shown in Figure 4D is composed of fragments organized such that the range of possible fragment lengths is restricted based on genomic position. Specifically, on the positive strand, lengths of fragments (such as fragments A and B) must lie outside the range [dp, dp + |$w$|] where dp is the distance from the positive strand read to the center of the missing-column artifact (denoted with an asterisk). Likewise, on the negative strand, the lengths of fragments (such as fragments C and D) must lie outside the range [dn, dn + |$w$|] where dn is the distance from the negative strand read to the center of the missing-column artifact (denoted with an asterisk). Thus, column and missing-column artifacts are composed of fragments organized such that both strands are dependent on fragment length.

We note that if we were using Equations (2) or (4) without invoking Equations (5)–(8) at these artifactual regions, the resulting FLD would have two modes, one associated with the ‘phantom peak’ near the read length, and one associated with the predominant fragment length. ENCODE observed these two modes in the opposing strand cross-correlation, which is related to the FLD (Equations (1) and (3)). However, when these artifactual regions are filtered out as is done in equations (5)–(8), the ‘phantom peak’ is greatly attenuated. Thus, these artifactual areas give rise to a low quality RSC (2,49), the ENCODE measures of the ‘phantom peak’ using cross-correlation. The cause of these artifacts is likely incorrect mapping which is described in the next section.

Incorrect mapping leads to read length artifacts

The read length used in any sequencing experiment is determined subsequent to the collection of fragments. Therefore, read length artifacts are associated with sequencing or post-sequencing procedures. Each read in a ChIP-seq experiment is sequenced from the sample’s genomic DNA, which can differ from the reference genome used in the alignment step. Comparative genome assembly algorithms used for sequence alignment work by comparing the nucleotide sequence for each read to the sequence of the reference genome and assigning the read to the location that gave the best alignment score, usually based on fuzzy string matching. If the nucleotide sequence of the sample’s genome, from which the reads are sampled, disagrees with the reference genome at location x, the mapping algorithm can fail to assign the appropriate reads to that location. This could occur if the number of mismatched bases (or indels) per read exceeds a predetermined cutoff or simply if the reads belonging to that location map better to another region with higher sequence similarity. When this occurs there will be a discontinuity in coverage across |$w$| nucleotides (where |$w$| is the read length) because there are exactly |$w$| possible read start positions where the read would overlap the mismatched base (or indel). On the positive strand the discontinuity is upstream of x and on the negative strand the discontinuity is downstream of x. This results in an missing-column artifactual coverage pattern of incorrectly mapped reads to the upstream sequence highlighted in yellow as seen in Figure 5. Further, reads that fail to map to the correct location can instead map to another location with higher sequence similarity as determined by the mapping algorithm. This would result in a column artifact as seen in the downstream sequence highlighted in yellow in Figure 5. These artifacts tend to occur in interspersed repetitive regions such as the sequences shown in yellow in Figure 5.

Read length artifacts likely stem from mapping problems. Reads that map to their correct locations do not give rise to artifacts (left). Differences between the reference and sequenced genomes can produce missing-column artifacts and additionally the coverage can be relocated to a region of higher sequence similarity forming a column artifact (right).
Figure 5.

Read length artifacts likely stem from mapping problems. Reads that map to their correct locations do not give rise to artifacts (left). Differences between the reference and sequenced genomes can produce missing-column artifacts and additionally the coverage can be relocated to a region of higher sequence similarity forming a column artifact (right).

For paired-end each relocated read in a column artifact, the associated read from the same fragment is also relocated as shown in Figure 4C. Likewise for each missing read in a missing-column artifact, the associated read from the same fragment is also missing as shown in Figure 4D. In principle, the missing column artifact problem may be mitigated by suggesting multiple possible mappings (which can be done using the bowtie −k option or efficiently using other algorithms (50)). However, while this may fill in some of the missing columns, it will create additional column artifacts, which would need to be resolved.

Derive the expected read coverage distribution around binding events

Once we estimate the FLD, we use it to derive a filter matched to the read coverage pattern characteristic to regions of true-binding events (step 3 of Figure 1). We denote putative ChIP target-binding sites by Bj, where j is an index representing the j-th putative-binding event along the genome. Each fragment originating from binding event j covers the binding position, Bj. The binding position Bj is then related to the read position as follows:
(9)
and
(10)
where |$R^p_j$| is the start position of a read on the positive strand resulting from event j and |$R^n_j$| is the end position of a read on the negative strand resulting from event j. K is a random variable taking values between zero and one, describing the relative position of the binding site within a fragment. If K equals 0, then the binding position is at the most upstream end of the fragment, and if K equals 1, then the binding position is at the most downstream end of the fragment. If K is between 1 and 0, the binding position is at that location relative the fragment length. We model K as a β distributed random variable |$\Pr [K] \sim B(\alpha , \beta )$|⁠. The β distribution is convenient for this purpose because it is flexible for modeling random variables which take values between 0 and 1. Additionally, we set α = β, assuming that K is symmetrically distributed. We initialize K to a uniform distribution (α = 1) as a natural choice in the absence of prior knowledge (step 3 of Figure 1) and as detailed subsequently we reevaluate it by optimizing α (step 7 of Figure 1). Next, applying algebra of random variables, it can easily be seen that:
(11)
and
(12)
where * is the convolution operator, |$\Pr [K]=\Pr [1-K]$| due to the symmetry mentioned above and |$\Pr [FK]$| is the product distribution as follows:
(13)
|$\Pr [-FK]$| is the distribution of local read coverage on the positive strand with support upstream of a binding position (negative offset). |$\Pr [FK]$| is the distribution of local read coverage on the negative strand with support downstream of a binding position (positive offset). We will use these local coverage patterns in both steps 4 and 6 of Figure 1 to locate and quantify candidate peaks respectively.

Matched filtering for rapid and accurate localization of candidate peaks

TFs bind to a small fraction of the genome, thus to improve efficiency we only test candidate peak positions (step 6 of Figure 1) that closely match our expected peak shape (⁠|$\Pr [-FK]$| and |$\Pr [FK]$|⁠), as identified by a matched filter (51) (step 4 of Figure 1).

In signal processing, a filter is a function which selects for the desired output signal vector, s and suppresses the undesirable noise vector, |$v$|⁠, of an observed input signal x = s + |$v$|⁠. A matched filter (51) is a specialized filter whose time inverse is the impulse response function, h, where h is optimally parallel to the desired signal (hs) and orthogonal to the noise (h|$v$|⁠). The matched filter has the favorable property that when convolved with an observed signal, it will maximize the output signal to noise ratio. The time inversed matched filter, h, is defined as follows:
(14)
where γ is a normalization constant and Σ−1 is the inverse covariance matrix of the noise.
In the context of ChIP-seq experiments, the observed input signal x is the read count at each position. For identifying peaks, the desired output signal s is associated with an impulse response of |$\Pr [-FK]$| for the positive strand and |$\Pr [FK]$| for the negative strand. The read count noise is not globally stationary, but locally it is approximately a stationary process and is thus independent and identically distributed (i.i.d.) with the level of noise changing relatively slowly on a much larger length scale than the support (2Fmax) of the desired signal. When the noise is i.i.d., (14) is reduced to
(15)
where the amplitude of the noise is absorbed in |$\tilde{\gamma }$|⁠. From Equation (15) we see that the filter h is proportional to the impulse response of s. Specificaly, we use the following filters hp and hn for the positive and negative strands respectively:
(16)
We define the filtered signal Y at each position by the formula:
(17)

Usually γ is chosen to normalize the expected power of the noise after application of the filter to one. However, a given γ that normalizes the noise in low coverage areas to one, necessarily will give higher power in areas of higher coverage. Therefore, it is infeasible to specify a single γ when the noise is not globally stationary. As a result, the standard way of detecting the desired output signal by thresholding using a fixed signal to noise ratio is not applicable.

Instead, we identify local maxima using a Gaussian derivative filter, a technique commonly used for detecting local maxima (edges) in images as follows:
(18)
where |$\mathcal {G}$| is the Gaussian distribution with variance σ2. Zero crossings (Equation (18)) from positive to negative of this smoothed first derivative are the local maxima. A minimum read count requirement is applied to avoid spurious low coverage local maxima. Those local maxima passing the threshold are selected as candidate-binding events, bj; j ∈ [1, N] (for N candidate events).

Remove false positive-binding events using cross-correlation

Genomic regions that have similarly high read coverage in both the ChIP sample and the negative control are false positive-binding events. Most current peak calling algorithms rely on negative controls (usually total DNA input) to control the FPR. To the best of our knowledge, if negative controls are not available, false positive events would not be filtered out by current peak calling algorithms. We have discovered that most of the significant false positive events are in fact the read length artifacts described earlier and exemplified in Figure 6. We have already shown that the cross-correlations and deconvolved FLDs of regions with read length artifacts exhibit the ‘phantom peak’ at read length (Figure 3).

Local cross-correlation differentiates true-binding events from read length artifacts (false positives). (A) Local cross-correlation of a binding event in anti-ATF2 K562 ChIP sample peaks near the average fragment length. (B) Local cross-correlation of a column artifact in anti-ATF2 K562 ChIP sample peaks near the read length. (C) A scatterplot of the maximum local log cross-correlation up to 10 bp beyond the read length versus the maximum local log cross-correlation in the range of 10 bp beyond the read length to 0.75Fmax (D) A scatterplot of the maximum local binarized cross-correlation (using unique reads) up to 10 bp beyond the read length versus the maximum local binarized cross-correlation in the range of 10 bp beyond the read length to 0.75Fmax.
Figure 6.

Local cross-correlation differentiates true-binding events from read length artifacts (false positives). (A) Local cross-correlation of a binding event in anti-ATF2 K562 ChIP sample peaks near the average fragment length. (B) Local cross-correlation of a column artifact in anti-ATF2 K562 ChIP sample peaks near the read length. (C) A scatterplot of the maximum local log cross-correlation up to 10 bp beyond the read length versus the maximum local log cross-correlation in the range of 10 bp beyond the read length to 0.75Fmax (D) A scatterplot of the maximum local binarized cross-correlation (using unique reads) up to 10 bp beyond the read length versus the maximum local binarized cross-correlation in the range of 10 bp beyond the read length to 0.75Fmax.

Ritornello identifies regions containing false positive events by detecting the ‘phantom peak’ in their cross-correlations; true-binding events contain no such ‘phantom peak’ in thier cross-correlations. To this end, we employ a machine learning approach, building a classifier to distinguish between read length artifacts and true-binding events. To extract features for the classifiers, we calculated cross-correlation locally from bjFmax to bj + Fmax around each candidate peak. The features include: (i) the maximum value of the cross-correlation function in the range between zero and read length, which is denoted by c1 and (ii) the maximum value of the cross-correlation function in the range between the read length and the maximum fragment length Fmax, which is denoted by c2. In the neighborhoods of binding events c2 is expected to be higher than c1, whereas in neighborhoods of read-length artifacts we expect c1 to be larger than c2. We added additional features to account for consecutive read length artifacts of varying amplitudes and large amplifications such as due to polymerase chain reaction (PCR). For this purpose, we binarized the coverage in the positive and negative strands by setting positions with read count >0 to 1. We then performed a running mean smoothing on the binarized coverage, calculated cross-correlation and extracted the following features: (i) the maximum value of the binarized smoothed cross-correlation function in the range between zero and read length, which is denoted by d1 and (ii) the maximum value of the binarized smoothed cross-correlation function in the range between the read length and the maximum fragment length Fmax, which is denoted by d2.

We build a classifier using logistic regression, with features: |$\lbrace \frac{c_2}{c_1},\frac{d_2}{d_1}\rbrace$|⁠. The instances used to build this classifiers include manually classified peaks obtained as follows: we first applied MACS2 (negative control free mode) to four TF ChIP-seq datasets generated by ENCODE, subsequently selected the top 200 peaks for each of four samples and finally manually labeled regions with typical binding shape as true positives (see Figure 6A) and regions with characteristic read length artifact as false positives (see Figure 6B). We trained this model using a five fold cross-validation and achieved high performance with AUROC of 0.993. This set of features is scale-free and thus our trained classifier is generalizable to any ChIP-seq sample and does not need to be retrained. Ritornello incorporates this trained classifier (step 5 of Figure 1) to flag artifactual locations as false positives without the need for a paired total DNA input or IgG control.

Deconvolving single events from local coverage

The read coverage near an event is a mixture of reads generated by that event, noise and any neighboring events. In order to accurately quantify each event, it is essential to deconvolve its binding intensity (number of reads originating from each event) from this mixture. Fragments originating from different events in close proximity may overlap. Consequently, it is difficult to quantify the number of reads coming from each binding event. Further, fragments originating from non ChIP-ed DNA, off targeted sequencing due to antibody inefficiency, as well as other sources, contribute to background noise. We model the read coverage around each candidate peak using a generalized linear model to deconvolve its binding intensity.

The binding intensity, βj, of each candidate peak, j, is only dependent on positions where that peak has support, i ∈ [bjFmax, bj + Fmax]. To efficiently deconvolve the signal at bj we first discard peaks that do not overlap with bj. We retain only the subset qk ∈ {q1qT} of T peaks in close proximity to j (including j), such that the support of each peak in q overlaps with the support of j. The locally uniformly distributed noise associated with this neighborhood is indexed by q0. Here we assume that read counts follow a Poisson distribution, a common assumption made by other algorithms, such as MACS and GEM (3,23). We can then model the number of reads on the positive and negative strands, |$C^p_{i,q_k}$| and |$C^n_{i,q_k}$|⁠, at position i due to event qk as follows:
(19)
and
(20)
where the parameters |$\beta ^p_{q_k}$| and |$\beta ^n_{q_k}$| denote the binding intensities (expected read counts) of event qk. The impulse response functions |$h^p(i-b_{q_k})$| and |$h^n(i-b_{q_k})$| are the probabilities of observing a read at position i from event qk. We note that different |$\beta ^p_{q_k}$| and |$\beta ^n_{q_k}$| values are used to account for local differences in read coverage between positive and negative strands.
To model the noise we will once again invoke our assumption of locally stationary noise, as in the discussion before Equation (16). Here we assume that the locally stationary noise is a uniformly Poisson distribution. We model the read counts due to noise at position i for the positive and negative strands as follows:
(21)
and
(22)
where U is a function that is locally uniform with support of 2Fmax around bj and |$\beta ^p_{q_0}$| and |$\beta ^n_{q_0}$| are the expected number of reads due to noise on the positive and negative strands respectively.
The read count at position i is then given by the sum of read counts from all sources qk as follows:
(23)
where
(24)
and
(25)
where
(26)

The relationships in Equations (23) and (25) use the following theorem: if X1Xn are independent Poisson distributed random variables, Pois1)…Poisn), then their sum X1 + … + Xn is Poisson distributed, Pois1 + … + λn).

In order to obtain the binding intensity for peak, bj, we maximize the likelihood for the models (Equations (23) and (26)) of all nucleotides around bj. The likelihood of local binding intensities |$\beta ^p_q$| and |$\beta ^n_q$| around the peak of interest, j, can be written as:
(27)

We then find the maximum likelihood estimates for parameters |$\beta ^p_q$| and |$\beta ^n_q$|⁠. The sum, |$\beta ^p_j + \beta ^n_j$|⁠, is reported as the binding intensity for the peak at bj.

We note that this is formally a Poissson generalized linear model with identity link function. Such a model has the advantage that it can resolve multiple peaks in close proximity, such as double-binding or triple-binding events. To our knowledge only BRACIL (46) and CSDeconv (47) are designed to deconvolve adjacent-binding events, in particular double-binding events and use different models. Those two algorithms are less efficient and therefore require as input a set of peaks from other peak callers.

Ritornello implements a dogleg optimization (the Newton–Raphson method coupled with initial gradient descent), which is much faster than traditional Expectation Maximization or Markov Chain Monte Carlo methods (52), enabling the rapid deconvolution of all loci detected in previous steps.

Testing candidate peaks for significance

In the previous section we quantified the intensity, βj, of each candidate peak. Here we determine the significance of each of these candidate-binding events using a likelihood ratio test based on the likelihood we derived in Equations (27). The null model |$H^0_j$| is obtained by setting both |$\beta ^p_j=0$| and |$\beta ^n_j=0$|⁠. We note that we use the term null model at each position bj to refer to the model involving a zero-binding intensity at bj but with potentially nonzero |$\beta _{q_0}^p$| and |$\beta _{q_0}^n$| as well as |$\beta _q^p$| and |$\beta _q^n$| in neighboring candidate events. The alternative model |$H^1_j$| uses full parameterization including non vanishing |$\beta _j^p$| and |$\beta _j^n$| at the location of interest bj.

Since the null model |$H^0_j$| is nested within |$H^1_j$|⁠, we can employ the likelihood ratio test statistic (D) in the form:
(28)
According to Wilke’s theorem (53) the likelihood ratio test statistic for this nested model is distributed according to a χ2 distribution with two degrees. We then calculate the p-value for each peak based on this χ2 distribution.

Up to this point, we have obtained an initial list of putative-binding events. This was based on inferring an impulse response function given by the product distribution of the FLD and a uniformly distributed K (step 3 Figure 1). To further refine the impulse response function, we find the estimate of α that maximizes the combined likelihood of the 200 most significant putative events (step 7 of Figure 1). As shown in Figure 7, the |$\Pr [-FK]$| and |$\Pr [FK]$| derived from this procedure closely match the shape of highly abundant peaks. Finally, we repeat the peak identification, artifact testing, and likelihood ratio testing (steps 4–6 of Figure 1) using the updated hp and hn and report final list of significant peaks.

The parameterized filter closely matches the peak shape. Shown is an example peak in a human SRF sample. The parameterized filter for this sample is shown in gray. The peak location as determined by the filter is shown by a dashed line.
Figure 7.

The parameterized filter closely matches the peak shape. Shown is an example peak in a human SRF sample. The parameterized filter for this sample is shown in gray. The peak location as determined by the filter is shown by a dashed line.

FDR correction

To control the false discovery rate, we adjust the P-values, pi, associated with each hypothesis test by applying the Benjamini–Hochberg (54) approach to obtain q-values, qi, which control the FDR as follows: |$q_i=p_i \frac{m}{k_i}$|⁠, where ki is the P-value rank by significance and m, is the total number of potential hypothesis tested. Finally, we ensure monotonicity using the Benjamini–Yekutieli correction (55).

Usually one performs all hypothesis tests and m is set to that number. For the likelihood ratio test that Ritornello performs, it is important to note that the null hypothesis is the absence of a peak at a position of interest in the presence of any amount of coverage due to uniform background signal or neighboring peaks. The appropriate m for Ritornello is then, not the number of candidate peaks tested, as these were previously enriched for peak shape coverage by the matched filter, but rather the total number of windows exceeding the minimum read count threshold before selecting those that match the filter.

RESULTS

To assess Ritornello’s performance, we compared it against the MACS2 (3) and GEM (23) peak callers, which have both been recommended by the ENCODE consortium (56,57). We use 14 single-end TF ChIP-seq experiments from the ENCODE project (57), each with two biological replicates (see Table 1 and Supplementary Table S1). Matched DNA input or IgG controls were also available (Supplementary Table S1).

Although matched input controls are optional for some methods, they are used by ChIP-seq peak calling algorithms to avoid calling many spurious false positives. Both MACS2 and GEM have options to run with or without the matched input control. To show that Ritornello, which is a matched control free approach, avoids calling these false positives we benchmark it against MACS2 (with and without the matched control), and GEM (with and without the matched control) on each of the 28 samples. We refer to MACS2 with the control as MACS2-I and without the control as MACS2. We refer to GEM with the control as GEM-I and without the control as GEM. To control for the variable size of reported peaks, we used the interval from 100 bp upstream to 100 bp downstream of each peak summit identified by MACS2 or GEM, and of each peak location identified by Ritornello. If multiple peaks called from a single algorithm overlapped each other using this 200 bp window, only the peak with the more significant q-value is considered. Mitochondria were excluded from all analysis. We compare the performance of these algorithms in terms of:

  • the percentage of peaks unique to Ritornello with respect to each control utilizing sample.

  • the similarity between characteristic coverages in binding events predicted uniquely by Ritornello, MACS2-I, or GEM-I to the coverages of strong binding events predicted by multiple algorithms.

  • motif enrichment.

Additionally, we demonstrate that Ritornello predicts very few false positives in negative controls suggesting that Ritornello has a low FPR. We also observe that Ritornello produces results that are reproducible among technical replicates as determined by the irreproducible discovery rate algorithm (58) (Supplementary Table S2).

Most peaks called by Ritornello are common to MACS2-I or GEM-I

Control samples are used to eliminate false positive peaks during ChIP-seq peak calling. It is therefore important to measure how many peaks are unique to Ritornello with respect to the control using algorithms. The percentages of peaks unique to Ritornello with respect to MACS2-I and GEM-I are shown in Table 2. Ritornello tends to call few (<10%) unique peaks in all samples with the exceptions of E2F4 and ATF2. Although Ritornello tends to report fewer peaks than MACS2-I and GEM-I, these peaks are most often a subset of the peaks reported by those input using algorithms. Peaks that are called by MACS2-I and GEM-I but not by Ritornello typically have less reads and hence the shape of the read coverage at these loci is not well defined.

TF ChIP-seq experiments (two replicates each)

Table 1.
TF ChIP-seq experiments (two replicates each)
TFCell type
NRF1K562
SRFGM12878
RESTH1
MAXK562
ATF2H1
E2F4GM12878
GATA1MEL
MYCMEL
CTCFMyotube
ELK1K562
SRFH1
MAXH1
YY1H1
RESTK562
TFCell type
NRF1K562
SRFGM12878
RESTH1
MAXK562
ATF2H1
E2F4GM12878
GATA1MEL
MYCMEL
CTCFMyotube
ELK1K562
SRFH1
MAXH1
YY1H1
RESTK562
Table 1.
TF ChIP-seq experiments (two replicates each)
TFCell type
NRF1K562
SRFGM12878
RESTH1
MAXK562
ATF2H1
E2F4GM12878
GATA1MEL
MYCMEL
CTCFMyotube
ELK1K562
SRFH1
MAXH1
YY1H1
RESTK562
TFCell type
NRF1K562
SRFGM12878
RESTH1
MAXK562
ATF2H1
E2F4GM12878
GATA1MEL
MYCMEL
CTCFMyotube
ELK1K562
SRFH1
MAXH1
YY1H1
RESTK562

Percentage of peaks unique to Ritornello as compared to MACS2-I and GEM-I

Table 2.
Percentage of peaks unique to Ritornello as compared to MACS2-I and GEM-I
graphic
graphic
Table 2.
Percentage of peaks unique to Ritornello as compared to MACS2-I and GEM-I
graphic
graphic

Comparing Ritornello and alternative methods based on unique peak coverage patterns

We next investigate the peak shape and motif content for samples where Ritonello called >10% unique peaks relative to MACS2-I or GEM-I. For this analysis, we compared the peaks that are uniquely reported by Ritornello to the corresponding algorithms in Figure 2. To be fair when comparing read coverage of unique peaks, we averaged the local distributions of read start positions (pileup) for the top 200 most significant unique peaks of each algorithm (pileup plots in Figure 8 and additionally Supplementary Figures S1 and 2) and found that the characteristic patterns (black curves whose shapes match the impulse response functions) associated with the highest intensity peaks reported by all algorithms tend to be similar to the patterns obtained by aggregating the pileups of peaks uniquely reported by Ritornello but in contrast are less similar to the aggregated pileups of peaks uniquely reported by MAC2-I or GEM-I. These results provide evidence that peaks uniquely reported by Ritornello may be more likely to be indicative of binding events than those uniquely reported by MACS2-I or GEM-I.

Pileup of read start positions for ATF2 rep2 peaks in H1 cells obtained by Ritornello only (blue) compared with that obtained from (A) MACS2 only(red) and (B) GEM only(green). The pileups of peaks common to Ritornello and MACS2 (A) or Ritornello and GEM (B) are shown in black. The pileups of read start positions for Ritornello best match the pileups of common peaks. Additionally Ritornello unique peaks have comparable levels of motif enrichment as compared to the other algorithms.
Figure 8.

Pileup of read start positions for ATF2 rep2 peaks in H1 cells obtained by Ritornello only (blue) compared with that obtained from (A) MACS2 only(red) and (B) GEM only(green). The pileups of peaks common to Ritornello and MACS2 (A) or Ritornello and GEM (B) are shown in black. The pileups of read start positions for Ritornello best match the pileups of common peaks. Additionally Ritornello unique peaks have comparable levels of motif enrichment as compared to the other algorithms.

Comparing Ritornello and alternative methods based on motif occurrence rate

Availability of genuine validations of TF-binding events inferred by TF ChIP-seq peak callers is limited. Therefore, one of the measures used by practitioners for assessing the quality of peak callers is the fraction of predicted TF-binding events that overlap with the characteristic binding motif of the relevant TF. Employing the same 28 public ChIP-seq samples, we compare the motif enrichment for the top 1000 peaks reported by each algorithm or all peaks when less than 1000 are reported as shown in (Table 3). The annotated motifs specific to the TFs are downloaded from the JASPAR CORE 2014 motif library (59). Genomic locations that match the position weight matrix of each JASPAR motif were identified using PWMScan with default P-value cutoff at 0.00001 (60). Predicted peaks whose binding centers are within 100 bp from the corresponding JASPAR motif are classified motif containing. The peaks found by Ritornello had the highest motif occurrence rate compared with MACS2-I, MACS2, GEM-I and GEM in 15 out of the 28 samples. MACS2-I, MACS2, GEM-I and GEM had the highest motif occurrence rate in 8, 1, 3 and 1 samples respectively. This suggests that Ritornello, which is a control free peak caller, is able to identify the most significant true-binding events at comparable rates to input using peak callers.

Percentage of the top 1000 peaks (or all peaks when <1000) within 100 bp of a motif obtained by Ritornello, MACS2-I, MACS, GEM-I and GEM

Table 3.
Percentage of the top 1000 peaks (or all peaks when <1000) within 100 bp of a motif obtained by Ritornello, MACS2-I, MACS, GEM-I and GEM
graphic
graphic
Table 3.
Percentage of the top 1000 peaks (or all peaks when <1000) within 100 bp of a motif obtained by Ritornello, MACS2-I, MACS, GEM-I and GEM
graphic
graphic

Additionally, we compare the number of motif containing peaks reported by each algorithm as a function of significance (Figure 9 and Supplementary Figures S3–9). We see that Ritornello exhibits improved motif enrichment, with respect to the other algorithms, for peaks reported in ATF2 rep2 and comparable enrichment to control utilizing algorithms in E2F4 rep1 and rep2. In general, Ritornello shows comparable motif enrichment to the input utilizing algorithms.

Motif containing peaks as a function of the rank of the q-value for (A) ATF2 rep2, (B) E2F4 rep1, (C) E2F4 rep2 and (D) CTCF rep1. The number of peaks displayed is slightly more than the number of peaks reported by Ritornello.
Figure 9.

Motif containing peaks as a function of the rank of the q-value for (A) ATF2 rep2, (B) E2F4 rep1, (C) E2F4 rep2 and (D) CTCF rep1. The number of peaks displayed is slightly more than the number of peaks reported by Ritornello.

Ritornello identifies true-binding events in low quality samples

ENCODE recommends discarding low quality samples as determined by the NSC and RSC scores. Ritornello discards read length artifacts that give rise to low RSC and uses a matched filtering approach that maximizes the signal to noise ratio measured by the NSC, we therefore conjectured that Ritornello may be able to rescue these low quality samples. We compared the performance of Ritornello with that of MACS2-I, MACS2, GEM-I and GEM on samples with suboptimal quality based on the NSC and RSC scores. The ENCODE Consortium has suggested repeating experiments with NSC values <1.05 and RSC values <0.8. Using these criteria we identified that out of the 28 samples we investigated, four samples have suboptimal quality. These four experiments include: ATF2 H1 replicate 1 (NSC = 1.04, RSC = 0.62), ATF2 H1 replicate 2 (NSC = 1.04, RSC = 0.74), ELK1 K562 replicate 1 (NSC = 1.03, RSC = 0.64) and ELK1 K562 replicate 2 (NSC = 1.05, RSC = 0.73).

We observed that in these four samples, the pileups of peaks predicted by Ritornello have a characteristic bimodal shape of TF binding and have much stronger read coverage than their matched input controls (Figure 10 and Supplementary Figures S10–12). This suggests that there are numerous significant binding events that can be captured in low quality ChIP-seq samples. Additionally, the pileups of peaks reported by MACS2 or GEM have either non-uniform read coverage or a narrow bimodal shape (similar to column artifacts) as in Figure 10 and Supplementary Figures S10–12. It is worth noting, however, that when provided controls, the MACS2-I and GEM-I reported peaks also have a strong bimodal shape with less signal in the control sample. This illustrates that Ritornello reliably rescues peaks from low quality samples, while MACS2-I and GEM-I might be able to avoid calling false positive artifacts from low quality samples as well.

Pileup of read start positions for peaks identified by: Ritornello, (A) and (D) (blue); MACS2-I, (B) and (E) (dark red); and MACS2, (C) and (F) (light red) in samples of low quality and their matched controls. Matched controls are shown in black. The pileups of peaks detected by Ritornello show smooth bimodal shapes and similarly to MACS2-I, have stronger read coverage in the ChIP as compared to their matched controls. In contrast, the pileups of peaks detected by MACS2 have a narrower shape and irregular spikes similar to the negative controls.
Figure 10.

Pileup of read start positions for peaks identified by: Ritornello, (A) and (D) (blue); MACS2-I, (B) and (E) (dark red); and MACS2, (C) and (F) (light red) in samples of low quality and their matched controls. Matched controls are shown in black. The pileups of peaks detected by Ritornello show smooth bimodal shapes and similarly to MACS2-I, have stronger read coverage in the ChIP as compared to their matched controls. In contrast, the pileups of peaks detected by MACS2 have a narrower shape and irregular spikes similar to the negative controls.

Ritornello obviates the need for a matched input control

To demonstrate that Ritornello calls few false positives that could otherwise be avoided with the aid of a match control, we show that Ritornello calls few peaks in control samples as seen in Table 4. We used the filter learned in the ChIP channel to look for regions in the corresponding control that match this pattern. This is performed by running Ritornello, supplying the filter found in the ChIP experiment rather than having it learned from the control data. The number of peaks Ritornello reported in each matched control is a proxy for the number of potential false positive peaks called by Ritornello in its corresponding ChIP-seq experiment. We found that Ritornello called very few peaks in the matched controls, which corresponds to a FPR in ChIP of <0.05 for 25 out of the 28 samples. The samples with the two highest FPR’s use IgG as a control, so it is likely that Ritornello is picking up non-specific binding events, which are unlikely to be a source of noise in an actual ChIP-seq sample.

Ritornello tends to call many more peaks in ChIP-seq samples than in control samples

Table 4.
Ritornello tends to call many more peaks in ChIP-seq samples than in control samples
graphic
graphic
Table 4.
Ritornello tends to call many more peaks in ChIP-seq samples than in control samples
graphic
graphic

DISCUSSION

In this work, we demonstrated that we could infer the entire FLD, rather than only the mean fragment length, using a deconvolution approach from single-end TF ChIP-seq experiments. We derived an experiment-specific probabilistic model to mathematically describe the well-known bimodal shape of TF binding. Using this bimodal shape, we applied the matched filter technique from signal-processing to identify potential TF-binding sites and used a Poisson GLM to deconvolve the binding intensities and test the significance of each putative-binding event. Our model efficiently deconvolves the effect of neighboring peaks as well as noise to resolve multiple adjacent-binding events. We compared Ritornello (a control-free approach) with two popular algorithms recommended by ENCODE, MACS2-I and GEM-I, which require matched controls to reduce false positives. We found that Ritornello outperforms these other methods when input is unavailable and performs similarly when it is in terms of reproducibility between biological replicates, motif enrichment and the coverage patterns of unique reproducible peaks.

We also identified artifactual-binding regions where the local cross-correlation peaks at read length instead of around fragment length. We elucidated that these artifactual regions contribute to the phantom peaks associated with poor experimental quality. Current peak calling algorithms, such as MACS2 and GEM, rely on matched control samples to remove a substantial fraction of these artifacts. We provide an extensive description of this specific category of artifacts and their origin, and offer an automated approach to filter out artifacts without requiring matched controls. Taken together, Ritornello offers an alternative that obviates the need for a match control, demonstrating that one can safely reduce the total experimental cost of TF ChIP-seq experiments, while providing superior analytic results.

ENCODE provides a blacklist of genomic regions which contains artifactually high read coverage in different ChIP-seq experiments (61). This manually curated blacklist largely overlaps with repetitive regions in the genome (61). The blacklist has several drawbacks: (i) it does not cover all artifactual regions, (ii) it is not generalizable to different cell types and (iii) it is only available for human and mouse. We note that a few peak calling tools, such as PePr (62), optionally remove artifacts in regions where the local read coverage in the ChIP is similar to that in the matched control. However, these methods still require a matched control. Ritornello is capable of removing artifacts independently without requiring either prior knowledge of a blacklist or matched negative controls.

Many variables influence the quality of ChIP-seq experiments and our ability to infer true-binding events from the data. These include factors such as antibody efficiency, DNA fragmentation, PCR amplification, sequencing depth, read mapping quality etc. Each of these factors may vary from one sample to another. Ritornello is designed to implicitly take into consideration these experiment specific parameters from raw data and is applicable to a wide variety of protocols. Additionally, it does not require any tuning of parameters. One limitation of Ritornello is that it is designed to detect point-source peaks such as TF-binding events. For broad-source peaks, such as epigenetic modifications, we recommend other peak callers.

We note that there are scenarios where Ritornello is expected to exceed the performance of control utilizing algorithms. These include: (i) when the control is of poor quality and (ii) positions where methods considering controls fail to eliminate read length artifacts. Additionally, in regions with very low coverage or high amounts of noise that obscure the binding shape of potential peaks, Ritornello will not call-binding events, and will thus report peaks more conservatively than MACS2-I and GEM-I.

We demonstrated that in TF ChIP-seq experiments that would otherwise be discarded due to low quality, Ritornello (as well as MACS2-I or GEM-I) might reliably recover true-binding events amidst the high levels of artifacts. Often repeat experiments with the same reagents are of poor quality according to ENCODE’s metric and thus algorithms capable of handling such data are required. If, however, the repeated experiments were to improve results, the previous poor quality samples may still be of value to strengthen the findings of the higher quality samples.

SOFTWARE AVAILABILITY

Ritornello was programmed in C++ using the FFTW (63) library for fast computation of the Fourier transform and the Samtools (64) library for interfacing with the sequence alignment/map format which has become the standard in high throughput sequencing. Ritornello is freely available for download at https://github.com/KlugerLab/Ritornello together with a detailed tutorial. Further analysis and graphics were made using the R statistical language (65).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

The authors thank Francesco Strino, Vladimir Rokhlin, Ronen Talmon and Ronald Coifman for insightful discussions.

FUNDING

National Institute of Health (NIH) [U54HG006996-03 to S.W., 1R01HG008383-01A1 to Y.K., R01 GM086852 to Y.K.]. Funding for open access charge: NIH [1R01 HG008383-01A1 to Y.K, R01 GM086852 to Y.K.].

Conflict of interest statement. None declared.

REFERENCES

1.

Vaquerizas
J.M.
,
Kummerfeld
S.K.
,
Teichmann
S.A.
,
Luscombe
N.M.
A census of human transcription factors: function, expression and evolution
.
Nat. Rev. Genet.
2009
;
10
:
252
263
.

2.

Landt
S.G.
,
Marinov
G.K.
,
Kundaje
A.
,
Kheradpour
P.
,
Pauli
F.
,
Batzoglou
S.
,
Bernstein
B.E.
,
Bickel
P.
,
Brown
J.B.
,
Cayting
P.
et al.
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
.
Genome Res.
2012
;
22
:
1813
1831
.

3.

Zhang
Y.
,
Liu
T.
,
Meyer
C.A.
,
Eeckhoute
J.
,
Johnson
D.S.
,
Bernstein
B.E.
,
Nusbaum
C.
,
Myers
R.M.
,
Brown
M.
,
Li
W.
et al.
Model-based analysis of ChIP-Seq (MACS)
.
Genome Biol.
2008
;
9
:
R137
.

4.

Boyle
A.P.
,
Guinney
J.
,
Crawford
G.E.
,
Furey
T.S.
F-Seq: a feature density estimator for high-throughput sequence tags
.
Bioinformatics
.
2008
;
24
:
2537
2538
.

5.

Chen
X.
,
Xu
H.
,
Yuan
P.
,
Fang
F.
,
Huss
M.
,
Vega
V.B.
,
Wong
E.
,
Orlov
Y.L.
,
Zhang
W.
,
Jiang
J.
et al.
Integration of external signaling pathways with the core transcriptional network in embryonic stem cells
.
Cell
.
2008
;
133
:
1106
1117
.

6.

Valouev
A.
,
Johnson
D.S.
,
Sundquist
A.
,
Medina
C.
,
Anton
E.
,
Batzoglou
S.
,
Myers
R.M.
,
Sidow
A.
Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data
.
Nat. Methods
.
2008
;
5
:
829
834
.

7.

Fejes
A.P.
,
Robertson
G.
,
Bilenky
M.
,
Varhol
R.
,
Bainbridge
M.
,
Jones
S.J.
FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology
.
Bioinformatics
.
2008
;
24
:
1729
1730
.

8.

Zang
C.
,
Schones
D.E.
,
Zeng
C.
,
Cui
K.
,
Zhao
K.
,
Peng
W.
A clustering approach for identification of enriched domains from histone modification ChIP-Seq data
.
Bioinformatics
.
2009
;
25
:
1952
1958
.

9.

Rozowsky
J.
,
Euskirchen
G.
,
Auerbach
R.K.
,
Zhang
Z.D.
,
Gibson
T.
,
Bjornson
R.
,
Carriero
N.
,
Snyder
M.
,
Gerstein
M.B.
PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls
.
Nat. Biotechnol.
2009
;
27
:
66
75
.

10.

Guttman
M.
,
Garber
M.
,
Levin
J.Z.
,
Donaghey
J.
,
Robinson
J.
,
Adiconis
X.
,
Fan
L.
,
Koziol
M.J.
,
Gnirke
A.
,
Nusbaum
C.
et al.
Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
.
Nat. Biotechnol.
2010
;
28
:
503
510
.

11.

Wang
C.
,
Xu
J.
,
Zhang
D.
,
Wilson
Z.A.
,
Zhang
D.
An effective approach for identification of in vivo protein-DNA binding sites from paired-end ChIP-Seq data
.
BMC Bioinformatics
.
2010
;
11
:
81
.

12.

Qin
Z.S.
,
Yu
J.
,
Shen
J.
,
Maher
C.A.
,
Hu
M.
,
Kalyana-Sundaram
S.
,
Yu
J.
,
Chinnaiyan
A.M.
HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data
.
BMC Bioinformatics
.
2010
;
11
:
369
.

13.

Boeva
V.
,
Surdez
D.
,
Guillon
N.
,
Tirode
F.
,
Fejes
A.P.
,
Delattre
O.
,
Barillot
E.
De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis
.
Nucleic Acids Res.
2010
;
38
:
e126
.

14.

Rashid
N.U.
,
Giresi
P.G.
,
Ibrahim
J.G.
,
Sun
W.
,
Lieb
J.D.
ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions
.
Genome Biol.
2011
;
12
:
R67
.

15.

Newkirk
D.
,
Biesinger
J.
,
Chon
A.
,
Yokomori
K.
,
Xie
X.
AREM: aligning short reads from ChIP-sequencing by expectation maximization
.
J. Comput. Biol.
2011
;
18
:
1495
1505
.

16.

Cairns
J.
,
Spyrou
C.
,
Stark
R.
,
Smith
M.L.
,
Lynch
A.G.
,
Tavare
S.
BayesPeak–an R package for analysing ChIP-seq data
.
Bioinformatics
.
2011
;
27
:
713
714
.

17.

Halbritter
F.
,
Vaidya
H.J.
,
Tomlinson
S.R.
GeneProf: analysis of high-throughput sequencing experiments
.
Nat. Methods
.
2012
;
9
:
7
8
.

18.

Muino
J.M.
,
Kaufmann
K.
,
van Ham
R.C.
,
Angenent
G.C.
,
Krajewski
P.
ChIP-seq analysis in R (CSAR): an R package for the statistical detection of protein-bound genomic regions
.
Plant Methods
.
2011
;
7
:
11
.

19.

Song
Q.
,
Smith
A.D.
Identifying dispersed epigenomic domains from ChIP-Seq data
.
Bioinformatics
.
2011
;
27
:
870
871
.

20.

Hower
V.
,
Evans
S.N.
,
Pachter
L.
Shape-based peak identification for ChIP-Seq
.
BMC Bioinformatics
.
2011
;
12
:
15
.

21.

Feng
X.
,
Grossman
R.
,
Stein
L.
PeakRanger: a cloud-enabled peak caller for ChIP-seq data
.
BMC Bioinformatics
.
2011
;
12
:
139
.

22.

Zhang
X.
,
Robertson
G.
,
Krzywinski
M.
,
Ning
K.
,
Droit
A.
,
Jones
S.
,
Gottardo
R.
PICS: probabilistic inference for ChIP-seq
.
Biometrics
.
2011
;
67
:
151
163
.

23.

Guo
Y.
,
Mahony
S.
,
Gifford
D.K.
High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints
.
PLoS Comput. Biol.
2012
;
8
:
e1002638
.

24.

Micsinai
M.
,
Parisi
F.
,
Strino
F.
,
Asp
P.
,
Dynlacht
B.D.
,
Kluger
Y.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments
.
Nucleic Acids Res.
2012
;
40
:
e70
.

25.

Narlikar
L.
,
Jothi
R.
ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder
.
Methods Mol. Biol.
2012
;
802
:
305
322
.

26.

Kumar
V.
,
Muratani
M.
,
Rayan
N.A.
,
Kraus
P.
,
Lufkin
T.
,
Ng
H.H.
,
Prabhakar
S.
Uniform, optimal signal processing of mapped deep-sequencing data
.
Nat. Biotechnol.
2013
;
31
:
615
622
.

27.

Wang
R.
,
Hsu
H.K.
,
Blattler
A.
,
Wang
Y.
,
Lan
X.
,
Wang
Y.
,
Hsu
P.Y.
,
Leu
Y.W.
,
Huang
T.H.
,
Farnham
P.J.
et al.
LOcating non-unique matched tags (LONUT) to improve the detection of the enriched regions for ChIP-seq data
.
PLoS One
.
2013
;
8
:
e67788
.

28.

Kruczyk
M.
,
Umer
H.M.
,
Enroth
S.
,
Komorowski
J.
Peak Finder Metaserver: a novel application for finding peaks in ChIP-seq data
.
BMC Bioinformatics
.
2013
;
14
:
280
.

29.

Wang
J.
,
Lunyak
V.V.
,
Jordan
I.K.
BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets
.
Bioinformatics
.
2013
;
29
:
492
493
.

30.

Nakato
R.
,
Itoh
T.
,
Shirahige
K.
DROMPA: easy-to-handle peak calling and visualization software for the computational analysis and validation of ChIP-seq data
.
Genes Cells
.
2013
;
18
:
589
601
.

31.

Sun
G.
,
Chung
D.
,
Liang
K.
,
Keles
S.
Statistical analysis of ChIP-seq data with MOSAiCS
.
Methods Mol. Biol.
2013
;
1038
:
193
212
.

32.

Kim
N.K.
,
Jayatillake
R.V.
,
Spouge
J.L.
NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data
.
BMC Genomics
.
2013
;
14
:
349
.

33.

Taskesen
E.
,
Hoogeboezem
R.
,
Delwel
R.
,
Reinders
M.J.
Hypergeometric analysis of tiling-array and sequence data: detection and interpretation of peaks
.
Adv. Appl. Bioinform. Chem.
2013
;
6
:
55
62
.

34.

Chung
D.
,
Park
D.
,
Myers
K.
,
Grass
J.
,
Kiley
P.
,
Landick
R.
,
Keles
S.
dPeak: high resolution identification of transcription factor binding sites from PET and SET ChIP-Seq data
.
PLoS Comput. Biol.
2013
;
9
:
e1003246
.

35.

Ashoor
H.
,
Herault
A.
,
Kamoun
A.
,
Radvanyi
F.
,
Bajic
V.B.
,
Barillot
E.
,
Boeva
V.
HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq data
.
Bioinformatics
.
2013
;
29
:
2979
2986
.

36.

Zeng
X.
,
Sanalkumar
R.
,
Bresnick
E.H.
,
Li
H.
,
Chang
Q.
,
Keles
S.
jMOSAiCS: joint analysis of multiple ChIP-seq datasets
.
Genome Biol.
2013
;
14
:
R38
.

37.

Kallio
A.
,
Elo
L.L.
Optimizing detection of transcription factor-binding sites in ChIP-seq experiments
.
Methods Mol. Biol.
2013
;
1038
:
181
191
.

38.

Bardet
A.F.
,
Steinmann
J.
,
Bafna
S.
,
Knoblich
J.A.
,
Zeitlinger
J.
,
Stark
A.
Identification of transcription factor binding sites from ChIP-seq data at high resolution
.
Bioinformatics
.
2013
;
29
:
2705
2713
.

39.

Li
Y.
,
Umbach
D.M.
,
Li
L.
T-KDE: a method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets
.
BMC Genomics
.
2014
;
15
:
27
.

40.

Wu
H.
,
Ji
H.
PolyaPeak: detecting transcription factor binding sites from ChIP-seq using peak shape information
.
PLoS One
.
2014
;
9
:
e89694
.

41.

Lund
E.
,
Oldenburg
A.R.
,
Collas
P.
Enriched domain detector: a program for detection of wide genomic enrichment domains robust against local variations
.
Nucleic Acids Res.
2014
;
42
:
e92
.

42.

Hansen
P.
,
Hecht
J.
,
Ibrahim
D.M.
,
Krannich
A.
,
Truss
M.
,
Robinson
P.N.
Saturation analysis of ChIP-seq data for reproducible identification of binding peaks
.
Genome Res.
2015
;
25
:
1391
1400
.

43.

Ibrahim
M.M.
,
Lacadie
S.A.
,
Ohler
U.
JAMM: a peak finder for joint analysis of NGS replicates
.
Bioinformatics
.
2015
;
31
:
48
55
.

44.

Jalili
V.
,
Matteucci
M.
,
Masseroli
M.
,
Morelli
M.J.
Using combined evidence from replicates to evaluate ChIP-seq peaks
.
Bioinformatics
.
2015
;
31
:
2761
2769
.

45.

Barski
A.
,
Cuddapah
S.
,
Cui
K.
,
Roh
T.Y.
,
Schones
D.E.
,
Wang
Z.
,
Wei
G.
,
Chepelev
I.
,
Zhao
K.
High-resolution profiling of histone methylations in the human genome
.
Cell
.
2007
;
129
:
823
837
.

46.

Gomes
A.L.
,
Abeel
T.
,
Peterson
M.
,
Azizi
E.
,
Lyubetskaya
A.
,
Carvalho
L.
,
Galagan
J.
Decoding ChIP-seq with a double-binding signal refines binding peaks to single-nucleotides and predicts cooperative interaction
.
Genome Res.
2014
;
24
:
1686
1697
.

47.

Lun
D.S.
,
Sherrid
A.
,
Weiner
B.
,
Sherman
D.R.
,
Galagan
J.E.
A blind deconvolution approach to high-resolution mapping of transcription factor binding sites from ChIP-seq data
.
Genome Biol.
2009
;
10
:
R142
.

48.

Stanton
K.P.
,
Parisi
F.
,
Strino
F.
,
Rabin
N.
,
Asp
P.
,
Kluger
Y.
Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures
.
Nucleic Acids Res.
2013
;
41
:
e161
.

49.

Ramachandran
P.
,
Palidwor
G.A.
,
Porter
C.J.
,
Perkins
T.J.
MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing data
.
Bioinformatics
.
2013
;
29
:
444
450
.

50.

Lederman
R.
A random-permutations-based approach to fast read alignment
.
BMC Bioinformatics
.
2013
;
14
(
Suppl. 5
):
S8
.

51.

North
D.O.
An analysis of the factors which determine signal/noise discrimination in pulsed-carrier systems
.
Proc. IEEE
.
1963
;
51
:
1016
1027
.

52.

Ruud
P.
A Comparison of the EM and Newton–Raphson Algorithms
.
Economics Working Papers 89-105
.
1989
;
Berkeley
:
University of California at Berkeley
.

53.

Wilks
S.S.
The large-sample distribution of the likelihood ratio for testing composite hypotheses
.
Ann. Math. Statist.
1938
;
9
:
60
62
.

54.

Benjamini
Y.
,
Hochberg
Y.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
J. R. Stat. Soc. B Stat. Methodol.
1995
;
57
:
289
300
.

55.

Benjamini
Y.
,
Yekutieli
D.
The control of the false discovery rate in multiple testing under dependency
.
Ann. Stat.
2001
;
29
:
1165
1188
.

56.

Consortium
E.P.
The ENCODE (ENCyclopedia Of DNA Elements) Project
.
Science
.
2004
;
306
:
636
640
.

57.

Consortium
E.P.
An integrated encyclopedia of DNA elements in the human genome
.
Nature
.
2012
;
489
:
57
74
.

58.

Li
Q.
,
Brown
J.B.
,
Huang
H.
,
Bickel
P.J.
Measuring reproducibility of high-throughput experiments
.
Ann. App. Stat.
2011
;
5
:
1752
1779
.

59.

Mathelier
A.
,
Zhao
X.
,
Zhang
A.W.
,
Parcy
F.
,
Worsley-Hunt
R.
,
Arenillas
D.J.
,
Buchman
S.
,
Chen
C.-y.Y.
,
Chou
A.
,
Ienasescu
H.
et al.
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
.
Nucleic Acids Res.
2014
;
42
:
D142
D147
.

60.

Iseli
C.
,
Ambrosini
G.
,
Bucher
P.
,
Jongeneel
C.V.
Indexing strategies for rapid searches of short words in genome sequences
.
PLoS One
.
2007
;
2
:
e579
.

61.

Carroll
T.S.
,
Liang
Z.
,
Salama
R.
,
Stark
R.
,
de Santiago
I.
Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data
.
Front. Genet.
2014
;
5
:
75
.

62.

Zhang
Y.
,
Lin
Y.H.
,
Johnson
T.D.
,
Rozek
L.S.
,
Sartor
M.A.
PePr: a peak-calling prioritization pipeline to identify consistent or differential peaks from replicated ChIP-Seq data
.
Bioinformatics
.
2014
;
30
:
2568
2575
.

63.

Frigo
M.
Ryder
BG
,
Zorn
BG
,
Berman
AM
A fast Fourier transform compiler
.
Acm Sigplan Notices
.
1999
;
34
:
NY
:
ACM
.
169
180
.

64.

Li
H.
,
Handsaker
B.
,
Wysoker
A.
,
Fennell
T.
,
Ruan
J.
,
Homer
N.
,
Marth
G.
,
Abecasis
G.
,
Durbin
R.
The sequence alignment/map format and SAMtools
.
Bioinformatics
.
2009
;
25
:
2078
2079
.

65.

R Core Team
R: A Language and Environment for Statistical Computing
.
2013
;
Vienna
:
R Foundation for Statistical Computing
.

Author notes

These authors contributed equally to the paper as first authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.