Galaxy CLIP-Explorer: a web server for CLIP-Seq data analysis

Abstract Background Post-transcriptional regulation via RNA-binding proteins plays a fundamental role in every organism, but the regulatory mechanisms lack important understanding. Nevertheless, they can be elucidated by cross-linking immunoprecipitation in combination with high-throughput sequencing (CLIP-Seq). CLIP-Seq answers questions about the functional role of an RNA-binding protein and its targets by determining binding sites on a nucleotide level and associated sequence and structural binding patterns. In recent years the amount of CLIP-Seq data skyrocketed, urging the need for an automatic data analysis that can deal with different experimental set-ups. However, noncanonical data, new protocols, and a huge variety of tools, especially for peak calling, made it difficult to define a standard. Findings CLIP-Explorer is a flexible and reproducible data analysis pipeline for iCLIP data that supports for the first time eCLIP, FLASH, and uvCLAP data. Individual steps like peak calling can be changed to adapt to different experimental settings. We validate CLIP-Explorer on eCLIP data, finding similar or nearly identical motifs for various proteins in comparison with other databases. In addition, we detect new sequence motifs for PTBP1 and U2AF2. Finally, we optimize the peak calling with 3 different peak callers on RBFOX2 data, discuss the difficulty of the peak-calling step, and give advice for different experimental set-ups. Conclusion CLIP-Explorer finally fills the demand for a flexible CLIP-Seq data analysis pipeline that is applicable to the up-to-date CLIP protocols. The article further shows the limitations of current peak-calling algorithms and the importance of a robust peak detection.


Findings Background
RNA plays a fundamental role in many regulatory processes like splicing or translation. Yet, processes like translation also undergo regulatory steps involving proteins such as elongation factors. These RBPs (RNA-binding proteins) interact with their target RNA and form ribonucleoprotein complexes [1]. Studies have revealed the involvement of RBPs in stages like splicing, polyadenylation, localization, translation, stability, and degradation [2,3,4,5]. So far more than a thousand RBPs have been identi ed in human cells [2,6,7]. Various RBPs have been linked to neurodegenerative diseases and various types of cancer [2,4,8,9]. These observations emphasize the importance to explore the mechanisms behind the regulatory processes mediated by RBPs.
Crosslinking and immunoprecipitation (CLIP) facilitates the analysis of the interdependence between the proteome and transcriptome in vivo [10] by detecting binding sites for RBPs on a genome-wide level. Many CLIP protocols such as PAR-CLIP [11], iCLIP [12], or eCLIP [13] emerged over a short period of time and new methods are still in development [14]. All methods consist of three fundamental steps: crosslinking the RBP of interest to its target RNAs, puri cation and immunoprecipitation of the resulting complexes, and high-throughput sequencing of the resulting RNAs. Despite these commonalities, protocols such as iCLIP or eCLIP perform additional steps to increase the precision of the CLIP-Seq experiment [13,15,16,17], which have to be covered by additional analysis tasks. For example, iCLIP introduced random barcodes (unique molecular identi ers, short UMIs) to reduce the number of duplicated reads [12]. Protocols like eCLIP [13] and uvCLAP [18] adapted this procedure. A deduplication step is therefore imperative for iCLIP, eCLIP, FLASH, and uvCLAP [12,19].
Because of the complexity and variety of CLIP protocols, the computational analysis is still the critical bottleneck, both in time and reproducibility. Individual tools that perform quality control, mapping, peak calling, and motif detection for CLIP-Seq data exist. However, an automatic and complete data analysis pipeline has to deal with a big list of obstacles such as biases that are introduced by the CLIP-Seq protocol and experimental conditions. On top of this, additional problems arise from changing hardware and tool versions, practicality of the user interface, di erent library formats (e.g., biological replicates or multiplexed data), and di erent CLIP-Seq data formats for old, recent, or upcoming protocols. Furthermore, each tool for each subtask has di erent assumptions and parameters that need to be optimized for the underlying protocol [19]. The most challenging task is the binding site identi cation, where a couple of di erent peakcallers, such as Piranha [1], PEAKachu [20], CLIPper [21], and PureCLIP [22], exist. For example, biological replicates are not supported by some peakcallers like Piranha [1]. These obstacles lead to a lack of reproducibility for the CLIP-Seq data analysis.
One possible solution could be one big, but xed pipeline that can cope with every possible type of data. This solution was already tried in the case of PIPE-CLIP [23] or CLIPSeqTools [24]. Nevertheless, it is intractable to cover all possible combinations of di erent experimental settings, such as the number of replicates, the existence of a control library and others. Focussing instead on one speci c type of data is easier to handle, like analyzing only iCLIP data with iCount [25]. However, neither PIPE-CLIP, nor CLIPSeqTools and iCount can be quickly and simply expanded or modi ed. They lack the option for an extension to cover noncanonical experimental data or new CLIP-Seq data types such as eCLIP, FLASH, or uvCLAP.
CLIP-Explorer provides all necessary tools to analyze eCLIP, FLASH, uvCLAP and iCLIP data.
CLIP-Explorer is well documented through an online tutorial in the Galaxy training material (https://galaxyproject.github.io/trainingmaterial/topics/transcriptomics/tutorials/clipseq/tutorial.html) and the main domain. Both websites assist the user to understand the main steps and parameters of the pipeline and the featured tools. The user can then, for example, replace the peakcaller or read mapper. It is not required to have detailed knowledge about the tools. CLIP-Explorer works in a server environment, thus the user does not have to worry about varying hardware or tool versions. A constant maintenance of CLIP-Explorer makes the data analysis easy to reproduce.
We have validated CLIP-Explorer on eCLIP data of DROSHA, HNRNPK, IGF2BP1, KHDRBS1, LIN28B, PTBP1, QKI, SLBP and U2AF2. We compared the results with a di erent analysis pipeline and databases, nding great diversity in the number of predicted peaks and found motifs. A more comprehensive analysis including the peakcallers Piranha, PureCLIP, and PEAKachu was done for RBFOX2 [13] as it has well documented targets and motifs. The protein RBFOX2 encoded by the gene RBM9 is a tissue-speci c splicing factor involved in de-velopmental processes [21,27]. Studies have shown RBFOX2's binding preference for introns close to di erentially spliced exons [28,29]. The conserved sequence motif TGCATG has been shown to be enriched in RBPFOX2's binding sites [21,28,29]. Concerning the inconsistent results of the peak calling, we propose standard guidelines for the peak calling for di erent experimental setups. We con rm RBFOX2's binding characteristics from the literature as another validation for CLIP-Explorer.

CLIP-Explorer: A Versatile Pipeline for the Analysis of CLIP-Seq Data
Di erent experimental settings require di erent analysis pipelines, since pre-processing, mapping, peak calling and motif detection have to be adapted. For that reason, CLIP-Explorer integrates several pipelines for analyzing di erent protocols, namely eCLIP, iCLIP, FLASH and uvCLAP. Common to all pipelines in CLIP-Explorer is the division into four major steps ( Figure 1). In the pre-processing, the read library is demultiplexed and, if necessary, adapter sequences as well as inline barcodes and UMIs are removed. In the post-processing, the reads are aligned and deduplicated. CLIP-Explorer then identi es di erentially enriched regions (peaks) that are further analyzed according to genomic localization and other criteria to investigate the precise function of the protein and properties of its targets. All subtasks are accompanied by quality control steps. The versatility of CLIP-Explorer allows the user to select three di erent peak calling pipelines for three different data speci cations. The methods section covers CLIP-Explorer in more detail. Additional information can be found in the Galaxy training material.

Recommendations for PEAKachu, Piranha, and Pure-CLIP
To this day, a benchmarking dataset for CLIP-Seq data analysis does not exist because of missing experimental methods to verify predicted binding sites. It is therefore recommended to test more than one peakcaller and to test more than one parameter set, which is easily possible with CLIP-Explorer.
The newest version of PEAKachu [20] is well-suited for an experimental setup with at least two replicates for the CLIP experiment and at least two replicates for the control experiment, because it uses DESeq2 [43]. It is therefore best to turn on the DESeq2 normalization. If the user has less replicates, the user can still use PEAKachu, but DESeq2 requires at least two replicates for both experiment and control to calculate p-values. With less than two replicates, PEAKachu lters the peaks based on the fold change and the mad (median absolute deviation) multiplier. It is therefore wise to check the peaks with another peakcaller or peak calling pipeline. PEAKachu works best in the adaptive mode with a mad multiplier of 0.0, a log2 fold change threshold of 2.0, and an adjusted p-value threshold of 0.05. PEAKachu needs the parameter of the maximum insert size, identi ed beforehand by Picard [44]. The estimation of the insert size is only necessary if the user provides paired-end read data. The window mode of PEAKachu is not recommended, because it is rather unstable. The mad multiplier might reduce the number of peaks, as it works as a second cuto . To get the full peak set, it is wise to leave it at zero and lter the peaks by the log2 fold change together with the adjusted pvalue. One key parameter is the minimum block overlap that has to be tested with the default of 0.5 in the beginning. The user has to increase this parameter if the results show a lot of peaks in a close vicinity. Another critical parameter is the minimum cluster expression fraction and the minimum block expression. These parameters can change the total number of  CLIP-Explorer has three major steps. In the pre-processing, the read library is demultiplexed, if necessary, and adapter sequences as well as in-line barcodes and UMIs are removed. CLIP-Explorer uses Je [30], UMI-Tools [31], bctools [32], and Cutadapt [33] for that purpose. A quality control step using FastQC [34] follows the pre-processing. In the post-processing, the reads are aligned with STAR [35], Bowtie [36] or segemehl [37], ltered using SAMtools [38], bedtools [39], and bctools, and deduplicated with UMI-tools. Another quality control, mainly with deeptools [40], checks the batch quality. Finally, CLIP-Explorer identi es di erentially enriched regions using either PEAKachu [20], Piranha [1], or PureCLIP [22]. The binding regions are then analyzed with RCAS [41] and MEME-ChIP [42]. predicted peaks. Leave them in default with 0.01 and 0.1 and adjust the parameters if some interesting binding regions are not covered by PEAKachu's prediction. PEAKachu can be applied to iCLIP, eCLIP, FLASH, uvCLAP and even older protocols like PAR-CLIP.
Piranha [1] can be applied to an experimental setup with or without control, but it does not support replicates for the CLIP experiment. Each replicate has to be treated separately or further validated with a robust peak detection. However, if the user has only one replicate and no control, we recommend to use Piranha. If a control is provided, Piranha uses a zerotruncated negative binomial regression by default. Without a control it is wise to stick to a negative binomial. The distance to merge signi cant bins is one of the most crucial parameters of Piranha, similar to the minimum block overlap of PEAKachu. If the user observes a lot of peaks in a close vicinity, then this parameter has to be increased (e.g, 10). The bin size of the signal and control is another crucial parameter of Piranha that needs to be optimized. If the bin size is quite big (e.g., 200), then Piranha might miss a few good candidates. If the bin size is very small (e.g., 5), then Piranha predicts a lot of false positives. Piranha can also be applied to iCLIP, eCLIP, FLASH, uvCLAP, and even older protocols like PAR-CLIP.
PureCLIP [22] can be applied to an experimental setup with or without control, but it does not support replicates by the time of our analysis. It is therefore best to apply PureCLIP, as well as Piranha, to each replicate separately and nd robust peaks by intersection, merging or calculating an IDR. Therefore, we recommend to use PureCLIP if the user has only one replicate for the CLIP and control experiment. PureCLIP already incorporates two default parameters sets. One set can be used if the protein is assumed to bind low complex motifs, which results in more broader and unspeci c binding sites. PureCLIP predicts not only the binding region, but also the crosslinking sites. In our tests, PureCLIP quite often reported very small binding regions, almost identical to the crosslinking sites. It is therefore recommended to slightly extend the predicted binding sites (e.g, ve to ten bases to the left and right) to cover the whole binding region. Furthermore, if the user provides paired-end reads, the mate containing the crosslinking event has to be provided explicitly. For iCLIP, FLASH, and uvCLAP this corresponds to the rst mate, while for eCLIP it is the second mate. PureCLIP was speci cally designed for eCLIP and iCLIP [22]. We recommend to use the peakcaller only for those protocols or other variants such as FLASH or uvCLAP.
It is not easy to nd a standard peak calling algorithm with a standard parameter set. We tried to cover possible cases and recommendations for Piranha, PEAKachu, and PureCLIP, but these tools can change over time, or a new peakcaller might outrank them. The user can therefore nd permanently updated recommendations and guidelines for a CLIP-Seq data analysis on CLIP-Explorer's main domain.

E ects of Using Di erent Peakcallers for RBFOX2
We use eCLIP data from DROSHA, HNRNPK, IGF2BP1, KHDRBS1, LIN28B, PTBP1, QKI, RBFOX2, SLBP, and U2AF2 from the study by Van Nostrand et al. [13] to validate CLIP-Explorer (see Methods). We rst use the data of RBFOX2 to check the robustness and quality of the predicted binding sites of CLIP-Explorer, since the protein has a well-known binding motif. We intersect the peaks of Piranha, PEAKachu, and PureCLIP with bedtools [39] (see Methods). Piranha detects the highest number of potential binding regions for RBFOX2 ( Figure 2a). Yet, more than one third of Piranha's peaks are not included in PEAKachu's and PureCLIP's peak set. PureCLIP has the highest fraction of peaks shared with the other two peakcallers, but it also has the lowest total number of peaks. PEAKachu on the other hand also has a high number of individual peaks, but signi cantly less than Piranha. To check for the origin of the di erence in the number of peaks, we compare all three peakcallers (PEAKachu, Piranha, PureCLIP), including the CLIPper peaks of the study by Van Nostrand et al. [13] for RBFOX2, regarding potential CLIP-Seq artifacts and biases. The table 1 shows that less than 1% of the peaks from all peakcallers overlap with 3' or 5' UTRs. PEAKachu has 362 peaks that overlap with repeats (2.45%), however all peakcallers have a rate below 1% of peaks that over-lap with intron repeats or any RNA pseudogene region. We also check the peak length and the distance between the peaks of the di erent peakcallers (Supplementary Table 3). PEAKachu has on average larger peaks with 131 nucleotides. On the other hand, PureCLIP has the smallest peaks with an average of 36 nucleotides. The distribution of Piranha is a constant of 20 nucleotides because the peak length is a parameter the user de nes for the tool. Furthermore, the distance between the a) b) A:7714  (a) We intersected the binding regions identi ed by the peakcallers with bedtools [39], paying attention to the strand (intersect -s), to assess the robustness of each method. (b) We then annotated the binding regions of each peakcaller and plotted the fraction of each target. We also analyzed separately the peak set of the individual peakcallers (O = peak set 7714/177/10589) and the peak set of the intersection with all peakcallers (I = peak set 2134/2117/2769). RBFOX2 prevalently binds introns, but also 3' and 5' UTRs as well as lincRNAs. The plot was generated with the hg19 script of targetdist [45]. Investigating the mean coverage of these binding regions identi ed by (c) PEAKachu [20] reveals an occupancy drop around the 5' and 3' ends of the exons. (d) Looking further at the mean coverage of the binding regions identi ed by PEAKachu in the overall transcript as well as for the 5' and 3' UTRs. The thickness of the ribbon around the mean coverage indicates the 95% con dence interval (mean +-standard error of the mean times 1.96). Each feature is divided into 100 bins of equal length, whereas features smaller than 100 bp are excluded [41].  Table 3). Only Piranha has slightly more peaks, which are close together.
To identify the type of bound genomic regions, we annotate the peaks discovered by the three di erent peakcallers. The distributions of binding sites generated by these three tools show a similar trend and prevalence for introns as the main target of RBFOX2 (Figure 2b). Yet, introns are not the only target. Our ndings suggest that some binding regions of RBFOX2 lie in 3' and 5' UTRs, but this fraction is not as big as for introns. PEAKachu, Piranha, and PureCLIP further detect another chunk of target sites in lincRNAs, but the portion detected by PEAKachu is smaller, probably because of a low coverage of lin-cRNAs in CLIP protocols. In addition, we annotated separately the peak set of the individual peakcallers (7714/177/10589, from Figure 2a) and the peak set of the intersection with all peakcallers (2134/2117/2769). The target distribution of Piranha and PureCLIP changes for the individual peak set. The portion of intron coverage shrinks and the portion exon targets increases as well as lincRNA, rRNA, snoRNA, and snRNA. In contrast, the target distributions of PEAKachu looks similar. However, the target distribution of the intersection of all three peakcallers looks di erent in comparison to the distribution of all peaks. PEAKachu's distribution has a smaller intron and a bigger exon and pseudogene portion. On the other hand, the peaks from the intersection set of Piranha and PureCLIP overlap more with introns and less with exons or any snRNA, snoRNA, tRNA, or rRNA.
To verify that RBFOX2 is a splicing factor, we investigate the peak pro le of the binding sites by looking at the coverage plot of RBFOX2. PEAKachu depict a drop in the binding region coverage around the exon-intron-boundaries at the 5' and 3' ends ( Figure 2c). The drop is more intense for the sites found by PEAKachu. Checking the mean coverage of the binding regions (Figure 2d), the results suggest a binding prevalence of RBFOX2 in the upstream region of the transcripts. Furthermore, RBFOX2 seems to target the beginning of the 5' UTR. In contrast, the binding coverage is homogeneous for the 3' UTR. The sequence UGCAUG seems to play an important part for the binding of RBFOX2 as it is among the top ve motifs detected by the PEAKachu, Piranha and PureCLIP pipeline (see Table 2). The second motif shows guanine richness and the third motif cytosine and uracil richness for all peakcallers.
At the end, we check the function of RBFOX2 to clarify the role in human liver cancer cells (Hep G2). Looking at the top one hundred genomic regions that have RBFOX2 binding sites, we nd Shank2 and Shank3 among the top hits as potential targets. Furthermore, a gene ontology (GO) analysis with RCAS for the targets of RBFOX2 identi es the protein to be relevant for the regulation of RNA splicing (with a Benjamini-Hochberg (BH) adjusted p-value of < 10 -4 ), as well as regulation of transcription by RNA polymerase I (BH adjusted p-value < 10 -3 ). In addition, the GO analysis identi es RBFOX2 to be involved in the regulation of histone modi cations (BH adjusted p-value < 10 -4 ), nucleosome and nucleosomal binding (BH adjusted p-value < 10 -4 and 0.04, respectively) and methyl-CpG binding (BH adjusted p-value < 0.04).
Checking the results for RBFOX2, Piranha found the highest number of peaks. A lot of these peaks, however, might represent false positives since Piranha was executed without the information from the control experiments (see Methods). More than one third of Piranha's peaks were not included in PEAKachu's and PureCLIP's peak set, which endorses the supposition. The observation is also substantiated by the change of the target distribution of the Piranha peaks that do not intersect with the peaks of the other two peakcallers. These peaks might cover false positives that lie in unspeci c regions such as rRNAs. Furthermore, the distance between the peaks is almost identical between the peakcallers. Only Piranha has slightly more peaks, which are closer together indicating a higher number of false positives because it calls many peaks (local maxima) in close proximity and does not combine them into a bigger peak (global maximum). In contrast, PureCLIP had the lowest number of predicted peaks and the highest fractions of peaks shared with the other peakcallers. This indicates that PureCLIP selects the peaks based on very stringent criteria. However, PureCLIP might therefore also have a high false negative rate. PureCLIP calls peaks for each replicate separately, which is likely the reason for missing some good candidates that have been jointly found by PEAKachu. Furthermore, the change of the target distribution for the PureCLIP peaks that do not intersect with the peaks of the other two peakcallers might show possible false positives. Like Piranha's distribution, peaks might overlap with unspeci c regions such as rRNAs. PEAKachu, on the other hand, had the largest peaks. We cannot make a general judgement about a good value for the peak length, because each RBP and each binding region can be speci c and a benchmarking dataset is missing for CLIP-Seq data. PEAKachu's target distribution for peaks that do not intersect with the peaks of the other two peakcallers might suggest some false positives, because of the bigger portion of peaks in pseudogenes in that peak set. We also checked for potential CLIP-Seq biases (annotation from Ensembl) to answer the question of the di erent number in peaks, but we could not nd any evidence regarding a signi cant UTR or repeat region bias for any of the implemented peakcallers.
Despite the disparate numbers of peaks between PEAKachu, Piranha, and PureCLIP, the main motif of RBFOX2 with the sequence UGCAUG was identi ed with CLIP-Explorer for all used peakcallers. The motif seems to be very robust to varying peak calling conditions, which is endorsed by the fact that it can be even found with PureCLIP with a di erent pre-processing (see Table 2). We tested the peakcallers PEAKachu, Piranha, and PureCLIP on the alignments les from the study by Van Nostrand et al. [13]. With CLIP-Explorer all three peakcallers found the main motif with lower background noise as in comparison to the already processed alignments les from ENCODE. Furthermore the motif set from CLIP-Explorer looks more similar between all three peakcallers (see Table 2). It is therefore possible to nd a better ground truth, that is to say, a benchmark set with CLIP-Explorer, since additionally more than one peakcaller can be tested with just a few clicks.
We further tried to verify RBFOX2's known role as splicing factor and that it preferably binds to introns [21,27,28,29] to substantiate the credibility and versatility of CLIP-Explorer. The distributions of binding sites generated by PEAKachu, Piranha, and PureCLIP showed a prevalence for introns as the main target of RBFOX2 in accordance with the literature [28,29], which is further supported by the target distribution of the intersected peaks for all three peakcallers. The drop of the binding occupancy of RBFOX2 around the exon-intron boundaries can also be seen in other studies [28,29]. The binding coverage of RBFOX2 around the exon-intron-boundaries Table 2. Top ve RBFOX2 sequence motifs for each peakcaller identi ed by MEME-ChIP with E-value and the fraction of sequences with that speci c motif, based on the RBFOX2 eCLIP data.

Pre-processing CLIP-Explorer
Pre-processing Van Nostrand et al. [           suggests an involvement of RBFOX2 in the regulation of splicing. The GO term analysis corroborates the hypothesis linking RBFOX2 to various splicing and structure-related processes, such as histone modi cations. Besides, the Shank gene family members, as a potential target of RBFOX2, play an important part in neuronal functions, where alterations in the encoded proteins may be connected to autism [21]. Shank2 and Shank3 might hereby be regulated by alternative splicing [46,47]. Another study has found the same interdependence of RBFOX2 and the Shank protein family [21].

Comparison of CLIP-Explorer's Results
We compare the sequence motifs detected by CLIP-Explorer with two di erent databases [48,49], and then with the peaks identi ed in the study by Van Nostrand et al. [13] (Supplementary Table 1). Here, the CLIPper algorithm [21] was used to identify potential binding regions of the same proteins. For the CLIPper peaks, we predict the sequence motifs with MEME-ChIP in the same way as implemented in CLIP-Explorer.
As a rst step, we focus on the sequence motifs that resulted from CLIP-Explorer using PEAKachu for peak calling. The motifs are similar and sometimes nearly identical to the motifs listed in the databases for HNRNPK, KHDRBS1, PTBP1, QKI, RB-FOX2, and U2AF2 (Supplementary Table 1). For example, the QKI-motif ACUAA [50] or the known motif UGCAUG of RBFOX2 [21,28,29] can be found in the databases and is also detected by CLIP-Explorer and CLIPper. Some proteins such as DROSHA, LIN28B, and SLBP are not listed in the databases, and the proteins IGF2BP1 and RBFOX2 have only one or two motifs. CLIP-Explorer identi es new motifs for these proteins. Several of the new motifs detected by CLIP-Explorer are also detected by CLIPper but not covered by the databases [48,49] or any other literature to the best of our knowledge (Supplementary Table  1). For example, we nd the new sequence motif CAGGCUGG for PTBP1, or the motif ACAG for U2AF2.
For some proteins, such as DROSHA and HNRNPK, the motifs detected by CLIP-Explorer deviate from the corresponding CLIPper motifs, but still show similar sequence compositions (Supplementary Table 1). In addition, CLIPper slightly misses the known motif GGAGA of LIN28B [51].
For a more ne-grained comparison between the PEAKachu pipeline in CLIP-Explorer and the CLIPper pipeline in Van Nostrand et al. [13], we intersect the peaks for each protein with bedtools [39] to check for common binding sites. We require an overlap for bedtools of at least one base (see Methods). This comparison reveals a discrepancy between some proteins. Less then ten percent of all PEAKachu peaks overlap with the CLIPper peaks of the proteins KHDRBS1 (Supplementary Table 2). Quite often, PEAKachu nds more peaks in comparison to CLIPper, except for QKI and U2AF2. For example, CLIPper nds 161 peaks and CLIP-Explorer (PEAKachu) 220 for the protein SLBP. Using PEAKachu, CLIP-Explorer predicts nearly sixty more peaks than the CLIPper pipeline. We suspect that this di erence is due to the computational model of CLIPper, which calls peaks for every replicate separately. PEAKachu, on the other hand, calls with all replicates in mind. Therefore, we investigate the di erence in the CLIPper peak sets for each replicate. The rst replicate encompasses 7, 942 CLIPper peaks and the second replicate 1, 530 peaks, when we lter for entries with a p-value of 0.05 and a log2 fold change of 1, which makes a total di erence of 6, 412 peaks. Intersecting the peaks without an irreducible discovery rate (IDR) results in 291 peaks, more than PEAKachu, and with an IDR CLIPper has the aforementioned 161 peaks, so less than PEAKachu. Thus, roughly 20% of the CLIPper peaks in replicate one are contained in replicate two. Consequently, the di erence in the peak set of PEAKachu (CLIP-Explorer) might come from a di erent noise estimation between replicates, thus detecting other binding sites. Because of the di erent number of peaks, we check for potential CLIP-   Table 4). Both PEAKachu and CLIPper have no peaks in repeat or pseudogene regions, except for one PEAKachu peak in a ncRNA pseudogene. PEAKachu also nds more peaks that overlap with ncRNA genes (48 peaks) than CLIPper (17 peaks). We also check the number of peaks overlapping with histone genes since SLBP targets mainly histone RNAs. PEAKachu nds slightly more peaks in histone regions (163 peaks) than CLIPper (135 peaks). Consequently, PEAKachu covers slightly more peaks in histone regions, but also more peaks in ncRNAs genes, which might represent false positives since SLBP is known to target histone mRNA. The eCLIP pipeline from Van Nostrand et al. [13] already removed repeat regions in a double mapping approach.

Seq biases (Supplementary
To check the correctness of the identi ed new motifs such as CAGGCUGG for PTBP1, we took independent RNA-Seq data from ENCODE from a knockdown and control experiment for PTBP1 (see Methods). We intersected the identi ed binding sites of PTBP1 from CLIP-Explorer (PEAKachu) and CLIPper with the genes of hg38 and calculated the log2 fold change of all genes. Figure 3 clearly shows a signi cant shift in the cumulative density function (CDF) of the fold changes. This observation was expected since true binding sites should have di erent RNA rates in a knockdown experiment. Thus, the identi ed sites from PEAKachu and CLIPper might be real binding sites of PTBP1 as both peakcallers show similar CDFs even though the processing of the data was di erent. Thus we checked the motif CAGGCUGG (from CLIP-Explorer), CCAGGCUG (from CLIPper), and another motif UUCCUUUC (from CLIP-Explorer and CLIPper). All three CDFs are signi cantly shifted. Consequently, CAGGCUGG might not be a false positive.

Potential Implications
CLIP-Explorer is a valuable tool for researchers working with CLIP-Seq data as it simpli es and integrates many processing steps in a well-tested and optimized pipeline. CLIP-Explorer provides the user with an extensive overview of the potential function of the RBP and its target RNAs. It can be easily extended or modi ed, and has no installation overhead as it is integrated in Galaxy. CLIP-Explorer is thus the rst general and fully automatic pipeline for eCLIP, FLASH, and uvCLAP data, which can also be applied to iCLIP and other types of CLIP-Seq protocols. Besides, it is permanently maintained, and new tools can be implemented or exchanged with existing ones to warrant a highly e cient data analysis.
We analyzed di erent eCLIP datasets and compared our ndings with di erent databases [48,49] and the results from the study by Van Nostrand et al. [13]. The analysis of di erent proteins, such as DROSHA, HNRNPK, and in more detail RB-FOX2, showed the strength of the exibility provided by CLIP-Explorer. We could identify similar and even new sequence motifs for PTBP1 with the motif CAGGCUGG, and U2AF2 with ACAG. We also veri ed the sequence motif UGCAUG for RBFOX2 and found other guanine, cytosine and uracil-rich binding regions. The most signi cant sequence motif UGCAUG of RBFOX2 was found by Piranha, PEAKachu, and PureCLIP, even though the three peakcallers predicted three substantially di erent peak sets. Based on our results we recommend to test more than one peak calling algorithm for other RBPs to assess the robustness of the motifs. We showed that di erent peak calling tools might result in di erent peak sets and thus encompass di erent false positives and negatives. Yet, CLIP-Explorer allows this very easily, because of its user-friendly interface. An exchange of the peakcaller can be done in an instant. CLIP-Explorer is also essential for a later data investigation such as sequence motif detection or GO term analysis, since it reduces noise in the data. CLIP-Explorer enabled us to identify the involvement of RBFOX2 in splicing because of the availability of speci c annotation tools like RCAS [41]. We could identify a decrease of RBFOX2 occupancy around the exon-intron boundaries, linked RBFOX2 to the regulation of splicing and DNA structural modi cations and veri ed other observation from previous studies.
We recently have integrated GraphProt [52] into our pipeline, which is one of the popular tools for RBP binding prole predictions. We will also include StoatyDive [53] in a later version for a newly developed peak shape clustering for a better noise reduction, as it showed promising results for CLIP-Seq data. All in all, CLIP-Explorer is a exible and easily extendable pipeline, which greatly simpli es CLIP-Seq data analysis on a transcriptome-and genome-wide scale.

Methods
CLIP-Explorer includes all major processes that are required to analyze CLIP-Seq data. All tools for each step were selected based on review [54,19,17] or benchmark articles (stated for each tool later in the paper), or experiences. The analysis involves three major steps as shown in Figure 1, each followed by a speci c quality control. In the pre-processing step, the data is demultiplexed into the read libraries stemming from di erent experiments. If necessary, adapter sequences (using Cutadapt [33,55]) as well as in-line barcodes and UMIs are removed. FastQC [34] performs during that step a standard quality check for the read and library quality.
The post-processing for CLIP experiments is similar to RNAseq experiments, that is, the reads are aligned and ltered. An additional deduplication step is required in the case of recent CLIP-Seq protocols such as iCLIP and eCLIP. The deduplication removes PCR duplication artifacts. CLIP-Explorer performs another quality control during the post-processing using mainly FastQC [34], and deeptools [40].
The nal and major part of CLIP-Explorer is the analysis step. CLIP-Explorer searches in that step for sequence and coverage motifs, and potential targets of the investigated RBP. Peak calling and motif detection are the fundamental and most critical processes. The quality and amount of detected binding sites can vary signi cantly based on the used tools, whereas the tools depend heavily on the experimental setup. Di erent tools therefore lead in part to di erent results. For that reason, CLIP-Explorer provides several peak calling and motif detection tools to fathom the robustness of the results.

Input and Output of CLIP-Explorer
The user needs to provide their experimental data in FASTA or FASTQ format. Nonstandard adapter sequences can be provided by the user. Otherwise, CLIP-Explorer automatically detects them. The pipeline is designed for multiplexed or demultiplexed paired read data and supports replicates and control experiments. Barcode sequences are required in case of demultiplexing. Other additional les are provided by the Galaxy database. CLIP-Explorer can be easily changed, for example, to allow for single end reads and di erent tool settings.
The user obtains a MultiQC [56] report for the raw, trimming, alignment, and deduplication quality to assess the quality of the raw data and important processing steps of CLIP-Explorer. MultiQC collects the FastQC reports made during the pre-and post-processing of CLIP-Explorer to inspect the mapping quality, elaborating on the amount of unmapped and multiply mapped reads, the length of mapped reads, and other important characteristics of the read library. Further quality control is provided by deeptools, as it can elicit di erences in signal and control experiments. A heatmap and a ngerprint plot assess the correlation between the signal and control libraries providing evidence for a correct execution of the CLIP experiment. CLIP-Explorer also generates coverage les (bigWig and bedGraph) for the alignments and the crosslinking sites to inspect the peak calling quality. Most importantly, CLIP-Explorer will produce a bed and gtf le of signi cantly enriched regions, representing the binding sites of the protein on the transcriptome or genome. A MEME-ChIP [42] report will further analyze the peaks, detecting potential sequence motifs of the protein. A FIMO [42] report then lists reference sequences, which where not covered by the peak calling, but contain the detected sequence motifs. Finally, a RCAS (RNA centric annotation system) [41] report determines the target distribution of the protein over RNA classes and transcript regions. It also includes a GO term analysis and plots to fathom the coverage of the protein binding around splice junctions, along the transcripts and along various other regions. CLIP-Explorer can also generate a list of robust peaks (shared between all input les). This feature is useful for peakcallers that do not support replicated data such as Piranha.

Mapping and Deduplication
We integrated STAR [35] into CLIP-Explorer to map reads against the genome based on the good performance and usability of STAR for RNA-Seq data [57,58,59]. STAR is an annotation and splice aware aligner, which is important for transcriptomic data. Thus, we used extra information about the transcriptome. The data of RBFOX2 was mapped against hg19 to better reproduce the literature results, all other proteins such as DROSHA and HNRNPK were mapped against hg38. STAR was executed with the two pass mode turned on and in the endto-end alignment scheme. CLIP-Explorer further checks for incomplete pairs, ambiguously mapped and low quality reads. CLIP-Explorer also includes a deduplication step to lower the false positive rate for the identi cation of binding regions. PCR duplicates are often collapsed into one representative [2,19]. CLIP-Explorer identi es potential PCR duplicates with the help of UMI-tools [31]. Duplicated reads are identi ed after the alignment step, searching for reads with identical genomic positions (begin and end) and orientation. Yet, sequencing errors can also occur in the UMIs. UMI-tools clusters the reads based on their UMI to handle these sequencing errors. Thus, we merged sequences with a high node count and a small Hamming distance between unique UMIs [31].

Identi cation of Enriched Regions and Sequence Motifs
Searching for enriched regions and motifs is the most challenging task because of the di erences in the gene expression between CLIP experiments and background controls. Hence, high false negative rates are a common result in the detection of di erentially enriched regions (peaks) [19]. CLIP-Explorer allows to choose between three di erent peakcallers, namely PEAKachu, PureCLIP, and Piranha, which we picked based on their performance, availability (i.e., open source and conda package) and support of di erent CLIP-Seq data sets with replicates or controls [1,20,22,60,61,62,63].
PEAKachu was executed in adaptive mode with a mad multiplier of 0.0, a log2 fold change threshold of 2.0, an adjusted pvalue (Benjamini-Hochberg procedure) threshold of 0.05 and a maximum insert size of 200, identi ed beforehand by Picard. All other parameters are set to their default values. We used for Piranha a negative binomial distribution with a bin size of 20 and a 0.05 p-value threshold. We did not include the control data for Piranha to test for possible experiments without control datasets. PureCLIP was trained on chromosome one, two, and three of hg38 (for RBFOX2 we used hg19, respectively), and executed with -bc 0 as the default option. The resulting binding regions of PEAKachu, Piranha, and PureCLIP were then extended by 20 nucleotides because many peakcallers often call peaks that stop before the motif itself. The resulting regions were analyzed with RCAS [41] to determine the target distribution over genomic regions and possible binding patterns of RBFOX2. The peaks were also analyzed with the MEME Suite [42] tool package (MEME-ChIP) to nd sequence motifs in the peaks. We used MEME-ChIP because of its versatility and performance for known motifs [64,65]. Because of a missing ground truth for CLIP-Seq data, a benchmarking for motif nding tools is still missing. MEME-ChIP was set to nd zero or one occurrence of the motif sites per sequence (zoops model).

Intersecting Peaks
We used the intersect module of bedtools [39] to assess the occurrence of the predicted peaks between CLIPper and CLIP-Explorer with PEAKachu, and between the three di erent CLIP-Explorer pipelines with Piranha, PureCLIP, and PEAKachu. We set bedtools intersect with the option -s to search for intersections on the same strand and kept the default value for -f, resulting in a minimum overlap of one base for overlapping regions to be reported. Further, we used the ag -u to consider only unique overlaps.
RBFOX2 was analyzed by CLIP-Explorer with PEAKachu, PureCLIP, and Piranha based on hg19. We took the peaks from the CLIPper pipeline (hg19 peaks ENCFF154DRN) and analyzed them with MEME-ChIP.
All other proteins were analyzed by CLIP-Explorer with PEAKachu on hg38. For the CLIPper pipeline we used the peaks from the IDR (signal normalization) from hg38 and analyzed them with MEME-ChIP. We investigated SLBP more thoroughly and used a more stringent parameter set for PEAKachu (i.e., no default) to nd more robust peaks between replicates, since CLIPper does the same with an IDR calculation. We set the fold change to 4, the MAD to 1.0, the minimum block overlap to 0.5, minimum cluster expression fraction to 0.001, and the minimum block expression to 0.6. This parameter set was optimal for the SLBP data, but it can be sub-optimal for other proteins or CLIP-Seq data. To better compare the two tools, we also ltered for peaks that do not overlap with repeat regions and that have an annotated feature, because CLIPper (i.e., the CLIPper pipeline) implements similar lter steps.
We also analyzed data from an shRNA knockdown experiment against PTBP1 in Hep G2 cells followed by RNA-seq (ENCSR064DXG), including a control shRNA against no target (ENCSR603TCV). Both data samples are from ENCODE and encompass two replicates for both experiments. We took the alignments that were mapped with STAR against the genome (hg38). We calculated the coverage for each gene in hg38 with htseq-count [66]. We then obtained the log2 fold change for each gene from DESeq2 [43], taking both the knockdown and the control experiment into account. We then intersected the identi ed sites from PEAKachu and CLIPper for PTBP1 with the genes of hg38 and plotted the CDF of the log2 fold changes.