Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning

Abstract Motivation Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. Availability and Implementation Labeled histone mark data http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/, R package to compute the label error of predicted peaks https://github.com/tdhock/PeakError Supplementary information Supplementary data are available at Bioinformatics online.

• Counts of aligned reads for negative control ChIP-seq samples (also known as background or input samples). These samples are important in order to determine which peaks are specific to the antibody used in the ChIP-seq experiment. Non-specific peaks appear up in both experimental and negative control samples.
• Other genome browser tracks such as alignability/mappability and GC content. These tracks may influence how to interpret the aligned read coverage signal in terms of peaks and background.
• Peak calls can be displayed in order to visualize peak calling errors that need correction via labels.
• Labels can be displayed in order to see which genomic regions have already been labeled.
If your computer screen is too small to display all relevant tracks, then labeling can be used on subsets of the data. For example, if there are 500 ChIP-seq samples to analyze, we would recommend starting by visualizing and labeling a subset of 10-30 samples. If you are interested in differences between samples, make sure that the subset contains examples of several different sample types, so you can observe and label differences between sample types. For example, the figure below shows 27 samples of three cell types (bcell, tcell, monocyte).
Genome browser axis settings. Typically genome browsers have two options for the display of the y-axis scale. If all samples are of the same sequencing depth (for example 10x) and are expected to have about the same amount of coverage, then the y axis can be forced to have the same scale across all samples. For samples sequenced at different depths (for example 5x and 50x), the y-axis should be set to automatically rescale to the maximum of each sample. For example, this autoscale setting is useful to see peaks in background in the figure above (the normalized coverage counts range from 5 to 93, but peaks and background are still visually obvious when the y axis is autoscaled).
Choosing genomic regions to label. There are several methods that can be used to find genomic regions to label: • If you expect peaks in certain genomic regions (e.g. genes, promoters), then you can start by looking in those parts of the genome.
• If you have preliminary peak calls for the samples you want to label, then you can view a genome subset with peaks that have certain properties (e.g. small p-value, large length, large height).
• Otherwise, you can start in a random region of the genome, and then zoom in or out until you locate some peaks that are visually obvious.
Continue browsing through the genome until you are confident to have labeled an unbiased, representative subset of relevant peaks. Grouping nearby labels into windows. We recommend creating "labeled windows," which are genomic regions with groups of nearby labels. The reason we create windows is in order to make it easy to visually check the labels, after labeling has been performed. It is not easy to visually check a single label by itself in the genome. For example a positive peaks label may not be obvious unless the genome browser is zoomed out enough to see the nearby background noise (and perhaps a negative noPeaks label). Therefore, each labeled window should • Contain at least one region where all samples have a positive label (peaks, peakStart, peakEnd). Thus the maximum of an autoscaled y axis will equal the maximum coverage of a peak.
• Contain at least one region where all samples have a negative label (noPeaks). Thus the contrast between peaks and background will be easy to verify.
• Ideally contain at least one genomic region which has positive labels for some samples, and negative labels for the other samples. These labels further clarify the definition of a significant difference between peaks and background.
Examples of labeled windows that satisfy all three criteria are shown in the figures above and below. Since the labels in each figure constitute a single window, it is easy to verify them by simply plotting the coverage data and labels in the entire window.
Labeling broad domains. Broad ChIP-seq marks such as H3K36me3 data are sometimes interpreted in terms of "domains" which may consist of one or more "peaks" perhaps separated by small regions of background. To label these data, there are several possibilities: • If the data appear to be one broad peak covering the entire domain, then one large peakStart and one large peakEnd label can be used to indicate that one large peak should be called in this region (and not several smaller peaks).
• If the data appear to be several smaller peaks separated by small spaces, which visually make up a domain if concatenated, then such peaks would still be informative of the domain extension. In this case, one small peakStart label and one small peakEnd label can be used to indicate that a consistent peak model should at least recover the start and end of the domain.
• If there is one (or very few) small peak(s) located somewhere in the domain, then the start and the end of the data may not be obvious. In this case, one can add peakStart and peakEnd labels to indicate that a small peak to be called, or perhaps not add any labels at all (if the interpretation of the peaks in the data is not clear).
Example of a labels file. Below is an example of an labels file that TDH created by manual visual inspection of the H3K4me3 immune cell samples (tcell, bcell, and monocyte). Genomic regions were copied from the UCSC genome browser web page, and pasted into a text file. It is divided into 4 genomic windows of nearby labels, each of which is separated by two returns/newlines. Each line represents a label for several cell types. The cell types that are not listed all get a noPeaks label in the indicated region. For example, the first line means that monocyte and tcell samples get a peakStart label at the indicated region, and the bcell samples get a noPeaks label in the same region.

Supplementary Text 2
Details of peak calling algorithms. In the following paragraphs, we describe the details of the peak calling algorithms we tested.
We used Model-based Analysis of ChIP-Seq for the macs algorithm Zhang et al. [2008]. We downloaded MACS version 2.0.10 12162013 (commit ca806538118a85ec338674627f0ac53ea17877d9 on GitHub). We used the qvalue cutoff parameter -q for λ, with grid of 59 values from 0.8 to 10 −15 . We used the --broad command line option for the macs.broad algorithm. With broad settings there are two separate qvalue cutoff parameters: (-q and --broad-cutoff). So we used the same grid of 59 qvalue cutoffs for -q, and then defined broad-cutoff = q × 1.122, since broad-cutoff is required to be larger than -q. The default macs algorithms use a default qvalue cutoff ofλ = 0.05. We downloaded CCAT version 3.0 from http://cmb.gis.a-star.edu.sg/ChIPSeq/paperCCAT.htm Xu et al. [2010]. The example/config histone.txt file was used for the ccat.histone algorithm, and the example/config TF.txt file was used for the ccat.tf algorithm. We set minScore to 0 in both files, then analyzed the significant.region files after running CCAT. Column 7 in these files is the "fold-change score." We defined a grid of fold-change score thresholds λ ∈ {0.1, . . . , 400}, and we defined the CCAT model with parameter λ as all rows of the significant.region file which have a fold-change score greater than λ. The default CCAT algorithm uses a minScore ofλ = 5.
The HMCan algorithm was proposed to correct for GC-content and copy number differences Ashoor et al. [2013]. We downloaded HMCan commit 9d0a330d0a873a32b9c4fa72c94d00968132b9ef from BitBucket. We used the default GC content normalization file provided by the authors. We used two different parameter files to test two different peak detectors: We then ran HMCan with finalThreshold=0, and defined λ as a threshold on column 5 in the regions.bed file. Default models use a finalThreshold ofλ = 10. We downloaded RSEG version 0.4.8 from http://smithlabresearch.org/software/rseg/ Song and Smith [2011]. Upon recommendation of the authors, we saved computation time by running rseg-diff using options -training-size 100000 and -i 20. We used the -d option to specify a dead regions file for hg19 based on our alignment pipeline. We defined the significance threshold λ as the sum of posterior scores (column 6 in the output .bed file). For the default RSEG algorithm, we used all the peaks in the output file, meaning a posterior score threshold ofλ = 0.
We downloaded SICER version 1.1 from http://home.gwu.edu/~wpeng/SICER_V1.1.tgz Zang et al. [2009]. We defined the significance threshold λ as the FDR (column 8) in the islands-summary output file. For the default SICER algorithm, we used an FDR ofλ = 0.01 as suggested in the README and example.
The HOMER set of tools was proposed for DNA motif detection Heinz et al. [2010]. We used the findPeaks program in HOMER version 4.1 with the -style histone option. We defined the significance threshold λ as the "p-value vs Control" column. We defined the default model as all peaks in the output file.

Supplementary Text 3
Definition of the label error. In this section we give the precise mathematical definition of the label error.
Data definition Let there be n labeled training samples, all of the same histone mark type. For simplicity, and without loss of generality, let us consider just one chromosome with b base pairs. Let + be the vectors of coverage across that chromosome. For example b = 249, 250, 621 is the number of base pairs on chr1, and x i ∈ Z b + is the H3K4me3 coverage profile on chr1 for one sample i ∈ {1, . . . , n}. We also have exactly four sets of labeled regions For each sample i ∈ {1, . . . , n}, R + i is a set of regions, and each region r ∈ R + i is an interval of base pairs, e.g. r = (100, 200).
A peak detection function or peak caller c : The goal is to learn how to call peaks c(x i ) which agree with the labeled regions To quantify the error of the peak calls with respect to the labels, we define the following functions.
Weak label error The weak labels R + i , R − i count whether or not there are any peaks overlapping a region. They are called weak because each "peaks" label ∈ R + i can only produce a false negative (but not a false positive), and each "noPeaks" label ∈ R − i can only produce a false positive (but not a false negative). Fig. 3 shows how the "peaks" label counts false negatives and the "noPeaks" label counts false positives. Let be the number of bases which have peaks overlapping region r. Then for a sample i, the number of weak false positives is where I is the indicator function. The weak false positive (WFP) function counts the number of "noPeaks" regions r ∈ R − i which have at least one overlapping peak. The number of weak true positives is The weak true positive (WTP) function counts the number of "peaks" regions r ∈ R + i which have at least one overlapping peak.
Strong label error The strong labels R i , R i count the number of peak starts and ends occuring in the given regions. They are called strong because each label can produce a false positive (more than one peak start/end predicted in the region) or a false negative (no peak start/end predicted). Fig. 3 shows how these "peakStart" and "peakEnd" labels count both false negatives and false positives.
For a sample i, the number of strong false positives is The strong false positive (SFP) function counts the number of "peakStart" and "peakEnd" regions which contain more than one peak start/end. The number of strong true positives is The strong true positive (STP) function counts the number of "peakStart" and "peakEnd" regions which contain at least one peak start/end.
Total label error For a sample i, the total number of false positives is the total number of true positives is the total number of false negatives is The label error E quantifies the number of incorrect labels: The label error E can be easily computed using the C code in the R package PeakError on GitHub: https: //github.com/tdhock/PeakError.

ROC analysis
and the true positive rate as ROC curves are traced by plotting TPR(λ) versus FPR(λ) for all possible values of the peak detection parameter λ.

Supplementary Figure 1
Default and learned peak detectors in H3K4me3 data. It is clear that default parameter values yield false negatives and false positives. Parameter training for macs yields a perfect peak detection model.

Supplementary Figure 2
Default and learned peak detectors in H3K36me3 data. It is clear that default parameter values yield false positives. Parameter training for hmcan.broad yields a perfect peak detection model.

Supplementary Figure 3
H3K36me3 AM immune  Test error for all algorithms and all data sets. It is clear that each unsupervised algorithm yields accurate peak detection for specific experiments. For example, triform is accurate in transcription factor data sets (max, nrsf, srf) but not in histone marks. In contrast, the supervised PeakSeg method yields state-of-the-art accuracy in all data sets (except max).

Supplementary Figure 4
H3K36me3 AM immune   H3K36me3  TDH  immune   H3K36me3  TDH  other   H3K4me3  PGP  immune   H3K4me3  TDH  immune   H3K4me3  TDH  other   H3K4me3  XJ  immune   max  NTNU  k562   nrsf  NTNU  k562 srf NTNU Gm12878 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ROC curves for 4-fold cross-validation test error, for all algorithms and all data sets. Unlike usual ROC curves, these are not monotonic, since some of the parameters (e.g. the macs qvalue parameter) affect peak size as well as the number of peaks. It is clear that default parameters tend to have higher false positive rates than learned parameters. It is also clear that some algorithms are better than others for all possible thresholds.

Models trained on same or different person
Train on one person, test on another. H3K4me3 immune cell data sets were labeled by TDH and PGP. We trained models using labels from one or the other person, and then tested those models on labels from the same or a different person. It is clear for all models that it does not make much difference in terms of test error when training using labels from one or another person. Train on some cell types, test on others. TDH labeled H3K4me3 data sets of immune and other cell types. We trained models on one or the other cell types, and then tested those models on the same or a different set of cell types. It is clear that for some models such as macs and macs.broad, a model trained on the different cell types yields higher test error than a model trained on the same cell types. It is also clear that the train data do not make much difference for other models.

Models trained on same or different experiments
Train on one histone mark, test on another. TDH labeled H3K4me3 and H3K36me3 profiles for the same set of immune cell samples. We trained models on one or the other histone mark, and then tested those models on the same or a different histone mark. It is clear that a model trained on a different histone mark yields higher test error than a model trained on the same histone mark.