PeaKDEck: a kernel density estimator-based peak calling program for DNaseI-seq data

Summary: Hypersensitivity to DNaseI digestion is a hallmark of open chromatin, and DNaseI-seq allows the genome-wide identification of regions of open chromatin. Interpreting these data is challenging, largely because of inherent variation in signal-to-noise ratio between datasets. We have developed PeaKDEck, a peak calling program that distinguishes signal from noise by randomly sampling read densities and using kernel density estimation to generate a dataset-specific probability distribution of random background signal. PeaKDEck uses this probability distribution to select an appropriate read density threshold for peak calling in each dataset. We benchmark PeaKDEck using published ENCODE DNaseI-seq data and other peak calling programs, and demonstrate superior performance in low signal-to-noise ratio datasets. Availability and implementation: PeaKDEck is written in standard Perl and runs on any platform with Perl installed. PeaKDEck is also available as a standalone application written in Perl/Tk, which does not require Perl to be installed. Files, including a user guide, can be downloaded at: www.ccmp.ox.ac.uk/peakdeck. Contact: chris.ocallaghan@ndm.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

Section S1: The DNaseI-seq method Supplementary Figure S1. The DNaseI-seq method. In regions of DNaseI-insensitive chromatin, genomic DNA interacts tightly with the histone proteins in the nucleosomes (green ellipses). The histone proteins protect the DNA from digestion by the enzyme DNaseI. By contrast, in regions of open chromatin, such as those at promoter or enhancer sites, nucleosomes are absent and the genomic DNA is accessible to transcription factors, which bind more loosely than nucleosomes. In open chromatin, the DNA is also accessible to the DNaseI enzyme and so is 'DNaseI hypersensitive'. Where DNaseI cuts twice within a single accessible site, a short fragment of regulatory DNA is released. Adapters are ligated to the released fragments and small fragments are amplified and isolated by gel purification. These fragments are sequenced using high throughput methods. DHS -DNaseI hypersensitive site; TFtranscription factor; HTS -high throughput sequencing

Section S2: Difficulties quantifying DNaseI digestion
Although there are several methods for assessing DNaseI digestion (outlined below), there is no universal genome-wide objective measurement that can accurately quantify this digestion, or the signal-to-noise ratio for the digestion before the digested DNA is sequenced. The central problem is that, when differential digestion is detected between samples, it is difficult to distinguish differences in true hypersensitivity from differences in the degree of experimental digestion that is achieved with each sample.
There are two principle methods described for assessing the extent of DNaseI-digestion. Digestion can be qualitatively assessed by pulsed-field gel electrophoresis (Song and Crawford, 2010) (Supplementary Figure S2). While this can be useful, it does not allow accurate quantification of digestion. An alternative approach is DNaseI-Southern blotting which can be combined with qPCR to estimate the degree of digestion at a defined genomic location (McArthur et al., 2001). This approach can be used to estimate DNaseIhypersensitivity for extended genomic regions (referred to as 'chromatin profiling') (Dorschner et al., 2004;Taylor et al., 2008). Sabo et al suggest that optimal digestion for DNaseI hypersensitive site detection is 50-70% copy number loss at hypersensitive sites with respect to insensitive sites (Sabo et al., 2006). We have found this approach helpful for approximate comparison of digestion within experiments (Supplementary Figure S3).
However, a key problem with all these approaches is that DNaseI-hypersensitivity is a dynamic characteristic, and is subject to cell type-specific, experimental and environmental influences (Hernando et al., 2013;Sheffield et al., 2013). Therefore, when differential 'digestion' is detected between samples, it is difficult to distinguish differences in hypersensitivity from differences in digestion. This is particularly relevant when attempting to equalize digestion in different treatment conditions, or between different cell types, and in different experiments, where it is plausible that hypersensitivity itself has changed. The difficulty in quantifying digestion in a standard way is illustrated by the fact that signal-tonoise ratios vary widely between samples in NCBI-deposited DNaseI-seq data (Supplementary Figure S4, Supplementary Table S1; see below for method of signal-to-noise ratio calculation). The use of specific sites as calibration controls is not reliable given these observations.
Phenol-chloroform extracted genomic DNA has been used to generate control data (Sabo et al., 2006;Hesselberth et al., 2009). However, this approach requires the samples to be prepared quite differently, with digestion occurring after DNA extraction, rather than before. While this may have some utility, such a control is unlikely to have an identical background noise pattern to the input sample, and is therefore unlikely to allow for correction of differences in signal-to-noise ratio in the resulting dataset. This is in contrast to similar high throughput methods such as ChIP-seq, where isotype antibodies can be employed in a parallel experiment to obtain highly meaningful control data. Figure S2. Qualitative assessment of digestion by pulsed-field gel electrophoresis. Digestion of genomic DNA can be assessed qualitatively by pulsed field gel electrophoresis. The above gel shows genomic DNA extracted from MCF7 cells, either undigested (2 lanes) or digested with 1, 2, 4, 8, 16, 32 or 64 units of DNaseI prior to extraction. With increasing digestion, the fragment sizes of the genomic DNA smear decreases gradually. However precise quantification of digestion is not possible.

Supplementary Figure S3. DNaseI activity influences signal intensity.
Isolated nuclei from a single population of cells were aliquoted, and digested with different amounts of DNaseI. The DNA was isolated by phenol-chloroform extraction. Primers were made for a known DNaseI hypersensitive site 5' to the c-myc gene (MYC), and for two known DNaseI insensitive sites near the rhodopsin (RHO) and hemoglobin beta (HBB) genes. SYBR green-based qPCR was used to assess relative digestion at the c-myc site compared to digestion at either the insensitive rhodopsin site (blue) or the insensitive betaglobin site (red). This demonstrated increasing digestion of the DNaseI hypersensitive MYC site with higher DNaseI concentrations. Thus signal intensity is influenced by DNaseI activity. No change was observed when we compared the two insensitive sites (green). Because hypersensitivity at any individual target site (e.g. at the c-myc promoter site) is not a universally stable property, and can change in different cell types, or with different environmental conditions, it is difficult to synchronize digestion levels in different DNaseI-seq libraries prior to sequencing. All samples were digested at 37 o C, with the exception of those labeled 0[ice] which were kept on ice throughout (DNaseI is inactive at 0 ˚C).

Supplementary Figure S4. Signal-to-noise ratio varies in DNaseI-seq datasets.
To quantify signal-to-noise ratio in 10 DNaseI-seq datasets downloaded from the NCBI Short Reads Archive, we used DNaseI-seq data from ENCODE covering 125 cell types to compile a list of open chromatin elements in the genome, and generate a map of the number of cell types with open chromatin at each site (Supplementary Figure S5). We measured the mean 'signal' intensity, by randomly selecting 50,000 sites that were designated as open chromatin in 60 or more (out of 125) different cell types, and measured the corrected read density at the center of each of these sites. We similarly measured 'noise' by selecting 50,000 loci which were not designated as DNaseI hypersensitive in any ENCODE cell lines. The values depicted are the ratios between mean 'signal' and 'noise' for each of the 10 NCBI sample datasets.

Section S3: Sample datasets
We downloaded datasets from the limited collection of DNaseI-seq datasets available at the time from the NCBI SRA archive (see Supplementary Table S1). Datasets were chosen sequentially, omitting obvious experimental replicates, and, where possible, duplicate cell types to provide sample heterogeneity. Having observed wide variation in signal-to-noise ratio in our preliminary analysis, we wanted to ensure that the test datasets displayed this range of variation. To avoid biases that might arise from choosing biologically identical datasets, and to maintain cell-type heterogeneity, we excluded datasets from duplicate cell lines within the same experimental set. The dataset SRR1013752 is a cell line dataset and was included to widen the heterogeneity of the samples. The full set of datasets, cell types and signal-to-noise ratios are described in Supplementary Table S1. The distribution of signal-to-noise ratio is plotted in Supplementary Figure S9, with included datasets in blue, and excluded datasets in red. For comparison, the ENCODE DNaseI-seq datasets used in our analysis are those described by Thurman et al (Thurman et al., 2012).

Section S4: Estimation of signal-to-noise ratio in DNaseI-seq data
While PeaKDEck does not depend on signal-to-noise ratio calculation for peak calling, the estimation of signal-to-noise ratio in DNaseI-seq datasets has some analytical utility. The problem with the calculation of a precise signal-to-noise ratio for DNaseI-seq data is that it would require measurements of signal intensity at sites of true signal (DNaseI hypersensitivity) and at sites of true noise. Unfortunately, no method exists for the precise identification of such sites and so true signal and true noise are not directly measurable.
For this reason, to assess signal-to-noise ratio in a dataset, we instead determine the genomic locations of sites where a body of consistent experimental data indicates that there are high or low probabilities of true signal (DNaseI hypersensitivity) being present. The ratio between the signal at sites of high probability and the sites of low probability is then used to estimate the signal-to-noise ratio.

Estimation of signal.
Our approach to the identification of sites with a high probability of true signal is based on analysis of 125 ENCODE DNaseI-seq datasets from 125 different cell lines. We divided the genome into discrete sites, and annotated each site with a count of the number of the ENCODE datasets which had DNaseI hypersensitivity designated within that site (see Supplementary Figure S5). From this annotation, we created 12 lists of genomic sites. The first list contained all sites that were designated as DNaseI hypersensitive in 10 or more cell lines, the next list contained all the sites that were designated as DNaseI hypersensitive in 20 or more cell lines and so on in increments of 10 up to a list of sites that were designated as hypersensitive in 120 or more cell lines. 372,201 sites were designated as hypersensitive in 10 more cell lines, but the number of shared sites fell such that there were only 70,646 hypersensitive sites shared by 60 or more cell lines and only 9039 hypersensitive sites shared by 120 or more cell lines (see Supplementary Table 2).
For each of the 125 individual ENCODE datasets, we then determined the percentage of the sites in each of the 12 groups that were designated as hypersensitive in that individual dataset. So, for example, in dataset #1, 19.5% (72,790 of 372,201) of the sites that were designated as hypersensitive in 10 or more cell lines were hypersensitive in this dataset, 28.8% (63,053 of 218,808) of the sites that were designated as hypersensitive in 20 or more cell lines were hypersensitive in this cell line and so on until 61.4% (5546 of 9039) of the sites that were designated as hypersensitive in 120 or more cell lines were hypersensitive in this cell line (Supplementary Table S2). Having done this for each of the 125 ENCODE datasets we then calculated the mean percentages ± SD across all 125 datasets for the number of sites that were hypersensitive in 10 or more, 20 or more cell lines and so on for each of the 12 lists (Supplementary Figure S6A). The results of this analysis are plotted in Supplementary Figure S6B, where the total number of sites in each of the 12 groups are represented by the full height of the bar (red plus blue) and the mean percentages are shown in red. For example, of the 70,647 sites that are designated as hypersensitive in 60 or more datasets, a mean of 44.0 ± 9.4% of these sites are designated as hypersensitive across the 125 datasets.

Supplementary Figure S5. Identification of known hypersensitive sites from published ENCODE DNaseI-seq datasets.
To define existing 'known' hypersensitive and insensitive sites, we used published ENCODE DNaseI-seq data from 125 cell types (Thurman et al., 2012). The hypersensitive sites from all 125 cell types were amalgamated into one single global track of hypersensitivity. The process of amalgamation is illustrated above. In this example, the hypersensitive sites for 5 cell types at a single locus on chromosome 6 are depicted. Areas of the genome that do not contain any hypersensitive sites are defined, and tagged with a score of zero, indicating no known hypersensitive chromatin at that site. Sites that do contain ENCODE-designated hypersensitivity are assigned a score corresponding to the number of cell types with hypersensitivity in that interval. This data was made into a single track in bed format, as illustrated in the 'all data sets' track above. From this analysis, we chose to estimate 'signal' across the sites that are designated as hypersensitive in 60 or more cell lines. This was because above the 60 level, there was little gain in the percentage of sites that were hypersensitive in each dataset (Supplementary Figure S6A), but the number of sites available for analysis diminished substantially (Supplementary Figure S6B). Ideally, 100% of the sites used to estimate signal in any given dataset would be truly DNaseI hypersensitive. While it is not possible to reach 100% certainty in this respect, in choosing the 70,647 sites that are designated as hypersensitive in 60 or more ENCODE datasets, we know that 44.0 ± 9.4% of these sites are likely to be DNaseI hypersensitive in any given cell type, based on the ENCODE data.

Estimation of noise.
To estimate the noise in each dataset, we selected sites with a low probability of containing truly hypersensitive DNA. These 'noise' sites were identified in an analogous manner from the ENCODE data as genomic sites that contain no hypersensitivity signal in any ENCODE cell line.
Calculation of estimated signal-to-noise ratio. The list of 'signal' sites, and the list of 'noise' sites were then used to estimate the signal-to-noise ratio by randomly selecting 50,000 sites from the list of 70,647 signal sites, and 50,000 sites from the list of 'noise' sites. A bin of 300bp was created centered on each of these sites, and any overlapping bins were discarded. The number of reads per base within each bin was calculated for the 'signal' sites, and also for the 'noise, and the mean score across all ~50,000 non-overlapping 'signal' sites was divided by the mean score across all ~50,000 non-overlapping 'noise' sites to arrive at an estimation of the signal-to-noise ratio for that dataset.
If our estimation of signal-to-noise ratio was dominated by the percentage of designated hypersensitive sites in that cell line, then we would expect a linear relationship between the estimated signal-to-noise ratio, and the percentage of the 70,647 sites (designated as hypersensitive in 60 or more ENCODE cell lines) that are designated as hypersensitive in that cell line. This relationship was analyzed for the 10 test datasets. We found no significant correlation between these values ( Figure S6C) which is consistent with our signal-to-noise ratio estimate being driven not by the percentage of designated hypersensitive sites, but by the true signal-to-noise ratio for a dataset.
The choice of 50,000 sites was based on an analysis of our sample datasets, where we found that variability in our estimate of mean signal and noise scores for any given dataset was minimal (coefficient of variation < 1%) when 50,000 or more sites were used for the calculation at both low and high signal-to-noise ratios (Supplementary Figure S7).
To determine whether the estimation of signal-to-noise ratio varies substantially according to which of the original 12 lists of sites were used to estimate 'signal' strength, we repeated the calculation of signal-to-noise ratio as described above for each list of sites. Whether we calculated the signal strength at sites selected from the first list (present in 10 or more cell types) or from the final list (present in 120 or more cell types), the ranking of signal-to-noise ratios from our 10 test dataset changed minimally (Supplementary Figure S8).
Further support for the utility of our approach is obtained by analyzing signal dispersion in the same datasets from the number of reads (or tags) per base with coverage. In high signalto-noise data sets, the reads will aggregate closely together, and therefore have a higher reads per base score than low signal-to-noise ratio datasets, where reads are more dispersed and spread over a larger number of genomic bases. Plotting these two independent genome-wide measurements for each of our 10 sample datasets (blue), plus 6 additional datasets (red) demonstrated a linear correlation (Supplementary Figure S9), again suggesting that our estimate of signal-to-noise ratio is reasonable.
In summary, precise determination of the signal-to-noise ratio in DNaseI-seq data is not feasible for the reasons outlined above. However, it is possible to estimate signal-to-noise ratio by evaluating signal strength at sites with a high or low probability of being hypersensitive (based on the available experimental data).
Supplementary Figure S6. Estimation of signal-to-noise ratio.
To identify a suitable list of genomic sites at which to estimate signal for a given dataset for signal-to-noise ratio calculation, we generated 12 lists of genomic sites from the ENCODE-125 dataset, beginning with a list of sites common to 10 or more ENCODE datasets, in increments of 10, up to a list of sites common to 120 or more datasets.
(A) To determine the percentage of sites in each of the 12 lists that were designated as hypersensitive sites, we calculated these percentages for each ENCODE dataset (see Supplementary Information text). Of the 70,647 sites common to 60 or more ENCODE cell types, a mean of 44.0% ± 9.4% of these were hypersensitive across the 125 ENCODE datasets.
(B) The mean number of sites (red bar) ± SD shown in (A) is plotted with the total number of available sites in that category represented by the whole bar (blue plus red). The mean ± SD for each list is itemized in red text.
(C) The estimated signal-to-noise ratio for each sample dataset is plotted against the percentage of the 70,647 sites common to 60 or more ENCODE datasets that contained a designated hypersensitive site in that sample dataset. No significant correlation was detected.   (A and B) The mean read density scores were calculated for 300bp segments at either randomly selected sites through the genome (blue), sites with no hypersensitivity in the ENCODE dataset (red), or at sites that were hypersensitive in 60 or more ENCODE datasets (green). In each case, the random selection, and measurement of mean read density scores was repeated five times, and the mean ± 95% confidence interval was calculated for both low (A) and high (B) signal-to-noise ratio datasets. (C and D) When 50,000 or more sites were randomly selected, the coefficient of variation for each of the measured read density scores dropped below 1%, demonstrating that 50,000 sites is a reasonable choice for reproducible SNR estimation. To test whether measurement of 'signal' at sites common to 60 or more ENCODE datasets had a large impact on our estimation of the signal-to-noise ratio, we plotted signal-to-noise ratio estimates for each of our 10 sample datasets where signal was measured at hypersensitive sites shared by 10 or more ENCODE cell lines, 20 or more cell lines, and so on in increments of 10 up to sites shared by 120 or more ENCODE cell lines. While the magnitude of the signal-to-noise ratio estimate changed, the ranking of signal-to-noise ratio estimates for our 10 sample datasets changed very little, suggesting that the method of estimation of signal-to-noise ratio remains useful, independently of the threshold chosen for defining 'signal' sites To test whether our estimate of signal-to-noise ratio was a valid indicator of signal dispersion, we plotted the signal-to-noise ratio estimate for each sample dataset against the number of reads per base covered for the entire dataset. A strong linear correlation was observed, suggesting that our signal-to-noise ratio estimate is a valid measurement of signal dispersion. Datasets included in the analysis in the main text are depicted in blue, and datasets that were excluded due to duplication of cell type or signal-to-noise ratio (to maintain sample heterogeneity) are shown in red. The critical advantage of using the kernel density estimate (Supplementary Figure S10) over the empirical probability estimate is that it provides a mechanism for smoothing the probability estimate. This smoothed estimate can then be used for rational interpolation of density estimates for points at which data is absent. To demonstrate this, the actual empirical probabilities, and the corresponding kernel estimated probabilities were calculated for low and high signal-to-noise ratio datasets using default PeaKDEck settings (Supplementary Figure S11). A key strength of PeaKDEck is its ability to identify 'safe' thresholds in noisy data, and calling the probabilities in this range is critical. It is clear from Supplementary Figure S11 that threshold estimation would be erratic without smoothing at these probability values. In rare cases, the absence of empirical data for a given read count value may even mean that calculation of an empirical probability would not be possible. The detrimental influence of such (expected) fluctuations in the empirical data is limited by the use of kernel density estimation.

Supplementary Figure S10. Kernel density estimation.
Kernel density estimation is a method for empirically calculating a probability distribution from non-parametric data. PeaKDEck uses Gaussian kernel density estimation to generate a probability distribution based on ~50,000 randomly selected corrected read density measurements from a given dataset. This allows estimation of the probability that any measured corrected read density in the dataset belongs to the random background signal, or is greater than random background signal. By default, PeaKDEck sets the corrected read density threshold for peak calling to a value where the probability that such a density belongs to the random background signal distribution is < 0.001.

Supplementary Figure S11. Kernel density estimation allows smoothing and interpolation of probability estimates.
To illustrate the distinction between the empirical probability distribution of background read density scores (red) and the calculated kernel density estimated probabilities (green), both these parameters are plotted for a low (A) and high (B) signal-to-noise ratio dataset at the lower end of the read density score probability spectrum. As is evident, a key advantage of the kernel estimated probabilities is smoothing of probability estimates in the lower ranges, and the ability to interpolate results where empirical data is limited.

A B
Section S6: Peak-calling The approach to peak-calling with PeaKDEck is graphically depicted in Supplementary Figure S12. A quantitative description of PeaKDEck identified peaks is shown in Supplementary Figure S13, and a qualitative description of this data is given in Supplementary Figure S14. Summary results for analysis of our 10 sample datasets are using PeaKDEck are shown in Supplementary Table S3.
The process of peak calling by PeaKDEck can be divided into four steps.
A. First, 50,000 random genomic sites are generated, and overlapping sites are discarded. B. The read density (RD) at each of these non-overlapping sites (default 300bp) is calculated (central RD). This value is corrected by subtracting the expected background RD calculated from the local background RD (default 3000bp), to give the corrected RD. For the sample calculations illustrated, the background bin is 10 times the size of the central bin. C. The ~50,000 random corrected RDs are used to calculate a probability distribution of randomly selected corrected RDs, using kernel density estimation (Supplementary Figure 10). The threshold corrected RD for peak calling is calculated from this distribution as the RD for which the probability (P) drops below a cut-off (by default, this value is set at p < 0.001). D. Finally, with this density threshold, the data set is scanned systematically using overlapping bins, and at each position, the corrected RD is calculated as in panel B. A peak is called when this corrected RD is higher than the threshold value.

Code
Name Signal (mean) Section S7: Analysis of dataset-specific unique sites.
As discussed in the section on 'Estimation of signal-to-noise ratio in DNaseI-seq data', there is no 'gold-standard' approach to the experimental identification (and quantification) of DNaseI hypersensitivity signal on a genome-wide basis, which is independent of DNaseI-seq itself. Because an objectively 'correct' genome-wide quantification of DNaseI hypersensitivity cannot be determined, the validation of peak calling strategies is challenging We approached the evaluation of peak calling strategy by measuring the percentage of uniquely occurring sites (i.e. sites that do not contain DNaseI hypersensitivity in any of the 125 ENCODE datasets) in each dataset. From our analysis of the ENCODE data, a mean of 0.87 ± 0.73% of the genomic bases in each dataset are designated as hypersensitive. Because the a majority of the genome is not hypersensitive, poorer peak calling strategies are more likely to call peaks randomly in areas of 'noise' as opposed to areas of hypersensitivity signal. Therefore, the percentage of unique peaks in a given dataset is likely to increase as the accuracy of the strategy decreases.
To understand what proportion of sites per dataset is likely to be unique to that dataset, we calculated the mean percentage of unique sites in each of the ENCODE 125 datasets. We found that 3.14 ± 4.13% of ENCODE-designated DNaseI hypersensitive sites per dataset were unique to that dataset. This figure represents a description of the spread of percentages occurring in the large set of ENCODE data which is currently the most standardized and complete data set from which to make this calculation.
When we called peaks in our 10 sample datasets with PeaKDEck, we found that the percentage of unique peaks per dataset varied within the range observed in the larger ENCODE dataset. As discussed above, if our peak calling strategy was poor, we would expect a higher percentage of peaks called by PeaKDEck to be unique to each dataset. Thus our analysis indicates the validity of the PeaKDEck approach.
Threshold calculation and peak calling run times for our 10 sample datasets are shown in Supplementary Table S4.

Section S8: Strengths of PeaKDEck
PeaKDEck has several distinguishing features compared to other peak calling applications.
The key features include: • Use of global background signal frequency distribution for peak threshold calculation Extensive sampling of background signal leads to improved ability to distinguish signal from noise at the whole genome and whole dataset level. This facilitates meaningful comparison of different DNaseI-seq datasets.
• Improved peak calling performance at low signal-to-noise ratios Kernel density smoothing of sampled background signal allows accurate threshold setting in noisy, low signal-to-noise ratio data. For example, assuming the range of variation in the percentage of unique peaks per dataset in the ENCODE data to be a reasonable estimation of the true variation in the number of unique hypersensitive sites in any given cell type, we calculated the z-scores and probabilities for the percentage of unique peaks in the DNase001 dataset (low signal-to-noise ratio) using each peak caller. For PeaKDEck, 6.95% of peaks were unique, with a 20.9% probability that this value 'belongs' to the normal distribution of values in the ENCODE data set. For MACS, with 7.38% unique peaks, the probability is 18.1%, for HOMER with 9.64%, the probability is 7.2%, and for FSEQ, with 31.40% unique peaks, the probability approaches zero.
• Implementation o A full range of tools for analysis of mapped DNaseI-seq data Both command line and GUI implementations of PeaKDEck include tools for sorting, filtering and random selection of mapped DNaseI-seq data. In addition, there are tools for generating a density track of the mapped data, and sorting called peaks. To achieve this level of analysis with other peak callers would require several additional software applications.
o A multi-platform GUI-based implementation The GUI application allows usage of PeaKDEck tools without the need for command line access.
o High accessibility The wide variety of command line and GUI multi-platform implementations, combined with its minimal dependencies makes PeaKDEck a widely accessible tool to users of widely varying computational abilities. PeaKDEck does not require any libraries or packages to be installed other than standard Perl for the command line version.

Section S9: Examples of use
The tools available in PeaKDEck are described in Supplementary S5. Examples of use are as follows: Filters out reads with a mapq score less than (q) and mismatch score greater than (-uq Analyzes an ordered mapped reads file in sam format. Assigns a unitless probability density score to each base in the genome, representing its probability of being a hypersensitive site. Only chromosomes given in the chromosome size file are scored. Identifies peaks in mapped target *.sam files. The number of sites sampled to calculate the background signal probability distribution is set using -npBack, and the cutoff threshold for peak calling is calculated at a probability selected by -sig. by default, peaks are scored with the maximum background corrected bin density. These scores can be converted to pvalues based on the background signal probability distribution by setting -PVAL to 'ON'.

Random Read Selector (-R) -nr [number of reads]
Randomly selects a target (-nr) number of reads from a given sam file.

Top Peak Selector (-T)
-n [number of peaks] Selects the top (-n) scoring peaks in given bed peak file. Defaults to 'ALL' peaks if -n is not set.

Numerical Sam Sort (-NS)
Sorts all reads in given sam file into ascending numerical order by base position.