damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets

Summary: DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently, no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalization based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalization and background minimization. We describe here damidseq_pipeline, a software pipeline that performs automatic normalization and background reduction on multiple DamID-seq FASTQ datasets. Availability and implementation: Open-source and freely available from http://owenjm.github.io/damidseq_pipeline. The damidseq_pipeline is implemented in Perl and is compatible with any Unix-based operating system (e.g. Linux, Mac OSX). Contact: o.marshall@gurdon.cam.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.


Targeted DamID
Targeted DamID was performed as previously described (Southall et al., 2013), with the following modifications. Expression of the Targeted DamID proteins was driven using the neural stem cellspecific driver worniu-GAL4 in the presence of tub-GAL80ts and induced for 16 hrs at 29°C in third instar larvae. DNA was obtained from the anterior portions of forty torn larvae per sample.
Following the DamID procedure, samples were sonicated in a Diagenode Bioruptor to reduce the average DNA fragment size to 350bp, and DamID adaptors were removed via overnight Sau3AI digestion. The resulting DNA was purified using a PCR purification kit (Qiagen) and processed for Illumina sequencing using a Truseq-LT (Illumina) kit as per manufacturers instructions. 50bp single-end reads were obtained via a HiSeq 2500 (Illumina). Libraries were multiplexed such as to yield at least 20 million mappable reads per sample. Datasets were visualised using Gbrowse (Stein et al., 2002).

Accession numbers
The RNA pol II DamID-seq raw and processed data have been deposited under NCBI GEO accession number GSE69184.
SRA datasets were converted to FASTQ via the SRA-tools software package (https://github.com/ncbi/sra-tools). Reads were trimmed to remove DamID adaptors, DamID adaptor dimers and low quality (<q30) ends using a custom Perl script (available upon request) before processing with the damidseq_pipeline software.

Software pipeline implementation
The software uses the following approach for processing NGS data in FASTQ format (processing of BAM files instead of FASTQ files starts at step 4).
2. The resulting SAM file reads with a Q-score > 30 are extended to the average length of sequenced fragments (300nt by default). This is achieved through modifying the SAM file, removing the sequence and quality lines (replacing with '*' in both instances) and setting the cigar string to the requisite length.
3. The modified SAM file is converted to a BAM file via the SAMtools software suite (Li et al., 2009).
4. SAMtools is used to read the BAM file. Total readcounts are saved, chromosome names and sizes are determined from the header information and a hash of the genomic coverage is calculated for a user-specified bin size. For the D. melanogaster genome, we recommend using 75nt bins (the default setting) as a compromise between accuracy, coverage and file size; for the Mus musculus genome we recommend 500nt bins.
5. The binned coverage is reduced to GATC fragment resolution using a pre-built file containing the location of all GATC sites. Several pre-built files are provided online; files for other genomes can be built with a companion Perl script that is also provided.
6. Binned counts are normalised and pseudocounts are added as described.
7. Log 2 (Dam-fusion/Dam) ratio files in either GFF or bedGraph format are generated for all samples excluding the Dam control, both at the GATC fragment resolution and (optionally) at the full resolution of the bins used to generate read counts.

Data analysis
Gene calls from RNA pol II DamID-seq data were performed using an R implementation of the gene-calling algorithm described in (Southall et al., 2013) with the following modifications (R script available upon request). Briefly, for a defined set of log 2 thresholds in the range [0.1 ... 2], we take a random sample of the dataset over 50000 iterations, and determine the number of times the average occupancy is greater than the threshold for varying numbers of GATC fragments. Dividing the number of times the count exceeds the threshold by the number of iterations gives the FDR for this threshold/fragment number. Given that the relationship between GATC fragment number and log(FDR) for any threshold is linear, we determine the linear regression for each threshold. Since the relationship between the slope of each linear regression and the number of GATC fragments is also linear and the relationship between the number of GATC fragments and the intercept is quasi-linear, we obtain linear regressions for both these values. Knowing these final two regressions, it is possible to predict the FDR for a gene spanning any number of GATC fragments, with any average occupancy. Peak calls were performed via the algorithm described in (Southall et al., 2014) (Perl script available upon request).
Mean correlation between Dam-only and Dam-fusion protein binding was based on four DamID-seq datasets: RNA pol II in larval neuroblasts (this study); Dichaete in whole embryos (Carl and Russell, 2015); Dsx-1 in male fat body (Clough et al., 2014); and Brm in larval neural stem cells (Marshall and Brand, unpublished).
All other analyses were performed using R (R Development Core Team, 2011). Table S1. Effect of read depth on DamID-seq data. Random samples of the neuroblast RNA pol II DamID-seq dataset were compared for the signal:noise ratio, total peak coverage and the number of expressed genes called. Pseudocounts were added using the default value of 10 for c.   Figure S1. Effect of differing values of pseudocounts (as governed through differing values of c) on sample signal:noise ratio (SNR). SNR was calculated as (mean binding over peaks)/(SD of unbound regions). The red diamond represents the value obtained from previously published binding tracks (Carl and Russell, 2015).  Figure S2. Effect of differing values of pseudocounts on total peak coverage (FDR < 0.01). The red diamond represents the value obtained from previously published binding tracks (Carl and Russell, 2015).  Figure S3. Effect of differing values of pseudocounts on the total number of detectable peaks with an FDR < 0.01. The red diamond represents the value obtained from previously published binding tracks (Carl and Russell, 2015).

Dichaete binding in D. melanogaster embyos
Value of c Number of bound genes (within 1kb of a peak) Figure S4. Effect of differing values of pseudocounts on the number of genes lying within 1kb of a peak (FDR < 0.01). The red diamond represents the value obtained from previously published binding tracks (Carl and Russell, 2015). Figure S5. Processing of previously published DamID-seq data for Dicheate binding in Drosophila melanogaster embryos using the damidseq_pipeline software. (A) Published output from Carl and Russell (2015). (B) Re-processing the sequencing data using damidseq_pipeline removes the negative bias at unbound regions and generates binding profiles with reduced noise.