QuasR: quantification and annotation of short reads in R

Summary: QuasR is a package for the integrated analysis of high-throughput sequencing data in R, covering all steps from read preprocessing, alignment and quality control to quantification. QuasR supports different experiment types (including RNA-seq, ChIP-seq and Bis-seq) and analysis variants (e.g. paired-end, stranded, spliced and allele-specific), and is integrated in Bioconductor so that its output can be directly processed for statistical analysis and visualization. Availability and implementation: QuasR is implemented in R and C/C++. Source code and binaries for major platforms (Linux, OS X and MS Windows) are available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/QuasR.html). The package includes a ‘vignette’ with step-by-step examples for typical work ﬂows. Contact: michael.stadler@fmi.ch Supplementary information: Supplementary data are available at Bioinformatics online.


Recipe 3: Removing of adaptor sequences from raw reads
In this recipe, preprocessReads is used to process the raw sequence files in infiles and store the processed sequences in outfiles. In the example here the processing consists of truncating the last three bases, removing poly-adenine from the start and filtering out all reads that are shorter than 14 bases (after truncation and trimming) or contain two or more undetermined nucleotides (nBases). As illustrated below, input and output files can be compressed. preprocessReads returns a matrix with the number of processed sequences for each file.
# create alignments sampleFile <-"extdata/samples_rna_single.txt" genomeFile <-"extdata/hg19sub.fa" proj <-qAlign ( For the definition of gene models, a GRanges object is constructed containing exon ranges named by gene. Multiple ranges (exons) with the same name will be automatically combined by qCount in a non-redundant manner, assuring that a single read is not counted multiple times. As an alternative, a TranscriptDb object can be used to define gene models for qCount (see QuasR vignette for examples).
# read in gene annotation library(rtracklayer) gr <-import("extdata/hg19sub_annotation.gtf") gr <-gr[gr$type=="exon"] names(gr) <-gr$gene_name The third step is the call to qCount that counts for each gene the alignments overlapping with any of its exons. The resulting count table also contains the cumulative length for each feature or combination of features (here: the cumulative, non-redundant number of exonic bases in any transcript of that gene), making it simple to scale the counts by length and sequencing depth and to calculate for example RPKM (or FPKM) values or similar. Please note the RPKM values in our example are higher than what you would usually get for a real RNA-seq dataset. The values here are artificially scaled up because our example data contains reads only for a small number of genes.
The count table returned by qCount is also the basis for downstream statistical analysis using packages such as edgeR ( ). The code below shows how it can be converted into a edgeR::DEGList or a DESeq2::DESeqDataSet object, although in this case no replicates are available and thus the example data is not well suited for a statistical analysis.

Recipe 5 (ChIP-seq): Export wig files and create a metagene profile plot
In this recipe, reads from a ChIP-seq experiment are aligned, and the alignment density along the genome is exported in wig format for visualization in public genome browsers.

Recipe 6 (Bis-seq): Calculate methylation states
The methylation of cytosines can be analyzed with QuasR using bisulphite sequencing (Bis-seq) data. QuasR supports both directed and undirected Bis-seq libraries, as specified using the bisulfite parameter. If this parameter is set, qAlign will bisulphite-convert both reads and genome and align the reads with a three-letter alphabet. The in silico-converted cytosines are then put back into the resulting alignments, and methylation states are quantified using qMeth. The resulting output can for example be used to identify partially methylated domains (PMDs) and regions of low methylation (LMRs and UMRs) using the MethylSeekR package (Burger et al., 2013).

Recipe 7: Allele-specific analysis
All experiment types supported by QuasR (ChIP-seq, RNA-seq and Bis-seq; only alignments to the genome, but not to auxiliaries) can also be analyzed in an allele-specific manner. For this, a file containing genomic location and the two alleles of known sequence polymorphisms has to be provided to qAlign. The recipie uses the file available from system.file(package="QuasR", "extdata", "hg19sub_snp.txt"). To avoid an alignment bias, all reads are aligned separately to each of the two new genomes, which QuasR generates by injecting the polymorphisms listed in snpFile into the reference genome, only retaining the best alignment for each read. While combining alignments from the two genomes into the final BAM file, each read is classified into one of three groups: • R: the read aligned better to the reference genome • U: the read aligned equally well to both genomes (unknown origin) • A: the read aligned better to the alternative genome Using this alignment classification, the qCount, qProfile and qMeth functions will produce three counts instead of a single count for each feature. The column names are suffixed by _R, _U and _A.

Hardware and data used in performance measurements
The QuasR running times reported below were obtained on an IBM ® x3850 X5 computer with Intel ® Xeon ® X7542 CPUs with a clock rate of 2.67GHz. Reported times correspond to elapsed time as measured by system.time (the minimum over three independent trials) and were obtained on three samples selected from an RNA-seq dataset that is publicly available from GEO (http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE33252) under the series accession GSE33252 (Tippmann et al., 2012)

Typical qAlign running time
We first measured the time to generate alignments with qAlign against the mouse genome (BSgenome package pre-indexed with QuasR) using default parameters and different numbers of CPU cores (n): cl <-makeCluster(n) proj <-qAlign("samples.txt", "BSgenome.Mmusculus.UCSC.mm10", clObj=cl) The following figure shows the observed execution times as a function of the number of CPU cores (left panel, logarithmic axes) and the resulting speedup (ratio of serial over parallel time, right panel). The ∼ 100 Mio. reads are aligned in about 5 hours on a single CPU, and that time linearly decreases up to about four cores. Starting with eight cores, the speedup is sub-linear due to sequential parts in the alignment generation process with contribute a constant amount of execution time. Typical qCount running time Next, we measured execution time of counting alignments for each gene with qCount. We first created a GRanges query object gr extracted from the TxDb.Mmusculus.UCSC.mm10.knownGene package, which contained a total of 218,671 exons for 23,725 genes. This object was used as a query in qCount to count alignments in all three samples using: qCount(proj, gr, clObj=cl) The figure below shows qCount execution times (left) and speedup (right) for alignment counting. For qCount, the speedup starts to become sub-linear already with four cores, which may be due to the larger fraction of serial code in qCount compared to qAlign. Furthermore, the performance of the storage system when accessing the alignment files can also limit the speedup.

QuasR performance summary
The