viRome: an R package for the visualization and analysis of viral small RNA sequence datasets

Summary: RNA interference (RNAi) is known to play an important part in defence against viruses in a range of species. Second-generation sequencing technologies allow us to assay these systems and the small RNAs that play a key role with unprecedented depth. However, scientists need access to tools that can condense, analyse and display the resulting data. Here, we present viRome, a package for R that takes aligned sequence data and produces a range of essential plots and reports. Availability and implementation: viRome is released under the BSD license as a package for R available for both Windows and Linux http://virome.sf.net. Additional information and a tutorial is available on the ARK-Genomics website: http://www.ark-genomics.org/bioinformatics/virome. Contact: mick.watson@roslin.ed.ac.uk

Second-generation sequencing allows scientists to assay these systems in unprecedented depth, and short reads capture both the 21-22 nt siRNAs and the 24-30 nt piRNAs. However, there is a need for scientists to be able to summarize, analyse and visualize the results of such experiments. Here, we present viRome, a package for R, which takes aligned sequencing data in the BAM format (Li et al., 2009) and produces a variety of plots and reports that are essential to the analysis of data from viral siRNA datasets.
Software packages to analyse viral siRNA data exist. Paparrazi (Vodovar et al., 2011) is designed to reconstruct viral genomes from siRNA data and produces some similar plots to viRome. Alternatively, Visitor (Antoniewski, 2011), an informatic pipeline for analysing short-read viRNA data, also produces several similar plots. However, both are implemented in Perl and are limited to the Linux/Unix operating system; they include alignment as part of the analysis; therefore, using an alternative aligner would require programming skills; finally, the plots are generated in batch mode; hence, there is no interaction between the user and the software.
As a package for R, viRome improves on these software packages in several ways, including (i) viRome allows interaction between the user and the software during report and graph generation, (ii) viRome is available on any operating system that supports R and has been tested on Microsoft Windows and several Linux distributions, (iii) viRome separates visualization from alignment; therefore, the user is free to use any alignment software they wish and (iv) as an R package, viRome integrates seamlessly with other R packages from the Bioconductor project (Gentleman et al., 2004).

ANALYSIS AND VISUALIZATION
As input, viRome takes aligned sequence data in the BAM format. Many tools exist for alignment (Fonseca et al., 2012) and provided they support the SAM/BAM format, viRome is capable of working with their output. Many of the functions within viRome attempt to summarize millions of data points into tables and plots that allow biological interpretation. One of the benefits of viRome is that most functions return the summarized data, as well as creating a plot. This allows users to create their own plots if they wish. Figure 1 shows a selection of plots produced by viRome.
Global analyses: One of the first requirements is to plot a histogram of the lengths of mapped reads-a peak at 21-22 nt implying an siRNA response, and a high frequency of 24-30 nt with a peak at 28 a piRNA response. In viRome, this can be created using the barplot.bam function. Users may also create a report using the sequence.report function. This produces a data.frame in R that summarizes and counts the sequences aligned to each base in a given reference sequence. Users can see the exact sequence, its length, the location and strand of the alignment plus a count of how many times that sequence *To whom correspondence should be addressed.
ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
occurs. As a data.frame, this can be easily exported to Excel or other spreadsheet software.
Location-based analyses: Although many viruses are targeted by the siRNA pathway throughout the genome, others are targeted only in limited regions (Sabin et al., 2013). A heatmap representing the occurrence of all mapped read lengths across all genomic locations can be produced using the size.position.heatmap function, and barplots showing counts for each genomic location for each read length generated using the stacked.barplot function.
Read-based analyses: Read-based analyses allow users to focus on patterns in particular subsets of reads. Single barplots showing the location, strand and count of reads mapping throughout the genome can be visualized using the position.barplot function. The base composition of subsets of reads can be calculated with the make.pwm function. Sequence signatures of the piRNA pathway include a strong U 1 bias in primary, antisense piRNAs and following 'ping-pong' cycle amplification involving AGO3 and Aub, a strong A 10 bias in secondary sense piRNAs in Drosophila (Brennecke et al., 2007). Similar motifs have been found in piRNAs and viral piRNA-like molecules in mosquitoes or derived cell lines (Morazzani et al., 2012;Schnettler et al., 2013;Vodovar et al., 2012). The output of make.pwm can be plotted as a heatmap using the pwm.heatmap function, or used with external packages such as seqLogo and motifStack to produce sequence logos. Finally, the 5 0 -ends of complementary piRNAs are most frequently separated by 10 nt (Brennecke et al., 2007;Vodovar et al., 2012) because of the earlier described 'ping-pong' amplification. The distance between 5 0 -ends of piRNAs mapping to opposite strands can be summarized and visualized using the read.dist.plot function.

CONCLUSIONS
Deep sequencing experiments have revealed a variety of interesting and unique signatures of the miRNA, siRNA and piRNA pathways, and there is a need for software that allows scientists to process such data. We have developed viRome, a package for R that allows the interactive generation of a range of informative plots and reports. As an R package, viRome is available on a range of operating systems. viRome is released under an open-source license and can be downloaded from http:// virome.sf.net, where a tutorial is also available.