Dashboard-style interactive plots for RNA-seq analysis are R Markdown ready with Glimma 2.0

Abstract Glimma 1.0 introduced intuitive, point-and-click interactive graphics for differential gene expression analysis. Here, we present a major update to Glimma that brings improved interactivity and reproducibility using high-level visualization frameworks for R and JavaScript. Glimma 2.0 plots are now readily embeddable in R Markdown, thus allowing users to create reproducible reports containing interactive graphics. The revamped multidimensional scaling plot features dashboard-style controls allowing the user to dynamically change the colour, shape and size of sample points according to different experimental conditions. Interactivity was enhanced in the MA-style plot for comparing differences to average expression, which now supports selecting multiple genes, export options to PNG, SVG or CSV formats and includes a new volcano plot function. Feature-rich and user-friendly, Glimma makes exploring data for gene expression analysis more accessible and intuitive and is available on Bioconductor and GitHub.


INTRODUCTION
RNA-sequencing (RNA-seq) is a high-throughput method for characterizing transcriptomes (1). Researchers commonly leverage RNA-seq technology to compare the transcription levels of genes across experimental conditions, in a workflow known as differential expression (DE) analysis. As there are tens of thousands of genes involved in DE analyses, it can be difficult to pinpoint information on genes of interest in densely populated static R (2) plots. Static graphics necessarily provide a flattened perspective on the data; version 1 of Glimma (3) aimed to remedy this by allowing users to interactively explore data at the sample-level through dimensionality reduction plots and at the gene-level in plots of summary statistics obtained from popular DE analysis tools limma (4), edgeR (5) and DESeq2 (6).
While many powerful tools exist for producing interactive plots for DE analysis (7)(8)(9)(10)(11), they require users to run a Shiny server (12), which may be difficult to navigate for those with minimal experience in R. The Glimma software does not depend on Shiny and can produce portable outputs that can be viewed without an active installation of R. This allows the package to cater to its main user base of biologists and end-users who would like drop-in graphics for popular gene expression workflows.
Glimma version 1 allowed the creation of two interactive versions of limma-style plots: a multidimensional scaling (MDS) plot used to assess variability between samples and a mean-difference (MD) plot used for identifying differentially expressed genes between experimental conditions. These could be exported as HTML files and shared with collaborators, allowing biologists to investigate interesting features in the data with minimal coding required.
The responsive and user-friendly layout of Glimma 1.0 has proven very popular among the Bioconductor community, amassing over 19 000 downloads in 2020 alone. The software is commonly used for exploration of transcriptional data from raw gene expression-levels to summarized results obtained from DE analysis (13).
Intended as a drop-in interactive visualization tool for common RNA-seq workflows, Glimma was built on d3.js (14) and relied on custom-built functions for connecting R code with a web-based frontend. The low-level nature of d3.js and the high complexity of the codebase made it difficult to add improved interactivity and new output formats to the package. For instance, simple improvements such as adding plot legends and scaling point sizes became intractable tasks. Version 2.0 of Glimma was built to address these deficiencies by reproducing all existing functionality in 1.0 using existing high-level libraries such as htmlwidgets (15) and Vega (16). This would make it easier for bioinformaticians and software engineers to rapidly develop new features in response to new developments in gene expression studies, such as single cell RNA-seq (scRNAseq) analysis.
Another important theme in the second iteration of Glimma was reproducibility. Previous versions instantiated plots within a new HTML window when Glimma R functions were called, creating a separation between interactive plots and the code that created them. It was therefore a requirement that plots could be embedded in R markdown (17) in version 2, allowing bioinformaticians to share reports with graphics interwoven between blocks of code.
Glimma 2.0 offers three main interactive plots--an MDS plot via the glimmaMDS function, an MD plot (which has been renamed to MA plot) via the glimmaMA function and a volcano plot via the glimmaVolcano function. Additionally, a generalized version of an interactive plot that comprises of the two components seen in the MD and volcano plots, which includes a plot of summary results from a DE analysis and a plot of expression data from which the DE analysis was performed on, is available via the glim-maXY function (see Figure 1). We refer to the style of these plots as 'summary-expression' plots for the remainder of the article.
Relative to the original version of the software, the volcano plot is a new addition to Glimma. It was included for ease of use since this it is a common plot created in a RNAseq workflow. The new plot simplifies coding by Glimmausers who would have otherwise had to used the more general XY plot to create this plot.
The MD plot has been renamed to 'MA plot' to reflect the original name given to this style of plot in the limma package. A plot of mean gene expression (or averages) against log 2 -fold change or logFC (also called 'M-values') were referred to as an MA plot. Glimma's MA plot is equivalent to limma's MD plot.
In this article, we will demonstrate a suite of interactivity and functionality upgrades made in Glimma 2.0. The embedded plots offer a seamless user interface integrated with the workflow of DE analysis. Amidst fervent demand for improved reproducibility in bioinformatics research, we anticipate that version 2 will be even more popular than the initial version.

Availability
Functions from Glimma 1.0, as well as the improvements made to Glimma 2.0 are available in the Glimma software package. The stable version of the software is released on Bioconductor (18) and is available at https://bioconductor. org/packages/release/bioc/html/Glimma.html. Both stable and developmental releases of Glimma are available on GitHub at https://github.com/hasaru-k/GlimmaV2.

Frameworks
Glimma 2.0 was re-developed using htmlwidgets for R (15) (https://CRAN.R-project.org/package=htmlwidgets), a framework for creating web-based visualizations which behave like R plots, which offers embeddability in R Markdown out of the box, addressing reproducibility issues in the previous version. Some popular packages using htmlwidgets as a foundation include Plotly (19,20) (https://CRAN. R-project.org/package=plotly) and dygraphs (21) (https:// CRAN.R-project.org/package=dygraphs) for R. The htmlwidgets package provides a convenient API (application programming interface) for software developers to bind R functions to a JavaScript visualization front-end codebase. This means the process of connecting R function calls to web-based visualizations including serialization of R data structures is automated, greatly reducing the complexity of the development process. To ensure backwards compatibility with previous versions, version 2 was built to support data structures from edgeR, limma and DESeq2 workflows as inputs (see Figure 1).

Visualization
Interactive plots were generated within an instantiated widget using the Vega visualization grammar (16) which is based on d3.js (14) but provides higher level graphical primitives and allows convenient linkage of graphs. Declarative JSON specifications for each genre of plot were created, parameterized and passed to the Vega API which renders the desired interactive web-based graphics. The Vega API was also used to display tooltips, respond to on-click events and dynamically update plots.
In conjunction with external libraries, native JavaScript was used to link data across various plots and tables. In the front-end interface for summary statistic plots, an interaction state machine ensures that the response to any sequence of inputs from the user is well-defined. For instance, selecting genes on the summary plot triggers a transition into the graph selection state. This event filters the table to display the selected genes and updates other visual controls. The open-source FileSaver.js package (22) (https://github.com/ eligrey/FileSaver.js) was used for exporting gene summary statistics and expression data as CSV files for summaryexpression plots.

Embedded Plot or stand-alone file
A major new feature of Glimma 2.0 is that it allows plots to be embedded in R markdown. Users still have the option to export widgets as a standalone HTML file with   inlined JavaScript, CSS and assets. This is another improvement on version 1.0 where exported HTML had to be manually zipped with scripts and assets before sharing interactive plots with collaborators.

Image export
Glimma 2.0 widgets contain save buttons which allow the user to export plots in their current state as static PNG or SVG files. The latter filetype is an infinitely scalable vector format which allows users to further manipulate plots in Adobe Illustrator. For MDS plots, both the MDS plot itself and the 'Variance Explained' plot can be exported separately using the 'Save Plot' button. Similarly, summaryexpression plots have the capability to export either of the summary or expression plots separately. This is a marked improvement over version 1.0 which lacked image saving capabilities.

MDS plot improvements
Each point on the MDS plot represents an experimental sample for which the transcriptional output across thousands of genes has been recorded. MDS reduces the dimensionality of the data so that the first few dimensions repre-sent most of the variation, which allows us to visually examine similarities and differences in the transcription profiles of samples (4). Samples that cluster together are interpreted as more similar than those which are far apart, and each dimension of the MDS plot explains a decreasing proportion of the total variation. This is used for exploratory analysis of an experiment to determine which factors drive the variation between samples and the degree of their effects. It is useful for observing both expected and unexpected differences in the samples' transcriptional profiles (e.g. separation between biological groups versus unknown batch effects), and serves as a predictor of what downstream DE analysis results may look like (e.g. poor separation between biological groups in the MDS plot tends to result in no or low numbers of differentially expressed genes).
Sizing, shaping and colouring points. Typically, one would simply examine the first 2 dimensions in a static MDS plot--the first 2 dimensions explain most of the variation in the data; and move onto the next steps in data exploration if satisfied with the clustering of biological and/or experimental groups there. Whilst this task is fairly straight-forward, one can easily end up with several versions (4 or more) of the same dimension-1-and-2 MDS plot by the time they apply different sizing, shapes and colours to the plot to reflect experimental factors in the study. Applying these custom visual transformations to static R plots can be timeintensive and require coding experience. We therefore designed dashboard-style inputs which apply a range of visual modifications to the MDS plot. Users can encode information about each sample by adjusting the colour, shape and size of each point using drop-down menus (see Figure 2). Points can be coloured by a given feature using a continuous or discrete colour scheme of choice which can be changed on the fly using a drop-down menu. These modifications to the MDS plot help users determine the experimental factors by which samples cluster. They also allow for the encoding of redundancy into a visualization which can emphasize the visual groupings of a data display (25). For instance in Figure 2, samples are coloured and shaped by the same experimental factor, whilst library sizes are reflected in its sizing.
Plotting dimensions. Aside from dimension-1-and-2 plots, the examination of higher dimensions allows one to check for unwanted batch effects in the dataset (13). In many cases, this next step is skipped over if dimension-1-and-2 plots do not raise any questions, especially when production of the first set of plots was already quite time-intensive. Occasionally, one would create higher dimension R plots within the R console as a visual inspection, but not save or document the plot unless an interesting feature is discovered (e.g. samples cluster by sample collection date when sample collection date was not expected to influence gene expression). Glimma 1.0 only allowed the user to plot consecutive pairs of dimensions together (i.e. 1 and 2 in conjunction, 2 and 3, etc). Version 2.0 remedies this limitation by providing drop-down menus which allow the user to plot any combination of dimensions on the x and y axes. This allows the creation of plots that show separation of samples based on two experimental factors when there is a confounding factor driving an intermediate MDS dimension. Moreover, the interactive plots allow one to check variations of the plot (higher dimensions or combination of dimensions) without having to retrospectively code and save extra plots to an analysis. The new 'Save Plot' button (see Figure 2) is also useful for adding any variation of the MDS plots to reports, presentations and publications when the interactive form is not needed or when working outside of R Markdown documents.

Summary-expression plot improvements
Summary-expression plots in Glimma, which include MA, volcano and XY plots, share a common front-end user interface (see Figure 3), with a main plot of gene-wise summary statistics on the top-left and an expression plot showing the abundance of the selected gene in each sample on the top-right. Beneath these plots lies a table containing annotation data and DE analysis statistics which is interactively linked to both the summary and expression plots.
The MA plot, which is called via the glimmaMA function, is used to visualize logFC (M) on the y-axis versus average abundance (A) on the x-axis. The volcano plot, called via the glimmaVolcano function, displays a measure of statistical significance on the y-axis, specifically this is the -log 10 -P-value of genes where higher values reflect greater significance, versus logFC on the x-axis. The generic XY plot, which is called via glimmaXY, displays two userparameterized vectors on the x-and y-axes of the summary plot. The vectors are required to be same length as the number of genes in the dataset, such that the points in the summary plot must match the number of columns in the expression data.
Numerous improvements have been made to the R function interfaces and common front-end for this trio of summary-expression plots. These are summarized below.
Volcano plot. In previous versions of Glimma, volcano plots could only be generated by manually extracting and log-transforming the relevant x and y axis vectors, and then passing these to the glimmaXY() function. Version 2.0 provides a new glimmaVolcano function that automatically extracts and transforms statistical significance alongside logFC from edgeR, limma and DESeq2 objects to generate an interactive volcano plot widget as per Figure 3.
Multiple gene selections. In DE analysis we are often interested in the transcription profile of a set of multiple genes. In the initial release of Glimma, researchers had to query genes one-by-one in order to locate them on the plot. Version 2.0 remedies this by allowing for multiple genes to be simultaneously highlighted. Users can search and sort genes in the table of summary statistics and annotation data. Selecting any number of rows on the table will highlight the corresponding points on the graph. For instance, the three most significant genes on the X chromosome have been highlighted in Figure 3. Interaction is bi-directional: users can also click on points in the summary plot to toggle their selection, which will filter the table beneath to display data for just the selected genes. The expression plot on the right shows the expression values measured in log 2 -counts-permillion (log-CPM) across all samples for the most recently selected gene, stratified by experimental group.
Automatic highlighting of significant genes. Version 2.0 auto-highlights significant genes on behalf of the user by calling limma::decideTests() on MArrayLM objects and edgeR::decideTestsDGE() on DGEExact and DGELRT objects. These sensible defaults can be overridden using the status argument.
User-specified colours for expression plot. Summaryexpression plots take an optional sample.cols vector argument which allows users to vary the colours of samples in the expression plot on. This creates a visualization whereby samples are separated by treatment group but coloured by another factor such as gender or batch.
X-axis label rotation. Feedback from version 1.0 users suggested that when a large number of groups were plotted on the expression plot, distinguishing between the labels of each group became impossible. Version 2 allows a larger number of clearly readable x-axis labels by rotating each label by 45 degrees anti-clockwise. Bioinformatics, 2021, Vol. 3, No. 4 Fixing expression Y-axis. By default, selecting a gene causes the vertical axis of the expression plot to re-scale itself according to the maximum expression value for that gene. Feedback from biologists indicated that this made it difficult to notice when there were large changes in expression relative to another gene they had clicked on. Version 2.0 remedies this by adding the max y axis input form allowing the user to fix the y-axis maximum to the numeric value specified (see Figure 3), which overrides the autoscaling behaviour.

NAR Genomics and
User interface improvements. Plots now display gene symbol text above selected points, allowing genes to be visually identified without additional manipulation in Illustrator. Legends were added to the summary plot indicating the colour associated with the DE status of the gene (1 for upregulated, 0 for non-DE, -1 for downregulated). A list of gene identifiers beneath the plots indicates the current selection of genes highlighted across the entire table or plot, helping the user to keep track of what they have highlighted.
Data export. The 'Save Data' button allows researchers to export the table data concatenated with expression data as a CSV file. Researchers can optionally save only the selected subset of genes, or all genes in the table. This is useful for further interrogation of the data using external programs and/or manipulation into other plotting styles outside of Glimma.
Streamlined function arguments. Plots in version 1.0 had lengthy function prototypes that required the user to manually extract gene counts, test results and experimental groups from data structures which already contained this information. Version 2.0 significantly reduces the verbosity of function calls by automating data extraction for the user. For instance, function calls for the MArrayLM data structure from limma now require two arguments down from four: # version 1 glMDPlot(fit, counts=rnaseq$counts, status=limma::decideTests(fit), groups=rnaseq$samples$group) # version 2 glimmaMA(fit, dge=rnaseq) In a similar fashion, analysis plot function prototypes for edgeR and DESeq2 data structures now require 2-3 fewer arguments to generate the equivalent plots.

Plots for scRNA-seq data
The Glimma package was originally designed and developed with bulk RNA-seq data in mind. However, we have found that users are also applying the plots to scRNA-seq data. In light of this, we have provided preliminary support for single cells as a new vignette to demonstrate the use of Glimma version 2.0 for single cell data. Briefly, MDS plots can be created using the full set of single cells, while summaryexpression plots can be created for pseudo-bulk samples or a sub-sample of the single cells. Sub-sampling of the single cells is currently necessary to meet performance constraints, but a random sub-sample is still informative to the distribution of the full set of samples. Rotation of x-axis labels has also improved the applicability of Glimma to scRNAseq analyses which often contain a high number of sample groups or cell types.

DISCUSSION
In this paper we presented R markdown-ready interactive plots for gene expression analysis, designed for convenient use for biologists and end users. A highly requested feature pack has been implemented which improves the reproducibility and interactivity of the package, with major inroads also made to supporting single cell RNA-seq analysis. Re-implementing Glimma using high-level libraries has greatly increased the extensibility of the package, futureproofing it for new capabilities to be added rapidly in response to user demand. For instance, version 2 achieved a 63.8% reduction in lines of R code (1175 versus 3246 in version 1.0) despite incorporating many more features due to the use of external libraries and modularization. This translates into a vastly more manageable codebase.
Another noteworthy concern is the file size of knitted HTML from R markdown containing Glimma 2.0 interactive plots. For the plots demonstrated in this paper drawn from a dataset with nine samples and 27,000 genes, we recorded file sizes of 1.1 and 4.5 megabytes for the MDS and volcano plots respectively. As we expect file sizes to increase linearly with the cardinality of the datasets used in analysis, compressing exported plots may be prudent in order to share them around more efficiently.
A Bioconductor tool that shares similar goals to Glimma is the iSEE package (9) which allows users to dynamically configure data displays such that one panel transmits certain features of the data to another receiving panel. For instance, panels can be arranged such that selecting rows in a transmitting table filters points in a receiving graph, achieving a similar effect to the Glimma analysis plots. In Glimma, transmission and reception of data between displays is preconfigured and immutable, sacrificing flexibility for ease of use and minimal set-up. This allows the package to cater to its main user base of biologists and end-users desiring dropin graphics for limma, edgeR and DESeq2 workflows, rather than bioinformaticians requiring powerful configurability.

Future work
Glimma's interactive analysis plots can be prohibitively slow to use on scRNA-seq analyses. This is due to the size of scRNA-seq counts matrices which can contain many thousands of cells. Taking advantage of the fact that the vast majority of counts are zero in single cell expression data, we are exploring sparse matrix objects for JavaScript to improve performance. The main challenge in this space is selecting and testing an open-source sparse matrix implementation which maximizes the efficiency for our use case (storing scRNA-seq data) while adding minimal bulk to the package.
In addition, the direct support of SingleCellExperiments objects from Bioconductor is challenging due to the lack of consensus among different packages on how final summary statistics are to be stored. Direct support of SingleCellExperiment objects can be added when a consensus or dominant method emerges.
In future versions of Glimma, we aim to extend the interactive front-end for the MDS plot to support UMAP (26) and tSNE (27) dimensionality reduction visualizations. Further improvements to reproducibility of the MDS plot could also be achieved by allowing users to pre-set the initial colour, shape or size of points using function arguments.