ggCyto: next generation open-source visualization software for cytometry

Abstract Motivation Open source software for computational cytometry has gained in popularity over the past few years. Efforts such as FlowCAP, the Lyoplate and Euroflow projects have highlighted the importance of efforts to standardize both experimental and computational aspects of cytometry data analysis. The R/BioConductor platform hosts the largest collection of open source cytometry software covering all aspects of data analysis and providing infrastructure to represent and analyze cytometry data with all relevant experimental, gating and cell population annotations enabling fully reproducible data analysis. Data visualization frameworks to support this infrastructure have lagged behind. Results ggCyto is a new open-source BioConductor software package for cytometry data visualization built on ggplot2 that enables ggplot-like functionality with the core BioConductor flow cytometry data structures. Amongst its features are the ability to transform data and axes on-the-fly using cytometry-specific transformations, plot faceting by experimental meta-data variables and partial matching of channel, marker and cell populations names to the contents of the BioConductor cytometry data structures. We demonstrate the salient features of the package using publicly available cytometry data with complete reproducible examples in a Supplementary Material. Availability and implementation https://bioconductor.org/packages/devel/bioc/html/ggcyto.html Supplementary information Supplementary data are available at Bioinformatics online.


Background
Open source software for computational cytometry has gained in popularity over the past few years. Efforts such as FlowCAP, the Lyoplate and Euroflow projects have highlighted the importance of efforts to standardize both experimental and computational aspects of cytometry data analysis. The R/BioConductor platform hosts the largest collection of open source cytometry software covering all aspects of data analysis. It provides complex data structures and uses state of the art technologies to store cytometry data with all the relevant experimental, gating, and cell population annotations. While the number of algorithms available in Bioconductor for cytometry data analysis has significantly increased over the past decade, the cytometry specific visualization frameworks available in Bioconductor for these types of data have lagged behind.

2.4
This Supplementary Information document for the paper "ggcyto: Next generation visualization software for flow cytometry" shows how to use the BioConductor package ggcyto to visualize flow cytometry data and reproduces the plots from the original publication, together with code.

BioConductor flow cytometry tools
The BioConductor project has, at the time of writing, 47 different packages tagged as "Flow Cytometry" related.
ggcyto supported data structures.
ggcyto supports the core flow cytometry data structures in R/Bioconductor: flowFrame and flowSet (defined in the flowCore package) store ungated data and are created when an FCS or set of FCS files are read into R/BioConductor. The GatingSet and GatingHierarchy (defined in the flowWorkspace package) store gated data. They are created when FCS files together with a workspace defining the gating (e.g. from FloJo, DIVA, or CytoBank) are imported together. The GatingHierarchy stores a single sample, while the GatingSet stores a collection of samples.
Interaction with ggplot and the core BioConductor FCM data structures.
To facilitate the use of core data structures with ggcyto, the package wraps the ggplot S3 class in an abstract S4 class that provides support for both ungated ( flowFrame or flowSet ) or gated ( GatingSet or GatingHierarchy ) data sources and overloads ggplot's fortify() method to generate plots from these four core infrastructure objects. In ggcyto, the default "ggplot::+" operator is also overloaded by the ggcyto_GatingSet and ggcyto_flowSet classes. The cytometry data structures are processed by the "+" operator, extracting and data and generating ggplot-compatible objects in the following sequence: 2.5 2.6 2.7 2.8 1. Appropriate data structures are created from the arguments of t he `geom_gate()` layer.
2. The appropriate `geom_density` or `geom_point` layers are creat ed from `geom_overlay()` to display the data.
3. Cell population statistics are computed labels are created from the `geom_stats()` layer.
4. Marker or channel names are displayed on the axes and the corre sponding label layers are created.

Principles of ggcyto.
In ggplot users map plot aesthetics to variables and build a plot in layers. ggCyto follows those conventions, but the application domain constrains requirements so many options are set to sensible defaults. A user specifies a data source (a flowSet or flowFrame of ungated data, or a GatingSet or GatingHierarchy of gated data), map plot axes to flow parameters (e.g. channels or markers), specify which cell population to plot, specify an axis transformation, and optionally add one or more gates to a plot. Many of these quantities are complex objects defined in the core cytometry data structures in BioConductor ( Figure 1).

Lazy loading of data
For large data sets ggcyto implements context-dependent lazy loading. When plotting with the ggcyto or autplot APIs, data loading is automatically deferred to later stages of plotting.

Additional strategies for plotting large data
The latest version of ggcyto (1.9.4) provides speeds up plotting of large data sets through a subsampling strategy. Additional arguments to autoplot and ggcyto named sample.ratio and sample.threshold specify what proportion of events to subsample (default 1%, 0.01) and at what total number of events should subsampling kick in (default: 100,000 events). This substantially speeds up plotting of large cell populations such as lymphocytes (https://github.com/RGLab/ggcyto/issues/32).

3.1.1
The original manuscript shows three sets of plots created with ggcyto . To reproduce these, we load the GvHD data and use the code presented in the published figure to generate the plots.

The autoplot API
Here autoplot is used to visualize the populations named "CD3" and "CD19" from the GatingHierarchy . Since cell populations are defined by gates, the vector of population names is passed to the gate argument of autoplot when the data are a GatingSet (first argument). The bins argument specifies the size of the hexagonal binning grid.
The geom_overlay layer allows us to highlight an additional cell population on top of an existing plot. In this case, we are highlighting the "IgD+CD27+" population.
All cell populations in the gating hierarchy can be printed via the getNodes() API from flowWorkspace . autoplot(gs[1], gate = c("CD3","CD19"), bins = 64) + geom_overlay(data = "IgD+CD27+",size=0.25) When using the autoplot API, most plotting decisions are set to sensible defaults, depending on the context of the input (e.g. whether the data is a flowFrame or flowSet, or GatingSet or GatingHierarchy). When the autoplot API takes a flowFrame or flowSet, the user specifies the channels or markers to plot on the axes, rather than a cell population, since these structures represent ungated data.
For GatingSets and GatingHierarchies, the first input is a cell population name (or vector of several populations, provided they are defined in the same dimensions, see below), stored in the data source. This determines the selection of axes for plotting, the subset of the cells that are plotted (generally the cells in the parent of the selected population are plotted), the data transformation (by default the transformation stored in the data source is used and raw data values are plotted on tick marks with a non-linear spacing, analogous to a logarithmic plot). Any gates that define the chosen cell populations are drawn. When using autoplot, the additional parameter axis_inverse_trans switches between the transformed and untransformed data scale. Finally, the proportion of the gated populations (as a fraction of the parent) are displayed on the plot.
The plots that are constructed are also "aware" of the dimensionality of a cell population. If a cell population is defined (e.g. gated) in only one dimension, ggcyto will produce a 1D density, whereas if it is defined in two dimensions, a 2D density estimate of the cells (using geom_hex) will be produced. Passing in a single flowFrame with no markers creates a set of 1D densities for each marker in the data (see below).
The ggcyto API.
The ggcyto API allows for more control and flexibility in producing plots.
# The mapping of channels to axes is explicit via the mapping = aes (). argument. # The subset argument specifies the cell subset / subset of the dat a to plot. # A geom_hex layer with the bins argument specifies the type of plo t. # A geom_gate layer specifies which gates / cell populations to add to plot. # the user must ensure they are defined in the same channels as t he mapping. # The geom_stats layer specifies here that we want to show the coun ts. # The "percent" argument to "type" in "geom_stats" is also allow ed. ggcyto(gs[1:2], mapping = aes(x="IgD",y="CD27"), subset = c("CD19an dCD20")) + geom_hex(bins=128) + geom_gate(c("IgD+CD27+","IgD+CD27-","IgD-CD27+","IgD-CD27-")) + geom_stats(type = "count") Above, the geom_gate layer will add one or two-dimensional gates to a plot and accepts a gate object (eg. "rectangleGate") or a filterList (collection of gate objects) defined in the flowCore package, or named cell population if using a gated data source. The most recent version of ggcyto (1.9.4) also supports quadGate. Cells from a different cell population can be highlighted and overlaid on an existing plot using the geom_overlay (see below) layer to produce an effect of backgating (i.e. visualizing a defined cell population on data projected onto a different set of parameters). Cell population summaries such as cell counts, percentages, or proportions can be added to figures of gated data using geom_stats, which takes the name of a gated population in a GatingSet , or a filterList . With no input, ggcyto tries to parse the gate information from the first geom_gate layer.
Data transformations can be performed using the custom scale_x and scale_y layers. Common cytometry specific transformations are supported, including the default parameterization of the biexponential and the default parameterization of the hyperbolic arcsine transforms implemented in FlowJo, and the logicle and default hyperbolic arcsine transform parameterizations implemented in flowCore .
Another useful feature is the overloaded '%+%' method, which is used to replace the data used in a (potentially complex) plot construct. For example, one may wish to produce a graphic using one GatingSet object, then the same graphic using a different data from a separate GatingSet object. To produce the new plot, the '%+%' method is applied with the old plot object on the left and the new data on the right hand side of the operator (see below).
Transforming the data.

Additional ggcyto features.
The ggcyto package is built on ggplot2 and uses the "grammar of graphics". Below we demonstrate how its features can be used to easily generate high quality plots of cytometry data.
Following common ggplot2 usage, ggcyto defines an autoplot() method. Calling autoplot() on a supported data structure automatically results in a default plot that is sensible for the data being displayed.
For example, supplying a flowSet and a channel name result in a 1D density plot using geom_density . If more than one sample is passed in, a faceted plot of all samples is created. The facet_grid and facet_wrap geoms can be used on variables in the pData slot of the flowFrame or flowSet object. Below we generate a faceted 1D density grid of the forward scatter channel from the first four samples of the GvHD data.
# we select the first four samples # We plot the forward scatter channel. # facet_grid is specified to facet by Patient and Visit, which are defined in # the pData() slot of the flowSet. autoplot(fs[1:4], x="FSC-H") + facet_grid(Patient~Visit) Changing geom s.
An existing plot can be easily modified to use a different geometry by appending the appropriate geom to the plot construction call, as below.
Calling autoplot on a GatingSet or GatingHierarchy and supplying a population name (also an alias for a gate since the former are defined by the latter) results in a 1D or 2D density (depending on how the gate is defined) using geom_hex or geom_density . Provided the specified gates are compatible (defined on the same channels, have the same dimensionality), they will be projected onto the plot: # CD9 and CD19 are gates / populations defined in the gating set. autoplot(gs [[1]], gate = c("CD3", "CD19"), bins = 64)

4.6
Using the ggcyto interface.
The same plots created with autoplot can be generated using ggcyto() . Below, we show how to create the previous plot using the ggcyto API. Note below, the gate names and channel names are the same, but this is not necessarily always the case.