MatrixQCvis: shiny-based interactive data quality exploration for omics data

Abstract Motivation First-line data quality assessment and exploratory data analysis are integral parts of any data analysis workflow. In high-throughput quantitative omics experiments (e.g. transcriptomics, proteomics and metabolomics), after initial processing, the data are typically presented as a matrix of numbers (feature IDs × samples). Efficient and standardized data quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality datasets and subsequent biological question-driven inference. Results We present MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R’s shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format (features IDs × samples). MatrixQCvis builds upon the Bioconductor SummarizedExperiment S4 class and thus facilitates the integration into existing workflows. Availability and implementation MatrixQCVis is implemented in R. It is available via Bioconductor and released under the GPL v3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.


Jiang et al. (2019): Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma
In their study, Jiang et al. [2019] analyzed primary tumor tissues and paired non-tumor liver tissues from 110 early-stage cases of hepatocellular carcinoma (according to the Barcelona Clinic Liver Cancer stages 0 and A) related to hepatitis B virus infection. The samples were taken prior to chemotherapy or radiotherapy. The samples were analyzed using a label-free proteomics mass spectrometry approach. The analysis of Jiang et al. [2019] resulted in the stratification into the subtypes S-I, S-II, and S-III, each associated with a specific clinical outcome.
The data set was downloaded from the PRIDE database (accession number PXD006512, available at www.ebi.ac.uk/pride/archive).
Since MatrixQCvis requires a SummarizedExperiment object, we will create in the following such an object from the files containing the intensity values and the metadata of the samples.

shinyQC(se)
The MatrixQCvis package offers several visualizations to explore the data quality and underlying statistical properties of the data set. Some of these visualizations will be plotted in the following paragraphs.

Normalization, transformation and missing value imputation of the data set of Jiang et al. (2019)
MaxQuant assigns the value 0 to the intensity  Lazar et al. [2016]).
The aim of a normalization strategy is to reduce the effects of technical variability on the data set. Jiang et al. [2019] performed quantile normalization in their analysis.
Normalization of data is complicated and there is no universal, a priori normalization strategy available for different data sets. Rather, the normalization technique has to be chosen based on the properties of the data set and the choice is likely to produce erroneous results when the underlying assumptions are not met [Wang et al., 2011, Wu et al., 2014. In the case of quantile normalization, it is assumed that all samples have the same distribution regardless of the sample type. This assumption, however, is only valid when a small number of proteins is disregulated [Wu et al., 2014]. Violations of these assumptions may lead to false positive results, effect-size reduction, and masking of true effects; more nuanced methods, such as class-specified normalization, in which the data is split based on the sample class/type, after which quantile normalization is performed on each split separately, might be more suitable in these cases [Zhao et al., 2020].
Using the naive quantile normalization approach and the MaxQuant data set from Jiang et al. [2019] containing 0s distorted the distributions of the underlying data (see We argue that visualization of the normalization results, as offered by MatrixQCvis, e.g. by looking at the per-sample distribution before and after conducting normalization or visualizations of trend lines of the per-sample median values or sum of values, will provide valuable insight into the expected execution of normalization strategies (i.e. reduction of technical variability and distribution properties of the underlying data).

Missing values in the data set of Jiang et al. (2019)
When detecting missing values in the SummarizedExperiment object, MatrixQCvis will load a dedicated interface to analyze missing and measured values. Missing values are an inherent characteristic of proteomics and metabolomics data sets acquired by mass spectrometry. Poor sample quality, however, might be characterized by a higher number of missing values and different sample types might show a different distribution of missing values. MatricQCvis offers visualizations that help to analyze these relations.
In addition, proper handling of missing values is required for sound biological information inference. MatrixQCvis includes several imputation methods [Lazar et al., 2016] for missing values; furthermore, MatrixQCvis enables to perform differential testing on the non-imputed data sets using the proDA package [Ahlmann-Eltze and Anders, 2019].

Trends/drifts in the data set of Jiang et al. (2019)
A diagnostic plot to identify trends in the data set stemming from technical variation is the trend/drift plot that shows the ordered set of samples on the x-axis and aggregated information (median or sum of values per sample) on the y-axis. This plot is valuable to identify systematic trends in the data set (e.g. due to fluctuations in mass spectrometry sensitivity dependent on acquisition time or batch effects). The trend lines should be checked after application of normalization, transformation, batch correction, and imputation methods.

Analysis of homogeneity of variance of the data set of Jiang et al. (2019)
Mean vs. standard deviation plots (mean-sd plots) display the (ranked) means of the features on the x-axis and the standard deviation of the features on the y-axis (see Figure 5). The combination of normalization, transformation, batch correction and imputation should result in a data set where the standard deviation is independent of the mean, i.e. that the features have the same finite variance (homogeneity of variance). The red lines in the plots indicate the running median estimator (window width 10%), and, in the case of mean-standard deviation independence, the running median estimator should be approximately horizontal. Imputation leads to the distortion of homogeneity of variance (see Figure 5 c and d). The effect has to be taken into account when performing differential expression analysis by including information on variance of the probed features.

Dimension reduction of the data set of Jiang et al. (2019)
MatrixQCvis includes functions to create dimension reduction plots (principal component analysis, PCA; principal coordinates analysis/ multidimensional scaling, PCoA/MDS; nonmetric multidimensional scaling, NMDS; t-distributed stochastic neighbor embedding, tSNE, van der Maaten and Hinton [2008]; uniform manifold approximation and projection, UMAP, McInnes et al. [2018]). Dimension reduction is a technique to show and explore the relations between variables of the underlying data in a reduced dimension space. All dimension reduction techniques employ geometrical projections that take points from a high-into a low-dimensional space. Next to the linear techniques (PCA, PCoA), non-linear dimension reduction techniques (NMDS, tSNE, UMAP) are implemented in the MatrixQCvis package.
As an example, PCA will reduce the 248-dimensional space from the data set of Jiang et al.
[2019] (consisting of 248 sample points) to a 2-dimensional plane (principal components 1 and 2) that explain 16.9% and 4.9%, respectively, of the variance of the 248 dimensional data set. MatrixQCvis enables to highlight the different factors of columns in colData(se) by different colors (see Figure 7).

Differential expression analysis of the data set of Jiang et al. (2019)
Using the MatrixQCvis package, it is possible to recapitulate the differential expression analysis of Jiang et al. [2019] (see Supplementary Table 6 therein). Before conducting the expression analysis, the (paired) samples that showed an increased number of missing proteins will be removed prior to conducting the differential expression analysis (this will be the samples XL21P, XL21T, XL18P, and XL18T). To conduct a differential expression analysis of paired samples, the following design formula will be used: where patient is the column in the colData of the SummarizedExperiment that contains the identifier of a patient and cell_type will be either T or P depending if the proteins were measured from tumor (T) or paired non-tumor (P) samples.
The specified contrast will be "cell_typeT", i.e. the effect between the cell_type T and P will be tested.
MatrixQCvis will display the result of a differential expression analysis both in a tabular format (similar to Table 1) and in the form of a volcano plot (see Figure 8). The employed normalization, transformation and imputation strategies and exclusion of the above-mentioned sample pairs due to increased number of missing values resulted in comparable results to Jiang et al. [2019] based on moderated t-tests (see Table 1).

Brueffer et al. (2018): Clinical Value of RNA Sequencing-Based Classifiers for Prediction of the Five Conventional Breast Cancer Biomarkers
In a second application, we will highlight the visualization functions of MatrixQCvis on a RNA-sequencing data set from breast cancer biopsies.
The data set stems from the population-based multicenter Sweden Cancerome Analysis Network -Breast Initiative [Brueffer et al., 2018[Brueffer et al., , 2020 and included Illumina paired-end RNA-sequencing data from 405 patients. The data set was accessed via the GEO database (accession number GSE81538). The data set contains comprehensive multi-rater histopathologic evaluation and was used as a training set for a classifier for a larger set consisting of 3273 samples (accession number GSE96058).
In a first step, the data set is loaded and converted to a SummarizedExperiment object that is readable by the shinyQC function. Furthermore, the SummarizedExperiment will be truncated in this analysis here: FPKM (fragments per kilobase of transcript per million mapped reads)-normalized transcripts with a mean expression below zero and invariant features (with standard deviation below 0.5) will be removed.

Transformation of the data set of Brueffer et al. (2018) and analysis of homogeneity of variance
The expression values are already FPKM-normalized. Here, the FPKM values will be subjected to variance stabilizing normalization/transformation only.
To check the statistical properties and the effects of variance stabilizing transformation, mean vs. standard deviation plots will give information on the homogeneity of variance across the mean-ranked features. Ideally, these plots show an approximately horizontally running median estimator (red line, window width 10%, cf. Figure 9).

MA plots and Hoeffding's D statistic values for the data set of Brueffer et al. (2018)
Next to the visualization plots shown above for the proteomics data set of Jiang et al. [2019], MatrixQCvis contains also other visualizations, such as MA plots (see Figure 10), to assess the data quality. Commonly, MA plots are used to identify abnormalities in the sample. The MA plot facilitates to identify samples that show abnormal values compared to a group of other samples (e.g. other samples of disease/healthy type).
Here, M is defined as and A is defined as where I i and I j are the logarithms of intensity or count values. The values for I i are taken from the sample i. For I j , the feature-wise means are calculated from the values of the group j of samples. The sample for calculating I i is excluded from the group j .   T7  T8  T9   T4  T5  T6   T1  T2  T3   0  5

Additional diagnostic plots to detect outlier samples in the data set of Brueffer et al. (2018)
Other diagnostic plots in exploratory data analysis to identify outlier samples are empirical cumulative distribution function (ECDF) plots (see Figure 12). The ECDF gives information on the distribution properties of a single sample and enables to identify the proportion of values below (or above) a certain threshold. MatrixQCvis includes functionality to calculate an averaged ECDF based on the specified group. Comparing the ECDF of a sample with the ECDF of the specified group, enables to add another layer of information if a sample deviates from the expected distribution (based on the group properties). Another diagnostic plot is based on distances between samples. MatrixQCvis enables to calculate distances between samples and provides visualizations to display these. Outlier samples might form (singleton) clusters within the heatmap, i.e. they might have a higher distance to other samples, and will show higher sums of distances. Taken together, the QC analysis conducted here using the MatrixQCvis package suggested that there is no indication that the data set of Brueffer et al. [2018] contains any outlier samples and that the data set shows favored statistical properties after FPKM-normalization and variance-stabilizing normalization/transformation (e.g. variance homogeneity across the features).