Abstract

Motivation

There is a need for easily accessible implementations that measure the strength of both linear and non-linear relationships between metabolites in biological systems as an approach for data-driven network development. While multiple tools implement linear Pearson and Spearman methods, there are no such tools that assess distance correlation.

Results

We present here SIgned Distance COrrelation (SiDCo). SiDCo is a GUI platform for calculation of distance correlation in omics data, measuring linear and non-linear dependencies between variables, as well as correlation between vectors of different lengths, e.g. different sample sizes. By combining the sign of the overall trend from Pearson’s correlation with distance correlation values, we further provide a novel “signed distance correlation” of particular use in metabolomic and lipidomic analyses. Distance correlations can be selected as one-to-one or one-to-all correlations, showing relationships between each feature and all other features one at a time or in combination. Additionally, we implement “partial distance correlation,” calculated using the Gaussian Graphical model approach adapted to distance covariance. Our platform provides an easy-to-use software implementation that can be applied to the investigation of any dataset.

Availability and implementation

The SiDCo software application is freely available at https://complimet.ca/sidco. Supplementary help pages are provided at https://complimet.ca/sidco. Supplementary Material shows an example of an application of SiDCo in metabolomics.

1 Introduction

The analysis of biological networks, as a parallel investigation to the study of individual feature characteristics, requires robust quantification of the interconnections between features within biological systems (Ma'ayan 2011). Several methods for data-driven network determination of feature interconnections have been used in the analysis of metabolomic data. Pearson or Spearman correlation-based methods are arguably the most prevalent (Amara et al. 2022). While providing critical information about the direction of dependencies, both methods measure linear or monotonic correlations and cannot detect non-linear feature interactions (Rosato et al. 2018). Distance correlation, a non-parametric approach for correlation analysis, can measure various types of data relationships (linear and non-linear) as well as the correlations between vectors of different lengths (Gábor and Maria 2009; Székely and Rizzo 2013; Edelmann et al. 2021). In metabolomic and lipidomic datasets, distance correlation can take into consideration the sparse coverage of feature data, the potential for determining non-linear relationships, and the possibly random network topologies associated with metabolism and inherent to lipidomic and metabolomic datasets with zero correlation only obtained for fully independent features.

Despite these advantages, few publications have used distance correlation to analyze metabolomic data (Oliveira et al. 2015; Tang et al. 2019; Cuperlovic-Culf et al. 2021). We suggest that this is, in part, due to the lack of easily accessible implementations. Moreover, no parallel implementation, to our knowledge, allows users to assess partial correlations, calculated as the measure of association between pairs of features while removing the confounding effects of other variables. To address this need and thereby provide new methods for the reconstruction of regulatory metabolomic and lipidomic networks, we present here SIgned Distance COrrelation (SiDCo), a web-based application of both signed distance correlation and partial distance correlation implemented using the Gaussian Graphical Model (GGM) previously implemented for other correlation approaches (Lauritzen 1996).

2 Implementation

SiDCo is implemented in Python with a RShiny front-end. It is compatible with all web browsers. Two analytical tabs allow users to perform either distance correlation or partial distance correlation. In both applications, users define their desired threshold values and P values. Data are automatically z-score normalized across all samples prior to analysis. Users are reminded that missing values must be imputed according to their specifications or data will be returned with the descriptive error message.

Distance correlations and P values are calculated and presented as described below and a correlation directionality sign is derived from Pearson correlation analysis as an indication of the overall linear trend in the data. Distance correlation calculations in SiDCo are provided in three forms: (i) “one-to-one,” calculating correlations between each pair of features, (ii) “one-to-all,” providing correlations for each feature with all other features combined, and (iii) partial correlation calculated for each pair of features while controlling for the contributions of other features, i.e. covariates.

Distance correlation, dCorX,Y between features X and Y and distance covariance, dCovX,Y are calculated as: where A and B represent doubly centered distance matrices for variables X and Y, respectively, measured in n samples. Distance variances (dVar) are: dVarX=1n2j=1nk=1nAj,k2 and dVarY=1n2j=1nk=1nBj,k2

In a one-to-one correlation calculation, an array of values for each feature is compared with an array of values for all the other features one at a time. In this case, a doubly centered distance matrix is calculated as:

Aj,k=aj,k-a-j.-a-k.+a̿. and Bj,k=bj,k-b-j.-b-k.+b̿.; where Euclidean distance is used to calculate xj to xk or yj to yk as .aj,k=(xj-xk)(xj-xk) and bj,k=(yj-yk)(yj-yk). a-j., b-j.and a-k. b-k. are the j-row and k-column mean values and a̿., b̿. are the overall mean of matrices A and B.

In a one-to-all case, the distance covariance for each feature out of m features in n dimensional sample space is compared to that of all the other features in n x (m-1) dimensional space. The doubly centered distance matrix for variable Y used in the calculation of dCov is here: bj,k=s=1m-1yjs-yksyjs-yks and equivalent for aj,k for X.

The distance correlation P value is calculated using the Student’s t cumulative distribution function with t value calculated as:

t(X,Y)=dCor(X,Y) n-2/1-dCor(X,Y)2 and corresponding two-sided P value for the t-distribution with n-2 degrees of freedom. The sign of the distance correlation is given by the sign of the Pearson correlation following (Pardo-Diaz et al. 2021). The final output is provided as an .xlsx file and includes distance correlation and corresponding P values. Output of one-to-one analysis also includes the Pearson and Spearman correlations and their corresponding P values for completeness.

Partial correlation, the correlation between two features corrected for contributions of other features, is calculated as (following GGM):

Where matrix ω=Σ-1 is inverse of Σ - distance covariance matrix. The inverse of the distance covariance matrix uses the Moore–Penrose method for pseudo-inverse which is equivalent to standard inversion for non-singular square matrices and multiplicative inverse for singular matrices where inverse is not possible. Partial distance correlation calculation should only be performed when number of features is smaller than number of samples. Here P values are calculated using the Fisher z-transformed correlation values: zij=0.5*log1+ρij1-ρij and cumulative standard normal distribution (cdtf) function: pi,j=2(1-cdtf(zij*N-M-1)). Results are provided as .xlsx downloads.

3 Conclusion

SiDCo is an open-access Web-based application for the calculation of signed and partial distance correlations between features available at https://complimet.ca/SiDCo where detailed instructions are provided.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported in part by RGPIN-2019-06796 to S.A.L.B. from the NSERC, as well as an operating grant [AI-4D-102-3] to S.A.L.B. and M.C.C. from the NRC AI4D Program.

References

Amara
A
,
Frainay
C
,
Jourdan
F
 et al.  
Networks and graphs discovery in metabolomics data analysis and interpretation
.
Front Mol Biosci
 
2022
;
9
:
841373
. https://doi.org/10.3389/fmolb.2022.841373.

Cuperlovic-Culf
M
,
Cunningham
EL
,
Teimoorinia
H
 et al.  
Metabolomics and computational analysis of the role of monoamine oxidase activity in delirium and SARS-COV-2 infection
.
Sci Rep
 
2021
;
11
:
10629
. https://doi.org/10.1038/s41598-021-90243-1.

Edelmann
D
,
Móri
TF
,
Székely
GJ.
 
On relationships between the Pearson and the distance correlation coefficients
.
Stat Probab Lett
 
2021
;
169
:
108960
. https://doi.org/10.1016/j.spl.2020.108960.

Gábor
JS
,
Maria
LR.
 
Brownian distance covariance
.
Ann Appl Stat
 
2009
;
3
:
1236
65
. https://doi.org/10.1214/09-AOAS312F.

Lauritzen
SL.
 
Graphical Models
.
Oxford
:
Clarendon Press
,
1996
.

Ma'ayan
A.
 
Introduction to network analysis in systems biology
.
Sci Signal
 
2011
;
4
:
tr5
. https://doi.org/10.1126/scisignal.2001965.

Oliveira
AP
,
Dimopoulos
S
,
Busetto
AG
 et al.  
Inferring causal metabolic signals that regulate the dynamic TORC1-dependent transcriptome
.
Mol Syst Biol
 
2015
;
11
:
802
. https://doi.org/10.15252/msb.20145475.

Pardo-Diaz
J
,
Bozhilova
LV
,
Beguerisse-Díaz
M
 et al.  
Robust gene coexpression networks using signed distance correlation
.
Bioinformatics
 
2021
;
37
:
1982
9
. https://doi.org/10.1093/bioinformatics/btab041.

Rosato
A
,
Tenori
L
,
Cascante
M
 et al.  
From correlation to causation: analysis of metabolomics data using systems biology approaches
.
Metabolomics
 
2018
;
14
:
37
. https://doi.org/10.1007/s11306-018-1335-y.

Székely
GJ
,
Rizzo
ML.
Partial distance correlation with methods for dissimilarities. arXiv: Methodology  
2013
;1310.2926.

Tang
ZZ
,
Chen
G
,
Hong
Q
 et al.  
Multi-omic analysis of the microbiome and metabolome in healthy subjects reveals microbiome-dependent relationships between diet and metabolites
.
Front Genet
 
2019
;
10
:
454
. https://doi.org/10.3389/fgene.2019.00454.

Author notes

Francesco Monti and David Stewart Equal first authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data