-
PDF
- Split View
-
Views
-
Cite
Cite
Francesco Monti, David Stewart, Anuradha Surendra, Irina Alecu, Thao Nguyen-Tran, Steffany A L Bennett, Miroslava Čuperlović-Culf, Signed Distance Correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad210, https://doi.org/10.1093/bioinformatics/btad210
- Share Icon Share
Abstract
There is a need for easily accessible implementations that measure the strength of both linear and non-linear relationships between metabolites in biological systems as an approach for data-driven network development. While multiple tools implement linear Pearson and Spearman methods, there are no such tools that assess distance correlation.
We present here SIgned Distance COrrelation (SiDCo). SiDCo is a GUI platform for calculation of distance correlation in omics data, measuring linear and non-linear dependencies between variables, as well as correlation between vectors of different lengths, e.g. different sample sizes. By combining the sign of the overall trend from Pearson’s correlation with distance correlation values, we further provide a novel “signed distance correlation” of particular use in metabolomic and lipidomic analyses. Distance correlations can be selected as one-to-one or one-to-all correlations, showing relationships between each feature and all other features one at a time or in combination. Additionally, we implement “partial distance correlation,” calculated using the Gaussian Graphical model approach adapted to distance covariance. Our platform provides an easy-to-use software implementation that can be applied to the investigation of any dataset.
The SiDCo software application is freely available at https://complimet.ca/sidco. Supplementary help pages are provided at https://complimet.ca/sidco. Supplementary Material shows an example of an application of SiDCo in metabolomics.
1 Introduction
The analysis of biological networks, as a parallel investigation to the study of individual feature characteristics, requires robust quantification of the interconnections between features within biological systems (Ma'ayan 2011). Several methods for data-driven network determination of feature interconnections have been used in the analysis of metabolomic data. Pearson or Spearman correlation-based methods are arguably the most prevalent (Amara et al. 2022). While providing critical information about the direction of dependencies, both methods measure linear or monotonic correlations and cannot detect non-linear feature interactions (Rosato et al. 2018). Distance correlation, a non-parametric approach for correlation analysis, can measure various types of data relationships (linear and non-linear) as well as the correlations between vectors of different lengths (Gábor and Maria 2009; Székely and Rizzo 2013; Edelmann et al. 2021). In metabolomic and lipidomic datasets, distance correlation can take into consideration the sparse coverage of feature data, the potential for determining non-linear relationships, and the possibly random network topologies associated with metabolism and inherent to lipidomic and metabolomic datasets with zero correlation only obtained for fully independent features.
Despite these advantages, few publications have used distance correlation to analyze metabolomic data (Oliveira et al. 2015; Tang et al. 2019; Cuperlovic-Culf et al. 2021). We suggest that this is, in part, due to the lack of easily accessible implementations. Moreover, no parallel implementation, to our knowledge, allows users to assess partial correlations, calculated as the measure of association between pairs of features while removing the confounding effects of other variables. To address this need and thereby provide new methods for the reconstruction of regulatory metabolomic and lipidomic networks, we present here SIgned Distance COrrelation (SiDCo), a web-based application of both signed distance correlation and partial distance correlation implemented using the Gaussian Graphical Model (GGM) previously implemented for other correlation approaches (Lauritzen 1996).
2 Implementation
SiDCo is implemented in Python with a RShiny front-end. It is compatible with all web browsers. Two analytical tabs allow users to perform either distance correlation or partial distance correlation. In both applications, users define their desired threshold values and P values. Data are automatically z-score normalized across all samples prior to analysis. Users are reminded that missing values must be imputed according to their specifications or data will be returned with the descriptive error message.
Distance correlations and P values are calculated and presented as described below and a correlation directionality sign is derived from Pearson correlation analysis as an indication of the overall linear trend in the data. Distance correlation calculations in SiDCo are provided in three forms: (i) “one-to-one,” calculating correlations between each pair of features, (ii) “one-to-all,” providing correlations for each feature with all other features combined, and (iii) partial correlation calculated for each pair of features while controlling for the contributions of other features, i.e. covariates.
Distance correlation, between features X and Y and distance covariance, are calculated as: where A and B represent doubly centered distance matrices for variables X and Y, respectively, measured in n samples. Distance variances (dVar) are: and
In a one-to-one correlation calculation, an array of values for each feature is compared with an array of values for all the other features one at a time. In this case, a doubly centered distance matrix is calculated as:
and ; where Euclidean distance is used to calculate to or to . and . , and are the j-row and k-column mean values and , are the overall mean of matrices A and B.
In a one-to-all case, the distance covariance for each feature out of m features in n dimensional sample space is compared to that of all the other features in n x (m-1) dimensional space. The doubly centered distance matrix for variable Y used in the calculation of dCov is here: and equivalent for for X.
The distance correlation P value is calculated using the Student’s t cumulative distribution function with t value calculated as
and corresponding two-sided P value for the t-distribution with n-2 degrees of freedom. The sign of the distance correlation is given by the sign of the Pearson correlation following (Pardo-Diaz et al. 2021). The final output is provided as an .xlsx file and includes distance correlation and corresponding P values. Output of one-to-one analysis also includes the Pearson and Spearman correlations and their corresponding P values for completeness.
Partial correlation, the correlation between two features corrected for contributions of other features, is calculated as (following GGM):
Where matrix is inverse of - distance covariance matrix. The inverse of the distance covariance matrix uses the Moore–Penrose method for pseudo-inverse which is equivalent to standard inversion for non-singular square matrices and multiplicative inverse for singular matrices where inverse is not possible. Partial distance correlation calculation should only be performed when number of features is smaller than number of samples. Here P values are calculated using the Fisher z-transformed correlation values: and cumulative standard normal distribution (cdtf) function:). Results are provided as .xlsx downloads.
3 Conclusion
SiDCo is an open-access Web-based application for the calculation of signed and partial distance correlations between features available at https://complimet.ca/SiDCo where detailed instructions are provided.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was supported in part by RGPIN-2019-06796 to S.A.L.B. from the NSERC, as well as an operating grant [AI-4D-102-3] to S.A.L.B. and M.C.C. from the NRC AI4D Program.
References
Author notes
Francesco Monti and David Stewart Equal first authors.