distance correlation

Abstract Motivation Even within well-studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information. Results We introduce the concept of signed distance correlation as a measure of dependency between two variables, and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods, such as Pearson correlation and mutual information. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson correlation or mutual information. Availability and implementation Code is available online (https://github.com/javier-pardodiaz/sdcorGCN). Supplementary information Supplementary data are available at Bioinformatics online.

2 Spearman correlation networks for R. leguminosarum We construct a Spearman correlation matrix R from the pre-processed gene expression matrix M * by computing the Spearman correlation between every pair of gene expression vectors.We threshold R to construct two networks N R(d S ) and N R(d P ) with edge density d S and d P , respectively.Similarly to the networks obtained using Pearson correlation N P (d S ) and N P (d P ), the ones based on Spearman correlation have a larger and less densely connected largest connected component than N S(d S ) and N S(d P ).The summaries of these two networks can be found in Table S2.We evaluate the Spearman networks using STRING as described in the paper.The results obtained are higher than the obtained by the Pearson correlation networks but lower than those using signed distance correlation.Table S3 and Figure S1 show the results of the STRING evaluation for the Spearman networks.3 Gene coexpression network analysis for the study of Yeast RNA-Seq data

Network
We use our signed correlation pipeline to generate a gene coexpression network from a dataset obtained using RNA-Seq of yeast (Saccharomyces cerevisiae) expressing pathways designed to increase ATP or GTP consumption.We obtain all the raw-counts for experiment E-MTAB-5174 in Expression Atlas [5] and remove the genes with zero expression variance.The final dataset which we feed into our pipeline includes the expression of 6,930 genes across 209 samples.Following the pipeline described in the methodology section, we pre-process the data and obtain the distance correlation matrix D, the Pearson correlation matrix P , the signed correlation matrix S, and the matrix |P | of absolute values of the Pearson correlation.Table S4 and Figure S2 show the summaries and distribution of these matrices.We estimate the optimal threshold values θ * and θ to construct the unweighted gene coexpression networks A S (θ * ) and A P (θ ) using COGENT [2].We follow our pipeline to calculate the selfconsistency of the networks.We adjust the similarity score by subtracting the network density.Fig S3 shows the variation of the score function for the correlation matrices across different edge densities.For both signed distance correlation and Pearson correlation, there is a edge density value for which the score function reaches its maximum.This value is d S = 0.0131 for the signed distance correlation (score of 0.746, which is achieved for θ * = 0.84) and d P = 0.0146 for the Pearson correlation (score of 0.733, which is achieved for θ = 0.81).The results show that the use of signed distance correlation offers more stable networks.However, the difference between the performance of the two studied correlations is lower than the observed in the study of R. leguminosarum.We analyse the networks N S(d S ) (edge density d S ) and N S(d P ) (edge density d P ) retrieved from S, the networks N P (d S ) (edge density d S ) and N P (d P ) (edge density d P ) retrieved from P , and the networks N R(d S ) (edge density d S ) and N R(d P ) (edge density d P ) obtained using Spearman correlation.The summaries of the six networks are detailed in Table S5.

Yeast networks score function
Yeast networks score function We evaluate the six networks using STRING [9].We use the three different sets of confidence scores: obtained using all the evidence in STRING (C), obtained using only coexpression information (C † ), and obtained using all the evidence except the coexpression information (C ‡ ).Table S6 and Figure S4 present the results.For all the studied cases, the results obtained using the signed distance correlation networks are the highest.Unlike with the R. leguminosarum dataset, the Pearson correlation networks perform better than the Spearman correlation networks.In fact, the results for the signed distance correlation and Pearson correlation are almost identical (as in the case of the self-consistency study).These results suggest that a high similarity in the self-consistency of the networks may imply a similarity in the amount of biological information that the networks are able to capture.We generate 60 random networks of which 30 have edge density d S and the other 30 have edge density d P .We evaluate these networks with the STRING information and find that the six networks in Table S5 outperform all of them.The highest difference to random is obtained for signed distance correlation networks when using only coexpression information (C † ) to evaluate the networks; for N S(d S ) the score is 17.69 times higher than the mean score obtained by random networks with matching densities.

Network
All of STRING information (C)  We conclude that the Yeast gene coexpression networks obtained using signed distance correlation are more stable and recover more biological information than those based on Pearson correlation or Spearman correlation.4 Gene coexpression network analysis for the study of human liver single-cell RNA-Seq data We use our signed correlation pipeline to generate a gene coexpression network from a dataset obtained using single-cell RNA-Seq of human liver cells [4].The original dataset measures the expression of 15,353 genes in 1,622 cells.This dataset is considerably different from the previous two, both in the data itself -the measurements correspond to different cells instead of to different samples -and the organism -while both R. leguminosarum and S. cerevisiae are unicellular organisms, humans are not.Hence, a considerable proportion of genes are not expressed in the studied cells (for example, specific genes in neurons will not be expressed in cells from liver).For this reason, we employ a different pre-processing strategy in this case.In the first place, we quantile-normalise the data [1] to make the measurements in the different cells comparable.Afterwards, as in [7], we identify the "non-changing genes".These genes are those for which the difference between its highest and lowest expression value ("expression difference") is lower than the median of all the expression differences calculated for each gene, and in addition for which the mean expression signal between samples is lower than the median of all the expression signals calculated for each gene.After removing the "non-changing genes" we obtain an already quantile-normalised dataset with information for 8,585 genes.We do not apply more preprocessing steps to this dataset (we do not quantile normalise again the data and we do not set the lowest expressed genes from each sample to the lowest expression value).
We calculate the distance correlation matrix (D), the Pearson correlation matrix (P ), the signed correlation matrix (S), and the matrix |P | of absolute values of the Pearson correlation.Table S7 and Figure S5 show the summaries and distribution of these matrices.We estimate the optimal threshold values θ * and θ to construct unweighted gene coexpression networks A S (θ * ) and A P (θ ) using COGENT [2].We follow a similar pipeline as the one employed in the case of R. leguminosarum to calculate and adjust the self-consistency of the networks.The only difference is that as the input expression matrix is already quantile-normalised and low expressed genes have already been filtered out, in each of the 25 iterations the pre-processing steps are omitted.Figure S6 shows the variation of the score function for the correlation matrices across different edge densities.In both cases, there is a edge density value for which the score function reaches its maximum.This value is d S = 0.00009 for the signed distance correlation (score of 0.896, which is achieved for θ * = 0.44) and d P = 0.000089 for the Pearson correlation (score of 0.781, which is achieved for θ = 0.43).The results show that for all tested edge densities, the use of signed distance correlation offers more stable networks.We analyse the networks N S(d P ) (edge density d S ) and N S(d P ) (edge density d P ) retrieved from S, the networks N P (d S ) (edge density d S ) and N P (d P ) (edge density d P ) retrieved from P .We also construct two networks from a correlation matrix obtained using Spearman correlation: N R(d S ) (edge density d S ) and N R(d P ) (edge density d P ).The summaries of the six networks are detailed in Table S8.We note that the largest connected components of the networks contain a lower proportion of the total vertices than in previous datasets.The networks obtained using signed distance correlation have a smaller and denser largest connected component than those obtained using Pearson correlation.In contrast to the two previous datasets, the largest connected component of the networks obtained using Spearman correlation is smaller and denser than those from signed distance correlation and Pearson correlation networks; however, the signed distance correlation networks still have the highest global clustering coefficient.
Similarly as for the two previous datasets, we evaluate the constructed networks using three sets of

Pearson correlation
Edge Density

Score
Human liver networks score function

Network
All of STRING information (C)  We conclude that the liver gene coexpression networks obtained using signed distance correlation are more stable and show recover more biological information than those based on Pearson correlation and that our signed distance correlation method is suitable for studying single cell gene expression data.5 Network evaluation using information from STRING We use the interactions between proteins reported in STRING to evaluate the amount of biological information that the constructed networks capture.The association evidence in STRING is categorized into independent channels, weighted, and integrated, resulting in a confidence score for all recorded protein interactions.For each organism, we use three different sets of interactions and confidence scores: the overall confidence score C provided by STRING, the C † obtained attending only to coexpression information, and the C ‡ obtained excluding all coexpression information.The recomputing of the each of the scores was done using the python script located on the STRING webpage (Figure 8), which was accessed on the 20 th of April 2020 using the URL https://string-db.org/cgi/help.pl?&subpage=faq%23how-are-the-scores-computed.The script can be found at the end of this section.We commented out lines 96-98, 100-102 in the script to compute the C † values; and the line 99 in the script to obtain the C ‡ values.

Figure 1 :
Figure1: Scores obtained for the R. leguminosarum gene coexpression networks using STRING.All panels show the score for the different networks in the y-axis, and the network density on the x-axis.The scores are the result of adding up the confidence scores with all evidence (C), only coexpression evidence (C † ) and everything excluding coexpression (C ‡ ) from STRING associated with the edges in the networks, each computed using different of information.The black box plots correspond to the scores obtained by 30 random networks.Blue circles, red triangles and yellow squares represent signed distance correlation, Pearson correlation and Spearman correlation, respectively.

Figure 2 :
Figure 2: Density plots of the distribution of the values of the correlation matrices from the Yeast dataset.Panel A shows the distribution of values in the distance correlation matrix (blue) and of the absolute value of the values in the Pearson correlation matrix (red).Panel B shows the distribution of values in the signed distance correlation matrix (blue) and of the values in the Pearson correlation matrix (red).

Figure 3 :
Figure 3: Score function values for different edge densities using the Yeast dataset.The blue line with circles shows the scores obtained using signed distance correlation and the red line with triangles those obtained using Pearson correlation.The dotted lines indicate the position of the highest score point for each line.This value is 0.746 for the signed distance correlation network (giving edge density 0.0131 which is achieved with θ * = 0.84) and 0.733 for the Pearson correlation network (giving edge density 0.0146 which is achieved with θ = 0.81).

Figure 4 :
Figure4: Scores obtained for the yeast gene coexpression networks using STRING.All panels show the score for the different networks in the y-axis, and the network density on the x-axis.The scores are the result of adding up the confidence scores from STRING associated with the edges in the networks.Each plot corresponds to a different set of values: using all evidence C, only coexpression evidence C † and excluding coexpression evidence C ‡ .The black box plots correspond to the scores obtained by 30 random networks.Blue circles, red triangles and yellow squares represent signed distance correlation, Pearson correlation and Spearman correlation, respectively.

Figure 6 :
Figure 6: Plot of the score function values for different edge densities using the human liver single-cell RNA-Seq dataset.The blue line shows the scores obtained using signed distance correlation and the red line those obtained using Pearson correlation.The dotted lines indicate the position of the highest score point for each line.This value is 0.896 for the signed distance correlation network (giving edge density 9.01e −05 which is achieved with θ * = 0.44) and 0.781 for the Pearson correlation network (giving edge density 8.90e −05 which is achieved with θ * = 0.43).

Figure 7 :
Figure7: Scores obtained for the single-cell human liver gene coexpression networks using STRING.All panels show the score for the different networks in the y-axis, and the network density on the x-axis.The scores are the result of adding up the confidence scores from STRING associated with the edges in the networks.Each plot corresponds to a different set of values: using all evidence C, only coexpression evidence C † and excluding coexpression evidence C ‡ .The black box plots correspond to the scores obtained by 30 random networks.Blue circles, red triangles and yellow squares represent signed distance correlation, Pearson correlation and Spearman correlation, respectively.

Figure 8 :
Figure 8: STRING FAQ webpage.Accessed on the 20 th of April 2020.The content describes how the STRING combines the scores of the different channels.At the end of the section there is the link to the script which was modified and used in this work.

Table 2 :
Summaries of Spearman networks for R. leguminosarum LCC Denotes largest connected component.

Table 3 :
Evaluation of the biological content of the Spearman networks with STRING

Table 4 :
Summaries for the correlation matrices of the Yeast dataset

Table 5 :
Summaries of yeast networks.LCC denotes largest connected component.

Table 6 :
Evaluation of the biological content of the networks with STRING.RE indicates the expected (mean) result based on random networks with the indicated edge density and its standard deviation.

Table 7 :
Summaries for the correlation matrices of the single-cell dataset from human liver

Table 9 :
Evaluation of the biological content of the networks with STRING.RE indicates the expected (mean) result based on random networks with the indicated edge density and its standard deviation.