-
PDF
- Split View
-
Views
-
Cite
Cite
Thi Thanh Yen Nguyen, Warith Harchaoui, Lucile Mégret, Cloé Mendoza, Olivier Bouaziz, Christian Neri, Antoine Chambaz, Optimal transport-based machine learning to match specific patterns: application to the detection of molecular regulation patterns in omics data, Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 73, Issue 3, June 2024, Pages 639–657, https://doi.org/10.1093/jrsssc/qlae005
- Share Icon Share
Abstract
We present several algorithms designed to learn a pattern of correspondence between 2 data sets in situations where it is desirable to match elements that exhibit a relationship belonging to a known parametric model. In the motivating case study, the challenge is to better understand micro-RNA regulation in the striatum of Huntington’s disease model mice. The algorithms unfold in 2 stages. First, an optimal transport plan P and an optimal affine transformation are learned, using the Sinkhorn–Knopp algorithm and a mini-batch gradient descent. Second, P is exploited to derive either several co-clusters or several sets of matched elements. A simulation study illustrates how the algorithms work and perform. The real data application further illustrates their applicability and interest.
1 Introduction
The analysis of numerous omics data is a challenging task in biological research (Benayoun et al., 2019) and disease research (Langfelder et al., 2016; Maniatis et al., 2019). In disease research, omics data are increasingly available for the analysis of molecular pathology. This is notably illustrated by research on Huntington’s disease (HD): messenger-RNA (mRNA), micro-RNA (miRNA), and protein data collectively quantifying several layers of molecular regulation in the brain of HD model knock-in mice (Langfelder et al., 2016, 2018) now compose one of the largest data set available to date to understand how neurodegenerative processes may work on a system level. The data set is publicly available through the database repository Gene Expression Omnibus and the HDinHD portal.1
Encouraged by the promising findings of Mégret et al. (2020), our ultimate goal is to shed light on the interaction between mRNAs and miRNAs based on data collected in the striatum (a brain region) of HD model knock-in mice (Langfelder et al., 2016, 2018). Each data point takes the form of multi-dimensional profile. The strong biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former, say y, should be similar to minus the profile of the latter, say . We relax the hypothesis and consider that y is similar to where θ is an affine transformation in a parametric class Θ that includes minus the identity and whose definition translates expert knowledge about the experiment that yields the data. Our study straightforwardly extends to the case that the relationship is known to belong to any parametric model. In order to identify groups of mRNAs and miRNAs that interact, we develop a co-clustering algorithm and a matching algorithm based on optimal transport (Peyré & Cuturi, 2019), spectral and block co-clustering, and a matching procedure tailored to our needs.
Spectral co-clustering (SCC; Dhillon, 2001) and block clustering (BC; Brault et al., 2014; Govaert & Nadif, 2010) are two ways among many others to carry out co-clustering, an unsupervised learning task to cluster simultaneously the rows and columns of a matrix in order to obtain homogeneous blocks. There are many efficient approaches to solving the problem, often characterized as model-based or metric-based methods (Pontes et al., 2015).
In an enlightening article, Nazarov and Kreis (2021) review a variety of computational approaches to study how miRNAs ‘come together to regulate the expression of a gene or a group of genes’. They identify three different families of methods: data-driven methods based on similarities, data-driven methods based on matrix factorization, and hybrid methods. Our algorithms belong to the first family. In view of Nazarov and Kreis (2021, Section 2.5 and Figure 2), we do not rely on the standard similarity measures (Pearson’s and Spearman’s correlation coefficients, cosine similarity, and mutual information) to define our similarity matrix but, instead, use optimal transport to derive it. Moreover, as in canonical correlation analysis, we do not compare the raw mRNA and miRNA profiles but, instead, we compare a data-driven transformation and y, where θ is an affine transformation of x. Finally, as explained by Nazarov and Kreis (2021), our algorithms cannot discriminate between true interactions and fake interactions originating from common hidden regulators such as transcription factors. It is necessary to conduct a further biological analysis to identify the relevant findings.
The rest of the article is organized as follows. Section 2 describes the data we use. Section 3 presents a modicum of optimal transport theory. Section 4 introduces our algorithms. Section 5 evaluates the performances of the algorithms in various simulation settings. Section 6 illustrates the real data application. Section 7 closes the study on a discussion.
2 Data
2.1 Presentation
The data analysed herein cover RNA-seq data obtained in the striatum of the allelic series of HD knock-in mice (poly Q lengths: Q20, Q80, Q92, Q111, Q140, and Q175) at 2-month, 6-month, and 10-month of age. For each combination of poly Q length and age, eight mice were sacrificed (four females and four males). After preprocessing (Mégret et al., 2020, Methods section), the final data set consists of mRNA profiles, , and of miRNA profiles, with .
Informally, we look for couples such that the nth miRNA induces the degradation of the mth mRNA or blocks its translation into proteins, or both. We are guided by the strong biological hypothesis that if that is the case, then the profile of the former is similar to minus the profile of the latter—then and exhibit what we call a mirroring relationship. Of note, it is expected that a single miRNA can target several mRNAs.
The actual mirroring relationships can be more or less acute, for instance, because of threshold effects, or of multiple miRNAs targeting the same mRNA, or of a single miRNA targeting several mRNAs. Therefore, instead of rigidly using comparisons between and , our algorithms will learn from the data a relevant transformation (in a parametric class Θ of transformations that includes minus the identity) and use comparisons between and .
Figure 1 exhibits two profiles and that showcase a mirrored similarity. The corresponding miRNA and mRNA, Mir20b (which may inhibit cerebral ischemia-induced inflammation in rats; Zhao et al., 2019), and the aryl-hydrocarbon receptor repressor (Ahrr) are believed to interact in the striatum of HD model knock-in mice (Mégret et al., 2020).

(a): profile of a mRNA (Ahrr). (b): profile of a miRNA (Mir20b). It is believed that Mir20b targets Ahrr. mRNA = messenger-RNA; miRNA = micro-RNA.
2.2 A brief data analysis
So as to give a sense of the distribution of the data, we propose two kinds of visual summaries. The first one uses Lloyd’s k-means algorithm (Lloyd, 1982) to build synthetic profiles representing the real profiles on the one hand and on the other hand. The second one uses kernel density estimators of the jth component of on the one hand and of on the other hand, for each . The visual summaries are presented in Section A of the online supplementary material.
3 Elements of optimal transport
Let be the -dimensional simplex and , where is the vector with all its entries equal to 1. For any , define
and let and be the ω-weighted empirical measure attached to X and the empirical measure attached to Y, respectively. An element P of represents a joint law on with marginals and .
The celebrated Monge–Kantorovich problem (Peyré & Cuturi, 2019, Chapter 2) consists in finding a joint law over with marginals and that minimizes the expected cost of transport with respect to some cost function . We focus on c given by (the squared Euclidean norm in ). Specifically, denoting , the cost matrix given by for each , the problem consists in solving where is the P-specific expected cost of transport from X to Y.
It is well known that it is very rewarding from a computational viewpoint to consider a regularized version of the above problem (Peyré & Cuturi, 2019, Chapter 4). The penalty term is proportional to the discretized entropy of P, that is, to . The regularized problem (presented here for any beyond the case ) consists, for some user-supplied , in finding that solves
One of the advantages of entropic regularization is that one can solve (1) efficiently using the Sinkhorn–Knopp matrix scaling algorithm.
Finally, following Genevay et al. (2018), we use to define the so-called Sinkhorn loss between (any ) and as
This loss interpolates between and the maximum mean discrepancy of relative to (Genevay et al., 2018, Theorem 1). Paraphrasing the abstract of Genevay et al. (2018), the interpolation allows to find ‘a sweet spot’ leveraging the geometry of optimal transport and the favourable high-dimensional sample complexity of maximum mean discrepancy, which comes with unbiased gradient estimates.
4 Optimal transport-based machine learning
In this section, we introduce two co-clustering algorithms and one matching algorithm, all based on the solution of a master optimization programme. The optimization programme is presented in Section 4.1 and the algorithms are presented in Section 4.2.
4.1 Stage 1: the master optimization programme and how to solve it
We introduce a parametric model Θ consisting of affine mappings of the form , where and . The formal definition of Θ is given in Section C of the online supplementary material. Each is a candidate to formalize the aforementioned mirroring relationship. The set Θ imposes constraints on the matrices , in particular that their diagonals are made of negative values. Of course, minus identity belongs to Θ. The parametrization is identifiable, in the sense that implies . It is noteworthy that any identifiable, regular model Θ could be used. We focus on Θ as defined in Section C of the online supplementary material because of the application that we consider in Section 6 (and in Section 5).
By analogy with Section 3 we introduce, for any , , and , the image of X by θ; the ω-weighted empirical measure attached to , ; the cost matrix given by for each ; and
where is the P-specific expected cost of transport from to Y.
Fix arbitrarily . The first programme that we introduce is the ω-specific programme
where we are interested in the minimizer that solves (3) and in the optimal joint matrix that solves
In words, we look for an ω-specific optimal mirroring function and its ω-specific optimal transport plan .
How to choose ω? We decide to optimize with respect to ω as well. This additional optimization is relevant because we do not expect to associate a to every eventually at the co-clustering stage. So, our master programme is
where we are interested in the minimizer and in the optimal matrix that solves
We propose to solve (4) iteratively by updating ω and then θ. At round t, given , we make one step of mini-batch gradient descent to derive from (here, we notably rely on the Sinkhorn–Knopp algorithm). Given , is chosen proportional to the vector in whose mth component equals where is the standard normal density and h is the arithmetic mean of the for all . Eventually, once the final round T is completed, we compute that solves
(again, we rely on the Sinkhorn–Knopp algorithm).
The algorithm to solve (4) is summarized in Procedure 1 of the online supplementary material. We have no guarantee that it converges. Note, however, that using the Sinkhorn–Knopp algorithm to solve (5) for a given is known to converge (Peyré & Cuturi, 2019, Theorem 4.2).
In light of Alvarez-Melis (2019, Section 1.3, page 25), we inject problem-specific knowledge onto two of the three main components of the transportation problem: the representation spaces (via the mapping θ) and the marginal constraints (via the weight ω), leaving aside the cost function. Furthermore, we resort to mini-batch gradient descent because the algorithmic complexity prevents the direct computation using the whole data set. A theoretical analysis of this practice is proposed in Fatras et al. (2020).
We can now exploit so as to derive relevant associations between mRNAs and miRNAs. We propose two approaches. On the one hand, the first approach outputs bona fide co-clusters. We expect that the co-clusters can associate many mRNAs with many miRNAs, thus making it difficult to interpret and analyse the results. On the other hand, the second approach rather matches each mRNA with at most k miRNAs and each miRNA with at most mRNAs (k and are user-supplied integers). Details follow.
4.2 Stage 2: co-clustering or matching
4.2.1 Co-clustering
To carry out the co-clustering task once has been derived, we propose to rely either on SCC (Dhillon, 2001), applying it once or twice, or co-clustering based on latent block models (Govaert & Nadif, 2010). Of course, any other co-clustering algorithm could be used as well. Specifically, we develop the following algorithms (the acronym WTOT stands for weighted transformation optimal transport).
WTOT-SCC1. Algorithm WTOT-SCC1 applies SCC once to build bona fide co-clusters based on . It is required to provide a number of clusters. We rely on a criterion involving graph modularity to learn from the data a relevant number of clusters (Ailem et al., 2016, Sections 2 and 4).
In our simulation study, we also consider algorithm WTOT-SCC1 , an oracular version of WTOT-SCC1 that benefits from relying on the true number of clusters. This allows to assess how relevant is the learned number of clusters in WTOT-SCC1.
WTOT-SCC2. Algorithm WTOT-SCC2 applies SCC twice to build bona fide co-clusters based on . It proceeds in three successive steps.
In step 1, WTOT-SCC2 applies SCC a first time to derive an initial co-clustering. A relevant number of co-clusters is learned as in WTOT-SCC1.
In step 2, WTOT-SCC2 selects and removes some rows and columns corresponding to mRNAs and miRNAs that are deemed irrelevant. The selection is based on a numerical criterion computed from . In our simulation study (Section 5), all rows and columns that correspond to diagonal blocks with a variance larger than two times the overall variance of are selected and removed. In the real data application (Section 6), we implement and use a different procedure.
In step 3, WTOT-SCC2 applies SCC a second time, the relevant number of co-clusters being learned as in WTOT-SCC1.
In our simulation study, we also consider algorithm WTOT-SCC2 , an oracular version of WTOT-SCC2, that provided the true number of clusters for its third step. This allows to assess how relevant is the sub-procedure to learn the numbers of clusters in WTOT-SCC2.
WTOT-BC. Algorithm WTOT-BC applies the so-called block clustering algorithm to build bona fide co-clusters based on . It is required to provide the row- and column-specific numbers of clusters. We rely on an integrated completed likelihood criterion (Brault et al., 2014) to learn relevant values from the data.
The co-clusters obtained via WTOT-SCC1, WTOT-SCC2, or WTOT-BC should reveal the interplay between the (remaining, as far as WTOT-SCC2 is concerned) mRNAs and miRNAs in HD.
4.2.2 Matching
The larger is, the more we are encouraged to believe that the profiles and reveal a strong relationship between the mth mRNA and the nth miRNA. This simple rule prompts the following matching procedure applied once has been derived.
WTOT-matching. Fix two integers and let be the quantile of order q of all the entries of . For every and , we introduce
where are the k largest values among and are the largest values among . For instance, identifies the miRNAs that are the k more likely to have a strong relationship with the mth mRNA. However, this does not qualify them as relevant matches yet. In order to keep only matches that are really relevant, we also introduce, for each and ,
Algorithm WTOT-matching outputs the collections and .
Now if, for instance, , then is among the k miRNA profiles upon which puts more mass when it ‘transports’ onto Y and is among the mRNA profiles upon which puts more mass when it ‘transports’ onto X.
Note that we expect that some and will be empty, depending on k and . The mRNAs and miRNAs worthy of interest are those for which and are not empty. The integers k and should be chosen relatively small, to make their interpretation and analysis feasible, but not too small because otherwise few matchings will be made.
In the simulation study, we use between 2 and 200, depending on the simulation scheme. Moreover, we choose so that is the median of the entries of .
4.3 Implementation
Our code is written in python and is available here.2 We adapt the Sinkhorn algorithm implemented by Aude Genevay and is available here.3 The stochastic gradient descents rely on the machine learning framework pytorch. We use the implementation of SCC available in the sklearn python module.4 To learn a relevant number of clusters, we rely on the coclustpython module.5 Finally, we rely on the blockcluster R package6 to carry out block clustering.
Our algorithms bear a similarity to the one developed in Laclau et al. (2017). The main differences are (i) our use of the parametric model Θ and weights ω and (ii) the fact that we apply SCC or block clustering to the approximation of the optimal transport matrix . Our algorithms also bear a similarity to Yang et al. (2021), a fast and certifiable point cloud registration algorithm. We plan to study the similarities and differences closely.
5 Simulation study
To assess the performances of the algorithms described in Section 4.1, we conduct a simulation study in three parts. As we go on, the task gets more difficult. In all cases, the laws of the synthetic observations are mixtures of Gaussian laws. Overall, 12 simulation scenarios are considered.
Section 5.1 lists all the algorithms that compete in the simulation study. Section 5.2 presents the measure of discrepancy between two co-clusterings and the matching criteria that we rely on to assess how well the algorithms perform. Sections 5.3–5.5 present in turn the data-generating mechanisms and report the results in terms of co-clustering and matching performances.
We think that the first two simulation schemes produce unrealistic data and, on the contrary, that the third simulation scheme produces somewhat realistic data. The diversity of the synthetic mRNA and miRNA profiles obtained using Lloyd’s k-means algorithm in order to summarize the variety of real profiles, see Section A.1 of the online supplementary material, encouraged us to rely on mixtures in order to simulate data. We chose mixtures of Gaussian laws because of their ubiquity and versatility.
In Section 5.3, the weights of the mixtures and parameters of the Gaussian laws are chosen by us. Moreover, the two mixtures (to simulate X and Y) share the same weights and induce a perfect mirroring relationship (details below), thus making the co-clustering task less difficult. In Section 5.4, the weights of the mixtures and parameters of the Gaussian laws are randomly generated. Moreover, the two mixtures do not share the same weights and do not induce a perfect mirroring relationship anymore, so that the co-clustering task is much more difficult. Finally, in Section 5.5, we use plus or minus real, randomly chosen miRNA profiles and as means of the Gaussian laws to simulate X and Y, in such a way that there is no perfect mirroring relationship. We think that the corresponding co-clustering task is the most difficult of the three.
5.1 Listing all competing algorithms
We run and compare algorithms WTOT-SCC1, WTOT-SCC2 (and their oracular counterparts WTOT-SCC1 and WTOT-SCC2 ), and WTOT-BC on the one hand (see Section 4.2.1) and Co-clustering through Optimal Transport with Gromov-Wassertein Distance (CCOT-GWD) and Co-clustering through Optimal Transport with Gromov-Wassertein Barycenters (CCOT-GWB ) on the other hand (see Section B.1 of the online supplementary material). In addition, we also run algorithm WTOT-matching (see Section 4.2.2).
For CCOT-GWD, we set in Equation (1) of the online supplementary material. For CCOT-GWB, we set in Equation (2) of the online supplementary material. We tried several values and chose the ones that yielded the smallest errors.
In view of Procedure 1 of the online supplementary material, we choose and equal to approximately and , respectively, (no decay), , and an initial mapping drawn randomly (see Section C of the online supplementary material for details).
We checked that varying and around and had little impact if any. Likewise, the randomly drawn initial mapping had little impact if any. Moreover, varying in with also had little impact if any. We did not rigorously check the impact of the total number of iterations T, but we observed that numerical convergence seemed to be reached for fewer iterations than T. Finally, we did not challenge the choice of .
5.2 Assessing performances
5.2.1 A measure of discrepancy between two co-clusterings
In order to assess the quality of the co-clusterings that we derive, and to compare performances, we propose to rely on a commonly used measure of discrepancy between two co-clusterings. Its definition extends that of a measure of discrepancy between partitions that we first present.
Let z and be two partitions of the set into K components, taking the form of matrices and with convention (respectively, ) if m belongs to component k of z (respectively, ) and 0 otherwise. The corresponding confusion matrix is given by (every ). Suppose that the labels of the partitions z and are such that
where is the set of permutations of the elements of . Then the proportion
is a natural measure of discrepancy between z and . As suggested earlier, the measure can be extended to compare pairs of partitions.
Consider now and two pairs of partitions, z and partitioning into K components, w and partitioning into L components. We represent and with
and
where and (for every ), supposing again that the labels of the partitions z, on the one hand and w, on the other hand maximize the traces of the confusion matrices and as above (then two pairs of partitions define without ambiguity a co-clustering). By analogy with (6), the proportion
is a measure of discrepancy between and . It can be shown that
In the rest of this section, we report means and standard deviations, computed across 30 independent replications of each analysis, of the above measure of discrepancy between the derived partition/co-clustering and the true one.
5.2.2 Matching criteria
Set arbitrarily and suppose that we have derived the subset that matches to . Suppose moreover that in reality is matched to for some . We propose to use three real-valued criteria to compare with .
Let , , , and be the numbers of true positives, false positives, true negatives, and false negatives, respectively. The so-called m-specific
precision: ,
sensitivity: ,
specificity:
quantify how similar are and , larger values indicating better concordance.
In the rest of this section, we report means and standard deviations, computed across 30 independent replications of each analysis, of the average of the m-specific precision, sensitivity, and specificity. We also report means and standard deviations, computed across the same 30 independent replications of each analysis, of
the row- and column-specific averages of the cardinalities of the sets and that are not empty.
5.3 First simulation study
For four different choices of the hyperparameters , , , such that , we sample independently from the mixture of Gaussian laws
and from
One way to sample x from the mixture (9) consists in sampling a latent label u in from the multinomial law with parameter then in sampling x from the Gaussian law . Similarly, sampling y from the mixture (10) can be carried out by sampling a latent label v in from the multinomial law with parameter then by sampling y from the Gaussian law . We think of x and y as having a mirrored relationship if . In this light, the challenge that we tackle consists in finding such relationships without having access to the latent labels.
The four configurations that we investigate and the results we comment upon can be found in Section B.2 of the online supplementary material.
5.4 Second simulation study
The second simulation scheme also relies on mixtures of Gaussian laws, but the means and weights are generated randomly from a Gaussian determinantal point process (DPP) for the former and from a Dirichlet law for the latter. More specifically, given the hyperparameters , ,
we sample from a Gaussian DPP on with a kernel proportional to conditionally on obtaining exactly K points (Baddeley & Turner, 2005; Lavancier et al., 2015);
independently, we sample and from the Dirichlet laws with parameters and ;
- we sample independently from the mixture of Gaussian lawsand from
We use a DPP to generate to avoid the arbitrary choice of the mean parameters in such a way that the randomly picked are dispersed in (because the DPP is a repulsive point process).
The four configurations that we investigate and the results we comment upon can be found in Section B.3 of the supplementary material.
5.5 Third simulation study
5.5.1 Simulation scheme
The third simulation scheme aspires to generate synthetic data sets X and Y that are more similar to the real data sets than those generated in the two first simulation studies. Once again, we rely on mixtures of Gaussian laws. This time, however, the various means are neither chosen arbitrarily (unlike in the first simulation study) nor drawn randomly (unlike in the second simulation study) but are sampled in the real collection of miRNAs. Moreover, the weights of the mixtures are random.
Specifically, given the hyperparameters , , , and (with much larger than σ),
we sample uniformly without replacement from the collection of observed miRNA profiles conditionally on ;
independently, we sample independently from the Poisson law with parameter , from the Poisson law with parameter , and from the Poisson laws with parameter and ;
for each , we sample independently from the Gaussian law and from the Gaussian law . Moreover, we also sample independently and from the Gaussian law .
Here, we think of x and y as having a mirrored relationship if there exists such that x and y are drawn from the laws and . Furthermore, we view x and y drawn from the law as noise.
Table 1 describes the four configurations that we investigate. The larger K is the more challenging the configuration is.
Configuration . | . | . | K . | . |
---|---|---|---|---|
C1 | 3 | |||
C2 | 15 | |||
C3 | 15 | |||
C4 | 15 |
Configuration . | . | . | K . | . |
---|---|---|---|---|
C1 | 3 | |||
C2 | 15 | |||
C3 | 15 | |||
C4 | 15 |
Note. The larger is, the more challenging configuration Cℓ is.
Configuration . | . | . | K . | . |
---|---|---|---|---|
C1 | 3 | |||
C2 | 15 | |||
C3 | 15 | |||
C4 | 15 |
Configuration . | . | . | K . | . |
---|---|---|---|---|
C1 | 3 | |||
C2 | 15 | |||
C3 | 15 | |||
C4 | 15 |
Note. The larger is, the more challenging configuration Cℓ is.
5.5.2 Results
Thirty times, independently, we simulated synthetic data sets X and Y under the simulation scheme described above, then we applied the various algorithms as presented in Section 5.1. Table 2 summarizes the results of the seven algorithms listed in Section 5.1 that rely on bona fide co-clustering algorithms (see Section 4.2.1). Tables 3 and 4 summarize the results of our algorithm that relies on matching (see Section 4.2.2).
Mean (± standard deviation) computed across the 30 independent replications of the co-clustering discrepancy obtained for configurations C1, C2, C3, and C4
. | The WTOT(…) algorithms . | The CCOT-(…) algorithms . | |||||
---|---|---|---|---|---|---|---|
. | WTOT-SCC1 . | WTOT-SCC1 . | WTOT-SCC2 . | WTOT-SCC2 . | WTOT-BC . | CCOT-GWD . | CCOT-GWB . |
C1 | |||||||
C2 | |||||||
C3 | |||||||
C4 |
. | The WTOT(…) algorithms . | The CCOT-(…) algorithms . | |||||
---|---|---|---|---|---|---|---|
. | WTOT-SCC1 . | WTOT-SCC1 . | WTOT-SCC2 . | WTOT-SCC2 . | WTOT-BC . | CCOT-GWD . | CCOT-GWB . |
C1 | |||||||
C2 | |||||||
C3 | |||||||
C4 |
Note. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.
Mean (± standard deviation) computed across the 30 independent replications of the co-clustering discrepancy obtained for configurations C1, C2, C3, and C4
. | The WTOT(…) algorithms . | The CCOT-(…) algorithms . | |||||
---|---|---|---|---|---|---|---|
. | WTOT-SCC1 . | WTOT-SCC1 . | WTOT-SCC2 . | WTOT-SCC2 . | WTOT-BC . | CCOT-GWD . | CCOT-GWB . |
C1 | |||||||
C2 | |||||||
C3 | |||||||
C4 |
. | The WTOT(…) algorithms . | The CCOT-(…) algorithms . | |||||
---|---|---|---|---|---|---|---|
. | WTOT-SCC1 . | WTOT-SCC1 . | WTOT-SCC2 . | WTOT-SCC2 . | WTOT-BC . | CCOT-GWD . | CCOT-GWB . |
C1 | |||||||
C2 | |||||||
C3 | |||||||
C4 |
Note. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.
Mean (± standard deviation) computed across the 30 independent replications of , precision, sensitivity, and specificity of the m-specific matchings averaged across all mRNAs for configurations C1 (left) and C4 (right)
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 10 | ||||
C1 | 30 | ||||
C1 | 50 | ||||
C1 | 55 | ||||
C1 | 60 | ||||
C1 | 70 | ||||
C4 | 5 | ||||
C4 | 10 | ||||
C4 | 15 | ||||
C4 | 20 | ||||
C4 | 25 | ||||
C4 | 30 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 10 | ||||
C1 | 30 | ||||
C1 | 50 | ||||
C1 | 55 | ||||
C1 | 60 | ||||
C1 | 70 | ||||
C4 | 5 | ||||
C4 | 10 | ||||
C4 | 15 | ||||
C4 | 20 | ||||
C4 | 25 | ||||
C4 | 30 |
Note. mRNA = messenger-RNA.
Mean (± standard deviation) computed across the 30 independent replications of , precision, sensitivity, and specificity of the m-specific matchings averaged across all mRNAs for configurations C1 (left) and C4 (right)
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 10 | ||||
C1 | 30 | ||||
C1 | 50 | ||||
C1 | 55 | ||||
C1 | 60 | ||||
C1 | 70 | ||||
C4 | 5 | ||||
C4 | 10 | ||||
C4 | 15 | ||||
C4 | 20 | ||||
C4 | 25 | ||||
C4 | 30 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 10 | ||||
C1 | 30 | ||||
C1 | 50 | ||||
C1 | 55 | ||||
C1 | 60 | ||||
C1 | 70 | ||||
C4 | 5 | ||||
C4 | 10 | ||||
C4 | 15 | ||||
C4 | 20 | ||||
C4 | 25 | ||||
C4 | 30 |
Note. mRNA = messenger-RNA.
Mean (± standard deviation) computed across the 30 independent replications of or , precision, sensitivity, and specificity of the m-specific matchings (left) and n-specific matchings (right) averaged across all mRNAs (left) and all miRNAs (right)
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
Note. mRNA = messenger-RNA; miRNA = micro-RNA.
Mean (± standard deviation) computed across the 30 independent replications of or , precision, sensitivity, and specificity of the m-specific matchings (left) and n-specific matchings (right) averaged across all mRNAs (left) and all miRNAs (right)
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
. | . | . | Precision . | Sensitivity . | Specificity . |
---|---|---|---|---|---|
C1 | 55 | ||||
C2 | 20 | ||||
C3 | 20 | ||||
C4 | 20 |
Note. mRNA = messenger-RNA; miRNA = micro-RNA.
Table 2. We first focus on configuration C1. We note that WTOT-SCC1 and WTOT-SCC2 perform similarly, much better than CCOT-GWD and CCOT-GWB, better than the oracular algorithm WTOT-BC , but not as well as the oracular algorithms WTOT-SCC1 and WTOT-SCC2 .
We now turn to configurations C2, C3, and C4. Configuration C3 is more challenging than configuration C2 because it shares the same hyperparameters as C2 except for (which drives the number of noisy data points), set to in C2 and to in C3. Similarly, configuration C4 is more challenging than configuration C3 because it shares the same hyperparameters as C3 except for σ (the standard deviation of the Gaussian variations around the mean profiles), set to in C3 and to in C4. The comparisons will not concern algorithms WTOT-BC (which never converges in these simulations), CCOT-GWD, and CCOT-GWB (which perform very poorly).
In configuration C2, in the absence of noisy data points, algorithm WTOT-SCC1 performs slightly better than WTOT-SCC2, as well as the oracular algorithm WTOT-SCC2 , and almost as well as the oracular algorithm WTOT-SCC1 (in average). In configurations C3 and C4, the introduction of noisy data points and then the increase in variability strongly degrade the performances of WTOT-SCC1, WTOT-SCC1 and, to a lesser extent, those of WTOT-SCC2 and WTOT-SCC2 . Algorithm WTOT-SCC2 outperforms WTOT-SCC1 and the oracular algorithm WTOT-SCC1 too.
Tables 3 and 4. Table 3 illustrates the influence of on the performances of algorithm WTOT-matching in configurations C1 and C4. In each configuration, the values are chosen in the vicinity of or (50 in configuration C1 and 15 in configuration C4). For specificity and sensitivity, we observe the same patterns in configurations C1 and C4: specificity is not impacted much as grows whereas sensitivity increases dramatically. Precision remains high in configuration C1 for all choices of . In configuration C4, precision remains high for ranging between 5 and 20, then it decreases when grows from 25 to 30.
Table 4 summarizes the results of WTOT-matching in configurations C1, C2, C3, and C4 for a specific choice of in terms of the row- and column-specific averages and , precision, sensitivity, and specificity. In each configuration, we chose the value of among many retrospectively, so that the overall performance (in terms of precision, sensitivity, and specificity) is good. The left-hand-side (m-specific) and right-hand-side (n-specific) tables in Table 4 are very similar. In configurations C1 and C2, all precision, sensitivity, and specificity are quite satisfying. In configurations C3 and C4, sensitivity and specificity are quite satisfying as well while precision falls below 0.86.
6 Illustration on real data: matching mRNA and miRNA in HD mice
Next, we apply algorithms WTOT-SCC2 and WTOT-matching to discover patterns hidden in RNA-seq data obtained in the striatum of HD model mice. As explained in Section 1, multi-dimensional mRNA and miRNA sequencing data were obtained in the striatum of these mice (Langfelder et al., 2016, 2018) and an earlier analysis of these data using shape analysis concepts (Mégret et al., 2020) has demonstrated their value.
6.1 Tuning
Specifically, in view of Procedure 1 of the online supplementary material, we choose , , and . The entries of the matrices are constrained to take their values in (for WTOT-SCC2) or (for WTOT-matching), and , respectively. We also choose . Finally, the initial mapping is drawn randomly.
Furthermore, regarding step 2 of algorithm WTOT-SCC2, we remove rows and columns based on the following loop: 100 times successively, (i) we compute the Kullback–Leibler divergence between each row (renormalized) and the uniform distribution, and then remove the 100 rows with the smallest divergences; then (ii) we compute the Kullback–Leibler divergence between each column (renormalized) and the uniform distribution, and then remove the 5 columns with the smallest divergences. By doing so, we successively get rid of the rows and columns which, viewed as distributions, are too uniform and therefore deemed irrelevant. Finally, we remove all rows for which the (columnwise) sum of the remaining entries of is smaller than one tenth of the maximal (columnwise) sum, and all columns for which the (rowwise) sum of the remaining entries of is smaller than 1/10th of the maximal (rowwise) sum.
6.2 Results
6.2.1 Co-clustering
The selection procedure (step 2 of WTOT-SCC2) keeps 3,409 mRNA profiles (among the 13,616 available in the data set) and 602 miRNA (among the 1,143 available in the data set). Eventually, algorithm WTOT-SCC2 outputs eight co-clusters. The co-clusters’s sizes (numbers of mRNA and miRNA gathered in each co-cluster) are , , , , , , , and . Figure 2 represents the averages, computed across all blocks, of the entries of the matrix derived from the optimal transport matrix during step 2 of algorithm WTOT-SCC2 and after its rearrangement. Squares located on the diagonal tend to be slightly darker than the other squares. This reveals that, in average, a pair of mRNA and miRNA gathered in a diagonal co-cluster tends to exhibit a mirrored relationship that is slightly stronger than those of the form or which do not fall in the same co-cluster. However, few of the off-diagonal averages are small in comparison to the on-diagonal averages, a disappointing observation that comes on top of the fact that the co-clusters’ sizes are so large that it is difficult to interpret the results. This makes it even more relevant to focus on algorithm WTOT-matching.

Logarithms of the averages, computed across all blocks, of the entries of the matrix derived from the optimal transport matrix during step 2 of algorithm WTOT-SCC2 and after its rearrangement. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.
6.2.2 Matching
We run the WTOT-matching algorithm with and . For the anecdote, we observe (recall that are the row- and column-specific averages of the cardinalities of the sets and that are not empty). We report the parameters that characterize the mapping in Section C of the online supplementary material.
As an illustration, the mirrored profile (the opposite value of ) of the Mir20b miRNA is displayed in Figure 3 along with its three matched mRNAs (Ahrr, Cnih3, and Relb) obtained by running algorithm WTOT-matching with . Recall that the original profile of Mir20b can be found in Figure 1.

Minus the profile of the Mir20b miRNA (a), and profiles of its matched mRNAs, Ahrr (b), Relb (c), and Cnih3 (d). mRNA = messenger-RNA; miRNA = micro-RNA.
6.3 Biological analysis of the results
In an effort to guarantee biological relevance to the matchings, we only retain those showing evidence for binding sites as indicated in the databases TargetScan (Lewis et al., 2005), MicroCosm (Betel et al., 2010), and miRDB (Ding et al., 2016). Specifically, a pair is retained if and only if the mRNA whose profile is x and the miRNA whose profile is y are both among the 27,355 mRNAs and 1,478 miRNAs appearing in the TargetScan, MicroCosm, and miRDB databases. The 1,247 matchings retained out of the 7,521 output by the WTOT-matching algorithm are all presented on the companion website.7 We stress that we would have obtained fewer matchings if we had excluded from the collections X and Y the profiles of mRNA or miRNA which do not appear in the databases.
Furthermore, we build upon two previous analyses of miRNA regulation in the striatum of HD knock-in-mice (Langfelder et al., 2018; Mégret et al., 2020) to comment on the biological relevance and novelty of our findings. The first analysis (Langfelder et al., 2018) relies on the Weighted Correlation Network Analysis (WGCNA) algorithm, a weighted gene co-expression network analysis which yields clusters of genes whose expression profiles are correlated. The second analysis (Mégret et al., 2020) relies on the MiRAMINT algorithm. MiRAMINT is a pipeline whose main steps consist in (a) carrying out a weighted gene co-expression network analysis, (b) using random forests to select candidate matchings, and (c) using Spearman’s correlation test and a multiple testing procedure to identify the more reliable matchings. We highlight that WGCNA outputs 1,583 mRNA–miRNA matchings showing evidence for binding sites in the databases TargetScan, MicroCosm, and miRDB, which involve only 46 different miRNAs. As for MiRAMINT, it only outputs 31 matchings of which 20 show evidence for binding sites in the databases TargetScan, MicroCosm, and miRDB, involving 14 different miRNAs. The 31 mRNA–miRNA matchings output by MiRAMINT are all presented on the webpage.8
6.3.1 Analysing the overlaps
Three mRNA–miRNA matchings are retained both by the WTOT-matching and WGCNA algorithms: Mir186-Chl1, Mir132-Fam196b, and Mir212-Fam196b. No matchings are retained both by the WTOT-matching and the MiRAMINT algorithms. One pair is retained both by the MiRAMINT and WGCNA algorithms: Mir132-Pafah121.
Figures 5–8 of the online supplementary material present several Venn diagrams that summarize the overlaps between the sets of miRNAs (respectively, mRNAs) which belong to a pair output by the WGCNA, MiRAMINT, and WTOT-matching algorithms. Figure 5 considers all pairs. In an attempt to identify miRNAs that are particularly susceptible to play a distinct role in HD in mice, Figures 6–8 focus on pairs where each miRNA is labelled as ‘peaked’, ‘monotonic’, or ‘neither peaked nor monotonic’ (referring to the shape of the miRNA’s profile).
Section C.1 of the online supplementary material presents a detailed analysis of the Venn diagrams. The take-home messages are
the findings of the WTOT-matching algorithm are not in agreement with those of the WGCNA and MiRAMINT algorithms, except when focusing on the mRNAs which belong to a pair output by the WTOT-matching and MiRAMINT algorithms;
the WTOT-algorithm retains mRNA–miRNA matchings that we label as peaked whereas neither the WGCNA nor the MiRAMINT algorithms do;
in matchings that we label as monotonic, the agreement between the WTOT-matching and WGCNA algorithms is better than that between the WTOT-matching and MiRAMINT algorithms;
in matchings that we label as neither peaked nor monotonic, the agreement between the WTOT-matching and WGCNA algorithms is better than that between the WTOT-matching and MiRAMINT algorithms.
6.3.2 Enrichment analysis
Next, we assess and compare the biological significance of the mRNAs retained by the WGCNA, MiRAMINT, and WTOT-matching algorithms. To do so we carry out an enrichment analysis using the EnrichR tools (Chen et al., 2013; Kuleshov et al., 2016; Xie et al., 2021). We consider only top annotations (balancing a small p-value and a large number of hits) as provided by Gene Ontology data9 (biological process and cellular content) and Kyoto Encyclopedia of Genes and Genomes (KEGG) data.10 When necessary, only the top 40 hits are considered so as to guarantee a sufficient level of biological precision. Pubmed searches are also used to assess the biological significance of predicted miRNA regulation.
Figures 9–11 of the online supplementary material present the mRNA–miRNA networks based on the mRNA–miRNA matchings output by the WTOT-matching algorithm, focusing on the matchings which are labelled as peaked, monotonic and neither peaked nor monotonic (in that order). The mRNAs and miRNAs retained by the WGCNA and MiRAMINT algorithms are coloured. The enrichment analysis reveals
that the mRNA–miRNA matchings output by the WGCNA algorithm are primarily annotated for axonogenesis,11 which relates to cytoskeleton dynamics and cell morphology;
that the matchings output by the MiRAMINT algorithm are primarily annotated for regulation of defense response to virus by host,12 which relates to stress response and innate immunity;
that the matchings output by the WTOT-matching algorithm are primarily annotated for extracellular matrix organization (which relates to cell identity),13 due to the matchings labelled as neither peaked nor monotonic, and secondarily annotated for mitigation of host antiviral defense response,14 due to the matchings labelled as monotonic, and for conventional motile cilium,15 due to the matchings labelled as peaked.
Although the numbers of hits in some of these annotations are small, they suggest that the WTOT-matching algorithm is able to uncover a role of miRNA regulation in responding to mutant huntingtin that was not detected by the WGCNA and MiRAMINT algorithms (despite the large number of mRNAs retained by the former).
Section C.2 of the online supplementary material presents a biological analysis of the results. We believe that it substantiates our claim that the WTOT-matching algorithm strikes a good balance between the low and high selectivity of the WGCNA and MiRAMINT algorithms. Moreover, our findings related to striatal alterations in HD mice lead to reconsidering the formerly expressed view on a limited role of miRNA regulation in the striatum of HD mice on a systems level (Mégret et al., 2020).
7 Discussion
We have developed two co-clustering algorithms (WTOT-SCC1 and WTOT-SCC2) and a matching algorithm (WTOT-matching) for the purpose of identifying groups of mRNAs and miRNAs that interact. The algorithms apply in any situation where it is of interest to cluster or match the elements of two data sets based on a parametric model Θ expressing what it means to interact for any two pair of elements from the two data sets. The algorithms rely on optimal transport, SCC, and a matching procedure. In light of Alvarez-Melis (2019, Section 1.3, page 25), problem-specific knowledge is injected onto two of the three main components of the transportation problem: the representation spaces (via Θ) and the marginal constraints, leaving aside the cost function.
During the first stage, an optimal optimal transport plan P and mapping in Θ are learned from the data using the Sinkhorn–Knopp algorithm and a mini-batch gradient descent. During the second stage, P is exploited to derive either co-clusters or several sets of matched elements.
As in Mégret et al. (2020), the motivation of our study is to shed light on the interaction between mRNAs and miRNAs based on data collected in the striatum of HD model knock-in mice (Langfelder et al., 2016, 2018). Each data point takes the form of multi-dimensional profile. The strong biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former should be similar to minus the profile of the latter—this particular form of affine relationship drives the formulation of a loosened hypothesis and definition of model Θ. The fact that the algorithm learns from the data is that a best element in Θ provides more flexibility than in Mégret et al. (2020).
The simulation study reveals on the one hand that WTOT-SCC2 works overall better than WTOT-SCC1, but that the co-clustering task can be very challenging in the presence of many irrelevant data points (data points that do not interact). On the other hand, it shows that the performances of WTOT-matching are satisfying.
An illustration on real data is given. The results are biologically relevant and illustrate how our algorithm strikes a good balance between two moderately and highly selective, competing algorithms. Our findings lead to reconsidering the formerly expressed view on a limited role of miRNA regulation in the striatum of HD mice on a systems level (Mégret et al., 2020).
In conclusion, there are several directions for future work. First, we will develop a similar study to better understand miRNA regulation in the cortex of HD model mice (ongoing project). Second, we will evaluate the performances of our algorithms by simulation studies based on a simulation scheme learned from the real data so as to better mimic their law (ongoing project). Third, we will put our algorithms into the general context of co-clustering and matching of datasets and carry out more benchmark tests and comparisons.
Funding
T.T.Y.N. is funded by Université Paris Cité, thanks to a Ph.D. fellowship granted by Domaine d’Intérêt Majeur Math Innov (Région Île-de-France and Fondation Sciences Mathématiques de Paris). O.B. and A.C. are funded by Université Paris Cité, and W.H. by Déraison.ai.16 C.M., L.M., and C.N. are funded by the CHDI Foundation (grant no. A-14814), Sorbonne Université, CNRS, and INSERM.
Data availability
The omics data used in this study are publicly available through the database repository Gene Expression Omnibus and the HDinHD portal.17 Overlaps between the results obtained by applying the WTOT-matching algorithm and results previously obtained based on two other algorithms, and full display of biological annotations, are available on the companion website.18 The code is available at http://www.broca.inserm.fr/WTOT/ and https://github.com/yen-nguyen-thi-thanh/wtot_coclust_match.
Authors’ contributions
C.N., O.B., and A.C. conceived the study. T.T.Y.N., O.B., and A.C. developed the methodology, formally and computationally, and performed the data analysis based on insights from L.M. and C.N. on how mutant huntingtin may significantly influence expression patterns across CAG repeat alleles and age points in the brain of HD mice. W.H. provided guidance on computational aspects. L.M., C.M., and C.N. performed the biological analysis of the results, the comparison to other algorithms, and the data base construction. T.T.Y.N, O.B., and A.C. wrote the first draft of the manuscript. All authors commented on subsequent versions. All authors read and approved the final manuscript.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series C.
References
Footnotes
GO:0007409, de novo generation of a long process of a neuron, including the terminal branched region. Refers to the morphogenesis or creation of shape or form of the developing axon, which carries efferent (outgoing) action potential from the cell body towards target cells.
GO:0050691, any host process that modulates the frequency, rate, or extent of the antiviral response of a host cell or organism.
GO:0030198, a process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an extracellular matrix.
GO:0050690, evasion by virus of host immune response.
GO:0097729, a motile cilium where the axoneme has a ring of nine outer microtubules doublets plus two central micro tubules.
Author notes
O.B., C.N., and A.C. contributed equally.
Conflicts of interest: None declared.