Abstract

We present several algorithms designed to learn a pattern of correspondence between 2 data sets in situations where it is desirable to match elements that exhibit a relationship belonging to a known parametric model. In the motivating case study, the challenge is to better understand micro-RNA regulation in the striatum of Huntington’s disease model mice. The algorithms unfold in 2 stages. First, an optimal transport plan P and an optimal affine transformation are learned, using the Sinkhorn–Knopp algorithm and a mini-batch gradient descent. Second, P is exploited to derive either several co-clusters or several sets of matched elements. A simulation study illustrates how the algorithms work and perform. The real data application further illustrates their applicability and interest.

1 Introduction

The analysis of numerous omics data is a challenging task in biological research (Benayoun et al., 2019) and disease research (Langfelder et al., 2016; Maniatis et al., 2019). In disease research, omics data are increasingly available for the analysis of molecular pathology. This is notably illustrated by research on Huntington’s disease (HD): messenger-RNA (mRNA), micro-RNA (miRNA), and protein data collectively quantifying several layers of molecular regulation in the brain of HD model knock-in mice (Langfelder et al., 2016, 2018) now compose one of the largest data set available to date to understand how neurodegenerative processes may work on a system level. The data set is publicly available through the database repository Gene Expression Omnibus and the HDinHD portal.1

Encouraged by the promising findings of Mégret et al. (2020), our ultimate goal is to shed light on the interaction between mRNAs and miRNAs based on data collected in the striatum (a brain region) of HD model knock-in mice (Langfelder et al., 2016, 2018). Each data point takes the form of multi-dimensional profile. The strong biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former, say y, should be similar to minus the profile of the latter, say x. We relax the hypothesis and consider that y is similar to θ(x) where θ is an affine transformation in a parametric class Θ that includes minus the identity and whose definition translates expert knowledge about the experiment that yields the data. Our study straightforwardly extends to the case that the relationship is known to belong to any parametric model. In order to identify groups of mRNAs and miRNAs that interact, we develop a co-clustering algorithm and a matching algorithm based on optimal transport (Peyré & Cuturi, 2019), spectral and block co-clustering, and a matching procedure tailored to our needs.

Spectral co-clustering (SCC; Dhillon, 2001) and block clustering (BC; Brault et al., 2014; Govaert & Nadif, 2010) are two ways among many others to carry out co-clustering, an unsupervised learning task to cluster simultaneously the rows and columns of a matrix in order to obtain homogeneous blocks. There are many efficient approaches to solving the problem, often characterized as model-based or metric-based methods (Pontes et al., 2015).

In an enlightening article, Nazarov and Kreis (2021) review a variety of computational approaches to study how miRNAs ‘come together to regulate the expression of a gene or a group of genes’. They identify three different families of methods: data-driven methods based on similarities, data-driven methods based on matrix factorization, and hybrid methods. Our algorithms belong to the first family. In view of Nazarov and Kreis (2021, Section 2.5 and Figure 2), we do not rely on the standard similarity measures (Pearson’s and Spearman’s correlation coefficients, cosine similarity, and mutual information) to define our similarity matrix but, instead, use optimal transport to derive it. Moreover, as in canonical correlation analysis, we do not compare the raw mRNA and miRNA profiles x,y but, instead, we compare a data-driven transformation θ(x) and y, where θ is an affine transformation of x. Finally, as explained by Nazarov and Kreis (2021), our algorithms cannot discriminate between true interactions and fake interactions originating from common hidden regulators such as transcription factors. It is necessary to conduct a further biological analysis to identify the relevant findings.

The rest of the article is organized as follows. Section 2 describes the data we use. Section 3 presents a modicum of optimal transport theory. Section 4 introduces our algorithms. Section 5 evaluates the performances of the algorithms in various simulation settings. Section 6 illustrates the real data application. Section 7 closes the study on a discussion.

2 Data

2.1 Presentation

The data analysed herein cover RNA-seq data obtained in the striatum of the allelic series of HD knock-in mice (poly Q lengths: Q20, Q80, Q92, Q111, Q140, and Q175) at 2-month, 6-month, and 10-month of age. For each combination of poly Q length and age, eight mice were sacrificed (four females and four males). After preprocessing (Mégret et al., 2020, Methods section), the final data set consists of M=13,616 mRNA profiles, X:={x1,,xM}Rd, and of N=1,143 miRNA profiles, Y:={y1,,yN}Rd with d=15.

Informally, we look for couples (m,n)[[M]]×[[N]]:={1,,M}×{1,,N} such that the nth miRNA induces the degradation of the mth mRNA or blocks its translation into proteins, or both. We are guided by the strong biological hypothesis that if that is the case, then the profile yn of the former is similar to minus the profile xm of the latter—then xm and yn exhibit what we call a mirroring relationship. Of note, it is expected that a single miRNA can target several mRNAs.

The actual mirroring relationships can be more or less acute, for instance, because of threshold effects, or of multiple miRNAs targeting the same mRNA, or of a single miRNA targeting several mRNAs. Therefore, instead of rigidly using comparisons between xm and yn, our algorithms will learn from the data a relevant transformation θΘ (in a parametric class Θ of transformations that includes minus the identity) and use comparisons between θ(xm) and yn.

Figure 1 exhibits two profiles xm and yn that showcase a mirrored similarity. The corresponding miRNA and mRNA, Mir20b (which may inhibit cerebral ischemia-induced inflammation in rats; Zhao et al., 2019), and the aryl-hydrocarbon receptor repressor (Ahrr) are believed to interact in the striatum of HD model knock-in mice (Mégret et al., 2020).

(a): profile xm of a mRNA (Ahrr). (b): profile yn of a miRNA (Mir20b). It is believed that Mir20b targets Ahrr. mRNA = messenger-RNA; miRNA = micro-RNA.
Figure 1.

(a): profile xm of a mRNA (Ahrr). (b): profile yn of a miRNA (Mir20b). It is believed that Mir20b targets Ahrr. mRNA = messenger-RNA; miRNA = micro-RNA.

2.2 A brief data analysis

So as to give a sense of the distribution of the data, we propose two kinds of visual summaries. The first one uses Lloyd’s k-means algorithm (Lloyd, 1982) to build synthetic profiles representing the real profiles x1,,xM on the one hand and y1,,yN on the other hand. The second one uses kernel density estimators of the jth component of x1,,xM on the one hand and of y1,,yN on the other hand, for each 1jd. The visual summaries are presented in Section A of the online supplementary material.

3 Elements of optimal transport

Let Ω:={ω(R+)M|m[[M]]ωm=1} be the (M1)-dimensional simplex and ω¯:=M11M, where 1MRM is the vector with all its entries equal to 1. For any ωΩ, define

and let μXω:=m[[M]]ωmδxm and νY:=N1n[[N]]δyn be the ω-weighted empirical measure attached to X and the empirical measure attached to Y, respectively. An element P of Π(ω) represents a joint law on X×Y with marginals μXω and νY.

The celebrated Monge–Kantorovich problem (Peyré & Cuturi, 2019, Chapter 2) consists in finding a joint law over X×Y with marginals μXω¯ and νY that minimizes the expected cost of transport with respect to some cost function c:Rd×RdR+. We focus on c given by c(x,y):=xy22 (the squared Euclidean norm in Rd). Specifically, denoting CX,YRM×N, the cost matrix given by (CX,Y)mn:=c(xm,yn) for each (m,n)[[M]]×[[N]], the problem consists in solving minPΠ(ω¯)CX,Y,PF where CX,Y,PF:=(m,n)[[M]]×[[N]](CX,Y)mnPmn is the P-specific expected cost of transport from X to Y.

It is well known that it is very rewarding from a computational viewpoint to consider a regularized version of the above problem (Peyré & Cuturi, 2019, Chapter 4). The penalty term is proportional to the discretized entropy of P, that is, to E(P):=(m,n)[[M]]×[[N]]]Pmn(logPmn1). The regularized problem (presented here for any ωΩ beyond the case ω=ω¯) consists, for some user-supplied γ>0, in finding Pγ that solves

(1)

One of the advantages of entropic regularization is that one can solve (1) efficiently using the Sinkhorn–Knopp matrix scaling algorithm.

Finally, following Genevay et al. (2018), we use Wγ to define the so-called Sinkhorn loss between μXω (any ωΩ) and νY as

This loss interpolates between W0(μXω,νY) and the maximum mean discrepancy of μXω relative to νY (Genevay et al., 2018, Theorem 1). Paraphrasing the abstract of Genevay et al. (2018), the interpolation allows to find ‘a sweet spot’ leveraging the geometry of optimal transport and the favourable high-dimensional sample complexity of maximum mean discrepancy, which comes with unbiased gradient estimates.

4 Optimal transport-based machine learning

In this section, we introduce two co-clustering algorithms and one matching algorithm, all based on the solution of a master optimization programme. The optimization programme is presented in Section 4.1 and the algorithms are presented in Section 4.2.

4.1 Stage 1: the master optimization programme and how to solve it

We introduce a parametric model Θ consisting of affine mappings θ:RdRd of the form xθ(x)=θ1x+θ2, where θ1Rd×d and θ2Rd. The formal definition of Θ is given in Section C of the online supplementary material. Each θΘ is a candidate to formalize the aforementioned mirroring relationship. The set Θ imposes constraints on the matrices θ1, in particular that their diagonals are made of negative values. Of course, minus identity belongs to Θ. The parametrization is identifiable, in the sense that θ=θ implies (θ1,θ2)=(θ1,θ2). It is noteworthy that any identifiable, regular model Θ could be used. We focus on Θ as defined in Section C of the online supplementary material because of the application that we consider in Section 6 (and in Section 5).

By analogy with Section 3 we introduce, for any θΘ, ωΩ, and γ>0, θ(X):={θ(x1),,θ(xM)} the image of X by θ; the ω-weighted empirical measure attached to θ(X), μθ(X)ω:=m[[M]]ωmδθ(xm); the cost matrix Cθ(X),Y given by (Cθ(X),Y)mn:=c(θ(xm),yn) for each (m,n)[[M]]×[[N]]; and

(2)

where Cθ(X),Y,PF:=(m,n)[[M]]×[[N]](Cθ(X),Y)mnPmn is the P-specific expected cost of transport from θ(X) to Y.

Fix arbitrarily ωΩ. The first programme that we introduce is the ω-specific programme

(3)

where we are interested in the minimizer θ^ that solves (3) and in the optimal joint matrix P^Π(ω) that solves

In words, we look for an ω-specific optimal mirroring function θ^ and its ω-specific optimal transport plan P^.

How to choose ω? We decide to optimize with respect to ω as well. This additional optimization is relevant because we do not expect to associate a yn to every xm eventually at the co-clustering stage. So, our master programme is

(4)

where we are interested in the minimizer (ω^,θ^)  and in the optimal matrix P^Π(ω^) that solves

(5)

We propose to solve (4) iteratively by updating ω and then θ. At round t, given ωt, we make one step of mini-batch gradient descent to derive θt+1 from θt (here, we notably rely on the Sinkhorn–Knopp algorithm). Given θt+1, ωt+1 is chosen proportional to the vector in (R+)M whose mth component equals h1n[[N]]φ((ynθt+1(xm))/h) where φ is the standard normal density and h is the arithmetic mean of the c(yn,yn) for all nn[[N]]. Eventually, once the final round T is completed, we compute P~Π(ωT) that solves

(again, we rely on the Sinkhorn–Knopp algorithm).

The algorithm to solve (4) is summarized in Procedure 1 of the online supplementary material. We have no guarantee that it converges. Note, however, that using the Sinkhorn–Knopp algorithm to solve (5) for a given (ω^,θ^) is known to converge (Peyré & Cuturi, 2019, Theorem 4.2).

In light of Alvarez-Melis (2019, Section 1.3, page 25), we inject problem-specific knowledge onto two of the three main components of the transportation problem: the representation spaces (via the mapping θ) and the marginal constraints (via the weight ω), leaving aside the cost function. Furthermore, we resort to mini-batch gradient descent because the algorithmic complexity prevents the direct computation using the whole data set. A theoretical analysis of this practice is proposed in Fatras et al. (2020).

We can now exploit P~ so as to derive relevant associations between mRNAs and miRNAs. We propose two approaches. On the one hand, the first approach outputs bona fide co-clusters. We expect that the co-clusters can associate many mRNAs with many miRNAs, thus making it difficult to interpret and analyse the results. On the other hand, the second approach rather matches each mRNA with at most k miRNAs and each miRNA with at most k mRNAs (k and k are user-supplied integers). Details follow.

4.2 Stage 2: co-clustering or matching

4.2.1 Co-clustering

To carry out the co-clustering task once P~ has been derived, we propose to rely either on SCC (Dhillon, 2001), applying it once or twice, or co-clustering based on latent block models (Govaert & Nadif, 2010). Of course, any other co-clustering algorithm could be used as well. Specifically, we develop the following algorithms (the acronym WTOT stands for weighted transformation optimal transport).

WTOT-SCC1. Algorithm WTOT-SCC1 applies SCC once to build bona fide co-clusters based on P~. It is required to provide a number of clusters. We rely on a criterion involving graph modularity to learn from the data a relevant number of clusters (Ailem et al., 2016, Sections 2 and 4).

In our simulation study, we also consider algorithm WTOT-SCC1 *, an oracular version of WTOT-SCC1 that benefits from relying on the true number of clusters. This allows to assess how relevant is the learned number of clusters in WTOT-SCC1.

WTOT-SCC2. Algorithm WTOT-SCC2 applies SCC twice to build bona fide co-clusters based on P~. It proceeds in three successive steps.

  • In step 1, WTOT-SCC2 applies SCC a first time to derive an initial co-clustering. A relevant number of co-clusters is learned as in WTOT-SCC1.

  • In step 2, WTOT-SCC2 selects and removes some rows and columns corresponding to mRNAs and miRNAs that are deemed irrelevant. The selection is based on a numerical criterion computed from P~. In our simulation study (Section 5), all rows and columns that correspond to diagonal blocks with a variance larger than two times the overall variance of P~ are selected and removed. In the real data application (Section 6), we implement and use a different procedure.

  • In step 3, WTOT-SCC2 applies SCC a second time, the relevant number of co-clusters being learned as in WTOT-SCC1.

In our simulation study, we also consider algorithm WTOT-SCC2 *, an oracular version of WTOT-SCC2, that provided the true number of clusters for its third step. This allows to assess how relevant is the sub-procedure to learn the numbers of clusters in WTOT-SCC2.

WTOT-BC. Algorithm WTOT-BC applies the so-called block clustering algorithm to build bona fide co-clusters based on P~. It is required to provide the row- and column-specific numbers of clusters. We rely on an integrated completed likelihood criterion (Brault et al., 2014) to learn relevant values from the data.

The co-clusters obtained via WTOT-SCC1, WTOT-SCC2, or WTOT-BC should reveal the interplay between the (remaining, as far as WTOT-SCC2 is concerned) mRNAs and miRNAs in HD.

4.2.2 Matching

The larger P~mn is, the more we are encouraged to believe that the profiles xm and yn reveal a strong relationship between the mth mRNA and the nth miRNA. This simple rule prompts the following matching procedure applied once P~ has been derived.

WTOT-matching. Fix two integers k,k1 and let τ~ be the quantile of order q of all the entries of P~. For every m[[M]] and n[[N]], we introduce

where P~m(1),,P~m(k) are the k largest values among P~m1,,P~mN and P~(1)n,,P~(k)m are the k largest values among P~1n,,P~Mn. For instance, Nm0 identifies the miRNAs that are the k more likely to have a strong relationship with the mth mRNA. However, this does not qualify them as relevant matches yet. In order to keep only matches that are really relevant, we also introduce, for each m[[M]] and n[[N]],

Algorithm WTOT-matching outputs the collections {Nm:m[[M]]} and {Mn:n[[N]]}.

Now if, for instance, nNm, then yn is among the k miRNA profiles upon which P~ puts more mass when it ‘transports’ xm onto Y and  xm is among the k mRNA profiles upon which P~ puts more mass when it ‘transports’ yn onto X.

Note that we expect that some Nm and Mn will be empty, depending on k and k. The mRNAs and miRNAs worthy of interest are those for which Nm and Mn are not empty. The integers k and k should be chosen relatively small, to make their interpretation and analysis feasible, but not too small because otherwise few matchings will be made.

In the simulation study, we use k=k between 2 and 200, depending on the simulation scheme. Moreover, we choose q=50% so that τ~ is the median of the entries of P~.

4.3 Implementation

Our code is written in python and is available here.2 We adapt the Sinkhorn algorithm implemented by Aude Genevay and is available here.3 The stochastic gradient descents rely on the machine learning framework pytorch. We use the implementation of SCC available in the sklearn  python module.4 To learn a relevant number of clusters, we rely on the coclustpython module.5 Finally, we rely on the blockcluster  R package6 to carry out block clustering.

Our algorithms bear a similarity to the one developed in Laclau et al. (2017). The main differences are (i) our use of the parametric model Θ and weights ω and (ii) the fact that we apply SCC or block clustering to the approximation of the optimal transport matrix P~. Our algorithms also bear a similarity to Yang et al. (2021), a fast and certifiable point cloud registration algorithm. We plan to study the similarities and differences closely.

5 Simulation study

To assess the performances of the algorithms described in Section 4.1, we conduct a simulation study in three parts. As we go on, the task gets more difficult. In all cases, the laws of the synthetic observations are mixtures of Gaussian laws. Overall, 12 simulation scenarios are considered.

Section 5.1 lists all the algorithms that compete in the simulation study. Section 5.2 presents the measure of discrepancy between two co-clusterings and the matching criteria that we rely on to assess how well the algorithms perform. Sections 5.35.5 present in turn the data-generating mechanisms and report the results in terms of co-clustering and matching performances.

We think that the first two simulation schemes produce unrealistic data and, on the contrary, that the third simulation scheme produces somewhat realistic data. The diversity of the synthetic mRNA and miRNA profiles obtained using Lloyd’s k-means algorithm in order to summarize the variety of real profiles, see Section A.1 of the online supplementary material, encouraged us to rely on mixtures in order to simulate data. We chose mixtures of Gaussian laws because of their ubiquity and versatility.

In Section 5.3, the weights of the mixtures and parameters of the Gaussian laws are chosen by us. Moreover, the two mixtures (to simulate X and Y) share the same weights and induce a perfect mirroring relationship (details below), thus making the co-clustering task less difficult. In Section 5.4, the weights of the mixtures and parameters of the Gaussian laws are randomly generated. Moreover, the two mixtures do not share the same weights and do not induce a perfect mirroring relationship anymore, so that the co-clustering task is much more difficult. Finally, in Section 5.5, we use plus or minus real, randomly chosen miRNA profiles and  0d as means of the Gaussian laws to simulate X and Y, in such a way that there is no perfect mirroring relationship. We think that the corresponding co-clustering task is the most difficult of the three.

5.1 Listing all competing algorithms

We run and compare algorithms WTOT-SCC1, WTOT-SCC2 (and their oracular counterparts WTOT-SCC1 * and WTOT-SCC2 *), and WTOT-BC on the one hand (see Section 4.2.1) and Co-clustering through Optimal Transport with Gromov-Wassertein Distance (CCOT-GWD) and Co-clustering through Optimal Transport with Gromov-Wassertein Barycenters (CCOT-GWB ) on the other hand (see Section B.1 of the online supplementary material). In addition, we also run algorithm WTOT-matching (see Section 4.2.2).

For CCOT-GWD, we set γ=0.1 in Equation (1) of the online supplementary material. For CCOT-GWB, we set γ=0.05 in Equation (2) of the online supplementary material. We tried several values and chose the ones that yielded the smallest errors.

In view of Procedure 1 of the online supplementary material, we choose M~ and N~ equal to approximately M/2 and N/2, respectively, (η,γ0)=(1,0) (no decay), T=500, and an initial mapping θ0 drawn randomly (see Section C of the online supplementary material for details).

We checked that varying M~ and N~ around M/2 and N/2 had little impact if any. Likewise, the randomly drawn initial mapping θ0 had little impact if any. Moreover, varying γ_ in [(1/2)×γ*;2×γ*] with γ*=mean{xx2:x,xX} also had little impact if any. We did not rigorously check the impact of the total number of iterations T, but we observed that numerical convergence seemed to be reached for fewer iterations than T. Finally, we did not challenge the choice of h=mean{yy2:y,yY}.

5.2 Assessing performances

5.2.1 A measure of discrepancy between two co-clusterings

In order to assess the quality of the co-clusterings that we derive, and to compare performances, we propose to rely on a commonly used measure of discrepancy between two co-clusterings. Its definition extends that of a measure of discrepancy between partitions that we first present.

Let z and z be two partitions of the set [[M]] into K components, taking the form of matrices z=(zmk)m[[M]],k[[K]] and z=(zmk)m[[M]],k[[K]] with convention zmk=1 (respectively, zmk=1) if m belongs to component k of z (respectively, z) and 0 otherwise. The corresponding confusion matrix C(z,z)=(ck)k,[[K]] is given by ck:=m[[M]]zmkzm (every k,[[K]]). Suppose that the labels of the partitions z and z are such that

where ΣK is the set of permutations of the elements of [[K]]. Then the proportion

(6)

is a natural measure of discrepancy between z and z. As suggested earlier, the measure can be extended to compare pairs of partitions.

Consider now (z,w) and (z,w) two pairs of partitions, z and z partitioning [[M]] into K components, w and w partitioning [[N]] into L components. We represent (z,w) and (z,w) with

and

where umnk:=zmk×wn and umnk:=zmk×wn (for every m[[M]],n[[N]],k[[K]],[[L]]), supposing again that the labels of the partitions z, z on the one hand and w, w on the other hand maximize the traces of the confusion matrices C(z,z) and C(w,w) as above (then two pairs of partitions define without ambiguity a co-clustering). By analogy with (6), the proportion

(7)

is a measure of discrepancy between (z,w) and (z,w). It can be shown that

(8)

In the rest of this section, we report means and standard deviations, computed across 30 independent replications of each analysis, of the above measure of discrepancy between the derived partition/co-clustering and the true one.

5.2.2 Matching criteria

Set arbitrarily m[[M]] and suppose that we have derived the subset Nm[[N]] that matches xm to {yn:nNm}. Suppose moreover that in reality xm is matched to {yn:nNm} for some Nm[[N]]. We propose to use three real-valued criteria to compare Nm with Nm.

Let TPm:=card(NmNm), FPm:=card(Nm(Nm)c), TNm:=card((Nm)c(Nm)c), and FNm:=card((Nm)cNm) be the numbers of true positives, false positives, true negatives, and false negatives, respectively. The so-called m-specific

  • precision: TPm/(TPm+FPm),

  • sensitivity: TPm/(TPm+FNm),

  • specificity: TNm/(TNm+FPm)

quantify how similar are Nm and Nm, larger values indicating better concordance.

In the rest of this section, we report means and standard deviations, computed across 30 independent replications of each analysis, of the average of the m-specific precision, sensitivity, and specificity. We also report means and standard deviations, computed across the same 30 independent replications of each analysis, of

the row- and column-specific averages of the cardinalities of the sets Nm and Mn that are not empty.

5.3 First simulation study

For four different choices of the hyperparameters M200,N200,K2,d2, μ1,,μKRd, σR+*, α(R+)K such that k[[K]]αk=1, we sample independently x1,,xM from the mixture of Gaussian laws

(9)

and y1,,yN from

(10)

One way to sample x from the mixture (9) consists in sampling a latent label u in [[K]] from the multinomial law with parameter (1;α1,,αK) then in sampling x from the Gaussian law N(μu,σ2Idd). Similarly, sampling y from the mixture (10) can be carried out by sampling a latent label v in [[K]] from the multinomial law with parameter (1;α1,,αK) then by sampling y from the Gaussian law N(μv,σ2Idd). We think of x and y as having a mirrored relationship if u=v. In this light, the challenge that we tackle consists in finding such relationships without having access to the latent labels.

The four configurations that we investigate and the results we comment upon can be found in Section B.2 of the online supplementary material.

5.4 Second simulation study

The second simulation scheme also relies on mixtures of Gaussian laws, but the means and weights are generated randomly from a Gaussian determinantal point process (DPP) for the former and from a Dirichlet law for the latter. More specifically, given the hyperparameters M200,N200,KL3, σR+*,

  1. we sample μ1,,μK from a Gaussian DPP on [0,1]2 with a kernel proportional to xexp(x/0.0522) conditionally on obtaining exactly K points (Baddeley & Turner, 2005; Lavancier et al., 2015);

  2. independently, we sample α(R+)K and β(R+)L from the Dirichlet laws with parameters 71K and 71L;

  3. we sample independently x1,,xM from the mixture of Gaussian laws
    and y1,,yN from

We use a DPP to generate μ1,,μK to avoid the arbitrary choice of the mean parameters in such a way that the randomly picked μ1,,μK are dispersed in [0,1]2 (because the DPP is a repulsive point process).

The four configurations that we investigate and the results we comment upon can be found in Section B.3 of the supplementary material.

5.5 Third simulation study

5.5.1 Simulation scheme

The third simulation scheme aspires to generate synthetic data sets X and Y that are more similar to the real data sets than those generated in the two first simulation studies. Once again, we rely on mixtures of Gaussian laws. This time, however, the various means are neither chosen arbitrarily (unlike in the first simulation study) nor drawn randomly (unlike in the second simulation study) but are sampled in the real collection of miRNAs. Moreover, the weights of the mixtures are random.

Specifically, given the hyperparameters K3, λx,λx0, λy,λy0, and σ,σR+* (with σ much larger than σ),

  1. we sample μ1,,μK uniformly without replacement from the collection of observed miRNA profiles conditionally on minkkμkμk22;

  2. independently, we sample independently (m11),,(mK1) from the Poisson law with parameter λx, (n11),,(nK1) from the Poisson law with parameter λy, (mK+11) and (nK+11) from the Poisson laws with parameter λx and λy;

  3. for each 1kK, we sample independently xk,1,,xk,mk from the Gaussian law N(μk,σ2Id18) and yk,1,,yk,nk from the Gaussian law N(μk,σ2Id18). Moreover, we also sample independently xK+1,1,,xK+1,mK+1 and yK+1,1,,yK+1,nK+1 from the Gaussian law N(018,(σ)2Id18).

Here, we think of x and y as having a mirrored relationship if there exists k[[K]] such that x and y are drawn from the laws N(μk,σ2Id18) and N(μk,σ2Id18). Furthermore, we view x and y drawn from the law N(018,(σ)2Id18) as noise.

Table 1 describes the four configurations that we investigate. The larger K is the more challenging the configuration is.

Table 1.

Four different configurations for the third simulation scheme

Configuration(λx,λy)(λx,λy)K(σ,σ)
C1(50,50)(50,10)3(0.1,5)
C2(15,15)(0,0)15(0.01,5)
C3(15,15)(30,30)15(0.01,5)
C4(15,15)(30,30)15(0.1,5)
Configuration(λx,λy)(λx,λy)K(σ,σ)
C1(50,50)(50,10)3(0.1,5)
C2(15,15)(0,0)15(0.01,5)
C3(15,15)(30,30)15(0.01,5)
C4(15,15)(30,30)15(0.1,5)

Note. The larger [[4]] is, the more challenging configuration Cℓ is.

Table 1.

Four different configurations for the third simulation scheme

Configuration(λx,λy)(λx,λy)K(σ,σ)
C1(50,50)(50,10)3(0.1,5)
C2(15,15)(0,0)15(0.01,5)
C3(15,15)(30,30)15(0.01,5)
C4(15,15)(30,30)15(0.1,5)
Configuration(λx,λy)(λx,λy)K(σ,σ)
C1(50,50)(50,10)3(0.1,5)
C2(15,15)(0,0)15(0.01,5)
C3(15,15)(30,30)15(0.01,5)
C4(15,15)(30,30)15(0.1,5)

Note. The larger [[4]] is, the more challenging configuration Cℓ is.

5.5.2 Results

Thirty times, independently, we simulated synthetic data sets X and Y under the simulation scheme described above, then we applied the various algorithms as presented in Section 5.1. Table 2 summarizes the results of the seven algorithms listed in Section 5.1 that rely on bona fide co-clustering algorithms (see Section 4.2.1). Tables 3 and 4 summarize the results of our algorithm that relies on matching (see Section 4.2.2).

Table 2.

Mean (± standard deviation) computed across the 30 independent replications of the co-clustering discrepancy obtained for configurations C1, C2, C3, and C4

The WTOT(…) algorithmsThe CCOT-(…) algorithms
WTOT-SCC1 *WTOT-SCC1WTOT-SCC2 *WTOT-SCC2WTOT-BC *CCOT-GWDCCOT-GWB
C10.106±0.10.203±0.1350.101±0.0560.194±0.1160.265±0.2550.496±0.160.902±0.007
C20.209±0.1310.252±0.1820.262±0.1410.345±0.2050.938±0.0230.971±0.026
C30.609±0.1130.693±0.1540.385±0.1510.521±0.1980.926±0.0270.987±0.002
C40.63±0.1410.751±0.1450.435±0.1970.6±0.2330.939±0.0270.987±0.002
The WTOT(…) algorithmsThe CCOT-(…) algorithms
WTOT-SCC1 *WTOT-SCC1WTOT-SCC2 *WTOT-SCC2WTOT-BC *CCOT-GWDCCOT-GWB
C10.106±0.10.203±0.1350.101±0.0560.194±0.1160.265±0.2550.496±0.160.902±0.007
C20.209±0.1310.252±0.1820.262±0.1410.345±0.2050.938±0.0230.971±0.026
C30.609±0.1130.693±0.1540.385±0.1510.521±0.1980.926±0.0270.987±0.002
C40.63±0.1410.751±0.1450.435±0.1970.6±0.2330.939±0.0270.987±0.002

Note. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.

Table 2.

Mean (± standard deviation) computed across the 30 independent replications of the co-clustering discrepancy obtained for configurations C1, C2, C3, and C4

The WTOT(…) algorithmsThe CCOT-(…) algorithms
WTOT-SCC1 *WTOT-SCC1WTOT-SCC2 *WTOT-SCC2WTOT-BC *CCOT-GWDCCOT-GWB
C10.106±0.10.203±0.1350.101±0.0560.194±0.1160.265±0.2550.496±0.160.902±0.007
C20.209±0.1310.252±0.1820.262±0.1410.345±0.2050.938±0.0230.971±0.026
C30.609±0.1130.693±0.1540.385±0.1510.521±0.1980.926±0.0270.987±0.002
C40.63±0.1410.751±0.1450.435±0.1970.6±0.2330.939±0.0270.987±0.002
The WTOT(…) algorithmsThe CCOT-(…) algorithms
WTOT-SCC1 *WTOT-SCC1WTOT-SCC2 *WTOT-SCC2WTOT-BC *CCOT-GWDCCOT-GWB
C10.106±0.10.203±0.1350.101±0.0560.194±0.1160.265±0.2550.496±0.160.902±0.007
C20.209±0.1310.252±0.1820.262±0.1410.345±0.2050.938±0.0230.971±0.026
C30.609±0.1130.693±0.1540.385±0.1510.521±0.1980.926±0.0270.987±0.002
C40.63±0.1410.751±0.1450.435±0.1970.6±0.2330.939±0.0270.987±0.002

Note. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.

Table 3.

Mean (± standard deviation) computed across the 30 independent replications of k~r, precision, sensitivity, and specificity of the m-specific matchings averaged across all mRNAs for configurations C1 (left) and C4 (right)

k=kk~rPrecisionSensitivitySpecificity
C1107.748±0.4460.973±0.030.156±0.011.0±0.0
C13025.888±1.4180.972±0.0290.526±0.0321.0±0.0
C15045.521±2.4410.944±0.0250.916±0.041.0±0.001
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C16051.365±3.3350.919±0.0240.993±0.0110.997±0.004
C17055.296±3.3120.881±0.0341.0±0.00.985±0.01
C453.293±0.0960.895±0.0230.185±0.0121.0±0.0
C4107.278±0.3030.899±0.0220.474±0.0291.0±0.0
C41511.982±0.5780.888±0.020.787±0.041.0±0.0
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
C42519.138±0.890.762±0.0320.997±0.0050.989±0.003
C43022.578±1.110.671±0.041.0±0.00.978±0.004
k=kk~rPrecisionSensitivitySpecificity
C1107.748±0.4460.973±0.030.156±0.011.0±0.0
C13025.888±1.4180.972±0.0290.526±0.0321.0±0.0
C15045.521±2.4410.944±0.0250.916±0.041.0±0.001
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C16051.365±3.3350.919±0.0240.993±0.0110.997±0.004
C17055.296±3.3120.881±0.0341.0±0.00.985±0.01
C453.293±0.0960.895±0.0230.185±0.0121.0±0.0
C4107.278±0.3030.899±0.0220.474±0.0291.0±0.0
C41511.982±0.5780.888±0.020.787±0.041.0±0.0
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
C42519.138±0.890.762±0.0320.997±0.0050.989±0.003
C43022.578±1.110.671±0.041.0±0.00.978±0.004

Note. mRNA = messenger-RNA.

Table 3.

Mean (± standard deviation) computed across the 30 independent replications of k~r, precision, sensitivity, and specificity of the m-specific matchings averaged across all mRNAs for configurations C1 (left) and C4 (right)

k=kk~rPrecisionSensitivitySpecificity
C1107.748±0.4460.973±0.030.156±0.011.0±0.0
C13025.888±1.4180.972±0.0290.526±0.0321.0±0.0
C15045.521±2.4410.944±0.0250.916±0.041.0±0.001
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C16051.365±3.3350.919±0.0240.993±0.0110.997±0.004
C17055.296±3.3120.881±0.0341.0±0.00.985±0.01
C453.293±0.0960.895±0.0230.185±0.0121.0±0.0
C4107.278±0.3030.899±0.0220.474±0.0291.0±0.0
C41511.982±0.5780.888±0.020.787±0.041.0±0.0
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
C42519.138±0.890.762±0.0320.997±0.0050.989±0.003
C43022.578±1.110.671±0.041.0±0.00.978±0.004
k=kk~rPrecisionSensitivitySpecificity
C1107.748±0.4460.973±0.030.156±0.011.0±0.0
C13025.888±1.4180.972±0.0290.526±0.0321.0±0.0
C15045.521±2.4410.944±0.0250.916±0.041.0±0.001
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C16051.365±3.3350.919±0.0240.993±0.0110.997±0.004
C17055.296±3.3120.881±0.0341.0±0.00.985±0.01
C453.293±0.0960.895±0.0230.185±0.0121.0±0.0
C4107.278±0.3030.899±0.0220.474±0.0291.0±0.0
C41511.982±0.5780.888±0.020.787±0.041.0±0.0
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
C42519.138±0.890.762±0.0320.997±0.0050.989±0.003
C43022.578±1.110.671±0.041.0±0.00.978±0.004

Note. mRNA = messenger-RNA.

Table 4.

Mean (± standard deviation) computed across the 30 independent replications of k~r or k~c, precision, sensitivity, and specificity of the m-specific matchings (left) and n-specific matchings (right) averaged across all mRNAs (left) and all miRNAs (right)

k=kk~rPrecisionSensitivitySpecificity
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C22016.203±0.9560.955±0.0160.965±0.0210.997±0.001
C32015.552±0.8770.854±0.0240.968±0.0190.997±0.001
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
k=kk~rPrecisionSensitivitySpecificity
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C22016.203±0.9560.955±0.0160.965±0.0210.997±0.001
C32015.552±0.8770.854±0.0240.968±0.0190.997±0.001
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
k=kk~cPrecisionSensitivitySpecificity
C15549.056±3.4610.898±0.060.971±0.0260.981±0.009
C22016.371±0.8120.953±0.0180.963±0.0230.997±0.001
C32015.879±0.6910.804±0.0250.969±0.0180.993±0.001
C42015.867±0.6350.812±0.0320.961±0.0210.993±0.002
k=kk~cPrecisionSensitivitySpecificity
C15549.056±3.4610.898±0.060.971±0.0260.981±0.009
C22016.371±0.8120.953±0.0180.963±0.0230.997±0.001
C32015.879±0.6910.804±0.0250.969±0.0180.993±0.001
C42015.867±0.6350.812±0.0320.961±0.0210.993±0.002

Note. mRNA = messenger-RNA; miRNA = micro-RNA.

Table 4.

Mean (± standard deviation) computed across the 30 independent replications of k~r or k~c, precision, sensitivity, and specificity of the m-specific matchings (left) and n-specific matchings (right) averaged across all mRNAs (left) and all miRNAs (right)

k=kk~rPrecisionSensitivitySpecificity
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C22016.203±0.9560.955±0.0160.965±0.0210.997±0.001
C32015.552±0.8770.854±0.0240.968±0.0190.997±0.001
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
k=kk~rPrecisionSensitivitySpecificity
C15549.108±3.0180.93±0.0210.972±0.0250.999±0.002
C22016.203±0.9560.955±0.0160.965±0.0210.997±0.001
C32015.552±0.8770.854±0.0240.968±0.0190.997±0.001
C42015.935±0.8640.843±0.0220.96±0.0230.997±0.001
k=kk~cPrecisionSensitivitySpecificity
C15549.056±3.4610.898±0.060.971±0.0260.981±0.009
C22016.371±0.8120.953±0.0180.963±0.0230.997±0.001
C32015.879±0.6910.804±0.0250.969±0.0180.993±0.001
C42015.867±0.6350.812±0.0320.961±0.0210.993±0.002
k=kk~cPrecisionSensitivitySpecificity
C15549.056±3.4610.898±0.060.971±0.0260.981±0.009
C22016.371±0.8120.953±0.0180.963±0.0230.997±0.001
C32015.879±0.6910.804±0.0250.969±0.0180.993±0.001
C42015.867±0.6350.812±0.0320.961±0.0210.993±0.002

Note. mRNA = messenger-RNA; miRNA = micro-RNA.

Table 2. We first focus on configuration C1. We note that WTOT-SCC1 and WTOT-SCC2 perform similarly, much better than CCOT-GWD and CCOT-GWB, better than the oracular algorithm WTOT-BC *, but not as well as the oracular algorithms WTOT-SCC1 * and WTOT-SCC2 *.

We now turn to configurations C2, C3, and C4. Configuration C3 is more challenging than configuration C2 because it shares the same hyperparameters as C2 except for (λx,λy) (which drives the number of noisy data points), set to (0,0) in C2 and to (30,30) in C3. Similarly, configuration C4 is more challenging than configuration C3 because it shares the same hyperparameters as C3 except for σ (the standard deviation of the Gaussian variations around the mean profiles), set to 0.01 in C3 and to 0.1 in C4. The comparisons will not concern algorithms WTOT-BC * (which never converges in these simulations), CCOT-GWD, and CCOT-GWB (which perform very poorly).

In configuration C2, in the absence of noisy data points, algorithm WTOT-SCC1 performs slightly better than WTOT-SCC2, as well as the oracular algorithm WTOT-SCC2 *, and almost as well as the oracular algorithm WTOT-SCC1 * (in average). In configurations C3 and C4, the introduction of noisy data points and then the increase in variability strongly degrade the performances of WTOT-SCC1, WTOT-SCC1 * and, to a lesser extent, those of WTOT-SCC2 and WTOT-SCC2 *. Algorithm WTOT-SCC2 outperforms WTOT-SCC1 and the oracular algorithm WTOT-SCC1 * too.

Tables 3 and 4. Table 3 illustrates the influence of k=k on the performances of algorithm WTOT-matching in configurations C1 and C4. In each configuration, the values k=k are chosen in the vicinity of λx or λy (50 in configuration C1 and 15 in configuration C4). For specificity and sensitivity, we observe the same patterns in configurations C1 and C4: specificity is not impacted much as k=k grows whereas sensitivity increases dramatically. Precision remains high in configuration C1 for all choices of k=k. In configuration C4, precision remains high for k=k ranging between 5 and 20, then it decreases when k=k grows from 25 to 30.

Table 4 summarizes the results of WTOT-matching in configurations C1, C2, C3, and C4 for a specific choice of k=k in terms of the row- and column-specific averages k~r and k~c, precision, sensitivity, and specificity. In each configuration, we chose the value of k=k among many retrospectively, so that the overall performance (in terms of precision, sensitivity, and specificity) is good. The left-hand-side (m-specific) and right-hand-side (n-specific) tables in Table 4 are very similar. In configurations C1 and C2, all precision, sensitivity, and specificity are quite satisfying. In configurations C3 and C4, sensitivity and specificity are quite satisfying as well while precision falls below 0.86.

6 Illustration on real data: matching mRNA and miRNA in HD mice

Next, we apply algorithms WTOT-SCC2 and WTOT-matching to discover patterns hidden in RNA-seq data obtained in the striatum of HD model mice. As explained in Section 1, multi-dimensional mRNA and miRNA sequencing data were obtained in the striatum of these mice (Langfelder et al., 2016, 2018) and an earlier analysis of these data using shape analysis concepts (Mégret et al., 2020) has demonstrated their value.

6.1 Tuning

Specifically, in view of Procedure 1 of the online supplementary material, we choose M~=1,024, N~=512, and T=500. The entries of the 3×5 matrices θ~1a,θ~1b,θ~1c are constrained to take their values in ]10,0[ (for WTOT-SCC2) or ]2,0[ (for WTOT-matching), ]0.2,0.2[ and ]0.2,0.2[, respectively. We also choose (η,γ0)=(0.95,3). Finally, the initial mapping θ0 is drawn randomly.

Furthermore, regarding step 2 of algorithm WTOT-SCC2, we remove rows and columns based on the following loop: 100 times successively, (i) we compute the Kullback–Leibler divergence between each row (renormalized) and the uniform distribution, and then remove the 100 rows with the smallest divergences; then (ii) we compute the Kullback–Leibler divergence between each column (renormalized) and the uniform distribution, and then remove the 5 columns with the smallest divergences. By doing so, we successively get rid of the rows and columns which, viewed as distributions, are too uniform and therefore deemed irrelevant. Finally, we remove all rows for which the (columnwise) sum of the remaining entries of P~ is smaller than one tenth of the maximal (columnwise) sum, and all columns for which the (rowwise) sum of the remaining entries of P~ is smaller than 1/10th of the maximal (rowwise) sum.

6.2 Results

6.2.1 Co-clustering

The selection procedure (step 2 of WTOT-SCC2) keeps 3,409 mRNA profiles (among the 13,616 available in the data set) and 602 miRNA (among the 1,143 available in the data set). Eventually, algorithm WTOT-SCC2 outputs eight co-clusters. The co-clusters’s sizes (numbers of mRNA and miRNA gathered in each co-cluster) are (321,86), (333,30), (261,6), (498,125), (127,5), (708,203), (703,119), and (458,28). Figure 2 represents the averages, computed across all blocks, of the entries of the matrix derived from the optimal transport matrix P~ during step 2 of algorithm WTOT-SCC2 and after its rearrangement. Squares located on the diagonal tend to be slightly darker than the other squares. This reveals that, in average, a pair (xm,yn) of mRNA and miRNA gathered in a diagonal co-cluster tends to exhibit a mirrored relationship that is slightly stronger than those of the form (xm,yn) or (xm,yn) which do not fall in the same co-cluster. However, few of the off-diagonal averages are small in comparison to the on-diagonal averages, a disappointing observation that comes on top of the fact that the co-clusters’ sizes are so large that it is difficult to interpret the results. This makes it even more relevant to focus on algorithm WTOT-matching.

Logarithms of the averages, computed across all blocks, of the entries of the matrix derived from the optimal transport matrix P~ during step 2 of algorithm WTOT-SCC2 and after its rearrangement. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.
Figure 2.

Logarithms of the averages, computed across all blocks, of the entries of the matrix derived from the optimal transport matrix P~ during step 2 of algorithm WTOT-SCC2 and after its rearrangement. WTOT-SCC2 = weighted transformation optimal transport-spectral co-clustering.

6.2.2 Matching

We run the WTOT-matching algorithm with k=k=10 and q=90%. For the anecdote, we observe (k~r,k~c)(1.82,6.04) (recall that k~r,k~c are the row- and column-specific averages of the cardinalities of the sets Nm and Mn that are not empty). We report the parameters that characterize the mapping θ^ in Section C of the online supplementary material.

As an illustration, the mirrored profile (the opposite value of yn) of the Mir20b miRNA is displayed in Figure 3 along with its three matched mRNAs (Ahrr, Cnih3, and Relb) obtained by running algorithm WTOT-matching with k=k=10. Recall that the original profile of Mir20b can be found in Figure 1.

Minus the profile −yn of the Mir20b miRNA (a), and profiles xm of its matched mRNAs, Ahrr (b), Relb (c), and Cnih3 (d). mRNA = messenger-RNA; miRNA = micro-RNA.
Figure 3.

Minus the profile yn of the Mir20b miRNA (a), and profiles xm of its matched mRNAs, Ahrr (b), Relb (c), and Cnih3 (d). mRNA = messenger-RNA; miRNA = micro-RNA.

6.3 Biological analysis of the results

In an effort to guarantee biological relevance to the matchings, we only retain those showing evidence for binding sites as indicated in the databases TargetScan (Lewis et al., 2005), MicroCosm (Betel et al., 2010), and miRDB (Ding et al., 2016). Specifically, a pair (x,y) is retained if and only if the mRNA whose profile is x and the miRNA whose profile is y are both among the 27,355 mRNAs and 1,478 miRNAs appearing in the TargetScan, MicroCosm, and miRDB databases. The 1,247 matchings retained out of the 7,521 output by the WTOT-matching algorithm are all presented on the companion website.7 We stress that we would have obtained fewer matchings if we had excluded from the collections X and Y the profiles of mRNA or miRNA which do not appear in the databases.

Furthermore, we build upon two previous analyses of miRNA regulation in the striatum of HD knock-in-mice (Langfelder et al., 2018; Mégret et al., 2020) to comment on the biological relevance and novelty of our findings. The first analysis (Langfelder et al., 2018) relies on the Weighted Correlation Network Analysis (WGCNA) algorithm, a weighted gene co-expression network analysis which yields clusters of genes whose expression profiles are correlated. The second analysis (Mégret et al., 2020) relies on the MiRAMINT algorithm. MiRAMINT is a pipeline whose main steps consist in (a) carrying out a weighted gene co-expression network analysis, (b) using random forests to select candidate matchings, and (c) using Spearman’s correlation test and a multiple testing procedure to identify the more reliable matchings. We highlight that WGCNA outputs 1,583 mRNA–miRNA matchings showing evidence for binding sites in the databases TargetScan, MicroCosm, and miRDB, which involve only 46 different miRNAs. As for MiRAMINT, it only outputs 31 matchings of which 20 show evidence for binding sites in the databases TargetScan, MicroCosm, and miRDB, involving 14 different miRNAs. The 31 mRNA–miRNA matchings output by MiRAMINT are all presented on the webpage.8

6.3.1 Analysing the overlaps

Three mRNA–miRNA matchings are retained both by the WTOT-matching and WGCNA algorithms: Mir186-Chl1, Mir132-Fam196b, and Mir212-Fam196b. No matchings are retained both by the WTOT-matching and the MiRAMINT algorithms. One pair is retained both by the MiRAMINT and WGCNA algorithms: Mir132-Pafah121.

Figures 5–8 of the online supplementary material present several Venn diagrams that summarize the overlaps between the sets of miRNAs (respectively, mRNAs) which belong to a pair output by the WGCNA, MiRAMINT, and WTOT-matching algorithms. Figure 5 considers all pairs. In an attempt to identify miRNAs that are particularly susceptible to play a distinct role in HD in mice, Figures 6–8 focus on pairs where each miRNA is labelled as ‘peaked’, ‘monotonic’, or ‘neither peaked nor monotonic’ (referring to the shape of the miRNA’s profile).

Section C.1 of the online supplementary material presents a detailed analysis of the Venn diagrams. The take-home messages are

  • the findings of the WTOT-matching algorithm are not in agreement with those of the WGCNA and MiRAMINT algorithms, except when focusing on the mRNAs which belong to a pair output by the WTOT-matching and MiRAMINT algorithms;

  • the WTOT-algorithm retains mRNA–miRNA matchings that we label as peaked whereas neither the WGCNA nor the MiRAMINT algorithms do;

  • in matchings that we label as monotonic, the agreement between the WTOT-matching and WGCNA algorithms is better than that between the WTOT-matching and MiRAMINT algorithms;

  • in matchings that we label as neither peaked nor monotonic, the agreement between the WTOT-matching and WGCNA algorithms is better than that between the WTOT-matching and MiRAMINT algorithms.

6.3.2 Enrichment analysis

Next, we assess and compare the biological significance of the mRNAs retained by the WGCNA, MiRAMINT, and WTOT-matching algorithms. To do so we carry out an enrichment analysis using the EnrichR tools (Chen et al., 2013; Kuleshov et al., 2016; Xie et al., 2021). We consider only top annotations (balancing a small p-value and a large number of hits) as provided by Gene Ontology data9 (biological process and cellular content) and Kyoto Encyclopedia of Genes and Genomes (KEGG) data.10 When necessary, only the top 40 hits are considered so as to guarantee a sufficient level of biological precision. Pubmed searches are also used to assess the biological significance of predicted miRNA regulation.

Figures 9–11 of the online supplementary material present the mRNA–miRNA networks based on the mRNA–miRNA matchings output by the WTOT-matching algorithm, focusing on the matchings which are labelled as peaked, monotonic and neither peaked nor monotonic (in that order). The mRNAs and miRNAs retained by the WGCNA and MiRAMINT algorithms are coloured. The enrichment analysis reveals

  • that the mRNA–miRNA matchings output by the WGCNA algorithm are primarily annotated for axonogenesis,11 which relates to cytoskeleton dynamics and cell morphology;

  • that the matchings output by the MiRAMINT algorithm are primarily annotated for regulation of defense response to virus by host,12 which relates to stress response and innate immunity;

  • that the matchings output by the WTOT-matching algorithm are primarily annotated for extracellular matrix organization (which relates to cell identity),13 due to the matchings labelled as neither peaked nor monotonic, and secondarily annotated for mitigation of host antiviral defense response,14 due to the matchings labelled as monotonic, and for conventional motile cilium,15 due to the matchings labelled as peaked.

Although the numbers of hits in some of these annotations are small, they suggest that the WTOT-matching algorithm is able to uncover a role of miRNA regulation in responding to mutant huntingtin that was not detected by the WGCNA and MiRAMINT algorithms (despite the large number of mRNAs retained by the former).

Section C.2 of the online supplementary material presents a biological analysis of the results. We believe that it substantiates our claim that the WTOT-matching algorithm strikes a good balance between the low and high selectivity of the WGCNA and MiRAMINT algorithms. Moreover, our findings related to striatal alterations in HD mice lead to reconsidering the formerly expressed view on a limited role of miRNA regulation in the striatum of HD mice on a systems level (Mégret et al., 2020).

7 Discussion

We have developed two co-clustering algorithms (WTOT-SCC1 and WTOT-SCC2) and a matching algorithm (WTOT-matching) for the purpose of identifying groups of mRNAs and miRNAs that interact. The algorithms apply in any situation where it is of interest to cluster or match the elements of two data sets based on a parametric model Θ expressing what it means to interact for any two pair of elements from the two data sets. The algorithms rely on optimal transport, SCC, and a matching procedure. In light of Alvarez-Melis (2019, Section 1.3, page 25), problem-specific knowledge is injected onto two of the three main components of the transportation problem: the representation spaces (via Θ) and the marginal constraints, leaving aside the cost function.

During the first stage, an optimal optimal transport plan P and mapping in Θ are learned from the data using the Sinkhorn–Knopp algorithm and a mini-batch gradient descent. During the second stage, P is exploited to derive either co-clusters or several sets of matched elements.

As in Mégret et al. (2020), the motivation of our study is to shed light on the interaction between mRNAs and miRNAs based on data collected in the striatum of HD model knock-in mice (Langfelder et al., 2016, 2018). Each data point takes the form of multi-dimensional profile. The strong biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former should be similar to minus the profile of the latter—this particular form of affine relationship drives the formulation of a loosened hypothesis and definition of model Θ. The fact that the algorithm learns from the data is that a best element in Θ provides more flexibility than in Mégret et al. (2020).

The simulation study reveals on the one hand that WTOT-SCC2 works overall better than WTOT-SCC1, but that the co-clustering task can be very challenging in the presence of many irrelevant data points (data points that do not interact). On the other hand, it shows that the performances of WTOT-matching are satisfying.

An illustration on real data is given. The results are biologically relevant and illustrate how our algorithm strikes a good balance between two moderately and highly selective, competing algorithms. Our findings lead to reconsidering the formerly expressed view on a limited role of miRNA regulation in the striatum of HD mice on a systems level (Mégret et al., 2020).

In conclusion, there are several directions for future work. First, we will develop a similar study to better understand miRNA regulation in the cortex of HD model mice (ongoing project). Second, we will evaluate the performances of our algorithms by simulation studies based on a simulation scheme learned from the real data so as to better mimic their law (ongoing project). Third, we will put our algorithms into the general context of co-clustering and matching of datasets and carry out more benchmark tests and comparisons.

Funding

T.T.Y.N. is funded by Université Paris Cité, thanks to a Ph.D. fellowship granted by Domaine d’Intérêt Majeur Math Innov (Région Île-de-France and Fondation Sciences Mathématiques de Paris). O.B. and A.C. are funded by Université Paris Cité, and W.H. by Déraison.ai.16 C.M., L.M., and C.N. are funded by the CHDI Foundation (grant no. A-14814), Sorbonne Université, CNRS, and INSERM.

Data availability

The omics data used in this study are publicly available through the database repository Gene Expression Omnibus and the HDinHD portal.17 Overlaps between the results obtained by applying the WTOT-matching algorithm and results previously obtained based on two other algorithms, and full display of biological annotations, are available on the companion website.18 The code is available at http://www.broca.inserm.fr/WTOT/ and https://github.com/yen-nguyen-thi-thanh/wtot_coclust_match.

Authors’ contributions

C.N., O.B., and A.C. conceived the study. T.T.Y.N., O.B., and A.C. developed the methodology, formally and computationally, and performed the data analysis based on insights from L.M. and C.N. on how mutant huntingtin may significantly influence expression patterns across CAG repeat alleles and age points in the brain of HD mice. W.H. provided guidance on computational aspects. L.M., C.M., and C.N. performed the biological analysis of the results, the comparison to other algorithms, and the data base construction. T.T.Y.N, O.B., and A.C. wrote the first draft of the manuscript. All authors commented on subsequent versions. All authors read and approved the final manuscript.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series C.

References

Ailem
 
M.
,
Role
 
F.
, &
Nadif
 
M.
(
2016
).
Graph modularity maximization as an effective method for co-clustering text data
.
Knowledge-Based Systems
,
109
,
160
173
. https://doi.org/10.1016/j.knosys.2016.07.002

Alvarez-Melis
 
D.
(
2019
).
Optimal transport in structured domains: Algorithms and applications [PhD thesis]. Massachusetts Institute of Technology
.

Baddeley
 
A.
, &
Turner
 
R.
(
2005
).
Spatstat: An R package for analyzing spatial point patterns
.
Journal of Statistical Software
,
12
(
6
),
1
42
. https://doi.org/10.18637/jss.v012.i06

Benayoun
 
B.
,
Pollina
 
E.
,
Singh
 
P.
,
Mahmoudi
 
S.
,
Harel
 
I.
,
Casey
 
K.
,
Dulken
 
B.
,
Kundaje
 
A.
, &
Brunet
 
A.
(
2019
).
Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses
.
Genome Research
,
29
(
4
),
697
709
. https://doi.org/10.1101/gr.240093.118

Betel
 
D.
,
Koppal
 
A.
,
Agius
 
P.
,
Sander
 
C.
, &
Leslie
 
C.
(
2010
).
Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites
.
Genome Biology
,
11
(
8
),
R90
. https://doi.org/10.1186/gb-2010-11-8-r90

Brault
 
V.
,
Keribin
 
C.
,
Celeux
 
G.
, &
Govaert
 
G.
(
2014
).
Estimation and selection for the latent block model on categorical data
.
Statistics and Computing
,
25
,
1201
1216
. https://doi.org/10.1186/gb-2010-11-8-r90

Chen
 
E. Y.
,
Tan
 
C. M.
,
Kou
 
Y.
,
Duan
 
Q.
,
Wang
 
Z.
,
Meirelles
 
G. V.
,
Clark
 
N. R.
, &
Ma’ayan
 
A.
(
2013
).
Enrichr: Interactive and collaborative html5 gene list enrichment analysis tool
.
BMC Bioinformatics
,
14
(
1
),
1
14
. https://doi.org/10.1186/1471-2105-14-128

Dhillon
 
I. S.
(
2001
).
Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01 (pp. 269–274). Association for Computing Machinery
.

Ding
 
J.
,
Li
 
X.
, &
Hu
 
H.
(
2016
).
TarPmiR: A new approach for microRNA target site prediction
.
BMC Bioinformatics
,
32
(
18
),
2768
2775
. https://doi.org/10.1093/bioinformatics/btw318

Fatras
 
K.
,
Zine
 
Y.
,
Flamary
 
R.
,
Gribonval
 
R.
, &
Courty
 
N.
(
2020
).
Learning with minibatch Wasserstein: Asymptotic and gradient properties. In The 23nd international conference on artificial intelligence and statistics, volume 108 of PMLR
.

Genevay
 
A.
,
Peyré
 
G.
, &
Cuturi
 
M.
(
2018, April 9–11
).
Learning generative models with Sinkhorn divergences. In A. Storkey & F. Perez-Cruz (Eds.), Proceedings of the twenty-first international conference on artificial intelligence and statistics, volume 84 of Proceedings of machine learning research (pp. 1608–1617). PMLR
.

Govaert
 
G.
, &
Nadif
 
M.
(
2010, December
).
Model-based co-clustering for continuous data. In Fourth international conference on Machine learning and applications (pp. 175–180). IEEE Computer Society
.

Kuleshov
 
M. V.
,
Jones
 
M. R.
,
Rouillard
 
A. D.
,
Fernandez
 
N. F.
,
Duan
 
Q.
,
Wang
 
Z.
,
Koplev
 
S.
,
Jenkins
 
S. L.
,
Jagodnik
 
K. M.
,
Lachmann
 
A.
, &
McDermott
 
M. G.
(
2016
).
Enrichr: A comprehensive gene set enrichment analysis web server 2016 update
.
Nucleic Acids Research
,
44
(
W1
),
W90
W97
. https://doi.org/10.1093/nar/gkw377

Laclau
 
C.
,
Redko
 
I.
,
Matei
 
B.
,
Bennani
 
Y.
, &
Brault
 
V.
(
2017, August
).
Co-clustering through optimal transport. In 34th international conference on machine learning (Vol. 70, pp. 1955–1964)
.

Langfelder
 
P.
,
Cantle
 
J.
,
Chatzopoulou
 
D.
,
Wang
 
N.
,
Gao
 
F.
,
Al-Ramahi
 
I.
,
Lu
 
X.
,
Ramos
 
E.
,
Merz
 
K.
,
Zhao
 
Y.
,
Deverasetty
 
S.
,
Tebbe
 
A.
,
Schaab
 
C.
,
Lavery
 
D.
,
Howland
 
D.
,
Kwak
 
S.
,
Botas
 
J.
,
Aaronson
 
J.
,
Rosinski
 
J.
, &
Yang
 
X.
(
2016
).
Integrated genomics and proteomics define Huntingtin CAGlength-dependent networks in mice
.
Nature Neuroscience
,
19
(
4
),
622
633
. https://doi.org/10.1038/nn.4256

Langfelder
 
P.
,
Gao
 
F.
,
Wang
 
N.
,
Howland
 
D.
,
Kwak
 
S.
,
Vogt
 
T.
,
Aaronson
 
J.
,
Rosinski
 
J.
,
Coppola
 
G.
,
Horvath
 
S.
, &
Yang
 
X.
(
2018
).
MicroRNA signatures of endogenous Huntingtin CAG repeat expansion in mice
.
PLoS One
,
13
(
1
),
e0190550
. https://doi.org/10.1371/journal.pone.0190550

Lavancier
 
F.
,
Møller
 
J.
, &
Rubak
 
E.
(
2015
).
Determinantal point process models and statistical inference
.
Journal of the Royal Statistical Society Series B: Statistical Methodology
,
77
(
4
),
853
877
. https://doi.org/10.1111/rssb.12096

Lewis
 
B. P.
,
Burge
 
C. B.
, &
Bartel
 
D. P.
(
2005
).
Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets
.
Cell
,
120
(
1
),
15
20
. https://doi.org/10.1016/j.cell.2004.12.035

Lloyd
 
S. P.
(
1982
).
Least squares quantization in PCM
.
IEEE Transactions on Information Theory
,
28
(
2
),
129
137
. https://doi.org/10.1109/TIT.1982.1056489

Maniatis
 
S.
,
Äijö
 
T.
,
Vickovic
 
S.
,
Braine
 
C.
,
Kang
 
K.
,
Mollbrink
 
A.
,
Fagegaltier
 
D.
,
Andrusivová
 
Ž.
,
Saarenpää
 
S.
,
Saiz-Castro
 
G.
,
Cuevas
 
M.
,
Watters
 
A.
,
Lundeberg
 
J.
,
Bonneau
 
R.
, &
Phatnani
 
H.
(
2019
).
Spatiotemporal dynamics of molecular pathology in amyotrophic lateral sclerosis
.
Science
,
364
(
6435
),
89
93
. https://doi.org/10.1126/science.aav9776

Mégret
 
L.
,
Sasidharan Nair
 
S.
,
Dancourt
 
J.
,
Aaronson
 
J.
,
Rosinski
 
J.
, &
Neri
 
C.
(
2020
).
Combining feature selection and shape analysis uncovers precise rules for miRNA regulation in Huntington’s disease mice
.
BMC Bioinformatics
,
21
(
1
),
75
. https://doi.org/10.1186/s12859-020-3418-9

Nazarov
 
P. V.
, &
Kreis
 
S.
(
2021
).
Integrative approaches for analysis of mRNA and microRNA high-throughput data
.
Computational and Structural Biotechnology Journal
,
19
,
1154
1162
. https://doi.org/10.1016/j.csbj.2021.01.029

Peyré
 
G.
, &
Cuturi
 
M.
(
2019
).
Computational optimal transport: With applications to data science
.
Foundations and trends in machine learning series
.
Now Publishers
.

Pontes
 
B.
,
Giráldez
 
R.
, &
Aguilar-Ruiz
 
J. S.
(
2015
).
Biclustering on expression data: A review
.
Journal of Biomedical Informatics
,
57
,
163
180
. https://doi.org/10.1016/j.jbi.2015.06.028

Xie
 
Z.
,
Bailey
 
A.
,
Kuleshov
 
M.
,
Clarke
 
D.
,
Evangelista
 
J.
,
Jenkins
 
S.
, &
Lachmann
 
A.
(
2021
).
Gene set knowledge discovery with Enrichr
.
Current Protocols
,
1
(
3
),
e90
. https://doi.org/10.1002/cpz1.90

Yang
 
H.
,
Shi
 
J.
, &
Carlone
 
L.
(
2021
).
TEASER: Fast and certifiable point cloud registration
.
IEEE Transactions on Robotics
,
37
(
2
),
314
333
. https://doi.org/10.1109/TRO.8860

Zhao
 
J.
,
Wang
 
H.
,
Dong
 
L.
,
Sun
 
S.
, &
Li
 
L.
(
2019
).
miRNA-20b inhibits cerebral ischemia-induced inflammation through targeting NLRP3
.
International Journal of Molecular Medicine
,
43
(
3
),
1167
1178
. https://doi.org/10.3892/ijmm.2018.4043

Footnotes

GO:0007409, de novo generation of a long process of a neuron, including the terminal branched region. Refers to the morphogenesis or creation of shape or form of the developing axon, which carries efferent (outgoing) action potential from the cell body towards target cells.

GO:0050691, any host process that modulates the frequency, rate, or extent of the antiviral response of a host cell or organism.

GO:0030198, a process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an extracellular matrix.

GO:0050690, evasion by virus of host immune response.

GO:0097729, a motile cilium where the axoneme has a ring of nine outer microtubules doublets plus two central micro tubules.

Author notes

O.B., C.N., and A.C. contributed equally.

Conflicts of interest: None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

Supplementary data