Finite mixtures of matrix variate Poisson-log normal distributions for three-way count data

Abstract Motivation Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. Results In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo-based approach, a variational Gaussian approximation-based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery. Availability and implementation The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.


Introduction
Finite mixture models are popular for clustering applications and are widely used on two-way data (McLachlan and Peel, 2000;McNicholas, 2016). Three-way data are becoming increasingly commonplace in several fields, including bioinformatics. Three-way data structures are characterized by three entities or modes: the units (rows), the variables (columns), and the occasions (layers). For two-way data, each observation is represented as a vector whereas, for three-way data, each observation can be regarded as a matrix. A random matrix T n is said to contain k ∈ {1, . . . , p} responses over i ∈ {1, . . . , r} occasions and n = 1, . . . , N such units are considered. This provides N independent and identically distributed random matrices T 1 , T 2 , . . . , T N .
Matrix variate distributions offer a natural approach for modeling three-way data. Extensions of matrix variate distributions in the context of mixture models have given rise to mixtures of matrix variate distributions, which have been used to cluster three-way data (Viroli, 2011;Anderlucci and Viroli, 2015;Dogru et al., 2016;McNicholas, 2017, 2018). Here, the interest lies in clustering the N observed matrices into G clus-ters, while utilizing all information from the other two modes (Viroli, 2011). It is assumed that matrices T 1 , T 2 , . . . , T N are conditionally independent and identically distributed observations coming from a mixture model with G possible groups in proportions π 1 , . . . , π G (Viroli, 2011). The density of the G-component mixture is f (T N |π 1 , . . . , π G , ϑ 1 , . . . , ϑ G ) = G g=1 π g f (r×p) (T n |ϑ g ). Here parameters of the distribution function f (r×p) (·) are represented by ϑ g and π g > 0, such that G g=1 π g = 1, is the mixing proportion of the gth component. where P denotes the Poisson distribution and the N rp is a rp-dimensional normal distribution. To account for the differences in library sizes across each sample c of an RNA-seq study, a fixed, known constant s c , representing the normalized library sizes, is added to the mean of the Poisson distribution. In this work, mixtures of MPLN distributions and matrix variate normal distributions are extended to give rise to mixtures of matrix variate Poisson-log normal (MVPLN) distributions for clustering three-way count data. Details of parameter estimation are provided, and both real and simulated data illustrations are used to demonstrate the clustering ability.

Matrix variate Poisson-log normal distribution
Mathematical properties of the matrix variate normal distribution can be found in Gupta and Nagar (2000). By considering a matrix variate structure, the number of free covariance parameters to be estimated is reduced from 1 2 rp(rp + 1) to 1 2 [r(r + 1) + p(p + 1)]. The matrix variate normal distribution can be extended to give rise to MVPLN distribution using a hierarchical structure. Consider n independent and identically distributed random matrices Y n , n = 1, . . . , N , each of dimension r × p. In the MVPLN frameword, Y nik |θ nik follows a Poisson distribution with mean exp(θ nik ), and (θ n ) follows a r × p matrix variate normal distribution N r×p (M, Φ, Ω), where M is a r × p matrix of means, Φ is a r × r covariance matrix containing the variances and covariances between r occasions and Ω is a p × p covariance matrix containing the variance and covariances of the p variables. Figure 1 provides a graphical representation of a mixture of MVPLN distributions. The vectorization of Y n , denoted vec(Y n ), is rp-dimensional.Given all vec(Y n ), i.e., for n = 1, . . . , N , the library sizes vec(s) can be calculated. The vec(s) and vec(θ n ) are rp-dimensional. The covariance matrix of vec(Y n ) is Σ = Φ ⊗ Ω, where ⊗ denotes the Kronecker product. Note that Σ has dimension rp × rp, and the probability mass function of the MVPLN distribution is where ϑ = (M, Φ, Ω), f (·) is the probability mass function of Poisson distribution and g (r×p) (·) is the probability density function of matrix variate normal distribution.
The unconditional mean and covariance of the MPLN distribution can be calculated using the properties of the log-normal distribution and of the conditional expectation (Aitchison and Ho, 1989;Tunaru, 2002). For the MVPLN distribution, the unconditional mean and covariance are respectively. The MVPLN distribution can account for both the correlations between variables and the correlations between occasions, as two different covariance matrices are used for the two modes. This makes the model ideal for modeling RNA-seq data when expression measurements for different conditions at different occasions, e.g., time-points or replicates, are available.

Finite mixtures of MVPLN distributions
In the mixture model context, a random matrix Y n is assumed to come from a population with G subgroups each distributed according to an MVPLN distribution. Then N such matrices Y 1 , Y 2 , . . . , Y N are observed, each of which belongs to one of g ∈ {1, . . . , G} different sub-populations with mixing proportions π 1 , . . . , π G . Then the probability density function of a G-component mixture of MVPLN distributions can be written as is the probability mass function of a Poisson distribution and the g (r×p) g (·) is the probability density function of matrix variate normal distribution. The cluster membership of all units is assumed to be unknown and z ng is used to cluster membership, where z ng = 1 if Y n is in component g and z ng = 0 otherwise. The complete-data consist of the observed and missing data, i.e., (Y 1 , . . . , Y N , z 1 , . . . , z N , θ 1 , . . . , θ N ). The complete-data likelihood is and the complete-data log-likelihood is where n g = N n=1 z ng . Compared to the mixtures of MPLN distribution, the number of free parameters to be estimated is reduced by considering a matrix variate structure (see Figures 2 and 3). For the mixtures of MPLN model, the number of free parameters is K = (G − 1) + (Grp) + 1 2 Grp[rp + 1], whereas for mixtures of MVPLN model it is K = (G − 1) + (Grp) + 1 2 G[r(r + 1) + p(p + 1)].  combines the variational approximation based approach and MCMC based approach.

MCMC based approach
In the MCMC based approach, the Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm is used to estimate the model parameters (see Silva et al., 2019, for details). Using an MCMC-EM algorithm, the expected value of the θ ng and the Z ng conditional on the parameter updates from the t th iteration, respectively, are updated in the expectation (E-) step as follows: where and θ (f ) ng is a random sample simulated via the RStan package for iterations f = 1, . . . , B. In the E-step, the expectation is taken conditional on the current parameter estimates; hence, the use of (t) on parameters in (1). As the values from initial iterations are discarded from further analysis to minimize bias, the number of iterations used for parameter estimation is W , where W < B. The conditional expected-value of the complete-data log-likelihood is where C is a constant with respect to M g , Φ g and Ω g , and n (t) ng . During the M-step, the updates for the parameters are obtained as follows:

VGA based approach
Variational approximations (Wainwright et al., 2008) are approximate inference techniques in which a computationally convenient approximating density is used in place of a more complex but 'true' posterior density. The approximating density is obtained by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities.
Suppose we have an approximating density q(θ), then the marginal log of the probability mass function can be written dθ is the KL divergence between f (θ | Y) and approximating distribution q(θ), and F (Y, q) = [log f (Y, θ) − log q(θ)]q(θ)dθ is our evidence lower bound (ELBO). Thus, to minimize the KL divergence, we maximize our ELBO. For variational Gaussian approximations, q(θ) is assumed to be a Gaussian distribution.
The complete-data log-likelihood for the mixtures of MVPLN distributions can be written where D KL (q ng f ng ) = q(θ ng ) log q(θng) f (θn|Yn,Zng=1) dθ ng is the KL divergence between f (θ n | Y n , Z ng = 1) and approximating distribution q(θ ng ). Assuming q(θ ng ) = N r×p (ξ ng , ∆ ng , κ ng ), the ELBO for each observation y n becomes The variational parameters that maximize the ELBO will minimize the KL divergence between the true posterior and the approximating density. Thus, parameter estimation can be done in an iterative EM-type approach such that the following steps are iterated. At the (t + 1) th step, 1. Conditional on the variational parameters ξ ng , ∆ ng , and κ ng and on M g , Φ g and Ω g , the E(Z ng ) is computed. Given π g , M g , Φ g and Ω g , .
Note that this involves the marginal distribution of Y which is difficult to compute.
Hence, we use an approximation of E(Z ng ), where we replace the marginal density of the exponent of ELBO such that .
This approximation is computationally convenient and a similar framework has been previously utilized (Gollini and Murphy, 2014;Tang et al., 2015). This approximation works well in simulation studies and real data analysis. (a) A fixed-point method is used for updating ∆ n :

GivenẐ
where the vector function exp [a] = (e a 1 , . . . , e ar ) is a vector of the exponential of each element of the r-dimensional vector a, diag(κ) = (κ 11 . . . , κ pp ) puts the diagonal elements of the p × p matrix κ into a p-dimensional vector, and is the Hadmard product.
(b) A fixed-point method is used for updating κ ng : where the vector function exp [a] = (e a 1 , . . . , e ap ) is a vector of exponential each element of the p-dimensional vector a, diag(∆) = (∆ 11 . . . , ∆ rr ) puts the diagonal elements of the r × r matrix ∆ into a r-dimensional vector, and is the Hadmard product.
(c) Newton's method is used to update ξ ng : ng .

GivenẐ
ng , and κ (t+1) ng , the parameters π g , M g , Φ g and Ω g can be solved for as

Hybrid Approach
While the MCMC based approach can generate exact results, fitting such models can take However, it does not guarantee an exact posterior (Ghahramani and Beal, 1999). Thus, we provide a computationally efficient hybrid approach in which: - Step 1: Fit the model using the VGA based approach.
-Step 2: Estimate the component indicator variable Z ng conditional on the parameter estimates from the VGA based approach.
-Step 3: Using the parameter estimates from Step 1 as the initial values for the parameters and using the classification from Step 2, compute the MCMC based expectation for the latent variable θ ng and obtain the final estimates of the model parameters.
The hybrid approach comes with a substantial reduction in computational overhead compared to a traditional MCMC EM but it can generate samples from the exact posterior distribution. Fitting such a model using the hybrid approach on a single dataset from Simulation 1 with N = 1000 and rp = 6 takes on average about 6 minutes on a standard laptop with Apple M1 chip. When the primary goal is to detect the underlying clusters (which is the case for our the real data analysis), the VGA based approach is sufficient. However, when the primarily goal is posterior inference, we recommend the hybrid approach as it can better yield an exact posterior similar to the MCMC-EM approach but is computationally efficient.
Details on the convergence criteria, initialization, and parallel implementation for an MCMC-EM approach is provided in Appendix C.

Identifiability
Model identifiability is vital to obtain unique and consistent parameter estimates. Identifiability of univariate and multivariate finite mixtures of normal distributions has been proved (Teicher, 1963;Yakowitz and Spragins, 1968). With regards to the mixtures of MVPLN distributions, the estimates for Φ g and Ω g are only unique up to a strictly positive constant.
To eliminate identifiability issues, a constraint needs to be imposed, e.g., the trace of Ω g can be set equal to p, the trace of Φ g can be set equal to r, or the first diagonal element of Φ g can be set equal to 1. The latter solution, which is used by Gallaugher and McNicholas (2017), is used for all analyses in this paper. To obtain final parameter estimates, the resulting Φ g .

Model selection and performance assessment
Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al., 2000). These criteria are given by AIC = −2 log L(ϑ (M LE) |y) + 2K, In situations where the true classes are known but, for clustering purposes, are ignored, the adjusted Rand index (ARI; Hubert and Arabie, 1985) can be used for performance assessment. The ARI takes a value 1 under perfect class agreement and has expected value 0 under random classification.

Simulations
Simulation studies were conducted to illustrate the ability to recover the true underlying parameters for the mixtures of MVPLN algorithm. For Simulation 1, datasets with G = 1 component were generated with N = 1000 observations, r = 2 and p = 3. For Simulation 2,  Comparative studies were also conducted. Because no comparable methods capable of clustering three-way count data were found in the current literature, datasets from Simulations 1, 2 and 3 were vectorized and analyzed with clustering methods designed for two-way data. For this purpose, a model-based clustering technique for count data, HTSCluster (Rau et al., 2011, 2015), and a distance-based method, k -means clustering (MacQueen, 1967), were used. For HTSCluster, initialization and clustering ranges were same as those used for mixtures of MVPLN algorithm. For k -means, the algorithm was run by specifying the true number of clusters and only the partitioning of the observations into clusters was evaluated.
For this reason k -means was not performed for Simulation 1.

Clustering transcriptome data
To illustrate the applicability of mixtures of MVPLN distributions for detecting the underlying cluster structure, the VGA based approach was applied to a RNA-seq dataset. Typically, only a subset of genes from the experiment are used for cluster analysis, in order to reduce noise. For this analysis, only the differentially expressed genes were used for clustering. The study identified 1336 differentially expressed genes, which were used for clustering.
The raw read counts for genes were obtained from Binary Alignment/Map files using samtools (Li et al., 2009) and HTSeq (Anders et al., 2015). The median value from the 3 replicates per each developmental stage was used. On the three-way data of dimensions 1336 × 2 × 3, a clustering range of G = 1, . . . , 10 was considered using k-means initialization (100 runs).
Furthermore, we repeated the analysis ten times. Since BIC and ICL both performed well in recovering the underlying cluster structure in the simulated data, here we also used BIC and ICL for model selection. Because no true labels are available for assessing the transcriptome data cluster analysis results, the clustering results were explored using the cluster-specific µ g which relates to mean trends of the expression levels of genes in different clusters in Figure 5. As evident in Figure 5, each cluster has its distinctive expression signatures. In some clusters (for example, Cluster 4), the means of the gene expression signatures are similar between the darkening and non-darkening beans but their mean expression pattern varies between developmental stages; whereas, in Cluster 8, the means of the gene expression signatures are similar across the developmental stages but varies slightly between the darkening and non-darkening beans.
On the other hand, in Cluster 3, the mean gene expression signatures vary across both the developmental stages as well as between darkening and non-darkening beans.
For all simulation and transcriptome data analyses, the normalization factors representing library sizes for samples were obtained using the trimmed mean of M values from

Discussion
A mixture of MVPLN distributions is introduced for clustering three-way count data, targeted at expression data arising from RNA-seq experiments. This is the first use of a mixture of MVPLN distributions for clustering within the literature. By allowing for a direct analysis Therefore, here we also propose a hybrid approach that is computationally efficient and it samples from the true posterior. Through simulation studies, we show that the VGA approach provides good clustering performance.
B number total iterations W number of iterations used in RStan for parameter estimation, after discarding values from initial iterations to minimize bias Q expected value of the complete-data log-likelihood C a constant with respect to some parameters l c complete-data log-likelihood Θ vector of model parameters and mixing proportions tr trace MAP maximum a posteriori classification K number of free parameters exp exponential function