-
PDF
- Split View
-
Views
-
Cite
Cite
Andrea Rau, Cathy Maugis-Rabusseau, Marie-Laure Martin-Magniette, Gilles Celeux, Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models, Bioinformatics, Volume 31, Issue 9, May 2015, Pages 1420–1427, https://doi.org/10.1093/bioinformatics/btu845
- Share Icon Share
Abstract
Motivation: In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis.
Results: In this work, we focus on the question of clustering DGE profiles as a means to discover groups of co-expressed genes. We propose a Poisson mixture model using a rigorous framework for parameter estimation as well as the choice of the appropriate number of clusters. We illustrate co-expression analyses using our approach on two real RNA-seq datasets. A set of simulation studies also compares the performance of the proposed model with that of several related approaches developed to cluster RNA-seq or serial analysis of gene expression data.
Availability and and implementation: The proposed method is implemented in the open-source R package HTSCluster, available on CRAN.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The application of high-throughput sequencing (HTS) to the study of gene expression has revolutionized the scope and depth of understanding of the genome, epigenome and transcriptome of dozens of organisms. In particular, the recent use of HTS technologies to sequence ribonucleic acid content (RNA-seq), has rivaled the use of microarrays for transcriptomic studies as it offers a way to quantify gene expression without prior knowledge of the genome sequence by providing counts of transcripts. Although both technologies seem to be complementary (Naghavachari et al., 2012; SEQC/MAQC-III Consortium, 2014; Wang et al., 2014), RNA-seq can provide information about the transcriptome at a level of detail not possible with microarrays, including allele-specific expression and transcript discovery.
Although a variety of different protocols exist for HTS studies, the same broad preprocessing steps are followed. Namely, if an appropriate genome sequence reference is available, reads are mapped to the genome or transcriptome; otherwise, de novo assembly may be used. After alignment or assembly, read coverage for a given biological entity (e.g. a gene) is subsequently calculated. The quantification of gene expression in RNA-seq data remains an active area of research (Trapnell et al., 2010), and in this work, we focus on measures of digital gene expression (DGE) (counts). These count-based measures of gene expression differ substantially from data produced with microarrays. For example, RNA-seq data are discrete, positive, and highly skewed, with a very large dynamic range. In addition, due to the sampling nature of sequencing, low precision tends to be observed for weakly to moderately expressed genes (McIntyre et al., 2011). Finally, sequencing depth (i.e. the library size) and coverage vary between experiments, and read counts are known to be correlated with gene length (Łabaj et al., 2011; Oshlack and Wakefield, 2009).
To date, most developments concerning the statistical analysis of RNA-seq data have dealt with the issues of experimental design (Auer and Doerge, 2010), normalization (Robinson and Oshlack, 2010) and the analysis of differential expression (Anders and Huber, 2010; Auer et al., 2012; Law et al., 2014; McCarthy et al., 2012; Zhou et al., 2014). In this work, we focus on the question of co-expression analyses for RNA-seq data. Identifying biological entities that share similar profiles across several treatment conditions, such as co-expressed genes, may help identify groups of genes that are involved in the same biological processes (Eisen et al., 1998; Jiang et al., 2004). Clustering analyses based on metric criteria, such as the K-means algorithm (MacQueen, 1967) and hierarchical clustering (Ward, 1963), have been used to cluster microarray-based measures of gene expression as they are rapid, simple and stable. However, such methods require both the choice of metric and criterion to be optimized, as well as the selection of the number of clusters. An alternative to such methods are probabilistic clustering models, where the objects to be classified (genes) are considered to be a sample of a random vector and a clustering of the data is obtained by analyzing the density of this vector (McLachlan et al., 2004; Yeung et al., 2001).
Presently, most proposals for clustering RNA-seq data have focused on grouping together biological samples rather than biological entities (e.g. genes). For example, Anders and Huber (2010) perform a hierarchical clustering with a Euclidean distance of samples following a variance-stabilizing transformation, and Severin et al. (2010) cluster fourteen diverse tissues of soybean using hierarchical clustering with Pearson correlation after normalizing the data using a variation of the Reads Per Kilobase per Million mapped reads (RPKM) measure. Witten (2011) discussed the clustering of samples using hierarchical clustering with a modified loglikelihood ratio statistic as distance measure based on a Poisson loglinear model; this model is similar to that of Cai et al. (2004) for the clustering of Serial Analysis of Gene Expression (SAGE) gene profiles using a K-means algorithm and a Poisson loglinear model. More recently, Si et al. (2014) considered Poisson and negative binomial mixture models to develop a model-based hybrid-hierarchical clustering algorithm.
In this work, like Cai et al. (2004) and Si et al. (2014), we focus on the use of Poisson loglinear models for the clustering of count-based HTS expression profiles; however, rather than using such a model to define a distance metric to be used in a K-means or hierarchical clustering algorithm, we make use of finite mixtures of Poisson loglinear models. This framework has the advantage of providing a straightforward procedure for parameter estimation and model selection, as well as a per-gene conditional probability of belonging to each cluster.
2 Methods
Let Yijl be the random variable corresponding to the DGE measure for biological entity i () of condition j () in biological replicate l (), with yijl being the corresponding observed value of Yijl. Let be the total number of variables (all replicates in all conditions) in the data, such that is the n × q matrix of the DGE for all observations and variables, and is the q-dimensional vector of DGE for all variables of observation i. We use dot notation to indicate summations in various directions, e.g. , and so on.
2.1 Poisson mixture model
2.2 Inference
Finally, based on , each observation i is assigned to the component maximizing the conditional probability (i.e. using the so-called MAP rule).
2.3 HTSCluster R package
Our proposed clustering procedure based on a Poisson mixture model is implemented in the R package HTSCluster, freely available on CRAN; in this section, we describe some of the options available in this package.
2.3.1 Normalization factors
In the model described in Equation (2), the cluster-specific parameters λjk are assumed to be shared among replicates within the same condition j; as such, our model assumes that differences in mean expression for a given gene among replicates within the same condition may be explained by differences in library sizes. As described in Section 2.1, these library size normalization factors sjl are estimated from the data and considered to be fixed parameters in the Poisson mixture model. Several options are available in HTSCluster to provide estimates for these normalization factors, including the Trimmed Mean of M-values (TMM) normalization (Robinson and Oshlack, 2010) in the edgeR Bioconductor package (Robinson et al., 2010) and the median ratio normalization developed in the DESeq Bioconductor package (Anders and Huber, 2010). Although Dillies et al. (2012) performed an evaluation of the impact of these normalization factors in the context of differential analyses, such a comparison remains an open research question for co-expression analyses.
2.3.2 Parameter estimation and initialization
For parameter estimation in HTSCluster, in addition to the EM algorithm described above, it is also possible to use the so-called CEM (Clustering EM) algorithm (Celeux and Govaert, 1992). The CEM algorithm estimates both the mixture parameters and the cluster labels by maximizing the completed likelihood. In the E-step of the algorithm, the conditional probabilities are computed as in Equation (3). In the C-step, the MAP rule is used to assign each observation to a component. Finally, in the M-step, the mixture parameter estimates are updated by maximizing the completed likelihood. Contrary to the EM algorithm, the CEM algorithm converges in a finite number of iterations, although it does provide biased estimates of the mixture parameters [see for instance McLachlan and Peel (2000), Section 2.21]. Because it aims to maximize the completed likelihood, where the component label of each sample point is included in the data, the CEM may be seen as a K-means-like algorithm.
2.3.3 Additional options
Finally, HTSCluster provides flexibility to the user through a variety of graphical representations as well as a set of additional options, including the following: (i) cluster proportions can be variable (the default option) or fixed to be equal for all clusters; and (ii) one or more clusters may be included with a fixed value for . The latter option may be particularly useful in the context of differential analyses, where a group of genes may be assumed to have identical expression across experimental conditions. We recently proposed an approach and an associated R package HTSDiff that make use of this particular functionality.
3 Results
3.1 Real data analysis
In the following, we illustrate the use of the HTSCluster package for a co-expression analysis of two real RNA-seq datasets. We stress that in both cases, it is not possible to compare the co-expression results obtained using HTSCluster to a ‘true’ clustering of the data, as such a classification does not generally exist. However, in order to identify whether co-expressed genes appear to be implicated in similar biological processes, we conduct functional enrichment analyses of gene ontology (GO) terms for the clusters identified by HTSCluster.
3.1.1 Dynamic expression of the transcriptome in embryonic flies
As part of the modENCODE project, which aims to provide functional annotation of the Drosophila melanogaster genome, Graveley et al. (2011) characterized the expression dynamics over 27 distinct stages of development during the life cycle of the fly using RNA-seq. In this work, we focus on a subset of these data from 12 embryonic samples that were collected at 2-h intervals for 24 h, with one biological replicate for each time point. The phenotype tables and raw read counts for the 13 164 genes with at least one non-zero count among the 12 time points were obtained from the ReCount online resource (Frazee et al., 2011).
Over three independent runs, we used the HTSCluster package with default settings and the splitting small-EM initialization strategy (described in Section 2.3.2) to fit a sequence of Poisson mixture models with clusters; for each number of clusters, the model corresponding to the largest loglikelihood among the three runs was retained. To ensure that the collection of models considered is large enough to apply the slope heuristics model selection, one additional set of Poisson mixture models was fit for (in steps of 5) and (in steps of 10). Using the slope heuristics, the number of clusters was determined to be ; see the Supplementary Materials for more detail.
Visualizing the results of a co-expression analysis for RNA-seq data can be somewhat complicated by the extremely large dynamic range of DGE and the fact that more highly expressed genes tend to exhibit greater variability (though much smaller coefficients of variation) than weakly expressed genes. Two possibilities to avoid this issue are to apply a logarithmic transformation to obtain pseudocounts (Robinson et al., 2010) or a variance-stabilizing transformation (Anders and Huber, 2010) prior to graphical representation; however, the choice of the data to be visualized (e.g. raw, scaled, or transformed data) as well as the most appropriate manner in which they should be graphically displayed are still an open matter of research. For the purposes of co-expression, rather than directly visualizing the data themselves, we propose an alternative visualization of the overall behavior of each cluster, as shown in Figure 1. In this plot, bar widths correspond to the estimated proportion of genes in each cluster (), and the proportion of reads attributed to each developmental time point in each cluster are represented by the colored segments within each bar. The advantage of such a visualization is that it enables a straightforward comparison of typical gene profiles among clusters. For instance, it can be seen that clusters characterized by higher relative expression in the early embryonic stages, such as Clusters 6 and 13 (composed of 70 and 60 genes, respectively) tend to be much smaller than those with higher relative expression in later stages, e.g. Clusters 4, 18, 19 and 21 (composed of 567, 680, 485 and 475 genes, respectively).

Visualization of overall cluster behavior for the Drosophila melanogaster developmental data. For each cluster, bar plots of are drawn for each developmental time point, where the width of each bar corresponds to the estimated proportion
A functional enrichment analysis of GO biological processes revealed that of the 48 clusters identified by HTSCluster, 33 were associated with at least one GO term. For example, cluster 39 was found to be associated with terms pertaining to morphogenesis (GO:0009653, GO:0048858) and cell development (GO:0048468), while cluster 6 is associated with muscle attachment (GO:0016203). As a comparison, we also fit the closely related Poisson and negative binomial mixture models proposed by Si et al. (2014) for K = 48 clusters; for these models, a total of 22 and 25 clusters, respectively, were associated with at least one GO term. Additional details may be found in the Supplementary Materials.
3.1.2 Sex-specific expression using RNA-seq in human liver cells
High-throughput transcriptome RNA-seq data were obtained (Sultan et al., 2008) from a human embryonic kidney (HEK293T) and a Ramos B cell line, with two biological replicates in each experimental condition. The raw read counts for 9010 genes and phenotype tables were obtained from the ReCount online resource (Frazee et al., 2011). Following filtering using the HTSFilter package (Rau et al., 2013) to remove weakly expressed genes across the two conditions, 4956 were retained for the subsequent coexpression analysis.
As for the previous dataset, we applied the HTSCluster package with default parameters and the splitting small-EM initialization strategy over three independent runs to fit a sequence of Poisson mixture models with , where only the model with the highest loglikelihood for each number of clusters was retained. One additional set of models was fit for (in steps of 5) to ensure the applicability of the slope heuristics for model selection. Using the slope heuristics, the number of clusters was determined to be ; additional details may be found in the Supplementary Materials.
As before, a visualization of the overall behavior of genes belonging to each cluster may be found in Figure 2. It can be noted that Cluster 1 represents a fairly large number of genes (540) with dominant expression in the HEK293T cell line, whereas the much smaller Cluster 2 (192 genes) represents those genes primarily expressed in the Ramos B cell line; on the other hand, Clusters 12–15 represent groups of genes (212, 546, 511 and 365 genes, respectively) with largely balanced expression in the two cell lines. We note that such a visualization may be useful as a complement to a full differential expression analysis, as it enables a global characterization of the differences that are present between the two conditions.

Visualization of overall cluster behavior for the human liver RNA-seq data. For each cluster, bar plots of are drawn for each experimental condition, where the width of each bar corresponds to the estimated proportion
A functional enrichment analysis of GO biological processes revealed that, of the 15 clusters identified by HTSCluster, 11 were associated with at least one GO term. For example, Cluster 4 is associated with terms related to morphogenesis (GO:0048644, GO:0055008) and mesenchyme tissue development (GO:0060485), while Cluster 10 is associated with the negative regulation of action potential, nitric oxide synthase activity, oxidoreductase activity and leukocyte activity (GO:0045759, GO:0051001, GO:0051354, GO:0002695). In comparison, the closely related Poisson and negative binomial mixture models proposed by Si et al. (2014) were also estimated for K = 15 clusters; for these models, a total of 10 and 5 clusters, respectively, were associated with at least one GO term. Additional details may be found in the Supplementary Materials.
3.2 Simulation study
In this section, we perform a set of simulation experiments in order to compare the performance of the proposed Poisson mixture model to that of several alternative related approaches, described below.
3.2.1 Description of alternative clustering approaches
PoisL: Originally proposed for SAGE data, the PoisL approach (Cai et al., 2004) assumes that, given the cluster k, genes follow a Poisson distribution with mean , under the constraint that for all k; the existence of replicates within each condition is not taken into account in the original method. Using this model, a K-means algorithm is proposed, where each gene i is assigned to the cluster k at iteration b if . This procedure is exactly equivalent to the Poisson mixture model implemented in HTSCluster with equiprobable Poisson mixtures (i.e. for all k), parameter estimation via the CEM algorithm, and unreplicated data. For comparison with the other methods described here, we also include normalization factors sjl in the model as replicates are present.
Witten: Witten (2011) recently considered the issue of clustering samples, rather than genes, using RNA-seq data. After fitting a Poisson loglinear model to the power-transformed data (Li et al., 2012), complete linkage hierarchical clustering is applied to the dissimilarity matrix calculated using a modified loglikelihood ratio statistic to compare Poisson distributions. Although this method was originally proposed to cluster samples, Witten (2011) claims that it may also be used to cluster genes in RNA-seq data. This procedure is available in the R package PoiClaClu.
Si-Pois and Si-NB: Si et al. (2014) consider Poisson mixture models (Si-Pois) where with the constraint for all k, where aijl is a normalization factor that simultaneously accounts for the length of gene i and the library size of replicate l in condition j. An EM algorithm and two stochastic versions are proposed to estimate the remaining parameters. Following parameter estimation, a model-based hybrid-hierarchical clustering algorithm is developed to build a hierarchical tree. In addition, Si et al. (2014) also consider negative binomial mixture models (Si-NB) parameterized by the same mean as the Si-Pois method described above. A per-gene dispersion parameter is estimated by quasi-likelihood prior to fitting the model and considered to be fixed. The remainder of the Si-NB procedure is similar to Si-Pois, and both may be implemented using the R package MBCluster.Seq.
K-means: As a comparison, we also consider a classic K-means algorithm (MacQueen, 1967) with the usual Euclidian distance, which is applied to the gene expression profiles such that .
We note that model selection is not addressed by any of the methods described above; for this purpose, we use the index proposed by Caliński and Harabasz (1974) (Witten, Si-Pois, and Si-NB), which is a pseudo F-statistic that compares between and within-cluster dispersion, or the slope heuristics (PoisL) to select the appropriate number of clusters and deduce a data clustering. Although the slope heuristics could potentially be used for the Si-Pois and Si-NB methods, the output provided by their implementation in the MBCluster.Seq package renders this calculation difficult in practice. In addition, we note that all methods (with the exception of the Si-NB and classic K-means) are based on the use of an underlying Poisson model.
3.2.2 Simulation strategy
Using the parameter estimates obtained by HTSCluster in the fly and human liver RNA-seq datasets (described in Section 3.1), we simulated 50 datasets for each setting under a Poisson mixture model as in Equation (2) in the following manner.
For the simulations based on each real dataset (fly or human liver), the numbers of conditions and replicates per condition were fixed to be equivalent to the experimental design of each real dataset. The number of clusters K was fixed to be equivalent to 15; for the human liver data, this corresponds to the model selected via the slope heuristics, while for the fly data, 15 clusters were randomly chosen among the 48 estimated in the selected model. In addition, normalization factors sjl, cluster proportions πk (renormalized to sum to 1, in the case of the fly data), and cluster parameters λjk were fixed to be their estimated values for each dataset, and overall expression levels wi were fixed to be equal to the observed values. A total of n = 3000 genes were randomly sampled from the fly or human liver data, weighted by their maximum conditional probability from the selected HTSCluster model. For each sampled gene i, we sampled from the appropriate Poisson distribution , where if . In all datasets, we verified that simulated data were indeed represented by K = 15 clusters.
3.2.3 Results
For each simulated dataset in the two settings (fly and human liver), HTSCluster and the methods described in Section 3.2.1 were fit over a range of possible numbers of clusters (), with model selection performed using the slope heuristics (HTSCluster, PoisL) or the CH index (Witten, Si-NB, Si-Pois). In addition, for all competing approaches, the model with the true number of clusters (K = 15) was also included. Models were subsequently compared using the adjusted Rand index (ARI) (Hubert and Arabie, 1985) and the estimated number of clusters , shown in Tables 1 and 2, respectively. The oracle ARI is also included for comparison, based on the assignment of observations to components maximizing the conditional probability using the true parameter values in the Poisson mixture model.
Mean (SD) ARI for simulations with parameters based on the fly and human liver
Method . | Model selection . | Fly . | Human . |
---|---|---|---|
HTSCluster | capushe | 0.93 (0.05) | 0.61 (0.02) |
True K | 0.84 (0.09) | 0.60 (0.02) | |
PoisL | capushe | 0.79 (0.15) | 0.53 (0.05) |
True K | 0.82 (0.05) | 0.53 (0.04) | |
Witten | CH index | 0.15 (0.07) | 0.11 (0.03) |
True K | 0.67 (0.09) | 0.39 (0.04) | |
Si-Pois | CH index | 0.26 (0.17) | 0.48 (0.04) |
True K | 0.95 (0.02) | 0.61 (0.02) | |
Si-NB | CH index | 0.23 (0.16) | 0.47 (0.04) |
True K | 0.94 (0.02) | 0.60 (0.02) | |
K-means | True K | 0.79 (0.08) | 0.42 (0.02) |
Oracle | True K | 0.95 (0.01) | 0.63 (0.01) |
Method . | Model selection . | Fly . | Human . |
---|---|---|---|
HTSCluster | capushe | 0.93 (0.05) | 0.61 (0.02) |
True K | 0.84 (0.09) | 0.60 (0.02) | |
PoisL | capushe | 0.79 (0.15) | 0.53 (0.05) |
True K | 0.82 (0.05) | 0.53 (0.04) | |
Witten | CH index | 0.15 (0.07) | 0.11 (0.03) |
True K | 0.67 (0.09) | 0.39 (0.04) | |
Si-Pois | CH index | 0.26 (0.17) | 0.48 (0.04) |
True K | 0.95 (0.02) | 0.61 (0.02) | |
Si-NB | CH index | 0.23 (0.16) | 0.47 (0.04) |
True K | 0.94 (0.02) | 0.60 (0.02) | |
K-means | True K | 0.79 (0.08) | 0.42 (0.02) |
Oracle | True K | 0.95 (0.01) | 0.63 (0.01) |
Note: The largest values in each simulation setting are highlighted in bold font.
Mean (SD) ARI for simulations with parameters based on the fly and human liver
Method . | Model selection . | Fly . | Human . |
---|---|---|---|
HTSCluster | capushe | 0.93 (0.05) | 0.61 (0.02) |
True K | 0.84 (0.09) | 0.60 (0.02) | |
PoisL | capushe | 0.79 (0.15) | 0.53 (0.05) |
True K | 0.82 (0.05) | 0.53 (0.04) | |
Witten | CH index | 0.15 (0.07) | 0.11 (0.03) |
True K | 0.67 (0.09) | 0.39 (0.04) | |
Si-Pois | CH index | 0.26 (0.17) | 0.48 (0.04) |
True K | 0.95 (0.02) | 0.61 (0.02) | |
Si-NB | CH index | 0.23 (0.16) | 0.47 (0.04) |
True K | 0.94 (0.02) | 0.60 (0.02) | |
K-means | True K | 0.79 (0.08) | 0.42 (0.02) |
Oracle | True K | 0.95 (0.01) | 0.63 (0.01) |
Method . | Model selection . | Fly . | Human . |
---|---|---|---|
HTSCluster | capushe | 0.93 (0.05) | 0.61 (0.02) |
True K | 0.84 (0.09) | 0.60 (0.02) | |
PoisL | capushe | 0.79 (0.15) | 0.53 (0.05) |
True K | 0.82 (0.05) | 0.53 (0.04) | |
Witten | CH index | 0.15 (0.07) | 0.11 (0.03) |
True K | 0.67 (0.09) | 0.39 (0.04) | |
Si-Pois | CH index | 0.26 (0.17) | 0.48 (0.04) |
True K | 0.95 (0.02) | 0.61 (0.02) | |
Si-NB | CH index | 0.23 (0.16) | 0.47 (0.04) |
True K | 0.94 (0.02) | 0.60 (0.02) | |
K-means | True K | 0.79 (0.08) | 0.42 (0.02) |
Oracle | True K | 0.95 (0.01) | 0.63 (0.01) |
Note: The largest values in each simulation setting are highlighted in bold font.
Mean (SD) estimated number of clusters determined by the slope heuristics (HTSCluster and PoisL) or the CH index (Witten, Si-Pois, Si-NB) for simulations with parameters based on the fly and human liver data
. | HTSCluster . | PoisL . | Witten . | Si-Pois . | Si-NB . |
---|---|---|---|---|---|
Criterion | capushe | capushe | CH | CH | CH |
Fly | 19.9 (5.3) | 14.1 (5.1) | 2.3 (0.6) | 3.9 (2.4) | 3.3 (1.9) |
Human | 21.2 (5.5) | 15.9 (4.0) | 2.8 (0.4) | 8.4 (2.1) | 8.4 (2.4) |
. | HTSCluster . | PoisL . | Witten . | Si-Pois . | Si-NB . |
---|---|---|---|---|---|
Criterion | capushe | capushe | CH | CH | CH |
Fly | 19.9 (5.3) | 14.1 (5.1) | 2.3 (0.6) | 3.9 (2.4) | 3.3 (1.9) |
Human | 21.2 (5.5) | 15.9 (4.0) | 2.8 (0.4) | 8.4 (2.1) | 8.4 (2.4) |
Note: The true number of clusters for all simulations was fixed to K = 15.
Mean (SD) estimated number of clusters determined by the slope heuristics (HTSCluster and PoisL) or the CH index (Witten, Si-Pois, Si-NB) for simulations with parameters based on the fly and human liver data
. | HTSCluster . | PoisL . | Witten . | Si-Pois . | Si-NB . |
---|---|---|---|---|---|
Criterion | capushe | capushe | CH | CH | CH |
Fly | 19.9 (5.3) | 14.1 (5.1) | 2.3 (0.6) | 3.9 (2.4) | 3.3 (1.9) |
Human | 21.2 (5.5) | 15.9 (4.0) | 2.8 (0.4) | 8.4 (2.1) | 8.4 (2.4) |
. | HTSCluster . | PoisL . | Witten . | Si-Pois . | Si-NB . |
---|---|---|---|---|---|
Criterion | capushe | capushe | CH | CH | CH |
Fly | 19.9 (5.3) | 14.1 (5.1) | 2.3 (0.6) | 3.9 (2.4) | 3.3 (1.9) |
Human | 21.2 (5.5) | 15.9 (4.0) | 2.8 (0.4) | 8.4 (2.1) | 8.4 (2.4) |
Note: The true number of clusters for all simulations was fixed to K = 15.
In both simulation settings considered here, we note that a major difficulty for the alternative methods is the choice of the number of clusters to be included; the Witten, Si-Pois, and Si-NB methods all exhibit significantly lower ARI values when model selection is performed using the CH index as compared to when the true number of clusters is fixed. This difficulty appears to be especially pronounced for the Witten approach, where ARI values for the model selected using the CH index is less than 0.2 in both simulation settings and the selected number of clusters is significantly underestimated. For the Si-Pois and Si-NB methods, the CH index also tends to underestimate the number of clusters present in the data, although this trend is less marked than for the Witten approach, particularly in the human simulated data. In the case of the PoisL approach, although the slope heuristics approach appears to generally yield an appropriate estimate of K in both simulation settings, the corresponding ARI values tend to be lower than those attained by the HTSCluster approach.
Even when the number of clusters is fixed to the true value, the competiting methods tend to have equivalent or smaller ARI values than the models selected via slope heuristics for the proposed HTSCluster approach, in spite of the fact that all approaches (with the exception of Si-NB and K-means) also make use of an underlying Poisson model. In other words, if the true number of clusters is known, the performance of the Si-Pois and Si-NB approaches is quite good and very nearly attains that of the oracle model; however, when the number of clusters must be estimated from the data (as is typically the case in real applications), HTSCluster has much stronger performance than the competing methods on these simulated data. Model selection via the slope heuristics for the HTSCluster approach leads to a slight overestimation of the number of clusters, but these slightly more complex models have ARI values close to those found using the oracle Poisson mixture model. Finally, we note that the clustering task in the human liver setting (where only two conditions are present) appears to be much more difficult for all methods considered here than in the fly setting (where 12 unreplicated conditions are present), as evidenced by the smaller oracle ARI value. This is perhaps unsurprising, as it is more difficult to discern differing cluster profiles for only two conditions than when multiple conditions are available; this can be seen in the overall cluster behavior in the two real data analyses (Figs 1 and 2).
4 Discussion
In this work, we have proposed a method and associated R package HTSCluster to cluster count-based DGE profiles based on a Poisson mixture model that enables the use of a rigorous framework for parameter estimation (through the EM algorithm) and model selection (through the slope heuristics). The model is parameterized to account for several characteristics of RNA-seq data, including: (i) a set of normalization factors (sjl) to account for systematic differences in library size among biological replicates, (ii) a per-gene offset parameter (wi) to account for differences among genes due to overall expression level and (iii) a condition-specific cluster effect (λjk). As the marginal sums of each gene are fixed in the model, variations in expression among experimental conditions may be modeled throughout the extremely large dynamic range of DGE typical of RNA-seq data. In particular, this parameterization enables a straightforward interpretation of the model, as corresponds to the proportion of reads attributed to condition j in cluster k. A co-expression analysis on two sets of real RNA-seq data highlighted the functionality of HTSCluster in practice, in particular with respect to model selection and visualization of overall cluster behavior. Finally, the processing time and memory requirements of HTSCluster reflect the fact that parameter estimation must be performed over a large set of models to enable model selection; one run of HTSCluster (version 2.0.4) took about 50 minutes and used about 450 MB of memory for the human liver data (), and about 2 h with 1800 MB of memory for the fly developmental data (). (All analyses were run on a Dell Latitude E6530 quad-core 2.70 GHz Intel(R) Core(TM) with 10GB RAM, running a 64-bit version of Windows 7 Professional.)
As previously mentioned, HTSCluster shares some similarities with other related approaches, although there are several key differences. First, we note that both PoisL (Cai et al., 2004) and Witten (2011) also make use of an underlying Poisson model; however, rather than using a finite mixture model, the former uses a K-means algorithm based on the loglikelihood and the latter applies a hierarchical clustering procedure based on a pairwise dissimilarity matrix of dimension (n × n). On the other hand, Si et al. (2014) suggest the construction of a hierarchical tree of either Poisson (Si-Pois) or negative binomial (Si-NB) mixture models with an alternative parameterization to that proposed here. Contrary to all of these alternative related approaches, the HTSCluster approach provides a straightforward and robust way to choose the number of clusters present in a given dataset.
A set of simulation studies, with parameters selected based on two real datasets, allowed a comparison of HTSCluster with the aforementioned related approaches in a controlled scenario. These simulations highlighted the importance of an appropriate procedure to perform model selection, as well as the satisfactory performance of HTSCluster in the objective of clustering and estimating the number of clusters. In addition, even when the number of clusters was fixed to the true value, we found that the alternative methods were generally observed to have similar or lower ARI values than HTSCluster. However, conclusions from these simulations should be drawn with some caution, particularly as the data were simulated based on a mixture of Poisson distributions. A great deal of discussion has focused on the most appropriate way to simulate RNA-seq data in the context of differential expression (Soneson and Delorenzi, 2013), and for the time being this remains an open question for co-expression analyses.
Finally, we note that in the context of differential expression analyses, the scientific community has generally focused on the use of negative binomial models due to the large variability typically observed among replicates for a fixed gene. This so-called overdispersion is modeled via the inclusion of a common dispersion parameter or a per-gene dispersion parameter , typically estimated using a shrinkage approach (Robinson and Smyth, 2007) or a parametric regression fit across all genes (Anders and Huber, 2010). The Si-NB approach (Si et al., 2014) recently attempted to apply a similar approach to the task of co-expression analysis through a finite mixture of negative binomial models, where is estimated from the data using a quasi-likelihood approach and treated as fixed in the mixture. However, for co-expression analyses it is difficult to estimate these per-gene dispersion parameters in practice due to the small number of replicates typically available in experiments concerning multiple conditions. A useful direction for future research may be to define a mixture of negative binomial models in which information about this dispersion parameter is shared among genes belonging to the same cluster.
Acknowledgements
The authors thank the members of the Statistics for Systems Biology (SSB) working group for their helpful and insightful comments, as well as the three anonymous reviewers for their helpful comments. The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grant MixStatSeq (ANR-13-JS01-0001-01).
Funding
A.R. was funded as a postdoctoral researcher at Inria Saclay - Île-de-France for a portion of this work.
Conflict of Interest: none declared.
References