QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Illustration of QUBIC2 algorithm design. (a) Discretization of gene expression data. The LTMG model was applied to fit the expression profile of a gene. A representing row for each gene was generated with integers denoting the most likely component distribution that an expression value belongs to, and then this representing row was split into multiple rows. Finally, a binary representing matrix was generated after applying the above procedure to all the genes; (b) Core biclustering. By sorting all the gene pairs (seeds) in decreasing order of their weight, an initial seed list was obtained. For a feasible seed, QUBIC2 selected highly correlated genes with the seed in support of the build of an initial Core. (c) Dual biclustering. QUBIC2 expanded the Core vertically and horizontally to recruit more genes and conditions under a preset consistency level, respectively. The intersected zone created by extended genes and conditions formed a searching pool. QUBIC2 identified a Core in the pool (denoted as Dual) using the same procedure in b and output the bicluster with genes and conditions that came from the Core and Dual

All the gene pairs (seeds) were sorted in decreasing order of their weight, which is defined as the number of conditions under which the two genes have 1 s in M_R(Fig. 1b). The algorithm will iterate this list of seeds. Specifically, given a feasible seed, it will recruit genes and conditions to build an initial Core. We assumed that a bicluster should consist of genes receiving the same TRSs in a subset of conditions, while the genes outside the bicluster may receive different TRSs. According to our LTMG model, genes receiving the same TRSs under certain conditions will be labeled as 1 s under those conditions, giving rise to a clue to group these genes. The Kullback–Leibler divergence score (KL score) is designed to select candidate genes and conditions such that the 1 s concentrate within biclusters while 0 s mainly appear in the outside, i.e. it makes sure the difference between a bicluster and its background is more significant than the difference between an arbitrary same-size submatrix and its background. The KL score can quantify this difference and facilitate bicluster optimization and candidate selection (Fig. 1b. Details in KL score for biclusters optimization).

The above process will generate a Core bicluster consisting of 1 s. We believe that some 0 s outside the cores might be dropouts, therefore the core needs to be expanded. Since it is difficult to determine cutoffs for the expansion, we first expand the Core both horizontally and vertically, and then we heuristically search for another Core in the expanded region, so-called a Dual bicluster (Fig. 1c). The genes and conditions from both the Core and Dual constitute a submatrix (I, J) of M_R, which is one bicluster ready for output. We assumed that 0 s induced in this way are more likely to be dropouts and can support gene expression recovery and further FGM identification.

Furthermore, a statistical framework based on the size of the biclusters was implemented to calculate a P-value for each of the identified biclusters. The problem of assessing the significance of identified biclusters was formulated as calculating the probability of finding at least one submatrix enriched by 1 from a binary matrix with a given size, with a beta distribution employed during the process. This P-value framework enables users to evaluate the statistical significance of all the identified biclusters, especially for those from less-annotated organisms.

2.3 Left-truncated Mixed Gaussian (LTMG) model for gene expression data discretization

To accurately model the gene expression profile of RNA-Seq and scRNA-Seq data, we specifically developed a mixed Gaussian model with a left truncation assumption. Denoting the log-transformed FPKM, RPKM, or CPM expression values of gene

X

over

N

conditions as

X = \{x_{1}, \dots x_{n}\}

⁠, we assumed that

x_{j} \in X

follows a mixture of

k

Gaussian distributions corresponding to

k

possible TRSs. The density function of

x_{j}

is:

p (x_{j}; Θ) = \sum_{i = 1}^{k} α_{i} p (x_{j}; θ_{i}) = \sum_{i = 1}^{k} α_{i} \frac{1}{\sqrt{2 π} σ_{i}} e^{\frac{- {(x_{j} - μ_{i})}^{2}}{2 {σ_{i}}^{2}}}

And the density function of X is:

p (X; Θ) = \prod_{j = 1}^{n} p {(x}_{j}; Θ) = \prod_{j = 1}^{n} \sum_{i = 1}^{k} α_{i} p (x_{j}; θ_{i}) = \prod_{j = 1}^{n} \sum_{i = 1}^{k} α_{i} \frac{1}{\sqrt{2 π} σ_{i}} e^{\frac{- {(x_{j} - μ_{i})}^{2}}{2 {σ_{i}}^{2}}} = L (Θ; X)

where

α_{i}

is the mixing weight,

μ_{i}

and

σ_{i}

are the mean and standard deviation of ith Gaussian distribution, which can be estimated by an EM algorithm with given

X

⁠:

Θ^{*} =_{Θ}^{\arg \max L (Θ; X)}

To model the errors at zero and the low expression values, we introduce a parameter

Z_{cut}

for each gene expression profile and consider the expression values smaller than

Z_{cut}

as left-censored data. With the left truncation assumption, the gene expression profile is split into

M

truly measured expression values (>

Z_{cut}

⁠) and

N - M

left-censored gene expression values (≤

Z_{cut}

⁠) for the

N

conditions. Latent variables

y_{j}

and

Z_{j}

are introduced to estimate

Θ

by the following Q function:

Q (Θ; Θ^{t - 1}) = \sum p (y_{j} | x_{j}; Θ^{t - 1}) \sum_{j = 1}^{m} \sum_{i = 1}^{k} \log (α_{i} p (x_{j}; μ_{i}, σ_{i})) + \sum p (y_{j} | z_{j}; Θ^{t - 1}) \sum_{j = m + 1}^{n} \sum_{i = 1}^{k} \log (α_{i} p (z_{j}; μ_{i}, σ_{i}))

The parameters

Θ

that maximize the likelihood function can be estimated by an EM algorithm (Supplementary Method S2), and the number of Gaussian components is selected by the Bayesian Information Criterion (Supplementary Method S3). Then the original gene expression values are labeled to the most likely distribution under each cell. In detail, the probability that

x_{j}

belongs to distribution

i

is formulated by:

p (x_{j} \in TRS i || K, Θ^{*}) \propto \frac{α_{i}}{\sqrt{2 π {σ_{j}}^{2}}} e^{\frac{- {(x_{j} - μ_{i})}^{2}}{{2 σ_{i}}^{2}}}

And $x_{j}$ is labeled by TRS $i$ if $p (x_{j} \in TRS i || K, Θ^{*}) = \max_{i = 1, \dots, K} (p (x_{j} \in TRS i || K, Θ^{*}))$ ⁠. In such a way, a row consisting of discrete values (1, 2, …, $K$ ⁠) for each gene will be generated.

2.4 KL score for biclusters optimization

A KL score is introduced in QUBIC 2 to guide candidate-selection and biclustering optimization. The KL score of a bicluster is defined as:

{KL}_{B} = \frac{1}{N} \sum_{j = 1}^{N} \sum_{i \in {0, 1}} R (i, j) \times \log \frac{R (i, j)}{Q (i, j)} + \frac{1}{M} \sum_{k = 1}^{M} \sum_{i \in {0, 1}} C (i, k) \times \log \frac{C (i, k)}{P (i, k)}

where

N

and

M

are the numbers of rows and columns of a submatrix B in M_R, respectively.

R (i, j)

represents the proportion of element

i

in row

j

of B,

Q (i, j)

is the proportion of

i

in the corresponding entire row,

C (i, k)

is the proportion of

i

in column

k

of B and

P (i, k)

is the proportion of

i

in the entire corresponding column. Meanwhile, the KL score for a gene quantifies the similarity between a candidate gene

j

and a bicluster, which is defined as follows:

{KL}_{j} = \sum_{i \in {0, 1}} R (i, j) \times \log \frac{R (i, j)}{Q (i, j)}

where

R (i, j)

represents the proportion of

i

under corresponding columns of the current bicluster.

2.5 Evaluation of the functional modules

The capability of algorithms to recapitulate known functional modules is assessed using precision and recall. First, for each identified bicluster, we use the P-value of its most enriched functional class (biological pathway) as the P-value of the bicluster (Li et al., 2009). The bicluster is deemed enriched with that function if its P-value is smaller than a specific cutoff (e.g. 0.05).

Given a group of biclusters identified by a tool under a parameter combination, the precision is defined as the fraction of observed biclusters whose genes are significantly enriched with the one biological pathway/known modules (Benjamini-Hochberg adjusted p < 0.05),

Precis i on = \frac{# of significant biclusters}{# of biclusters}

For recall, we compute the fraction of known modules that were rediscovered by the algorithms among all known modules in a functional annotation database,

Recall = \frac{# of significant modules}{# of modules}

Precision and recall reflect relevance and diversity of the biclusters, respectively. To balance these two aspects, the harmonic mean of precision and recall was calculated to represent the performance of an algorithm on a given dataset and parameter setting, denoted as f score:

f = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}

Note that the number of biclusters used to calculate precision and recall may affect the results. To make sure the evaluation is as fair as possible, for each dataset, we select the first 30 biclusters.

2.6 Size-based P-value framework

For well-annotated organisms, the P-value of an identified bicluster enriched with a specific regulatory pathway can be calculated based on a hypergeometric distribution. However, known experimental annotations are currently limited, even for most well-studied model organisms (for example, half of the protein-coding genes of E.coli have solid experimental evidence for their functions in KEGG and GO) (Monk et al., 2014). This status still limits the capability of a systematic evaluation of all the identified biclusters. To fill this gap, we calculate an alternative size-based P-value as follows. For a binary representing matrix M_R containing

m_{0}

rows and

n_{0}

columns, suppose we obtain an

m_{1}

-by-

n_{1}

bicluster M₁ with all the elements being 1 s. The probability of

n_{1} \geq W

can be assessed by the following formula (Sun and Nobel, 2008), giving rise to a P-value for the bicluster M₁:

\begin{matrix} P (n_{1} \geq W) = \lim_{n \to \infty} {n_{0}}^{- (β + 1) (W - s (n_{1}, n_{0}, β))} {({log}_{b} n_{0})}^{β + 1} \\ where α = \frac{m_{0}}{n_{0}}, β = \frac{m_{1}}{n_{1}}, b = \frac{1}{p}, p = P (M_{i, j} = 1) = 1 - P (M_{i, j} = 0) for \forall i, j \\ s (n_{1}, n_{0}, β) = \frac{β + 1}{β} {log}_{b} n_{0} - \frac{β + 1}{β} {log}_{b} (\frac{β + 1}{β} {log}_{b} n_{0}) + {log}_{b} α \\ + \frac{(1 + β) {log}_{b} e - β {log}_{b} β}{β} \end{matrix}

2.7 Pseudo algorithm design of QUBIC2

Step 1 (Data discretization and qualitative representation): Given an expression matrix with log-transformed FPKM, RPKM or CPM values for genes, use the LTMG model to fit the data. Label the values to the most likely distribution to get a representing row for each gene. Split these rows into multiple rows to get the representative matrix M_R(Supplementary Fig. S2a).

Step 2 (Graph construction and seed selection): Construct a weighted graph for M_R, where nodes correspond to genes and edges connect every pair of genes. The edge weight indicates the similarity between the two corresponding genes, which is defined as the number of conditions in which the two genes have 1 s in M_R. Sort the list of edges in $G$ in decreasing order of their weights (i.e. $w {(e}_{1}) \geq w {(e}_{2}) \geq \dots, w {(e}_{| E |})) .$ An edge $e_{ij} = g_{i} g_{j}$ is selected as a seed if and only if at least one of $g_{i} and g_{j}$ is not in any previously identified biclusters, or $g_{i} and g_{j}$ are in two nonintersecting biclusters in terms of genes. Select a feasible seed from the seed list. Go to Step 6 if the seed list is empty (Supplementary Fig. S2b).

Step 3 (Build core bicluster): Build an initial bicluster by finding all the conditions under which the two genes of the seed have 1 s in M_R. Set these columns of the two genes as the current bicluster B = (I, J). Expand B by adding a new gene that has the most 1 s in J, giving rise to a new bicluster B’ = (I’, J’), where I’ is I after adding the new gene and J’ is J by deleting those columns with 0 s. If two genes have the same number of 1 s in J, choose the one with larger KL similarity with B (Supplementary Fig. S2c). If KL_B_′ > KL_B, set B to B′ and repeat Step 2, otherwise stop and denote B as Core. Go to Step 4.

Step 4 (Core expansion): Expand the Core horizontally and vertically under preset consistency level as follows: for each gene (row) i not in B, if the ratio between the number of 1 s in row i under J and |J| is ≥c, mark it as an extended gene; for each condition (column) j not in B, if the ratio between the number of 1 s in the column j among I and |I| is ≥c, mark it as an extended condition. (Supplementary Fig. S2d). Mark the intersected zone created by extended genes and conditions as a Dual searching pool (light blue box in Supplementary Fig. S2e). Go to Step 5.

Step 5 (Search Dual): Search Dual in the intersected expanded zone, using the same process in Step 3, output the bicluster with genes and conditions that come from Core and Dual (red box in Supplementary Fig. S2e). Delete current seed, go to step 2.

Step 6 (Biclusters filtering): Rank all the identified biclusters in decreasing order of their sizes (#row x #column). Output the first bicluster. Go to the second bicluster, skip it if its overlap with the first bicluster exceeds f, otherwise output it. Go to the next bicluster, skip it if it overlaps with the either of the biclusters in output exceeds f, otherwise output it. Continue until the last bicluster is checked. Go to Step 7.

Step 7 (Enrichment validation): For each output bicluster, we use the P-value of its most enriched functional class (biological pathway) as the P-value of the bicluster (details in Evaluation of the functional modules section). A bicluster is deemed enriched with a function if its P-value is smaller than a specific cutoff (e.g. 0.05).

3 Results

3.1 QUBIC2 achieves a better performance in the identification of FGMs compared to other methods

We compared QUBIC2 with eight biclustering algorithms, which include those evaluated as the top performed algorithms in the public domain (Bimax, ISA, FABIA, Plaid and QUBIC) and those published in the recent three years [BicMix (Gao et al., 2016), Runibic and EBIC (Orzechowski et al., 2018)] (Supplementary Method S4 and Supplementary Tables S3 and S4). The comparison performance in FGMs identification was evaluated by the precision, recall and f-score defined by the harmonic mean of the precision and recall. Eight gene expression datasets were used in the evaluation covering simulated, microarray, bulk RNA-seq and scRNA-Seq data in E.coli and Human. To minimize the biases in performance comparison among multiple tools, for each of the eight datasets, we run the nine tools using more than 50 parameter combinations by adjusting their critical parameters around default/recommended values (Details in the Section 2).

As showcased in Figure 2a, QUBIC2 achieved the highest median f-scores (f) on almost all the eight datasets. On simulation data, QUBIC2, QUBIC and FABIA were the top three algorithms with very close median f-scores. On microarray data and bulk RNA-Seq data from E.coli, QUBIC2 demonstrated particularly outstanding advantages. Specifically, the f-scores of QUBIC2 were significantly higher than the second-best algorithms, i.e. FABIA, Plaid and QUBIC, with one-tailed Wilcoxon test P-values as 1.67e-10, 6.72e-30 and 2.23e-62, respectively. It is also noteworthy that QUBIC2 and ISA were the best choices on the scRNA-Seq datasets. EBIC achieved the best performance on only one scRNA-Seq dataset, while its performance on the other scRNA-Seq dataset was disappointing.

Fig. 2.

Overall performance of QUBIC2. (a) Bar plot: Distribution of f-scores on each of the eight datasets under multiple runs (with at least 50 parameter combinations). Bars denote the median of the dots. Q2, E, Bic, R, F, Q, B, I and P represent QUBIC2, EBIC, BicMix, Runibic, FABIA, QUBIC, Bimax, ISA and Plaid, respectively; Heatmap: Relative performance of six algorithms regarding d, v, p and r, respectively (normalized over six algorithms). Note that v depends on the increment of parameters and therefore is only indicative; (b) Lollipop plot: Overall scores of each algorithm summed across eight datasets; Spider plot: averaged scores of each algorithm in terms of f, d, v, p, r

In regards to f-scores under the default parameters (d), QUBIC2 consistently had the highest values across all eight datasets, suggesting that users can trust QUBIC2’s default settings in practical studies for getting satisfactory results. QUBIC ranked as the second-best on two datasets, namely one simulated dataset and one RNA-Seq dataset from E.coli; ISA was suboptimal on two scRNA-Seq datasets; and Bimax, FABIA and BicMix occupied second place on one simulated dataset, microarray dataset from E.coli and RNA-Seq dataset from human, respectively. QUBIC2 performed well in both precision (p) and recall (r), indicating that the identified FGMs are relevant and diverse. QUBIC2 also had a relatively small variance of f-score (v), while the performance of some algorithms on specific datasets was susceptible to parameter changes (e.g. FABIA, Bimax and EBIC). ISA, QUBIC and BicMix were generally very stable, and their variances were often the second smallest among datasets. As for Bimax, although its recall was relatively low, it was characterized with high precision on four datasets.

3.2 QUBIC2 reaches robust performance across various datasets and different sequencing technologies

To intuitively show the performance of a given algorithm, we further assigned scores to it based on its rank among all the six algorithms, in terms of f, d, v, p and r, respectively. Specifically, a score of 9 was assigned to the algorithm that has the best performance, and a score of 1 was assigned to the algorithm that has the worst performance. For f, d, p and r, algorithms with higher values had higher scores, while for v, those with lower values had higher scores. We then calculated each algorithm’s summed and average scores on eight datasets to get an idea about their overall performance and their respective strengths and weaknesses. Specifically, QUBIC2, QUBIC and BicMix were the top three algorithms across the eight datasets according to the above scores (Fig. 2b). Overall, QUBIC2 achieved the highest median f-scores and default f-scores, and for most of the cases, it also had higher precision and recall as well as lower variance than the others. The rest of the algorithms had unique advantages: QUBIC and BicMix had higher default and median f-scores, and QUBIC also had higher recall. ISA was remarkable for its low variance, and both Plaid and Bimax had relatively high precision. EBIC and FABIA were rather balanced tools, performing neither badly nor well in the five aspects.

The performance of these nine algorithms across various datasets and different sequencing technologies was also evaluated. QUBIC2 is recommended among simulation, microarray, RNA-Seq and scRNA-Seq data due to its highest median f-score, default f-score, median precision and median recall, as well as its lowest variance. QUBIC had obvious advantages on simulation and RNA-Seq data; BicMix was also a good choice for RNA-Seq data; FABIA performed well on microarray data; both ISA and EBIC proved to be outstanding on scRNA-Seq data; and Plaid ranked the third-best on simulation and microarray data. While Bimax had no comprehensive advantages, it is known for high precision on simulation and RNA-Seq data. As far as species/organisms are concerned, QUBIC2 had the best performance on E.coli and human data, with its ranks in terms of f, d, v, p and r almost always at the top; Plaid was extraordinary in terms of low variance on data from E.coli and high precision on data from human; and ISA was balanced and second only to QUBIC2’s performance on data from human (Supplementary Fig. S3).

3.3 QUBIC2 identifies FGMs with more biological characteristics

The above section demonstrated the outstanding performance of QUBIC2 on FGMs identification, and in this section, we used a specific case to illustrate why QUBIC2 has such excellent power. Since QUBIC was overall the second-best algorithm, we took it as a reference in the illustration. A bicluster with 963 genes and 30 embryonic cells (BC001_Q) was identified from Yan’s data using QUBIC, with 16 cells being in the eight-cell stage and 14 cells being in the Morulae stage; and QUBIC2 found a bicluster consisting of 1, 264 genes and 15 embryonic cells (BC018_Q2), with all the cells being in the eight-cell stage. Six hundred thirty-one genes appeared in both BC018_Q2 and BC002_Q (shared genes) (Fig. 3a). Due to the considerable overlap between the two biclusters, we concluded that BC001_Q and BC018_Q2 share the same Core. Functional enrichment analysis was conducted for the 631 shared genes based on the Reactome database, and the top ten most enriched pathways were identified (Fig. 3b). For each of the ten pathways, the adjusted P-values were significantly lower in the genes of BC018_Q2 than in the shared genes (one-tailed paired Wilcoxon-test P-value = 0.000976), while genes of BC001_Q always had a significantly higher adjusted P-value than did the shared genes (one-tailed paired Wilcoxon-test P-value = 0.000977). The above phenomena suggest that given a pathway, QUBIC2 tends to capture a higher proportion of genes involved in that pathway than does QUBIC. Hence, it is appropriate to claim that given the same Core, QUBIC2 is more likely than QUBIC to recruit genes that are functionally related to the core.

Fig. 3.

Comparison of two biclusters identified by QUBIC2 and QUBIC, respectively. (a) Visualization of the two biclusters. Red box denotes bicluster from QUBIC2, and a blue box indicates bicluster from QUBIC. The overlapped genes are marked as shared genes. The colors of the bar at the top denote cell types, with blue representing the eight-cell stage and red being the Morulae stage; (b) Comparison of adjusted P-values for the top ten most enriched pathways of shared genes, QUBIC genes and QUBIC2 genes

Among the top ten most enriched pathways, DNA methylation encountered the most apparent change in terms of adjusted P-values. DNA methylation provides a potential epigenetic mechanism for maintaining cellular memory during self-renewal. The DNA methylation pathway consists of 32 genes, and QUBIC2 captured 20 of them. Compared to QUBIC, QUBIC2 additionally secured seven genes DNMT1, DNMT3L, HIST1H2BO, HIST1H2BJ, HIST1H2AJ, HIST1H2BB and HIST1H2AB, all of which play essential roles in the DNA methylation pathway. It is also noteworthy that QUBIC2 has more comprehensive coverage of the histone family genes, which play essential roles in DNA strands compaction and chromatin regulation. An overview of pathways that BC018_Q2 genes are involved in are provided in Supplementary Figure S4.

3.4 QUBIC2 provides a robust statistical evaluation framework for all the identified biclusters

The significance of gene modules from the identified biclusters is usually evaluated by pathway enrichment analysis. However, many organisms (including E.coli and human) have limited functional annotations supported by experimental verifications, which makes a systematic evaluation of all identified biclusters non-trivial. To fill this gap, a statistical method was proposed in this study, which can calculate a P-value for a bicluster purely based on its size (number of genes and conditions).

To validate the rationality of our statistical method, QUBIC2 was run on the RNA-Seq data from E.coli under multiple parameter settings, and a correlation test was conducted between the P-values of biclusters calculated via pathway enrichment analysis (named knowledge-based P-value) and the corresponding size-based P-values (Fig. 4a, details in Supplementary Method S5). Note that to distinguish the two, italic lowercase p was used to denote the p-value of the Spearman correlation test, while italic uppercase P was used to denote the significance of biclusters. Interestingly, we found that there was a strong correlation between the size-based P-values and the knowledge-based P-values. The average Spearman correlation coefficients (ρ) were higher than 0.40 (ComTF_ρ = 0.50, TF_ρ = 0.57, KEGG_ρ = 0.47, SEED_ρ = 0.46 and ECO_ρ = 0.44), and the average p-values for the correlation tests were smaller than 0.01 (Fig. 4b and c), suggesting that the correlations between knowledge-based P-values and size-based P-values were statistically significant at 0.01 level. As showcased in Figure 4, all the ρs in the five groups were positive. Also, ρs related to regulatory pathways (i.e. TF_ ρ and ComTF_ ρ) were generally larger than ρs those related to metabolic pathways (i.e. KEGG_ ρ and SEED_ ρ). This indicated that the size-based P-value seemed to be more suitable for the evaluation of biclusters in terms of their regulatory significance. Besides, the parameter f, which controls the level of overlaps between biclusters, had a negative association with ρ (Supplementary Fig. S5), signifying that the size-based P-values had a stronger association with knowledge-based P-values when the overlaps between biclusters are relatively low.

Fig. 4.

Validation on the rationality of the statistical framework. (a) The workflow of correlation test between knowledge-based P-value and size-based P-value; (b) The distribution of correlation coefficients (ρ) between P-value obtained from enrichment analysis and size-based P-value. We run QUBIC2 under 70 different parameter settings, and ρ was calculated under each run; (c) Scatter plot of correlation coefficients and p-values. The y-axis denotes the correlation coefficient for the spearman association test, and the x-axis denotes the p-value of the association test

4 Conclusion and discussion

QUBIC2 is a novel biclustering algorithm developed for FMGs detection from both simulated and real gene expression data and shows superior and robust performance across diverse biotechnology platforms, including microarray, bulk RNA-Seq and scRNA-Seq. It is empowered by three unique features, which contribute to its significant advantages in functional gene modules detection. The proposed significance evaluation framework based on bicluster size will provide a solid foundation for the systematic evaluation of all identified biclusters, especially for those from non-model species. We believe that QUBIC2 can serve for biologists and bioinformaticians as a useful tool to extract novel biological insights from large-scale gene expression data, and we believe that it can facilitate the study of distinct responses by different cell types in the same population when encountered by the same stimuli or stresses. However, most of the biclustering methods, including QUBIC2, encountered a dramatic performance drop on scRNA-Seq data, especially 10X genomics based scRNA-Seq data, compared to other RNA-Seq data. To fully excavate the potential of scRNA-Seq data, we must overcome several challenges.

First, as sequencing costs decrease, more massive scRNA-Seq datasets will become increasingly common [e.g. researchers can easily obtain scRNA-Seq data covering tens of thousands of cells via 10x Genomics (Freytag et al., 2018)]. Thus, the scalability to large datasets and the efficiency of tools will become the bottlenecks. Currently, the discretization and dual searching functions of QUBIC2 are time-consuming (Supplementary Note S3 and Supplementary Fig. S6). To improve the efficiency of QUBIC2, the OpenMP method will be implemented in the EM steps for discretization.

Another challenge lies in the interpretation of time series data, which provides the potential to elucidate the cell trajectory process and reconstruct causal gene regulatory networks from observational data. However, current scRNA-Seq protocols lose the true temporal coupling between measurements; thus existing tools fail to identify causal interactions to a satisfactory level (Qiu et al., 2018). QUBIC2 was applied to another temporal dataset (Supplementary Result S1 and Supplementary Fig. S7) and discovered biclusters specific to time point. However, QUBIC2 could only separate cells collected at different time points, and the further finer differentiation was not captured. We are developing a new framework to rebuild the timeline of cell differentiation from scRNA-Seq data, followed by a new algorithm to extract causal interactions from time-series expression data. The new algorithm will first group the expression change tendencies into several representative patterns to reduce the dimension of the timeline but keep the correlation with time delay. Then it will detect significant biclusters by taking expression values and change tendency patterns into consideration simultaneously.

Acknowledgements

We thank Minxuan Sun and Weiliang Liu for testing the QUBIC2 program.

Funding

This work was supported by an R01 award #1R01GM131399-01 from the National Institute of General Medical Sciences of the National Institutes of Health. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation #ACI-1548562. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation and the National Institutes of Health.

Conflict of Interest: none declared.

References

Bacher

R.

,

Kendziorski

C.

(

2016

)

Design and computational analysis of single-cell RNA-sequencing experiments

.

Genome Biol

.,

17

,

63.

Bengtsson

M.

et al. (

2005

)

Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels

.

Genome Res

.,

15

,

1388

–

1392

.

Bergmann

S.

et al. (

2003

)

Iterative signature algorithm for the analysis of large-scale gene expression data

.

Phys. Re. E

,

67

,

031902.

Cao

S.

et al. (

2018

) Bi-clustering based biological and clinical characterization. bioRxiv, 508275.

Chen

X.

et al. (

2016

)

Genome-scale identification of cell-wall-related genes in switchgrass through comparative genomics and computational analyses of transcriptomic data

.

BioEnergy Res

.,

9

,

172

–

180

.

Cohen

A.C.

(

1959

)

Simplified estimators for the normal distribution when samples are singly censored or truncated

.

Technometrics

,

1

,

217

–

237

.

Eren

K.

et al. (

2013

)

A comparative analysis of biclustering algorithms for gene expression data

.

Brief. Bioinf

.,

14

,

279

–

292

.

Faith

J.J.

et al. (

2008

)

Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata

.

Nucleic Acids Res

.,

36

,

D866

–

870

.

Freytag

S.

et al. (

2018

)

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 1; referees: 1 approved, 2 approved with reservations]

.

F1000Research

,

7

,

1297

.

Gao

C.

et al. (

2016

)

Context specific and differential gene co-expression networks via Bayesian biclustering

.

PLoS Comput. Biol

.,

12

,

e1004791

.

Gierahn

T.M.

et al. (

2017

)

Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput

.

Nat. Methods

,

14

,

395

.

Goodwin

S.

et al. (

2016

)

Coming of age: ten years of next-generation sequencing technologies

.

Nat. Rev. Genet

.,

17

,

333.

Hebenstreit

D.

et al. (

2011

)

RNA sequencing reveals two major classes of gene expression levels in metazoan cells

.

Mol. Syst. Biol

.,

7

,

497

.

Hochreiter

S.

et al. (

2010

)

FABIA: factor analysis for bicluster acquisition

.

Bioinformatics

,

26

,

1520

–

1527

.

Hwang

B.

et al. (

2018

)

Single-cell RNA sequencing technologies and bioinformatics pipelines

.

Exp. Mol. Med

.,

50

,

96

–

96

.

Keseler

I.M.

et al. (

2017

)

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12

.

Nucleic Acids Res

.,

45

,

D543

–

D550

.

Kiselev

V.Y.

et al. (

2017

)

SC3: consensus clustering of single-cell RNA-seq data

.

Nat. Methods

,

14

,

483

–

486

.

Lazzeroni

L.

,

Owen

A.

(

2002

)

Plaid models for gene expression data

.

Stat. Sin

.,

61

–

86

.

Leinonen

R.

et al. (

2011

)

The sequence read archive

.

Nucleic Acids Res

.,

39

,

D19

–

D21

.

Li

G.

et al. (

2009

)

QUBIC: a qualitative biclustering algorithm for analyses of gene expression data

.

Nucleic Acids Res

.,

37

,

e101

.

Lun

A.T.

et al. (

2016

)

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

.

Genome Biol

.,

17

,

75.

Mei

S.

et al. (

2017

)

Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse

.

Nucleic Acids Res

.,

45

,

D658

–

D662

.

Monier

B.

et al. (

2019

) IRIS-EDA: An integrated RNA-Seq interpretation system for gene expression data analysis.

PLoS Comput. Biol

.,

15

,

e1006792

.

Monk

J.

et al. (

2014

)

Optimizing genome-scale network reconstructions

.

Nat. Biotechnol

.,

32

,

447

–

452

.

Orzechowski

P.

et al. (

2018

)

EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery

.

Bioinformatics

,

34

,

3719

–

3726

.

Overbeek

R.

et al. (

2005

)

The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes

.

Nucleic Acids Res

.,

33

,

5691

–

5702

.

Prelić

A.

et al. (

2006

)

A systematic comparison and evaluation of biclustering methods for gene expression data

.

Bioinformatics

,

22

,

1122

–

1129

.

Qiu

X.

et al. (

2018

) Towards inferring causal gene regulatory networks from single cell expression measurements. bioRxiv 2018: 426981.

Saelens

W.

et al. (

2018

)

A comprehensive evaluation of module detection methods for gene expression data

.

Nat. Commun

.,

9

,

1090.

Stegle

O.

et al. (

2015

)

Computational and analytical challenges in single-cell transcriptomics

.

Nat. Rev. Genet

.,

16

,

133

–

145

.

Sun

X.

,

Nobel

A.B.

(

2008

)

On the size and recovery of submatrices of ones in a random binary matrix

.

J. Mach. Learn. Res

.,

9

,

2431

–

2453

.

Ulitsky

I.

et al. (

2010

)

Expander: from expression microarrays to networks and functions

.

Nat. Protoc

.,

5

,

303

–

322

.

Wan

C.

et al. (

2019

) LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data. Nucleic Acids Res, gkz655.

Wang

S.

et al. (

2012

)

Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis

.

BMC Plant Biol

.,

12

,

138

.

Xie

J.

et al. (

2018

)

It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data

.

Brief. Bioinf.

, doi: 10.1093/bib/bby014.

Zhang

Y.

et al. (

2016

)

QUBIC: a bioconductor package for qualitative biclustering analysis of gene co-expression data

.

Bioinformatics

,

33

,

450

–

452

.