- Split View
-
Views
-
Cite
Cite
Siamak Zamani Dadaneh, Mingyuan Zhou, Xiaoning Qian, Bayesian negative binomial regression for differential expression with confounding factors, Bioinformatics, Volume 34, Issue 19, October 2018, Pages 3349–3356, https://doi.org/10.1093/bioinformatics/bty330
- Share Icon Share
Abstract
Rapid adoption of high-throughput sequencing technologies has enabled better understanding of genome-wide molecular profile changes associated with phenotypic differences in biomedical studies. Often, these changes are due to multiple interacting factors. Existing methods are mostly considering differential expression across two conditions studying one main factor without considering other confounding factors. In addition, they are often coupled with essential sophisticated ad-hoc pre-processing steps such as normalization, restricting their adaptability to general experimental setups. Complex multi-factor experimental design to accurately decipher genotype-phenotype relationships signifies the need for developing effective statistical tools for genome-scale sequencing data profiled under multi-factor conditions.
We have developed a novel Bayesian negative binomial regression (BNB-R) method for the analysis of RNA sequencing (RNA-seq) count data. In particular, the natural model parameterization removes the needs for the normalization step, while the method is capable of tackling complex experimental design involving multi-variate dependence structures. Efficient Bayesian inference of model parameters is obtained by exploiting conditional conjugacy via novel data augmentation techniques. Comprehensive studies on both synthetic and real-world RNA-seq data demonstrate the superior performance of BNB-R in terms of the areas under both the receiver operating characteristic and precision-recall curves.
BNB-R is implemented in R language and is available at https://github.com/siamakz/BNBR.
Supplementary data are available at Bioinformatics online.
1 Introduction
High-throughput sequencing technologies have become the basic practice for genomic studies in life science research (Wang et al., 2009). In particular, RNA sequencing (RNA-seq), which measures the expression of each gene or genomic feature of interest by counting the number of sequence reads mapped to them, has been widely adopted for genotype-phenotype association studies. To identify the genes that are differentially expressed between different groups of samples as candidate biomarkers across different phenotypes or treatment conditions, a large number of statistical methods and tools have been developed (Anders and Huber, 2010; Dadaneh et al., 2017; Law et al., 2014; Li and Tibshirani, 2013; Love et al., 2014; Robinson et al., 2010).
While the majority of differential expression (DE) analyses are conducted with respect to a main treatment factor, the presence of potential confounding factors in real-world experiments makes it desirable to take them into account in the developed tools to derive unbiased genotype-phenotype association results. There exists a rich set of methods on addressing this problem in microarray data analysis, such as the ones developed based on linear models (Smyth, 2004, 2005).
Unlike microarray data that are based on continuous intensity measurements, RNA-seq data have unique properties. The sequencing read counts are often skewed and highly over-dispersed (Datta and Nettleton, 2014), and most of the existing data are with a small number of samples due to high data collection cost. Analyzing RNA-seq data are challenging, especially if taking into account the potential confounding effects. A number of methods extend the statistical tools developed for microarray data to analyze RNA-seq read counts. For instance, voom (Law et al., 2014) estimates the mean-variance relationship of the logarithmic transformation of counts, and after generating precision weights for each observation, exploits the empirical Bayes pipeline of limma (Smyth, 2005) for downstream analyses. Other statistical methods are specifically designed for RNA-seq count data. One of the most popular solutions in this category to account for over-dispersion due to biological variations is using the negative binomial (NB) distribution. Several DE analysis methods have employed generalized linear models (GLMs) to adapt the NB distribution to experiments with complex design. For example, two widely used methods, edgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014), both use GLMs to model the mean of the NB distribution as a log-linear function of the covariates. The gene-wise dispersion parameters are then estimated using adjusted profile likelihood and GLM coefficients are estimated using Fisher scoring iterations.
For all these existing RNA-seq analysis methods, a common pre-processing step is to normalize the sequencing counts to compensate the variations of the sequencing depths across samples (Soneson and Delorenzi, 2013). For instance, edgeR (Robinson et al., 2010) either calculates a trimmed mean of M-values between each pair of samples or uses an upper quantile of samples for normalization, while DESeq2 (Love et al., 2014) takes the median of the ratios of observed sample counts to the geometric mean across samples as a scaling factor for that specific sample. Normalizing the sequencing counts, however, makes the performance depend on whether the introduced normalization is appropriate for the structure of the RNA-seq data under study (Zyprych-Walczak et al., 2015). In addition to normalization, another common pre-processing practice is to perform the surrogate variable analysis (SVA) to identify potential unknown factors that may help model batch effects, and then incorporate these surrogate variables (SVs) as additional covariates to adjust for the consequent DE analysis (Leek et al., 2012; Leek, 2014).
Obviating the need to pre-process the data, BNP-Seq (Dadaneh et al., 2017) uses a stochastic process based approach to model the observed sample-gene random count matrix of each group in a Bayesian non-parametric framework. More specifically, BNP-Seq algorithms model the gene counts using the gamma-negative binomial process (GNBP), which mixes the NB shape parameter for each gene with the distribution of the weight of an atom of a gamma process, or beta-negative binomial process (BNBP), which mixes the NB probability parameter of each gene with the distribution of the weight of an atom of a beta process (Zhou et al., 2016). A limitation of the BNP-Seq methods is that they are designed for two-group comparative analysis, which only considers the main treatment factor, and cannot be applied to experiments with more complex design.
Given the prevalence of model uncertainty in genomic studies, a Bayesian approach is often the only course possible (Boluki et al., 2017a, b; Karbalayghareh et al., 2017). In this paper, we propose a fully Bayesian negative binomial regression (BNB-R) method for DE analysis of RNA-seq data from experiments with complex multiple-factor design. Unlike all the existing DE methods based on the NB distribution, our method does not rely on ad-hoc approximations of various kinds, such as the fact that many statistical tests are only asymptotically valid (Law et al., 2014). BNB-R quantifies the uncertainty of the estimations, and also allows for the incorporation of prior information. BNB-R directly models the influence from covariates of interest for DE analysis and therefore it does not need the SVA pre-processing step. Moreover, this new approach does not require the ad-hoc normalization step either, as the model accounts for the sequencing-depth heterogeneity of different samples automatically, similar to the mechanisms employed in the BNP-Seq algorithms.
By exploiting two novel data augmentation techniques (Zhou et al., 2012), closed-form posterior inference of BNB-R model parameters is derived in a Gibbs sampling procedure. Specifically, the dispersion parameter of NB distribution is inferred using the augmentation technique of Zhou and Carin [2015], and regression coefficients are inferred in closed-forms by utilizing the Polya-Gamma (PG) distributed auxiliary variable technique of Polson and Scott [2011], removing the need for non-trivial Metropolis–Hastings correction steps (Chib and Greenberg, 1995). Comprehensive simulation results using synthetic and real-world RNA-seq datasets demonstrate the dominance of the proposed BNB-R over existing state-of-the-art tools.
The remainder of this paper is organized as follows. In Section 2, after presenting notations and a brief review of count regression methods, we introduce the model, inference, and DE procedure of BNB-R for genotype-phenotype association with multi-factor experiment design. In Section 3, we present experimental results on both synthetic and real-world benchmark RNA-seq data and show that the proposed NB regression algorithm outperforms the state of the art. We conclude the paper in Section 4.
2 Materials and methods
2.1 Notations and backgrounds
2.1.1 Count regression
2.2 BNB-R: NB regression DE analysis
One important task of RNA-seq data analysis is to identify the genes that show significant changes in their expression levels under different phenotypes or treatment conditions. In this section, a BNB-R model is developed to discover differentially expressed genes, while taking into account biological variability, sequencing depth heterogeneity and experimental confounding factors simultaneously.
We denote the number of sequencing reads mapped to gene in sequencing sample by nkj, and model this count as a NB random variable . The dispersion parameter rj, which only depends on the sample index, can be considered as a parameter reflecting the heterogeneity of counts, due to the variation of the sequencing depths across different samples. This can be justified by the gene count expectation , which is directly proportional to rj. To establish the dependence between the gene expression and covariates (e.g. phenotypes, treatments and other potential confounding factors) in different experimental setups, we impose a linear relationship between the logit function of the probability and covariates as , where is the covariate vector for sample j and is the regression coefficient vector for gene k. In our proposed model, the covariate variables can be numerical or categorical. Consequently, the expected gene expression can be expressed as , which resembles the familiar form of NB GLM (Gardner et al., 1995). Thereby, the effects of different experimental factors on gene expression are captured through the regression coefficients . In particular, by utilizing the Bayesian framework, the posterior distributions of different combinations of the regression coefficients can be estimated via a Markov chain Monte Carlo (MCMC; Andrieu et al., 2003) inference procedure to assess how the covariates impact the expression changes.
In addition to controlling the effects of multiple experimental factors via the regression coefficients , in BNB-R, the precision parameters of the normal distributions over these coefficients are shared between all genes to borrow signal strengths, a desirable property of the model that makes it robust especially in RNA-seq data analysis with a small sample size. In the following, we present our efficient MCMC inference of model parameters, which takes advantage of two novel data augmentation techniques, leading to closed-form parameter updates.
2.2.1 Parameter inference
The Gibbs sampling steps in Equations (3)–(9) are summarized in Supplementary Algorithm 1.
2.2.2 DE analysis
3 Results
To evaluate our BNB-R DE analysis algorithm, referred to as BNB-R, we compare its performance on both synthetic and real-world benchmark data with those of edgeR (Robinson et al., 2010), DESeq2 (Love et al., 2014) and voom included in the package limma (Law et al., 2014), three widely used methods capable of handling biomedical studies with complex experimental design. As it is common in practice, before applying these methods to real-world RNA-seq data, we first perform a SVA to introduce SVs as additional covariates to model potential unwanted batch effects (Leek, 2014) and then use them to adjust for these artifacts for unbiased DE analysis. We first consider synthetic RNA-seq data in simulated experiments with multiple factors, and we demonstrate that the proposed BNB-R consistently outperforms the other approaches. We then consider the real-world benchmark RNA-seq data extracted from the SEquencing Quality Control (SEQC) project (SEQC/MAQC-III Consortium, 2014). While this dataset does not possess explicit confounding factors, the results support the outstanding performance of BNB-R for DE analysis method in general. On both synthetic and real-world RNA-seq count data, different methods are compared in terms of both the receiver operating characteristic (ROC) and precision-recall (PR) curves and the area under these curves (AUC). Finally, we test BNB-R on a RNA-seq dataset of Th17 cell differentiation to study how incorporating the temporal information can lead to more meaningful biological discoveries.
3.1 Synthetic data
3.1.1 Incorporating covariates improves DE detection
We generate synthetic RNA-seq data with the NB regression generative model. To make the synthetic data closely resemble real-world RNA-seq data, the parameters of the NB regression model are first inferred from the SEQC dataset and then synthetic sequencing counts are generated using these inferred model parameters. Throughout the simulations, we consider three experimental factors as condition, gender and dosage, where condition and gender are categorical covariates with labels {treated, untreated} and {male, female}, respectively, and dosage is a numeric covariate in the interval , generated uniformly at random for each sample.
In the first simulation setting, the expression of gene k in sample j is simulated from , where for sample , the covariate vector is . The variable xjv represents the value of covariate v for sample j. In the first simulation setup, v = 0 corresponds to the intercept term, and v = 1, 2, 3 correspond to condition, gender and dosage covariates respectively. We use a binary scheme for coding the categorical covariates and . More precisely, if no treatment has been applied to sample j, and if this sample is under treatment. Also, if sample j belongs to a female individual and if it belongs to a male.
In the first simulation setting, the gene-expression counts for a total of 5000 genes and 12 samples, with three males and three females in each of the two conditions, are generated. We evaluate the performance of BNB-R based on this synthetic data, and compare it to edgeR, DESeq2 and voom. For BNB-R, model parameters are inferred via Gibbs sampling, where in each run of the algorithm, we collect 1000 MCMC samples after 1000 burn-in iterations and then rank the genes using the symmetric KL-divergence measure developed in Section 2.2.2. For edgeR, DESeq2 and voom, we follow the standard analysis pipelines and rank the genes using the computed P-values.
Panels in the top row of Figure 1 illustrate the ROC and PR curves of BNB-R, edgeR, DESeq2 and voom under the first simulation setting, when all covariates are employed. The AUCs of these curves are presented in Table 1. The panels in the bottom row of Figure 1 represent the performance of BNB-R, edgeR, DESeq2 and voom on the synthetic data when using the condition covariate as the single experimental factor, while neglecting all the other covariates. Table 2 provides the AUCs of the curves in the latter scenario. Methods that exploit covariates’ information clearly outperform the ones that only rely on the condition factor to identify differentially expressed genes, in terms of both the ROC and PR curves. This observation demonstrates the benefit of incorporating available experimental design information to better capture the heterogeneity of gene expression counts. In particular, BNB-R with covariates has the best performance with a significant margin over all the other algorithms. This may be explained by the hierarchical structure of BNB-R, where borrowing information from all genes to estimate precision parameters makes it robust in modeling overdispersed count data. In addition, we have also applied the BNBP and GNBP methods (Dadaneh et al., 2017), which use only the condition factor to determine DE, to the synthetic data in this simulation (results not included in Fig. 1 to not overwhelm it but it can be found in the Supplementary Material). These two methods also perform closely to the algorithms exploiting only the condition factor, confirming the observation that integrating additional covariates into a DE model can achieve more accurate and robust DE analysis for genotype-phenotype association.
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7952 | 0.3922 |
edgeR-GLM | 0.7563 | 0.3622 |
DESeq2 | 0.7533 | 0.3587 |
voom | 0.7450 | 0.3499 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7952 | 0.3922 |
edgeR-GLM | 0.7563 | 0.3622 |
DESeq2 | 0.7533 | 0.3587 |
voom | 0.7450 | 0.3499 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7952 | 0.3922 |
edgeR-GLM | 0.7563 | 0.3622 |
DESeq2 | 0.7533 | 0.3587 |
voom | 0.7450 | 0.3499 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7952 | 0.3922 |
edgeR-GLM | 0.7563 | 0.3622 |
DESeq2 | 0.7533 | 0.3587 |
voom | 0.7450 | 0.3499 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7343 | 0.3188 |
edgeR-GLM | 0.7302 | 0.3087 |
DESeq2 | 0.7193 | 0.2999 |
voom | 0.6832 | 0.2617 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7343 | 0.3188 |
edgeR-GLM | 0.7302 | 0.3087 |
DESeq2 | 0.7193 | 0.2999 |
voom | 0.6832 | 0.2617 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7343 | 0.3188 |
edgeR-GLM | 0.7302 | 0.3087 |
DESeq2 | 0.7193 | 0.2999 |
voom | 0.6832 | 0.2617 |
Method . | AUC-ROC . | AUC-PR . |
---|---|---|
BNB-R | 0.7343 | 0.3188 |
edgeR-GLM | 0.7302 | 0.3087 |
DESeq2 | 0.7193 | 0.2999 |
voom | 0.6832 | 0.2617 |
3.1.2 Sensitivity to experimental design
To assess the sensitivity of BNB-R to the experimental design assumption employed in the DE analysis model, we consider a simulation setting with a more complex combination of experimental factors, including an interaction term between the gender and condition covariates. Similar to the previous simulation, the expression of gene k in sample j is drawn from , where for sample , the covariate vector is . In this simulation setup, the elements xjv in the covariate vector for correspond to intercept, gender, condition, dosage and the interaction between gender and condition, respectively. We employ the same binary coding scheme for the categorical covariates as those used in the previous simulation setting. Thus, for example, if sample j has been under treatment and belongs to a male individual, and otherwise. We also generate the dosage covariates from a uniform distribution in interval .
RNA-seq counts for a total of 5000 genes and 12 samples, with three males and three females in each treatment condition, are generated. In this synthetic dataset, 516 genes are differentially expressed across treatment conditions for females and 653 genes are differentially expressed for males. First, we evaluate the performance of BNB-R, edgeR, DESeq2 and voom on this synthetic data, assuming that the true design matrix used for data generation is provided for all algorithms. Differentially expressed genes are identified using the same protocol as described in the previous sub-section. The top and middle panels of Figure 2 illustrate the ROC and PR curves for the detection of differentially expressed genes across conditions in males and females, respectively. BNB-R clearly outperforms the other methods in terms of both ROC and PR, for gender-specific DE analyses.
Next, instead of assuming knowing the true underlying data generation mechanism, we exclude the interaction term used for data generation for DE analysis with different methods and use the covariate vector for sample j, where the elements xjv for represent the same covariates as in the data generation procedure. As a consequence of using this design, detected differentially expressed genes are not specific to a gender. Hence to evaluate the performance of BNB-R, edgeR, DESeq2 and voom when using this design matrix, we need to compare the detected genes to those that are truly differentially expressed across conditions independent of gender. In this simulation, there are 400 genes that are differentially expressed across the treatment conditions for both male and female groups. We consider these genes as truly differentially expressed independent of gender, and the rest of the genes as not differentially expressed. The ROC and PR curves plotted based on this setting are shown in the bottom row of Figure 2. In this case, BNB-R again exhibits the best performance in terms of the ROC and PR curves, confirming its superior performance even if the true mechanism of data generation is not fully known.
3.2 SEQC benchmark
In this section, we evaluate the performance of the proposed BNB-R method using SEQC benchmark (SEQC/MAQC-III Consortium, 2014). Specifically, we use the RNA-seq data from BGI provided in the R package SEQC on Bioconductor (Gentleman et al., 2004), containing the counts for about 26 000 genes. In our experiments, we employ sample groups A and B, which are derived from the Agilent’s Universal Human Reference RNA and Life Technologies’ Human Brain Reference RNA cell lines, respectively. We collect the counts from the first flow cells of the sequencing machines on five replicates for each group.
To evaluate the DE analysis methods, we note that in the SEQC project, the same RNA samples for a comprehensive group of control genes are analyzed based on quantitative Reverse Transcription Polymerase Chain Reaction (qRT-PCR) using TaqMan assays (Joyce, 2002), which is referred as the TaqMan benchmark data (Maqc Consortium and Others, 2006; SEQC/MAQC-III Consortium, 2014). More precisely, for sample groups A and B, the expression intensity values of 955 selected control genes have been derived in the TaqMan qPT-PCR analysis for sequencing benchmarking. In the absence of the knowledge on the genes that are truly differentially expressed across different conditions, we follow the approach in Rapaport et al. [2013] to threshold the qRT-PCR expression ratios across different conditions at a certain value to define the ground-truth set of differentially expressed genes. Based on these 955 genes in the TaqMan data, we evaluate the performance of different DE analysis pipelines.
Before applying edgeR, DESeq2 and voom to this dataset, we first perform a SVA to adjust for un-modeled artifacts such as batch effects (Leek, 2014). More precisely, we use svaseq function of R package sva (Leek et al., 2012) with two introduced SVs. In the downstream DE analysis, we use these two SVs as extra confounding factors for edgeR, DESeq2 and voom. Our experiment shows that incorporation of the SVs slightly improves the performance of these methods (Supplementary Fig. S5). Note that although for BNB-R no explicit experimental factor other than a sample’s group is used in this experiment, our results suggest the performance of the proposed BNB-R DE analysis method is superior to those of stochastic processes inspired models in BNP-Seq, all of which achieve better ROC and PR curves than edgeR, DESeq2 and voom in conjunction with SVA, as described in detail below.
While truly differentially expressed genes are unknown for the SEQC RNA-seq data, we rely on the qRT-PCR expression intensity of the 955 genes in the TaqMan data and set different cut-offs for the binary logarithm (log2) of the qRT-PCR expression ratio to define ‘truly’ differentially expressed genes. We increase this log2 cut-off value gradually from 0.5 to 2, and calculate both AUC-ROC and AUC-PR. For the analysis of the dataset BGI on a single cluster node with Intel Xeon 2.5 GHz E5-2670 v2 processor, it took around 2 h for BNB-R method with 2000 MCMC iterations. The posterior distributions of the regression coefficients are used to assess DE. In addition to the methods used for synthetic data, we also include BNBP and GNBP (Dadaneh et al., 2017), both of which are generative models designed specifically for a single factor setting. As shown in the bottom panels of Figure 3, the BNB-R method outperforms all the other methods in both ROC and PR analyses, followed very closely by BNBP and GNBP. Note that the performance gains of the three generative models over the other methods become more significant as one increase the log2 cut-off for the qRT-PCR expression ratio, which reduces the number of genes that are considered as truly differentially expressed.
To further investigate the experimental results, we fix the log2 cut-off value at 2 for the qRT-PCR expression intensity of the 955 genes in the TaqMan data, and illustrate the ROC and PR curves for the BGI dataset in the top panels of Figure 3. It is clear the BNB-R method along with GNBP and BNBP not only have higher AUC-ROC and AUC-PR, but also outperform edgeR, DESeq2, and voom used together with SVA in almost all regions of the ROC and PR curves.
3.3 Case study: Th17 cell differentiation
To further illustrate its potential biological significance when integrating other covariates in BNB-R for biomarker identification applications, we provide a case study with our BNB-R method on a RNA-seq dataset of early human T helper 17 (Th17) cell differentiations and T-cell activation (Th0). Th17 cells play an essential role in the pathogenesis of auto-immune and inflammatory diseases, and have been the focus of many recent research efforts (Tuomela et al., 2012). In particular, the knowledge of the early phase of Th17 differentiation helps to gain insight into the process of signal propagation through various pathways and gene regulatory networks (Äijö et al., 2014). We use the RNA-seq dataset of Tuomela et al. [2016] and Chan et al. [2016], which contains gene expression profiling of Th0 and Th17 cells at the following five time points: 0 h, 12 h, 24 h, 48 h and 72 h after cell activation and stimulation, with three biological replicates at each time point. The data is obtained from Gene Expression Omnibus, with accession GSE52260.
The design matrix of the analysis is formed from an additive model formula as in our simulation studies, accounting for condition and time point factors. More precisely, for sample the covariate vector is , where is the intercept, is the cell category (i.e. Th0 versus Th17) and is the sample time point. We apply BNB-R to identify differentially expressed genes, where after 1000 burn-in iterations, 1000 posterior samples are collected to calculate the symmetric KL-divergence between the posterior distributions of and to rank the genes. The run-time of BNB-R with 2000 MCMC sampling iterations for the Th17 dataset on the cluster node with configuration provided in Section 3.2 is around 6 h.
We consider the top 100 genes ranked by the symmetric KL-divergence and perform Gene Ontology (GO) analysis using LAGO (Boyle et al., 2004) software (available at http://go.princeton.edu/cgi-bin/LAGO), focusing on the ontology of biological processes. The top five significantly enriched GO terms discovered by LAGO, with their corresponding adjusted P-values shown in Table 3, illustrating the association between the differentially expressed genes and immune system activation and response to stimulus.
GO-ID . | Term . | P-value . |
---|---|---|
GO: 0002376 | Immune system process | 4.74695e–13 |
GO: 0046649 | lymphocyte activation | 3.33415e–11 |
GO: 0006955 | Immune response | 3.90728e–11 |
GO: 0045321 | Leukocyte activation | 1.6007e–10 |
GO: 0050896 | Response to stimulus | 1.89798e–10 |
GO-ID . | Term . | P-value . |
---|---|---|
GO: 0002376 | Immune system process | 4.74695e–13 |
GO: 0046649 | lymphocyte activation | 3.33415e–11 |
GO: 0006955 | Immune response | 3.90728e–11 |
GO: 0045321 | Leukocyte activation | 1.6007e–10 |
GO: 0050896 | Response to stimulus | 1.89798e–10 |
GO-ID . | Term . | P-value . |
---|---|---|
GO: 0002376 | Immune system process | 4.74695e–13 |
GO: 0046649 | lymphocyte activation | 3.33415e–11 |
GO: 0006955 | Immune response | 3.90728e–11 |
GO: 0045321 | Leukocyte activation | 1.6007e–10 |
GO: 0050896 | Response to stimulus | 1.89798e–10 |
GO-ID . | Term . | P-value . |
---|---|---|
GO: 0002376 | Immune system process | 4.74695e–13 |
GO: 0046649 | lymphocyte activation | 3.33415e–11 |
GO: 0006955 | Immune response | 3.90728e–11 |
GO: 0045321 | Leukocyte activation | 1.6007e–10 |
GO: 0050896 | Response to stimulus | 1.89798e–10 |
In a closer look at the results, the top differentially expressed gene identified by BNB-R is gene COL6A3, an important organizer of the extracellular matrix proteins, contributing to adipose tissue inflammation (Pasarica et al., 2009). Also, the up-regulation of COL6A3 gene in Th17-polarizing cells is confirmed by microarray and RT-PCR assays in Tuomela et al. [2012]. The third ranked gene, Leukemia Inhibitory Factor (LIF), belongs to the IL-6 family of cytokines and resides within the core regulatory circuitry of T cells (Metcalfe, 2011). The fourth gene, RORC, is a Th17 lineage-specific transcription factor (Diveu et al., 2009), whose DE is also verified in the microarray study in Tuomela et al. [2012]. In addition, Western blotting results in Tuomela et al. [2012] show that genes BATF, CTSL1, VDR, KDSR, ATP1B1 and BASP1 were highly expressed in Th17 cells compared with their expression in Th0 cells at various time points during the first 3 days of polarization. The rankings of these genes obtained by our BNB-R are 11, 13, 15, 20, 24 and 41, respectively, which confirms the significance of their expression changes. Moreover, microarray studies of Tuomela et al. [2012] found out the up-regulation of CXCR5 and LMNA in CD4 T cells cultured under Th17-polarizing conditions compared with Th0 cells, and flow cytometric detection of CD52 at 48 h and 72 h showed down-regulation of this protein in CD4 T cells cultured under Th17-polarizing conditions. These genes are ranked 14, 44 and 60, respectively, in our DE analysis, supporting their potential roles in Th17 cells differentiation process.
Next, to examine how incorporating the time course information changes the DE analysis results, we apply BNB-R on the Th17 dataset, considering only the condition factor but ignoring the temporal information of different samples. Although out of the top 100 differentially expressed genes, there are 84 genes common between these two differential analysis results, the GO analysis, when the time factor is neglected, results in a total of 36 significantly enriched terms with known annotations, which is less than 40 annotated enriched terms when including the time factor. Some of the GO terms missed include cytokine-mediated signaling pathway, positive regulation of JAK-STAT, STAT cascades and T cell activation involved in immune response, which are all related to the immune system and can potentially lead to new hypotheses. In addition, BNB-R considering the time factor leads to smaller P-values overall in comparison to the analysis without time information, and hence more significantly enriched GO terms. For instance, the adjusted P-value obtained for T cell differentiation by the former analysis is , while the latter returns .
4 Conclusions
We propose a BNB-R method for DE analysis of sequencing count data. On one hand, BNB-R is capable of handling complex experiments involving multiple factors. On the other hand, it does not require an ad-hoc normalization pre-processing step. By taking advantage of novel data augmentation techniques, BNB-R possesses efficient closed-form Gibbs sampling update equations and ranks differentially expressed genes based on a symmetric KL-divergence measure, exploiting the full posterior distributions of the model parameters. Experimental results on both synthetic and real-world RNA-seq data demonstrate the state-of-the-art performance of BNB-R in DE analysis of RNA-seq data.
Acknowledgements
This project was partially supported by the USDA-SCRI competitive grant 2017-51181-26834. XQ was partially supported by Award CCF-1553281 from the National Science Foundation and the USDA NIFA Award 06-505570-01006. The authors thank Texas A&M High Performance Research Computing and Texas Advanced Computing Center for providing computational resources to perform experiments in this paper.
Funding
This work was supported by Award CCF-1553281 from the National Science Foundation and the USDA NIFA Award 06-505570-01006.
Conflict of Interest: none declared.
References