-
PDF
- Split View
-
Views
-
Cite
Cite
Koji Iwayama, Yuri Aisaka, Natsumaro Kutsuna, Atsushi J Nagano, FIT: statistical modeling tool for transcriptome dynamics under fluctuating field conditions, Bioinformatics, Volume 33, Issue 11, 1 June 2017, Pages 1672–1680, https://doi.org/10.1093/bioinformatics/btx049
Close -
Share
Abstract
Considerable attention has been given to the quantification of environmental effects on organisms. In natural conditions, environmental factors are continuously changing in a complex manner. To reveal the effects of such environmental variations on organisms, transcriptome data in field environments have been collected and analyzed. Nagano et al. proposed a model that describes the relationship between transcriptomic variation and environmental conditions and demonstrated the capability to predict transcriptome variation in rice plants. However, the computational cost of parameter optimization has prevented its wide application.
We propose a new statistical model and efficient parameter optimization based on the previous study. We developed and released FIT, an R package that offers functions for parameter optimization and transcriptome prediction. The proposed method achieves comparable or better prediction performance within a shorter computational time than the previous method. The package will facilitate the study of the environmental effects on transcriptomic variation in field conditions.
Freely available from CRAN (https://cran.r-project.org/web/packages/FIT/).
Supplementary data are available at Bioinformatics online
1 Introduction
Variation in environmental factors affects various aspects of organisms. The comprehensive quantification of environmental effects on organisms is an emerging problem. For example, gene-environment interactions have been studied to explain the heritability of complex diseases that could not be explained by conventional genome-wide association studies (Thomas, 2010). In addition, concerns about organisms’ adaptation and response to changes in environmental conditions has been growing because of climate change (Ahuja et al., 2010; Hoffmann and Sgrò, 2011; Nicotra et al., 2010; Weston et al., 2008; Xu, 2016).
There is a trade-off between precise control of and minimal intervention to the environment (Jones, 2013). Any results obtained in the field more accurately reflect a plant’s environmental response in natural conditions. However, these results are difficult to interpret, because environmental factors are continuously changing in a complex manner and exhibit diurnal oscillations, seasonal changes, and long-term trends. On the other hand, experiments conducted in controlled conditions provide results that are more precise but not necessarily reflective of the plant’s behavior in natural conditions. For example, photosynthetic responses to fluctuating, environments like natural conditions differ from those to controlled, constant environmental conditions (Yamori, 2016). At the molecular level, traits observed in laboratory conditions are not always consistent with those observed in natural conditions (Malmberg et al., 2005; Mishra et al., 2012; Weinig et al., 2002). Further, even if similar physiological plasticity is observed, the molecular responses of plants to drought stress can vary between the controlled conditions of greenhouses and the uncontrolled conditions of the field (Lovell et al., 2016).
To reveal the effects of such fluctuating environments, transcriptome data in field environments have been collected (Alvarez et al., 2015). In particular, the transcriptomic variation of plants in fields has been studied (Hayes et al., 2010; Izawa et al., 2011; Nagano et al., 2012; Plessis et al., 2015; Sato et al., 2011; Richards et al., 2012). To bridge the gap between natural and laboratory conditions, Nagano et al. (2012) proposed a model to relate the transcriptomic variation of plants in a field to environmental conditions and applied it to large-scale transcriptome data of samples collected in a field. They demonstrated that the model can precisely predict transcriptome variations using meteorological data. However, the vast computational cost required to optimize parameters and select the best model has restricted the use of the model.
To accelerate transcriptomic studies in fields, we propose a new statistical model based on that proposed by Nagano et al. (2012). The previous model contains some distinct functions representing diurnal changes in sensitivity and the characteristics of responses to environmental stimuli. The new model reduces the computational cost by unifying these functions. We also propose a cluster-based approach for optimizing parameters, in which we reuse the optimization result of one gene for other genes in the same cluster. This approach significantly reduces the cost of searching for the optimum parameter value. Both the previous and new models are intended for application to microarray data; however, the use of RNA-Seq (Wang et al., 2009) technology is rapidly increasing. To apply our model to RNA-Seq data, we incorporated precision weights for normalized log-counts into the model as the voom method (Law et al., 2014). In addition, we developed and released FIT, an R package that provides efficient parameter optimization, model selection, and transcriptome prediction of unsequenced samples. This is the first tool to integrate the transcriptome data from field samples and meteorological data by modeling their relation.
2 Methods
2.1 Previous model
Parameters are optimized to maximize the likelihood by using the Nelder–Mead algorithm (Nelder and Mead, 1965). Because the likelihood function has multiple local maxima, a grid search is performed before the optimization. Regression coefficients and the phase of the circadian clock are optimized using the nonlinear least-squares method and the likelihood is calculated on each grid point of the remaining parameters. The optimization by the Nelder–Mead algorithm starts from the parameter values of the grid point with the largest likelihood. The response function to environmental stimuli with the largest likelihood is selected from four response model types (, and ). Because different gate function types have different degrees of freedom, the optimum gate function type is selected by performing likelihood-ratio tests. It can be considered that some variables in Equation (1) may contribute to neither the explanation nor the prediction for the expression patterns of many genes. Thus, a parameter reduction process that repeats parameter optimizations and likelihood-ratio tests is performed to obtain the simplest models.
2.2 New model
Although the previous model achieved a detailed description of transcriptome fluctuations in complex environments, the parameter optimization for all genes is computationally expensive. The computational cost of the optimization is due mainly to the existence of different model types, i.e., different types of response models and gate functions, and variable selection. We need to optimize and compare each model repeatedly. Therefore, to reduce computational cost, we unified the different model types.
Response functions (a) and gate functions (b) for various parameter values
Response functions (a) and gate functions (b) for various parameter values
2.3 Cluster-based optimization for computational time reduction
For further computational reduction, we omit the grid search (which is the most time-consuming step in the optimization) for almost all genes. Similar models would be optimum for genes exhibiting similar expression patterns. Hence, it is expected that the optimum model for one representative gene of a cluster can be used as the initial value of the Nelder–Mead optimization for genes, the expression pattern of which is similar to the pattern of the representative gene of the cluster.
Before optimization, we perform clustering of expression patterns using the affinity-propagation method (Frey and Dueck, 2007), which automatically provides an appropriate number of clusters and their exemplars. For each cluster, we optimize the parameter values for the exemplar of the cluster using the procedure mentioned above, and use the optimized value as the initial value for the Nelder–Mead method for other genes in the cluster.
2.4 Application to RNA-seq data
3 Datasets
3.1 Meteorological data
We used meteorological data measured every at a meteorological station (Tsukuba (Tateno), attitude ) of the Japan Meteorological Agency. The data consists of wind intensity = (), air temperature (), relative humidity (%), atmospheric pressure (), precipitation (), and global radiation ().
3.2 Synthetic gene expression data
In order to confirm that correct models can be selected by the proposed method, we synthetically generated RNA-Seq data assuming the following situation. Rice plants were transplanted into a paddy field on June 1. Samples were collected every week from June 12 to September 18, 2008 for . In order to verify influence of sampling design on model selection and parameter optimization, five types of sampling were considered: one sample at each time at intervals of , two samples at each time at intervals of , three samples at each time at intervals of , four samples at each time at intervals of , and six samples at each time at intervals of . The total number of samples for each type was 180. For evaluation of the predictive capability, we also computed gene expressions of samples assumed to be collected every week from June 12 to September 18, 2009 for at intervals of .
To decide dispersion parameters, we estimated the mean-dispersion trend from the Pickrell real RNA-Seq data set (Pickrell et al., 2010), which is available from the tweeDESeqCountData R package (http://www.creal.cat/jrgonzalez/software.htm), using the method implemented in the edgeR package (Zhou et al., 2014). From the estimated trend, we decided the value of the dispersion parameter corresponding to the average expression value for each gene αj. Because we use the log-count per million values for analysis, 9969 constantly expressed genes () were also considered in order to suppress the influence of the variation of the total read counts. The dispersion parameters of these constantly expressed genes were also decided according to the estimated mean-dispersion trend.
3.3 Real gene expression data
We also analyzed the same data as Nagano et al. (2012), which consisted of microarray data from 461 samples of mature leaves of rice plants in a paddy field at Tsukuba collected in 2008, and 108 samples collected in 2009. We used data from the samples collected in 2008 for model selection and parameter fitting, and those from 2009 for evaluation of the model’s predictive capability. The samples collected in 2008 can be categorized into six groups: of samples collected at intervals of starting at 7:00 am on August 12, 2008 (8 samples at each time × 13 time points samples); of samples collected at intervals of starting at 10:00 am on June 5, June 19, July 3, July 17, August 7, August 14, August 21, August 28, and September 11, 2008 (225 samples); samples collected at noon every other week from June 3 to September 23, 2008 (three samples at each time × 17 time points samples); samples collected at 12:00 am every other week from June 4 to September 24, 2008 (2 samples at each time × 17 time points samples); samples collected from 5:00 pm to 8:00 pm at intervals of on August 7, 2008 (19 samples); samples collected from 3:50 am to 6:00 am at intervals of on August 8, 2008 (two samples at each time × 14 time points samples). The samples collected in 2009 can be categorized into six groups: of samples collected at intervals of starting at noon on August 10, 2009 (two samples at each time × 9 time points samples); of samples collected at intervals of starting at 7:00 am on August 24, 2009 (6 samples at each time × 13 time points samples); two samples collected at noon on August 31, 2009; two samples collected at 6:00 pm on August 31, 2009; four samples collected at 11:00 am on October 8, 2009; four samples collected at 11:00 am on October 9, 2009. This information is also summarized in Supplementary Tables S2 and S3. We extracted 17 616 genes having log2-transformed signals larger than 5 in 80% of the samples from 2008.
4 Results
4.1 Synthetic gene expression data
We used synthetic RNA-Seq data collected in 2008 for parameter optimization and model selection; we used synthetic RNA-Seq data collected in 2009 to evaluate the prediction capabilities of the optimized model. Mean squared errors (MSE) and correlation coefficients between the predicted and synthetic gene expressions are plotted in Supplementary Figures S1 and S2. These figures indicate that sampling at intervals of provides better predictive performance than sampling at longer intervals. It can be considered that sampling at effectively captures diurnal variation of gene expression patterns. Although the predictive performance of sampling at intervals was comparable to that of sampling at intervals, longer intervals are preferable in terms of sampling labor. Hence, only results of sampling at intervals are shown below.
The selected models for variably expressed genes are summarized in Supplementary Table S4. This results indicate that the correct models were selected for 29 out of 31 genes. It is important to note that absolute values of coefficients do not always correspond to those of the true models in Supplementary Table S1, because of the normalization and nonlinear transformation of inputs. Further, the constant model, where all coefficients are zero, was correctly selected for 8441 out of 9969 constantly expressed genes. These results indicate that the variable selection through L1 regularization worked well.
4.2 Real gene expression data
We optimized the parameters of the model using the microarray data of samples collected in 2008 by normal optimization, in which we performed both the grid search and the nonlinear optimization for all genes, and cluster-based optimization (section 2.3). We also applied cluster-based optimization to pseudo RNA-Seq data generated from the same microarray data. We performed all optimizations on an Amazon EC2 m3.medium instance (one Intel Xeon E5-2670 v2 processor, memory). The normal optimization took per gene on average. If we performed the normal optimization using a single m3.medium instance, we could optimize parameters for 17 616 genes within 17 days. The affinity-propagation method yielded 500 clusters and their exemplars. After normal optimization for these exemplars, we performed nonlinear optimization for other genes using the Nelder–Mead method. The nonlinear optimization took per gene. We could optimize parameters for 17 616 genes within 5 days by using a single instance. In the previous study (Nagano et al., 2012), the parameter optimization took about 30 days using a high-performance cluster computer. Even after taking into consideration advances in computer technology, the new model improved computational efficiency.
Comparison of coefficients of determination (R2). Distributions of R2 from the previous model are compared to those of the new model with normal optimization (a: "normal"), cluster-based optimization (b: "cluster"), and cluster-based optimization with pseudo RNA-Seq data (c: "RNA-Seq"). (d) Boxplot of R2 values
Comparison of coefficients of determination (R2). Distributions of R2 from the previous model are compared to those of the new model with normal optimization (a: "normal"), cluster-based optimization (b: "cluster"), and cluster-based optimization with pseudo RNA-Seq data (c: "RNA-Seq"). (d) Boxplot of R2 values
To assess the new model’s capability to predict gene expression, we predicted the microarray data or pseudo RNA-Seq data of samples collected in 2009. We compared gene-wise mean squared errors across samples. The mean squared errors of the new model with normal optimization, cluster-based optimization, and cluster-based optimization using pseudo RNA-Seq data were smaller than those of the previous model for 12 125, 12 209 and 9305 out of 17 616 genes, respectively. Hence, the new model yields better predictions than the previous model, regardless of whether the parameters are optimized by the normal or the cluster-based methods and whether the data are measured using a microarray or RNA-Seq. We also calculated gene-wise correlation across samples and sample-wise correlation across genes between prediction and observation. The gene-wise correlation is illustrated in Figure 3. The distributions of the correlation coefficients of the new model were comparable to that of the previous model, regardless of the methods of measurement and parameter optimization. Further, the sample-wise correlations of all three tests were higher than 0.9 for most samples and improved in comparison to those of the previous model (Fig. 4).
Comparison of gene-wise correlation across samples. Each panel is illustrated as in Fig. 2. The null model, in which the expression is the constant, was selected for some genes in the previous study. Because correlation coefficients for such genes cannot be defined, the total number of genes contained in the plots of the previous study (dashed) is smaller than that of the new model (solid)
Comparison of gene-wise correlation across samples. Each panel is illustrated as in Fig. 2. The null model, in which the expression is the constant, was selected for some genes in the previous study. Because correlation coefficients for such genes cannot be defined, the total number of genes contained in the plots of the previous study (dashed) is smaller than that of the new model (solid)
Comparison of sample-wise correlation across genes. Each panel is illustrated as in Figure 2
Comparison of sample-wise correlation across genes. Each panel is illustrated as in Figure 2
Next, we compared the models selected by the normal optimization to those obtained in the previous study (Nagano et al., 2012). Table 1 shows the numbers of genes for which each environmental factor was selected in the previous study and the normal optimization of the new model. Each row and column shows the number of genes for which the corresponding environmental factor was selected in the previous study and the normal optimization of the new model, respectively. The bottom row represents the model without the environmental effects. In the normal optimization, if the weight for the environmental effects is zero, the gene is counted as "new none." The bold-faced numbers in the diagonal cells of the table indicate the numbers of genes for which the same environmental factor was selected in the previous study and the normal optimization of the new model. The same environmental factor was selected in both models for 4523 (sum of the first six diagonal cells) out of 7496 genes, for which the selected models contain the response to any environmental factors in both the previous and new studies. Meanwhile, in the previous study, the model without the environmental effects was selected for 10 120 genes (sum of the bottom row); it was selected by the normal optimization of the new model for only 5714 genes (sum of the rightmost column). The previous model tends to select the model without the environmental effects about twice as frequently as the new model. This difference is probably due to differences in the model selection, i.e., likelihood ratio tests and group lasso.
Comparison of selected models
| . | New wind . | New temperature . | New humidity . | New atmosphere . | New radiation . | New precipitation . | New none . |
|---|---|---|---|---|---|---|---|
| Previous wind | 56 | 20 | 8 | 3 | 7 | 4 | 113 |
| Previous temperature | 36 | 3762 | 99 | 68 | 92 | 15 | 877 |
| Previous humidity | 10 | 49 | 340 | 7 | 27 | 14 | 237 |
| Previous atmosphere | 6 | 18 | 6 | 38 | 7 | 6 | 78 |
| Previous radiation | 26 | 238 | 91 | 49 | 309 | 20 | 647 |
| Previous precipitation | 4 | 8 | 13 | 5 | 10 | 18 | 55 |
| Previous none | 233 | 4434 | 641 | 343 | 643 | 119 | 3707 |
| . | New wind . | New temperature . | New humidity . | New atmosphere . | New radiation . | New precipitation . | New none . |
|---|---|---|---|---|---|---|---|
| Previous wind | 56 | 20 | 8 | 3 | 7 | 4 | 113 |
| Previous temperature | 36 | 3762 | 99 | 68 | 92 | 15 | 877 |
| Previous humidity | 10 | 49 | 340 | 7 | 27 | 14 | 237 |
| Previous atmosphere | 6 | 18 | 6 | 38 | 7 | 6 | 78 |
| Previous radiation | 26 | 238 | 91 | 49 | 309 | 20 | 647 |
| Previous precipitation | 4 | 8 | 13 | 5 | 10 | 18 | 55 |
| Previous none | 233 | 4434 | 641 | 343 | 643 | 119 | 3707 |
Comparison of selected models
| . | New wind . | New temperature . | New humidity . | New atmosphere . | New radiation . | New precipitation . | New none . |
|---|---|---|---|---|---|---|---|
| Previous wind | 56 | 20 | 8 | 3 | 7 | 4 | 113 |
| Previous temperature | 36 | 3762 | 99 | 68 | 92 | 15 | 877 |
| Previous humidity | 10 | 49 | 340 | 7 | 27 | 14 | 237 |
| Previous atmosphere | 6 | 18 | 6 | 38 | 7 | 6 | 78 |
| Previous radiation | 26 | 238 | 91 | 49 | 309 | 20 | 647 |
| Previous precipitation | 4 | 8 | 13 | 5 | 10 | 18 | 55 |
| Previous none | 233 | 4434 | 641 | 343 | 643 | 119 | 3707 |
| . | New wind . | New temperature . | New humidity . | New atmosphere . | New radiation . | New precipitation . | New none . |
|---|---|---|---|---|---|---|---|
| Previous wind | 56 | 20 | 8 | 3 | 7 | 4 | 113 |
| Previous temperature | 36 | 3762 | 99 | 68 | 92 | 15 | 877 |
| Previous humidity | 10 | 49 | 340 | 7 | 27 | 14 | 237 |
| Previous atmosphere | 6 | 18 | 6 | 38 | 7 | 6 | 78 |
| Previous radiation | 26 | 238 | 91 | 49 | 309 | 20 | 647 |
| Previous precipitation | 4 | 8 | 13 | 5 | 10 | 18 | 55 |
| Previous none | 233 | 4434 | 641 | 343 | 643 | 119 | 3707 |
We investigated the associations between gene annotations and time-of-day characteristics of expressions to assess the consistency between model parameters and biological knowledge. For each gene ontology term with which more than 10 genes are annotated, we compared the distribution of clock phases of genes annotated with the term to its background distribution, which is that of genes not annotated with the term. The left-hand panel in Figure 5 indicates that the distribution of of genes annotated with protein serine/threonine kinase activity (GO:0004674) was significantly different from its background distribution (, Watson–Wheeler test, Bonferroni correction). The values of of many genes annotated with the term were distributed from before to after midnight. Because there were only four genes annotated with flavonoid biosynthetic process (GO:0009813), we did not perform the statistical test. However, the distribution of of those genes was clearly concentrated in the early morning (right-hand panel in Fig. 5). These results are consistent with those of the previous study and a laboratory study of Arabidopsis in which the genes implicated in phenylpropanoid biosynthesis showed expressional peaks before subjective dawn (Harmer et al., 2000). Figure 6 shows the distributions of of genes annotated with rRNA processing (GO:0006364), small ribosomal subunit (GO:0015935), ribosome (GO:0005840), large ribosomal subunit (GO:0015934), and aminoacyl-tRNA ligase activity (GO:0004812). Although the distribution of genes annotated with rRNA processing (GO:0006364) was not significantly different from its background distribution (P = 0.0659; Watson–Wheeler test, Bonferroni correction), the values of seem to concentrate in the afternoon. Other distributions (GO:0015935, GO:0005840, GO:0015934, and GO:0004812) were significantly different from their background distributions (, P = 0.0168, and , respectively) and the peaks of the distributions were shifted from afternoon to evening in the same order as their biological order in rRNA processing. These results are also consistent with the previous results, which implies that the entrained circadian clock in the field controls the order of the acceleration of translation during the same time period.
Distributions of clock phases of genes annotated with protein serine/threonine kinase activity (GO:0004674) and flavonoid biosynthetic process (GO:0009813). Fractions of genes annotated with the term and not annotated with the term are shown in red and gray, respectively
Distributions of clock phases of genes annotated with protein serine/threonine kinase activity (GO:0004674) and flavonoid biosynthetic process (GO:0009813). Fractions of genes annotated with the term and not annotated with the term are shown in red and gray, respectively
Distributions of clock phases of genes annotated with GO:0006364 (rRNA processing), GO:0015935 (small ribosomal subunit), GO:0005840 (ribosome), GO:0015934 (large ribosomal subunit), and GO:0004812 (aminoacyl-tRNA ligase activity), shown as in Figure 5
Distributions of clock phases of genes annotated with GO:0006364 (rRNA processing), GO:0015935 (small ribosomal subunit), GO:0005840 (ribosome), GO:0015934 (large ribosomal subunit), and GO:0004812 (aminoacyl-tRNA ligase activity), shown as in Figure 5
We can also use model parameters to form a hypothesis about biological processes occurring in a field. The associations between gene annotations and regression coefficients were also investigated. Before the investigation, we divided the regression coefficients for each gene by the standard deviation of the observed expression of the gene to normalize coefficients. The absolute values of the normalized coefficients of genotype for genes annotated with photosynthesis (GO:0015979), thylakoid (GO:0009579), and photosysthetic membrane (GO:0034357) were significantly larger than the background distributions of those values (, and , respectively, Wilcoxon rank sum test, Bonferroni correction). This result implies that the difference in genotype affects photosynthesis. Further, the absolute values of normalized coefficients of the response for genes, for which temperature was selected as an environmental factor, indicated a significant association with phosphorus metabolic process (GO:0006793) (, Wilcoxon rank sum test, Bonferroni correction). This association suggests that phosphorum metabolism may be affected by fluctuations in temperature.
5 Conclusion
In this paper, we proposed a new gene expression model for organisms in a field, based on a previous model (Nagano et al., 2012). The new model vastly reduces the computational time required for parameter optimization and model selection by unifying various types of gate functions and response functions, and introducing group lasso (Yuan and Lin, 2006). By applying the model to synthetically generated RNA-Seq data, we confirmed that the optimized model was consistent with the true gene expression dynamics for most genes. Further, to assess the capability of the new model, we applied it to the same ric plant data that were used in the previous study. The new model offered a comparable or slightly better prediction for most genes.
The model parameters were consistent with those of the previous study and the biological knowledge. This consistency indicates the model’s capability to provide biological insights. In fact, the investigation of model parameters found associations between genotypes and photosynthesis and between the response to temperature and phosphorum metabolism, which were not discovered in molecular biological studies. We can form a hypothesis based on such associations; it will be validated by further experimental studies.
Whereas the previous model was targeted only for microarray data, the new model is applicable to RNA-Seq data. The results of applying the new model to synthetic RNA-Seq data assuming known true models and pseudo RNA-Seq data generated from real microarray data indicated that the model may be useful for the analysis of RNA-Seq data. However, it should be noted that, in this study, the applicability was verified with only simulated data rather than real data. Further verification of performance with real data is required.
In this study, we focused on the time variation of gene expression and analyzed transcriptome data sampled over time. However, the proposed model is applicable to data collected by other sampling strategies, such as multiple treatments at a single time point, only by preparing meteorological data of sufficient length. Further, although we applied the model to the transcriptome data of rice plants in this study, it is applicable to those of other organisms.
The developed package (FIT) offers efficient parameter optimization and model selection. While the parameter optimization and model selection processes for all genes in the previous study required 30 days when a high-performance cluster computer was used, our new package does not incur such a high computational cost. Because the model describes the expression of each gene independently, the parameter optimization and model selection processes can be easily performed in parallel by dividing genes into several groups and performing these processes for each group. For example, it is expected that the parameter optimization and model selection processes can be completed by 10 Amazon EC2 m3.medium instances in half a day. This package will accelerate field transcriptomic studies.
Funding
This work was supported by Precursory Research for Embryonic Science and Technology (PRESTO), Japan Science and Technology Agency (JST) to A.J.N.; Core Research for Evolutional Science and Technology (CREST), JST to A.J.N.; Accelerated Innovation Research Initiative Turning Top Science and Ideas into High-Impact Values (ACCEL), JST to A.J.N.; and KAKENHI [JP16H06171, JP16H01473 to A.J.N.].
Conflict of Interest: none declared.






