A Bayesian model for genomic prediction using metabolic networks

Abstract Motivation Genomic prediction is now an essential technique in breeding and medicine, and it is interesting to see how omics data can be used to improve prediction accuracy. Precedent work proposed a metabolic network-based method in biomass prediction of Arabidopsis; however, the method consists of multiple steps that possibly degrade prediction accuracy. Results We proposed a Bayesian model that integrates all steps and jointly infers all fluxes of reactions related to biomass production. The proposed model showed higher accuracies than methods compared both in simulated and real data. The findings support the previous excellent idea that metabolic network information can be used for prediction. Availability and implementation All R and stan scripts to reproduce the results of this study are available at https://github.com/Onogi/MetabolicModeling.


Introduction
Genomic prediction is a statistical method for predicting phenotypes or genetic merits using genome-wide DNA markers (Meuwissen et al. 2001).Genomic prediction is now essential in animal breeding (e.g.Garc ıa-Ruiz et al. 2016, Scott et al. 2021) and is becoming increasingly important in plant breeding (e.g.Dreisigacker et al. 2021).In breeding, selection strategies employing genomic prediction are often referred to as genomic selection.The framework of genomic prediction is also used in medicine to predict disease risks, which are known as polygenic or genetic risk scores (Wray et al. 2019, Lello et al. 2020).
One of the primary goals of genomic prediction research has been to improve prediction accuracy (e.g.Heslot et al. 2012, Onogi et al. 2015, Dreisigacker et al. 2021).This interest has motivated various statistical methods and models, including Bayesian regressions (reviewed in de los Campos et al. 2013, Dreisigacker et al. 2021).It has also been explored how to utilize information other than the genome to enhance prediction accuracy.The typical additional information is "omics data."For example, transcriptome (Li et al. 2019, Perez et al. 2022) and metabolome (Riedelsheimer et al. 2012, Campbell et al. 2021) are often used together with the genome, and some studies have used both (Xu et al. 2016, Schrag et al. 2018).Furthermore, because data in biology are essentially multivariate (i.e.multiple traits can be measured for a genotype at multiple environments), it has been of interest to learn how to utilize multivariate information to improve prediction accuracy for target traits (e.g.Jia and Jannink 2012, Jarquin et al. 2016, Atanda et al. 2022).
Upon such trends, Tong et al. (2020) proposed utilizing metabolic networks to predict the biomass of Arabidopsis thaliana (Arabidopsis).The method is based on the concept of flux balance: which states that the fluxes (amounts) of each reaction related to biomass production are determined in such a way that the production and consumption of each metabolite are balanced.Under this constraint, the flux of biomass production is estimated using quadratic programming (QP) such that it matches the observed biomass.When predicting the biomass of a new genotype, the fluxes of all reactions and biomass production are first predicted using ordinary genomic prediction methods, and the predicted fluxes are then adjusted using QP to fill the flux balance.
Although their findings suggest that metabolic network information can be used to predict phenotypes, their approaches still have room for improvement.First, because the flux is estimated for each genotype independently, dependencies (or covariances/similarities) between genotypes are not used.Second, because flux estimation and prediction are done separately, uncertainty in estimation increases noise during prediction.To overcome these drawbacks, the current study proposes a Bayesian model that estimates fluxes of all genotypes simultaneously and conducts flux estimation and prediction jointly.A fundamental question of the present study is, as well as that of Tong et al. (2020), whether, once metabolite networks are clarified, the information on networks is helpful for phenotype prediction.The performance of the proposed model was compared with other methods, including the approach of Tong et al. (2020).The results demonstrated the superiority of the proposed model and supported the excellent idea presented by Tong et al. (2020) that metabolic network information can aid in phenotype prediction.

Systems and methods
Let M i denote the stoichiometry matrix of genotype i (Fig. 1A).M i is a K by J matrix containing reaction coefficients, where K and J denote the number of metabolites and reactions, respectively.Let the Jth reaction be the one related to the target trait, and the other reactions (reaction j where 1 j < J) represent processes that produce metabolites related to the target trait.Here, the target trait is the biomass of Arabidopsis.Let V i;j denote the flux of reaction j of genotype i which represents the amount that the reaction occurs in the genotype's cells/tissues and let V be an N by J matrix containing all fluxes of all genotypes, where N is the number of genotypes.V i is the ith rows of V which denotes the fluxes of genotype i.The stoichiometry matrix M i is referred to as N by Tong et al. (2020).
Each metabolite's production and consumption are balanced according to the flux balance analysis principle.This can be written as an equation such as where M i;k denotes the kth row of M i .Equation ( 1) is also illustrated in Fig. 1A.The flux of the Jth reaction (i.e.reaction for the target trait), V ;J , is assumed to be proportional to the observed phenotypes of the target trait, Y. Here, Y is an Nlength vector that contains the phenotypes of the target trait of all genotypes.
To summarize, Tong et al. (2020) estimates V i independently for each genotype so that Equation (1) is filled and the ratio between V i;J and V ColÀ0;J , a flux of the reference genotype Columbia-0, is matched the ratio between Y i and Y Col-0 .(Fig. 1B).The flux estimated in this manner is referred to as Vt in the following.[See Section 6.2 for the explanation of prediction by Tong et al. (2020).] In the present study, V ;J is connected with Y as where 1 indicates the vector of 1 and e is a gaussian noise, that is, e $ N 0; Ir 2 e À Á where 0 indicates the vector of 0 and I is an identity matrix.
The purpose of the proposed model is to estimate V ;J under the constraint of Equation (1).To treat the constraint in a Bayesian hierarchical model, Equation ( 1) is rewritten as where e ik $ N 0; r 2 e À Á .Equations ( 2) and (3) constitute the likelihood of the model.The prior distributions of r 2 e and r 2 e are half-Cauchy distributions written as, r 2 e $ HC 0; S e À Á ; S e $ Gamma s gamma ; r gamma À Á ; respectively.Here, s gamma and r gamma denote the shape and rate parameters set to 0.1 and 1.0, respectively.Positive values of flux (V) mean that the reaction flows from the left to the right hand, whereas negative values mean the opposite direction.Some reactions can occur in both directions, whereas some only occur in either direction (Fig. 1A).Let J n denote the set of reactions that can proceed in both directions and J r denote the set of reactions that have a positive flux restriction.Then, the prior distributions of V are written as, where U ;j defines the genetic component of V ;j ., and TruncN denotes truncated normal distributions.The prior distribution of U ;j is written as where l j is a known mean value described later and G is a genomic relationship matrix (VanRaden 2008) that is defined with single nucleotide polymorphisms (SNPs).
The unobserved phenotype Y i of new genotype i can be predicted as Here, V i;J can be estimated from the constraints of Equation (3) even when the phenotype Y i is unavailable, and from U i;J which can be predicted with G.In Fig. 1B, the outline of the proposed model is illustrated with the precedent approach by Tong et al. (2020).
Based on prior physiological knowledge, some fluxes may be constrained by other fluxes.For example, in the biomass prediction proposed by Tong et al. (2020), it was supposed that 0:94V i;oxg V i;car 3:81V i;oxg for 1 i N (4) and 0:79V i;suc V i;sta 3:37V i;suc for 1 i N; (5) where V i;oxg , V i;car , V i;suc , and V i;sta denote the fluxes of oxygenation, carboxylation, sucrose synthesis, and starch synthesis reactions, respectively.Such constraints are referred to as "additional constraints" in the following.However, in the case of Arabidopsis biomass, these additional constraints resulted in erroneous predictions.This issue is mentioned in Section 7.

Algorithm and implementation
All statistical inferences were conducted with R (ver.4.1.3on Windows machines or ver.4.2.2 on Ubuntu 18.4 machines) (R Core Team 2022).The posterior distributions of the proposed models were calculated using Hamiltonian Monte Carlo (HMC), which was implemented by rstan (ver.2.21.5 or 2.21.7,Stan Development Team 2022).The number of iterations was 5000, the length of the warm-up was 4000, and the thinning was 1. NUTS (No-U-Turn Sampler) was used for sampling.Because both V and U are high dimensional (N by J matrices), to facilitate exploration of the posterior distribution, two modifications were made from the base model defined above when implementing the model using stan.These modifications are illustrated in Sections 3.1 and 3.2.

Scale and location
The scales (r 2 Vj ) and locations (l j ) of flux are considerably different among reactions, which can hamper the efficient exploration of posterior distributions.Thus, using prior knowledge, the scales and locations were changed to be comparable across reactions.Suppose that rough estimates for r 2 Vj , r 2 Uj , and , and l j $ , can be obtained from the literature and/or preceding experiments.Define d j as then divide V and U with d j as V ;j 0 ¼ V ;j =d j and U ;j 0 ¼ U ;j =d j yielding prior distributions of V ;j 0 and U ;j 0 represented as where r 2 Uj 0 and r 2 Vj 0 correspond with r 2 Uj =d 2 j and r 2 Vj =d 2 j , respectively.The location of V ;j 0 was then arbitrarily adjusted to 10 using l j $ as Metabolic network-based prediction where a j ¼ 10 À l j $ =d j : Inference was conducted using V ;j 00 and U ;j 0 .The constraint of Equation (1) becomes that is, Thus, Equation (3) can be written as where e ik $ N 0; r 2 e À Á .Equation ( 2) is rewritten using V ;J 00 as The prior distributions of a 0 , b 0 , and variances are half-Cauchy distributions written as,

Weight in the likelihood
Equations ( 6) and ( 7), terms for observations and constraints, respectively, define the likelihood of the model.However, the difference in dimensions (the observations consist of N components while the constraints consist of N Â K components) can cause underfitting to Y if these two terms are combined naively in the log likelihood function.Thus, a parameter to control the relative weight between the observations and constraints was introduced to the log likelihood: where W is the weight parameter.The effect of W was investigated using simulated and real data.

Real data analysis
The real data used in this study were the same as the data used by Tong et al. (2020).The target trait is the biomass of Arabidopsis evaluated under the 12-h-light optimal nitrogen conditions (Sulpice et al. 2013) The prediction ability of the proposed model was assessed with 3-fold cross-validation (CV).The genotype divisions were done following Tong et al. (2020).This precedent study conducted a 3-fold CV 50 times and opened the genotype divisions of all CVs at the above-mentioned site.Following the first 20 CVs of Tong et al. (2020), a 3-fold CV was carried out 20 times in the current study.R package snow (ver.0.4-4) was used to parallelize CVs.
Modifying scales and locations as described in Section 3.1 requires rough estimates for r 2 Vj , r 2 Uj , and l j (i.e.r 2 Vj $ , r 2 Uj $ , and l j $ ).To obtain these estimates, the following linear model was fitted to the training data: where Vt ;j is the fluxes of reaction j estimated by Tong et al.
(2020), U ;j $ N 0; Gr 2 Uj , and r j is the residual following r j $ N 0; Ir 2 Vj .Note that Vt ;j was estimated using QP for each genotype without information on SNP genotypes.It is also noteworthy that in this real data analysis, Vt ;j only took into account the genotype fluxes from the training data and did not use those from the testing data for this estimation.The solutions of r 2 Uj , r 2 Vj , and l j were obtained employing an R package rrBLUP ver.4.6.1 (Endelman 2011).
The unobserved phenotype Y i of new genotype i was predicted as Prediction accuracy was investigated using the Pearson correlation coefficient (R) between Y i and Y i ^.Following Tong et al.
(2020), R was calculated for each fold of each CV.As the weight parameter W, six values (0.16, 0.32, 0.48, 0.64, 0.80, and 0.96) were compared.Differences in accuracy between methods were tested with the Mann-Whitney-Wilcoxon test using the R function wilcox.test.

Simulation analysis
To assess the prediction ability of the proposed model, data were simulated by mimicking the real data.First, employing the model ( 9), the location (l j ) and variance components (r 2 Vj and r 2 Uj ) were estimated from all genotypes (N ¼ 67).Then, U ;j and V ;j were simulated from U ;j $ N 1l j ; Gr 2 Uj and V ;j $ N U ;j ; Ir 2 Vj ; respectively.If the flux of reaction j is restricted to be positive, samples were drawn until the restriction was met.M was the stoichiometry matrix of the reference genotype (Columbia-0).However, for each genotype, the last nonzero reaction of each metabolite was modified as follows to fill the constraints of Equation (1): where J k;Àlast indicates the set of reactions in M that are nonzero for metabolite k except for the last reaction, J k;last .For example, when the metabolite k has nonzero coefficients in M at reactions 4, 18, and 125, The observed phenotypes (Y) were then generated by Herein, e was drawn from the normal distribution.The variance of the normal distribution was set to 25% of that of V ;J .
To facilitate model fitting in simulations, a small noise, w ik , was applied to the left hand of Equation (6).That is, where w ik $ N 0; 0:0001 À Á of which the variance was determined arbitrarily.
With this process, 10 simulated datasets were generated.Then, a 3-fold CV was carried out for each dataset.Genotype divisions of CVs followed the first 10 CVs of Tong et al. (2020).As performed in the real data analysis, prediction accuracy was evaluated using R, which was calculated for each fold of each CV.Six values (0.16, 0.32, 0.48, 0.64, 0.80, and 0.96) were compared as the weight parameter W. Differences in accuracy between methods were tested with the Mann-Whitney-Wilcoxon test.
6 Methods compared

Genomic BLUP
Genomic BLUP (GBLUP), the most widely employed method in genomic prediction, was carried out employing the observed phenotypes (Y) as the response variable.The model can be written as where u $ N 0; Gr 2 u À Á and e $ N 0; Ir 2 e À Á , respectively.No information about metabolites and metabolic networks was used.The rrBLUP package ver.4.6.1 (Endelman 2011) was used to execute.

Quadratic programming
This method was proposed by Tong et al. (2020).First, using the model ( 9), GBLUP predicts fluxes of genotypes in testing data (see also Fig. 1B).The predicted fluxes, U test;j , however, do not fill the constraints of Equation ( 1).Thus, the steady state fluxes, V test;j , are estimated using QP in the R package CVXR (ver.1.0-10) (Fig. 1B).The objective function and constraints of this optimization are V test;j !0 for j 2 J r : V test;J at the steady state was then treated as the predicted value for Y.Although the constraints of Equations ( 4) and ( 5) were applied in the original method proposed by Tong et al. (2020), here QP was performed without these additional constraints due to the reasons described in Section 7.

MegaLMM
MegaLMM is a multitrait mixed model that is based on the factorization of genetic covariances between traits (Runcie et al. 2021).This method treated estimated fluxes (Vt ;j ) and Y as traits, and the following multitrait mixed model was fitted: where The second and third arguments in the normal distribution N denote the covariances between rows and columns, respectively.X represents the design matrix, which only included the intercepts for each trait, and B represents the fixed effects.

Metabolic network-based prediction
F is an N Â P latent factor matrix, and K is a P Â J þ 1 ð Þfactor loading matrix.W R and U R are diagonal matrices of dimension J þ 1, and W F and U F are diagonal matrices of dimension P. Here, P is the dimension of the latent variables.The prior distributions of parameters followed the vignette of the R package MegaLMM, which can be found at https://github.com/deruncie/MegaLMM/blob/master/vignettes/Running_MegaLMM.Rmd.Y was scaled before fitting.The total number of iterations was 10 000, thinning was 2, and the posterior inference was performed on the last 500 samples.P was set to 50, 80, or 160.The R package MegaLMM (ver.0.1.0)was used to fit the model.

Results and discussion
The outcome of CVs employing simulated data is presented in Fig. 2. QP and the proposed models generally showed greater accuracy than GBLUP, suggesting that the information of metabolic networks is advantageous in predicting the target trait phenotype.The accuracy of the proposed model depended on the weight parameter W. The accuracy showed a convex curve peaking at W ¼ 0.64.The proposed models were able to predict more accurately than QP at W ¼ 0.48 and 0.64, suggesting that the proposed models potentially have superior prediction ability than QP as intended.MegaLMM was less accurate than GBLUP, although MegaLMM also used the multivariate information of fluxes.Perhaps, the factorization of genetic covariance has no meaning in these data and might make inference unstable because genetic covariances among fluxes and the target trait were not simulated.
The results of CVs employing the real data are presented in Fig. 3. Here, the QP was significantly inferior to GBLUP.The medians of the accuracies were 0.012 and 0.239, respectively.The accuracy of the proposed model showed a convex curve with W as observed in the simulations, but the peak shifted to W ¼ 0.32.At this W value, the proposed model performed significantly better than GBLUP (the median 0.367 versus 0.239).MegaLMM was equivalent to GBLUP in terms of accuracy.Figure 4 illustrates how strictly the constraint of Equation ( 6) was filled.The Pearson correlation between P J j¼1 M i;k;j d j a j and P J j¼1 M i;k;j d j V i;j 00 was almost 1.0 at each W value, suggesting that the fluxes were at steady state.
When comparing the real data results to the simulation results, two major differences were observed: (i) fewer advantages in using flux balance information and (ii) shift of optimal W values.The former may arise because of inadequate or inaccurate information on the metabolic networks of biomass production of Arabidopsis.That is, not all metabolites and reactions may have been listed, and/or some stoichiometry coefficients may have been calculated incorrectly.It will be a question for future studies as up to what extent such deficiencies in network information affect prediction accuracy.
Regarding the second point, the relative importance between the observation (Y) and constraints (flux balance) may differ between simulated and real data because the weight parameter W controls the relative contributions of these two terms to likelihood, as shown in Equation (8).Intuitively, if the observations are more reliable as an information source to infer fluxes (V) than the constraints, lower W values may be preferable and if the observations are less reliable, higher W may be preferable.Supposing that the reliability as an information source could be quantified with heritability, the outcome may suggest that the heritabilities of fluxes in real data were lesser than those employed in the simulations.Although the heritabilities in the simulations were taken from those estimated from real data, the estimates must have been overestimated due to the small data size, and overestimation for the fluxes would be more influential on inference than overestimation for the observations (Y) due to the difference in dimensions (J versus 1).
In the present applications of the proposed model and QP, the stoichiometry equation of the biomass production (i.e. the Jth column of M i ) was estimated using metabolite data Pearson correlation coefficients between the predicted and observed phenotypes are presented.There are significant differences between different characters (P < .05,after Bonferroni correction).GBLUP, genomic BLUP; QP, quadratic programming; PM-X, proposed model with the weight parameter W ¼ X; MegaLMM-X, MegaLMM with the number of latent factors ¼ X.
Figure 3. Accuracy in cross-validations employing real data.Pearson correlation coefficients between the predicted and observed phenotypes are presented.There are significant differences between different characters (P < .05,after Bonferroni correction).GBLUP, genomic BLUP; QP, quadratic programming; PM-X, proposed model with the weight parameter W ¼ X; MegaLMM-X, MegaLMM with the number of latent factors ¼ X. (Arnold andNikoloski 2014, Tong et al. 2020).That is, metabolite data of the tested genotypes were used for prediction.Thus, the CVs conducted in the present study were equivalent with the so-called CV2 scheme (Burgueño et al. 2012).Nevertheless, because both the genetic and residual correlations between the biomass and metabolites were estimated to be small and close to each other (posterior means were À0.062 6 0.031 and À0.244 6 0.051, respectively, Supplementary Table ), prediction accuracies of these methods would not be inflated by this CV scheme (Runcie and Cheng 2019).When using the 32 metabolites, which were used for constructing the stoichiometry equations of the biomass production, as indicator traits in MegaLMM, prediction accuracy obtained from the CV was better than PM-0.32 (the median was 0.405) although the difference was insignificant.This may suggest that directly using metabolites in the prediction models is better than converting them to the network information when metabolite data are available.This also suggests that the proposed model still has room for improvement as discussed later.
The unexpectedly low accuracy of QP in the real data analyses was drastically advanced by introducing the additional constraints illustrated in Equations ( 4) and ( 5).The median accuracy was 0.414, which was significantly greater than that of GBLUP and the proposed models.However, it was found that the predicted values hardly fluctuated among CVs.The mean of the Pearson correlations between predicted values obtained in different CVs was 0.999, indicating that the fluctuations in prediction accuracy reported by Tong et al. (2020) were caused by genotype division variation in CVs.
Moreover, even when the genomic relationship matrix (G) was permutated, the prediction accuracy was still high (median 0.424), suggesting that the prediction accuracy was achieved without using genomic information.Further research confirmed that the constraint of Equation ( 4) is enough to cause these phenomena (data not shown).Although it is unclear why this constraint causes the phenomenon, these findings suggest that when using flux balance for prediction, it is important to be cautious about putting constraints on fluxes.
Both the simulated and real data analyses revealed that the proposed model is capable of improving the prediction accuracy of quantitative traits.However, the proposed model has four issues.The first issue is the multimodality of the solutions.The estimates of fluxes differed between the proposed model and QP, although both methods achieved flux balance, suggesting the presence of multiple sets of solutions that can fill the constraints of Equation (3) (i.e.multimodality of solutions).Probably, both the proposed model and QP failed to find global optima, although the proposed model could find better solutions than QP, which may be due to the joint estimation of all genotypes.Thus, finding better solutions is a key to increasing the prediction accuracy.As the constraints of Equations ( 4) and ( 5) gained prediction accuracy in the study by Tong et al. (2020), imposing additional constraints may be a remedy; however, cautions are required to avoid spurious prediction.Other remedies will modify prior distributions and/or scale fluxes and phenotypes to induce the chains to explore better solutions.Informative prior distributions will be suitable for this purpose rather than P J j¼1 M i;k;j d j a j and y-axis is P J j¼1 M i;k;j d j V i;j 00 of Equation ( 6).PM-X denotes the proposed model with the weight parameter W ¼ X. Results from all CVs were pooled.The diagonal line is the 1:1 line.
Metabolic network-based prediction noninformative prior distributions.The second issue is the computational cost.HMC simulations typically took 21 h for model fitting [Windows 11 machine with Core (TM) i7-12700K 3.61 GHz], which is significantly slower than GBLUP (took <1 s.).The high computational cost of the proposed model is not surprising because the model includes J (336) random effects of which each dimension is N (67) (U).Despite this long calculation time, R ^values reported by rstan were often larger than 1.1, a criteria of convergence suggested by Gelman et al. (2014), signifying that the proposed model requires longer chains for convergence.In this study, the chain length was limited to 5000 to save computational cost.However, the unconverged results of the model will not be an issue, at least in this study, as long as this study aims to compare the proposed model with other prediction methods and as the proposed model shows better accuracy.However, this will be an issue when the proposed model is used for more practical purposes where the reliability of prediction should be considered.The third complication is how to determine W before analysis.A grid search using CV could be a viable solution, albeit at a higher computational cost.Another solution will be to put a prior distribution on W and infer from the data.The last problem is the assumption of independency among the fluxes and target traits.Because fluxes would be correlated with each other, ignoring this dependency can make inference unstable.The factorization adopted in MegaLMM will be useful to mitigate this issue.Such improvement in models will be needed to make the proposed models more applicable to various traits and species.
In summary, the proposed model was shown to be capable of improving the prediction accuracy using metabolic network information without employing additional constraints on fluxes.Although there are still problems to be addressed, this research supports the idea of Tong et al. (2020) that metabolic network information can be used to predict phenotypes of quantitative traits.

Figure 1 .
Figure 1.An overview of the proposed model and its analysis.(A) Diagram of the stoichiometry matrix and flux balance.Here, two reactions (RuBisCO and PGA kinase) and eight metabolites related to the reactions are shown.Flux balance analysis assumes that all the metabolites in the stoichiometry matrix are balanced.Note that biomass itself is not included in the stoichiometry matrix.(B) Strategy comparison between the previous study (Tong et al. 2020) and the current study.The precedent study strategy consists primarily of three steps: (i) estimation of the flux of each genotype under flux balance constraints employing quadratic programming, (ii) prediction of fluxes of new genotypes employing genomic BLUP (GBLUP), and (iii) making predicted fluxes balanced employing quadratic programming and predicting the biomass of new genotypes with the predicted flux of the biomass production.In contrast, the proposed model in the present research conducts these steps in one step employing a Bayesian hierarchical model.

Figure 2 .
Figure 2. Accuracy in cross-validations employing simulated data.Pearson correlation coefficients between the predicted and observed phenotypes are presented.There are significant differences between different characters (P < .05,after Bonferroni correction).GBLUP, genomic BLUP; QP, quadratic programming; PM-X, proposed model with the weight parameter W ¼ X; MegaLMM-X, MegaLMM with the number of latent factors ¼ X.

Figure 4 .
Figure4.Fitting status in the constraint term.The x-axis is P J j¼1 M i;k;j d j a j and y-axis is P J j¼1 M i;k;j d j V i;j 00 of Equation (6).PM-X denotes the proposed model with the weight parameter W ¼ X. Results from all CVs were pooled.The diagonal line is the 1:1 line.
Arnold and Nikoloski (2014)N) was 67, including Columbia-0.Arnold and Nikoloski (2014)created the stoichiometry matrix (M) of the reference genotype relevant to biomass.The matrix originally contained 407 metabolites and 549 reactions.A stoichiometry matrix with 350 metabolites and 336 reactions (i.e.K ¼ 350 and J ¼ 336) was obtained by eliminating reactions with zero flux.The matrix was sparse; the medians of the number of metabolites per reaction and the number of reactions per metabolite were 4 and 2, respectively.The Jth column of the stoichiometry matrix contains genotype-specific coefficients for biomass production.Arnold and Nikoloski (2014)estimated the coefficients for the reference genotype (Columbia-0), andTong et al. (2020)estimated the coefficients for the genotypes other than the reference.Both studies used measurements of 32 metabolites, including total protein, soluble metabolites, and starch which were provided bySulpice et al. (2013)together with biomass.The coefficients in the other columns (i.e.first to J À 1th columns) of the stoichiometry matrix were determined byArnold and Nikoloski (2014)mainly based on literature survey, and common for all genotypes.