Functional data analysis-based yield modeling in year-round crop cultivation

Abstract Crop yield prediction is essential for effective agricultural management. We introduce a methodology for modeling the relationship between environmental parameters and crop yield in longitudinal crop cultivation, exemplified by strawberry and tomato production based on year-round cultivation. Employing functional data analysis (FDA), we developed a model to assess the impact of these factors on crop yield, particularly in the face of environmental fluctuation. Specifically, we demonstrated that a varying-coefficient functional regression model (VCFRM) is utilized to analyze time-series data, enabling to visualize seasonal shifts and the dynamic interplay between environmental conditions such as solar radiation and temperature and crop yield. The interpretability of our FDA-based model yields insights for optimizing growth parameters, thereby augmenting resource efficiency and sustainability. Our results demonstrate the feasibility of VCFRM-based yield modeling, offering strategies for stable, efficient crop production, pivotal in addressing the challenges of climate adaptability in plant factory-based horticulture.


Transforming observed data into functional data
We describe the method for transforming the observed time-course data for environmental factors such as temperature and solar radiation into functional data.Suppose we have N ij repeatedly measurements x ij1 , . . ., x ijN ij at time points s ij1 , . . ., s ijN ij , respectively, for the i-th observation and the j-th variable.In our analysis, the i and j correspond to the day the crop yield is observed and the environmental factor, respectively, and N ij is the number of days before the i-th day.Our goal here is to express the data {(s ijα , x ijα ); α = 1, . . ., N ij , i = 1, . . ., n, j = 1, 2, s iα ∈ S ⊂ R} as a set of smooth functions {x ij (s); i = 1, . . ., n, j = 1, 2, s ∈ S} by an appropriate smoothing technique.
Suppose the observation with respect to the i-th observation and the j-th variable, {(x ijα , s ijα ); α = 1, . . ., N i }, are drawn from the regression model where u ij (s) is an unknown smooth function and the errors e ijα are independently, normally distributed with mean zero and variance τ 2 ij .We also assume that the function u ij (s) can be expressed by basis expansions: where w ij1 , . . ., w ijm j are unknown coefficients to be estimated, and ϕ j1 (s), . . ., ϕ jm j (s) are basis functions.Several basis functions such as Fourier series or B-splines can be used for ϕ k (s) (k = 1, . . ., m).See Green and Silverman (1994) for details.It follows from ( 1) that the statistical model based on the regression model via the basis expansion for the i-th observation and the j-th variable is given by where w ij = (w ij1 , . . ., w ijm j ) ⊤ and ϕ j (s) = (ϕ j1 (s), . . ., ϕ jm j (s)) ⊤ are vectors of coefficients and basis functions, respectively, and the superscript ⊤ denotes the transpose of the vector or the matrix.
In the functional data analysis context, functional data are obtained using common basis functions for all i.On the other hand, the smoothness of functions differ from each other, and for this reason, applying the maximum likelihood method can be inappropriate to obtain the estimators of w ij and τ 2 ij .Instead, we apply the penalized likelihood method that imposes different amounts of smoothness penalties to each subject (Araki et al., 2009).The penalized likelihood method obtains estimators by maximizing the following penalized log-likelihood function.
where x ij = (x ij1 , . . ., x ijN ij ) ⊤ , Φ ij is an N ij × m j matrix defined by Φ ij = (ϕ j (s ij1 ), . . ., ϕ j (s ijN ij )) ⊤ .Furthermore, ζ ij > 0 is a regularization or smoothing parameter, which adjusts the amount of smoothness and also avoids ill-posed problems, and Ω ij is a m j ×m j positive semi-definite matrix.There are several form for the matrix Ω ij , here, for all i, we denote with an (m j − 2)×m j matrix D (2) j , given by that represents the second-order difference operator so that the penalty term is given as where ∆ is a difference operator given by ∆w ik = w ijk − w ij(k−1) .The first term of the right-hand side of (4) is the likelihood function of the probability model (3) whereas the second term is the penalty for the smoothness of the function x ij (s).By maximizing (4) with respect to w ij and τ 2 ij , we have penalized maximum likelihood estimators Then the observed longitudinal data {(x ijα , s ijα ); α = 1, . . ., N ij } are transformed into functional data expressed as The value of the penalized maximum likelihood estimators (6) and the resulting smoothness of functional data (7) strongly depends on the number of basis functions m j and the value of regularization parameter ζ ij , so the selection of these values is an important issue.To obtain functions with appropriate smoothness for each individual, model selection criteria (Konishi and Kitagawa, 2008) are used.Here we use a cross-validation for selecting m j and ζ ij .

Details of the varying-coefficient functional linear model
Suppose we have n sets of observations {x ij (s), y i (t i ), t i ; i = 1, . . ., n, j = 1, . . ., p, s ∈ S ⊂ R, t ∈ T ⊂ R}, where x i1 (s), . . ., x ip (s) are p functional predictors, y i (t i ) is a scalar response that depends on another variable t i .In our analysis, y i (t i ) corresponds to the average of the crop yield from day t i to a week after.Furthermore, x ij (s) (j = 1, 2) are temperature and solar radiation respectively that are transformed into functional data, and their periods are 80 days before day t i .Then we model the relationship between the predictors and a response by the varying-coefficient functional regression model (VCFRM, Cardot and Sarda, 2008;Wu et al., 2010): where β j (s, t) (j = 1, 2) are coefficient surfaces.Furthermore, ε i is an normally distributed error, where ε = (ε 1 , . . ., ε n ) ⊤ has mean zero vector and variance-covariance matrix Σ.
The coefficient function β j (s, t) provides us insights about how the functional predictor x ij (s) relates to the response y i .The data corresponding to the environmental factors are longitudinally observed and are not functions in nature.Thus we apply the procedure described in the previous section to transform the longitudinal data into functional data before applying the VCFRM.Then the longitudinal data for the environmental factors are expressed as where m (1) j is the number of basis functions for the jth variable, ϕ j (s) = (ϕ j1 (s), . . ., ϕ jm (1) j (s)) ⊤ is a vector of basis functions and w ij = (w ij1 , . . ., w ijm (1) j ) ⊤ is a vector of coefficients of basis function ϕ j (s).The w ij are obtained by the penalized likelihood method described in the previous section, and therefore they are supposed to be known here.
To estimate the coefficient functions β j (s, t) in the VCFRM (8), we assume that the β j (s, t) are expressed by basis expansions as follows.
By applying the assumptions of basis expansions ( 9) and ( 10), the VCFRM is expressed by where ⊗ denotes a Kronecker product of the matrix that gives for arbitrary matrices A = (a ij ) ij (i = 1, . . ., n, j = 1, . . ., m) and B, and vec is an operator that aligns a matrix to a vector in the column-wise.Furthermore, Φ j = T ϕ j (s)ϕ j (s) ⊤ ds, ⊤ is a known vector and θ = ((vecB 1 ) ⊤ , . . ., (vecB p ) ⊤ ) ⊤ is a vector of unknown parameters.Therefore, the problem of estimating the VCFRM (8) becomes that of estimating θ in (11) and we can estimate it by traditional estimation strategies.Furthermore, this provides a statistical model for a response vector y = (y 1 (t 1 ), . . ., y n (t n )) ⊤ as follows: where Z = (z 1 , . . ., z n ) ⊤ .
We estimate the unknown parameter θ by the penalized likelihood method, that is, by maximizing the following penalized likelihood function: where ℓ(θ, Σ) = log f (y; θ, Σ) is a log-likelihood function and Ω λ is a non-negative definite matrix that controls the smoothness of the unknown functions β j (s, t), and it depends on tuning parameters.In our analysis, we denote Ω λ as Ω λ = blockdiag{Ω 1 , . . ., Ω p }, where blockdiag denotes a block diagonal matrix whose block diagonal elements are matrices in brackets.Furthermore, Ω j is given by Matrices Ω (s) j and Ω (t) j have rolls for smoothing the coefficient functions β j (s, t) in the s and t directions, respectively, and λ s , λ t > 0 are tuning parameters that control the degrees of smoothness for s and t directions respectively.
Maximizing the ℓ λ (θ, Σ) with respect to θ, we obtain a maximum penalized likelihood estimator The estimator Σ of the variance-covariance matrix Σ depends on the covariance structure.
Furthermore, the predicted crop yield for a future time point t i is given by ŷi (t i ) = z ⊤ i θ.

Selection of tuning parameters
To obtain the appropriately smooth estimator of β j (s, t) that can easily interpret the relationship between environmental factors and crop yield, we have to select optimal values of tuning parameters λ s and λ t .Here we select these values by an information criterion that measures the goodness of the statistical model in the prediction viewpoint.
For more details of the information criteria, see Konishi and Kitagawa (2008).
If we have the maximum penalized likelihood estimators of the coefficient parameter θ in ( 12) and variance-covariance matrix Σ of the VCFRM, a modified version of Akaike's Information Criterion (mAIC) for evaluating the VCFRM is given by where edf is an effective degrees of freedom which is given by edf = Z(Z ⊤ Z + nΩ λ ) −1 Z ⊤ .
The edf represents a measure of the complexity of the estimated VCFRM.In this work, we analyze time-series data for the response, which requires us to use CV carefully.Therefore, the information criteria including mAIC are useful tools for evaluating the statistical model.We repeat calculating mAIC (13) for several different values for tuning parameters λ s and λ t and then choose them that minimize the value of mAIC.