Multimodal deep learning methods enhance genomic prediction of wheat breeding

Abstract While several statistical machine learning methods have been developed and studied for assessing the genomic prediction (GP) accuracy of unobserved phenotypes in plant breeding research, few methods have linked genomics and phenomics (imaging). Deep learning (DL) neural networks have been developed to increase the GP accuracy of unobserved phenotypes while simultaneously accounting for the complexity of genotype–environment interaction (GE); however, unlike conventional GP models, DL has not been investigated for when genomics is linked with phenomics. In this study we used 2 wheat data sets (DS1 and DS2) to compare a novel DL method with conventional GP models. Models fitted for DS1 were GBLUP, gradient boosting machine (GBM), support vector regression (SVR) and the DL method. Results indicated that for 1 year, DL provided better GP accuracy than results obtained by the other models. However, GP accuracy obtained for other years indicated that the GBLUP model was slightly superior to the DL. DS2 is comprised only of genomic data from wheat lines tested for 3 years, 2 environments (drought and irrigated) and 2–4 traits. DS2 results showed that when predicting the irrigated environment with the drought environment, DL had higher accuracy than the GBLUP model in all analyzed traits and years. When predicting drought environment with information on the irrigated environment, the DL model and GBLUP model had similar accuracy. The DL method used in this study is novel and presents a strong degree of generalization as several modules can potentially be incorporated and concatenated to produce an output for a multi-input data structure.


Introduction
The principle of genomic selection (GS) and genomic prediction (GP) is to build an accurate prediction model based on a training population comprised of individuals with both genotypic and phenotypic data to predict unobserved lines. Extensive research studies were conducted and novel statistical methods that incorporate pedigree, genomic, and environmental covariates (e.g. weather data) into statistical-genetic prediction models have been developed, studied, and employed (Crossa et al. 2021). Genomic Best Linear Unbiased Predictor (GBLUP) models are widely used in GP, and the extension of GBLUP for incorporating genotype × environment (GE) interaction improved the accuracy of predicting unobserved cultivars in environments. Jarquín et al. (2014) found that the prediction accuracy of models, including the GE term (reaction norm model), was substantially higher (17-34%) than that of models based only on major effects. For a maize ordinal data set, A. Montesinos-López et al. (2017) reported that models including GE achieved gains of 9-14% in prediction accuracy over models including only main effects. Using wheat data, Cuevas et al. (2016) found that models with the GE term were up to 60-68% better in terms of GP accuracy (association between the true genotypic values and the genomic genotypic values) than the corresponding single-environment models.
The genomic model that includes GE has been successfully used in several wheat and cotton applications, as exemplified by Pérez-Rodríguez and de los Campos (2014), who used only pedigree relationship information and environmental covariables, and Crossa et al. (2016), who used only markers in large sets of wheat gene bank landraces. Velu et al. (2016) applied the reaction norm model in wheat lines from CIMMYT's biofortification breeding program, with Zn and Fe content in the grain and other agronomic traits measured during two successive crop seasons (2011-2012 and 2012-2013). The prediction accuracy of Zn ranged between 0.331 and 0.694, with an average of 0.542. Prediction ability for Zn was more effective in high Zn environments when compared to low Zn environments. Prediction accuracy for Fe ranged from 0.324 to 0.734 with a mean prediction of 0.529 across environments.
Continued increases in GP accuracies reflect the success of genomic data for establishing genomic-assisted plant breeding (Bonnett et al. 2022;Beyene et al. 2015) and, subsequently, for increasing crop productivity in less time. Recently, studies have shown the benefits of integrating modern technologies into breeding programs (Crossa et al. 2021;A. Montesinos-López et al. 2017O.A. Montesinos-López et al. 2019;Costa-Neto et al. 2021a. This includes both genomics and detailed trait analysis using advances in phenomics (e.g. image or high-throughput phenotype, HTP) and environment characterization to increase the genomic prediction, thus accelerating future increases in genetic gains and crop production. There is empirical evidence showing that integrating HTP information with GS data has the potential to complement GS and increase crop productivity. One of the advantages of recent phenotyping technology is that it can quickly and accurately obtain data on many agronomic traits.
The primary goal of plant imaging HTP is to measure the different physiological stages of the plant through automated processes using digital camera technology that collects reflectance data at different wavelengths. This data can predict physiological or agronomic traits in plants with scores of spectral vegetative indices that are used as predictors. The reflectance data can also be used directly to predict the traits of interest. Using images to improve GP accuracy was highlighted by A. Montesinos-López et al. (2017), who proposed Bayesian functional regression models considering all available image bands, genomic or pedigree information, the main effects of lines and environments, as well as GE and the band × environment (BE) interaction effects. The authors observed that the models with BE interaction terms were the most accurate, while the functional regression models and the conventional models performed similarly in terms of prediction accuracy. Functional regression models are more parsimonious and computationally very efficient.
The interdisciplinary researchers in computing statistics and data science focus on achieving more accurate GP by using various statistical models. In this regard, deep neural network methods are powerful prediction tools in machine learning. Recent developments in deep neural networks and faster computer processing have made it possible to add new layers to the neural network (deep machine learning, DL) to capture small cryptic correlations between inputs, which in GP are interactions between markers that reflects the gene × gene and higher-order interaction (genetic epistasis). Initial applications of machine learning and neural networks in animal and plants GP were demonstrated by Gianola et al. (2011), González-Camacho et al. (2012), González-Camacho et al. (2016) and González-Recio and Forni (2011), O.A. , and O.A. Montesinos-López et al. (2019).
Recently, several applications of GP using DL have been studied and compared with other methods (O.A. Montesinos-López et al. 2019). O.A.  used the prediction performance of DL applied to multi-trait model and compared it to the performance of the Bayesian multi-trait and multienvironment (BMTME) model. The authors found that when the GE term was excluded, the best predictions were observed under the DL multi-trait model; however, when the GE term was considered, the BMTME outperformed the DL multi-trait. Using 9 data sets, O.A.  found that when the GE interaction term was excluded, the DL method was better than the GBLUP model in 6 out of the 9 data sets. However, with the inclusion of GE, the GBLUP model was the best in 8 out of 9 data sets. O.A.  compared GBLUP, univariate and multi-trait DL and found that when the GE was included, the best predictions were observed under the GBLUP model, while the worst were under the univariate DL model. However, when the GE was ignored, the best predictions were observed under the GBLUP method and the multi-trait DL model.
The most applied DL architectures in GS are the multilayer perceptron and the convolutional neural networks (CNN) (Jiang and Li 2020;O.A. Montesinos-López et al. 2021a. Although thousands of single nucleotide polymorphisms (SNP) can be used in both DL methods, to avoid training a model with a vast number of parameters, the squared root matrix or the Cholesky factor of the genomic relations matrix is used. Another less computationally intensive option is the recently proposed method of Nazzicari and Biscarini (2022), which uses genomic data as an image for input data in the CNN by first ordering the data through hierarchical clustering. This last option could be used under a multimodal strategy where for each type of information, an individual neural network is created and its outputs are used as inputs in another output layer to produce the prediction value (Ouyang et al. 2014;Baltrušaitis et al. 2018). For an early overview on deep multimodal learning models, see Ramachandram and Taylor (2017), who explore its applications in diverse research fields. For example, in agriculture, Danilevicz et al. (2021) employed multimodal DL to integrate management and genotype information with the multispectral data to describe plant conditions during the field trial. A recent study integrated genomic, environment, and management data in a multimodal DL and found that multimodal DL and GBLUP, which included several interactions, had the best overall performance (Kick et al. 2023). Likewise, in the field of healthcare, multimodal learning models have been widely used (Huang et al. 2020;Venugopalan et al. 2021;Stahlschmidt et al. 2022;Kline et al. 2022).
Based on the previous considerations, the main objectives of this study were to compare the GP accuracy of several statistical machine learning models: (1) the conventional Bayesian GBLUP model, implemented by means of the Bayesian generalized linear regression (BGLR) R package (Pérez-Rodríguez and de los Campos 2014), (2) a novel multimodal genomic DL (MMDL) method that uses the creation of an individual neural network (as opposed to combining all available features types to feed a network) that combines the outputs to create the output value, (3) the gradient boosting machine (GBM) that can be used with regression (Hastie et al. 2009), and (4) the support vector regression (SVR) as a scheme to predict continuous values (Drucker et al. 1996).
Two wheat data sets were used in this study. Data set 1 (DS1) includes 350 wheat lines and incorporates genomic and image (NDVI) values as covariables that were measured several times at the vegetative (VG, 4 times) and grain filing (GF, 2 times) periods; both measurements were used on the same date of each year (aligned), using the average of these covariables on both periods. Furthermore, DS1 was analyzed over 2 years (cycles 2015-2016 and 2016-2017) using 2 traits: grain yield (YLD, grams per square meter) and thousand grains weight (TGW). Genomic prediction of DS1 were performed with GBLUP, MMDL, GBM, and SVR. DS2 was analyzed over 3 years; between 145 and 155 lines were tested in 2 environments (drought and irrigated) and 2 or 4 traits were measured. DS2 includes genomics but does not include image data. We predicted 1 entire year with the other using GBLUPs and deep learning (DL) methods with genomic information as the unique predictor.

Materials and methods
This section has 5 subsections, each one describing (1) the structure of the phenotypic data sets, (2) how the metrics for assessing the prediction accuracy were computed, (3) the genotypic data, (4) the aerial high-throughput phenotyping used for DS1, and (5) the statistical models. The Appendix details how the two phenotypic data sets, DS1 and DS2, were adjusted for the experimental design in each year and environment and for each trait.

Data set 1 (DS1)
The data set is a panel of 350 wheat cultivars and wheat accessions from a wheat gene bank, evaluated by the Physiology Unit of the Global Wheat Program at the International Center for Maize and Wheat Improvement (CIMMYT) over 2 years, using cycles 2015-2016 and 2016-2017 for 2 traits-grain yield (YLD, grams per square meter) and thousand grains weight (TGW)-under yield potential conditions (fully irrigated, timely sown). As already mentioned, data set 1 used genomic and image (NDVI) values as covariables; these were measured several times at the vegetative and grain filing periods and both measurements were used for (1) the same date of each year (aligned) or (2) the average of these covariables on both measurements' periods. An alpha lattice design was used with 2 replicates and 6 incomplete blocks of size 5. The bands were measured 4 times during the vegetative (VG) period and 2 times during the grain filling (GF).

Data set 2 (DS2)
This data set is a panel of synthetic hexaploidy wheat lines that were formed by the hybridization and doubling of the cross between durum wheat and Aegilops tauschii, which were evaluated over 3 consecutive years (2015-2016, 2016-2017, and 2017-2018) by the Physiology Unit of the Global Wheat Program at CIMMYT. For each year, between 145 and 155 lines were tested in 2 kinds of environments, drought (suboptimal yield potential conditions) and irrigated (optimal yield potential conditions), and 2 or 4 traits were measured (biomass = BM, harvest index = HI, a thousand kernel weight = PMG, and grain yield = YLD). To predict each trait in one environment with information from the other environment for each year, methods GBLUPs and DL were used with G as the unique predictor.

Assessing prediction accuracy
The prediction accuracy of several models was assessed by two methods: (1) predicting the performance of 1 entire year using all the lines from another year (leave-one-year out, LOO) and (2) a 5-fold cross-validation (5FCV) considering a training set of 80% of the wheat lines that predict 20% of the testing set comprised of unobserved wheat lines. Note that the multimodal DL model is only used under the 5FCV where the genomic and year predictors were included.
For DS2, no information on images were available for this data set, and for each year the prediction accuracy was assessed by predicting the performance of 1 entire environment (irrigated or drought) using all the lines from the other environment (LOO). Consequently, for this data set the multimodal DL resumes to the standard DL as module year (environment) and its interaction with the year is not applied when using the LOO cross-validation method.

Genotyping data
For both panels, the DNA extraction was performed using the commercial kit and were genotyped using the 35 K Axiom Wheat Breeders array (Affymetrix, High Wycombe, UK) (Allen et al. 2017). For each marker, the genotype for each line was coded as the number of copies of a designated marker-specific allele carried by the line (absence = 0 and presence = 1). SNP markers with unexpected heterozygous >10% were removed; remaining heterozygous genotypes were recoded as either AA or BB according to the allele with higher frequency. Those markers that had more than 20% missing values or with minor allele frequency (MAF) < 0.05 were removed. Residual missing genotypes were imputed using mean imputation. A total of 10,560 and 10,064 SNPs were still available for analysis after quality control and imputation for DS1 and DS2, respectively.

Aerial high-throughput phenotyping for DS1
Aerial multispectral images were collected using UAVs over the trials. In year 2015-2016, multispectral images were collected at 65 m above the ground using a MultiSpec 4C camera (SenseFly/ Airinov, France) mounted on an eBee fixed-wings UAV (SenseFly Ltd., Switzerland). During years 2015-2016, the aerial imagery was collected at 60 m above ground level using the AscTec Falcon 8 multirotor UAV (Ascending Technologies, Germany) equipped with an ADC Lite multispectral camera (Tetracam, USA).
The multiple images were collected over the trials flying the drones in north-south flight lines with a lateral and longitudinal overlap of ∼80% of the field of view. Data collection took place under clear skies around noon. The images were used to reconstruct mosaics of the trials using the Pix4D Mapper software (Pix4D, Switzerland). The geo-referencing of each mosaic was achieved using ground control points distributed in the field. The spectral reflectance was calculated using calibration panels and then the NDVI was calculated using the RED and NIR bands. The NDVI orthomosaics were exported to ArcGIS (ESRI, USA) where a grid of polygons representing each plot was adjusted on top of the image. Before data extraction, the polygons were buffered 30 cm from each side to avoid the border effect and soil pixels. Using the "raster" package in R, the value of NDVI per plot was calculated by averaging all the pixels enclosed by each polygon and discarding possible outliers and soil pixels.

Statistical and machine learning models with cross-validation
The GBLUP model with  The statistical GBLUP model used in this study assumed that each response variable is modeled as were Y ij is the value of the response variable of line jth measured in year ith, μ is the general mean, year i is the fixed effect of the year ith(i = 1, . . . , I), g j (j = 1, . . . , J) is the random effects of lines, year × g ij is the random interaction year × line effect, β vg,l (l = 1, . . . , 4) and β gf,l (l = 1, 2) are the coefficients of the covariable NDVI (average or all dates measurements) measured for line jth during several times at the vegetative (x vg,jl ) and grain filling (x gf,jl ) periods, respectively, and ϵ ij are independent random errors assumed to be a normal variable with mean 0 and variance σ 2 . Furthermore, it is assumed that the random effect of lines g = (g 1 , . . . , g J ) T is distributed as N J (0 J , σ 2 g G), while the interaction year × line random effects Yg = (year × g 11 , . . . , year × g 1J , . . . , year × g 21 , . . . , year × g IJ ) T follow a N IJ (0 IJ , σ 2 Yg (I I ⊗ G)), with 0 IJ null vector of size IJ, I I the identity matrix of dimensions I × I, ⊗ denote the Kronecker product, and G is the genomic relationship matrix (Van Raden 2008).
A Bayesian estimation of this model (GBLUP) was implemented with the BGLR R package using the defaults priors for the parameters (Pérez-Rodríguez and de los Campos 2014), where for both covariables (x vg,jl and x gf,jl ) a flat prior was used for its corresponding regression coefficients. This model was used in the evaluation of GP accuracy under the 5FCV, that is, the available data set was partitioned in five balanced sub-sets, and 4/5 of them were used to train the model, while the remaining (1/5, testing) were used to evaluate its performance. This process was repeated until each 1/5 part of the partition was left out of the training process to evaluate its performance, and finally, the average of the metric values obtained across all five partitions was reported as the global evaluation performance of the particular model being evaluated. This model was also used for the evaluation of the prediction performance of 1 entire year using all the lines from the other years (LOO); however, the terms year i and year × g ij effects were removed.

Deep learning under the 5-fold cross-validation (5FCV)
The same information previously described was used under the DL model, where genomic and NDVI covariables were measured in the 2 periods; VG and GF, were used as inputs. The DL model used for performing the GP accuracy under the 5FCV is a multimodal DL (MMDL; Ramachandram and Taylor 2017) with 3 modalities (type of information used; see Fig. 1) under multilayer perceptron deep neural network (MP) given by , and x T ij3 are the 3-modality (type of inputs) vectors and are, respectively, the ij -th row of the following matrices, X year is the matrix design of year, Z * L =Z L L T with Z L the matrix design of lines and L is the upper triangular part of the Cholesky decomposition of G(G = L T L), X NDVIs = [X vg , X gf ], and X vg and X gf are the matrix designs of NDVIs measurement in the vegetative and grain feeling periods, respectively. n (q) HL is the hidden layer for the q th type neural network created with modality q, Additionally, x the hidden neurons in hidden layer l, l = 1, .., n HL , for q th type neural network (modality) q, where; x are all the inputs (predictors) for the q th modality HL , are the activations functions for all the hidden layers in the qth modality; and W is the vector of all weights in the model. After each dense layer and before activation, a batch normalization layer was applied to approximately standardized (mean close to 0 and standard deviation close to 1) corresponding outputs (see more details in Fig. 1).

Deep learning under leave-one-year out (LOO)
To evaluate DL models for predicting the performance of 1 entire year using all the lines from the other year, the same model (2) was used; however in this version only 2 modalities were used: Z * L and X NVIs . All DL models applied are versions of model (2) with a stacked residuals network (RestNet) composed of 2 sequence layers (He et al. 2016) and implemented with library TensorFlow in Python software. The model was fitted using a Batch_size value equal to 32, 128 epochs and the Adam optimizer (a popular stochastic gradient descend method to minimize the penalized loss function in a DL), and using callback options of the fit keras function. We specified an adaptative exponential decay learning scheduler.
For the training process, we established the validation split argument equal to 0.10 and again by means of the callback option, an early stopping rule was stated with monitor = "loss," mode = "min," and patience= "Pat" to check if the loss function in training data at the end of each epoch ceases to decrease, and if it continues to an additional Pat epoch, the training is stopped. Likewise, to prevent overfitting, dropout and L2 regularization were added at each layer involved, except for the output layer, where only the last one was applied. Learning decay (wd), patience values (Pat), dropout rate (DO), and regularization parameters (λ) were also considered as hyperparameters to be tuned. L2 regularization penalized the loss function (sum of squared error loss, for example) with the sum of squared weights by means of the regularization parameter (λ) that controls the degree to which the weights are shrunk toward zero, reducing the complexity of the model and avoiding fitting the training data too well. Dropout consists of setting a random fraction of the weights to 0 at each step during training time.
In all, for each modality (type of input) under an MP, the number of units after the second hidden layer was equal to half of the units in the preceding layer, that is, if N (q) 1 is the number of units of the first hidden layer in the NN q, then the units for the rest of the layers was taken as N HL , where the x denotes the functions that take the largest integer less than x. Additionally, in all hidden layers the rectifier linear unit (relu) activation function was used, except for the output layer because the traits analyzed were quantitative with an approximante normal distribution and a linear activation was applied.
The hyperparameters of the MMDL model were tuned using the library bayes_opt with 150 iterations, where the objective was to find the "optimal" hyperparameter values that minimize the validation means squared error. The complete list of hyperparameters and its domain space where the "optimal" values were searched is shown in Table 1. Further references in the manuscript will use MMDL as the specific DL when only 5FCV was used in DS1. The models were run in 1 computer node with 16 GB of RAM and 8 cores, using Python version 3.8.0 and TensorFlow 2.7.0, requiring an average of 8 hours to train a model with the specified characteristics described in the paper.

Gradient boosting machine (GBM) with LOO cross-validation
The GBM is a machine learning algorithm originally proposed for improving weak classifiers, that can be used with many statistical learning methods including regression (Hastie et al. 2009). It is a modification of the original boosting algorithm and was introduced by Friedman (2001), proposing to fit a base learner on a subsample of the training set drawn at random without replacement at each iteration of the algorithm.
We implemented the following GBM algorithm proposed by Friedman (2001), where (y i , x i ), for i = 1, 2, . . . , n are the data (output and input), M number of iterations, φ(y, f ) is the chosen loss function, and choose the base-learner model h(x, θ): Algorithm: initialize f 0 with a constant and for t = 1 to M, repeat Steps 1-4: Step 1: compute the negative gradient of φ( Step 2: fit a new base-learner function h(x, θ t ) for predicting g t (x i ) from the covariables x i .
Step 3: find the best gradient descent step-size ρ t : Step 4: update the function estimate: f The GBM method was implemented using a Gaussian loss function (based on normal distribution), and as a base-learner model (h(x, θ t )), the decision trees were used with interaction depth equal to 1, number of trees equal to 5000, the shrinkage value 0.001, with the minimum number of observations in the terminal nodes of the tree equal to 10, and 0.5 as the fraction of the training set observations randomly selected to propose the next tree in the expansion. This was done using the GBM R package (Greenwell et al. 2022). Like in the GBLUP and DL models, the information used here as inputs contained information of environments, the information of the genotypes with markers and the information of NDVIs measurements for the 5FCV evaluation. The LOO evaluation was similar.

Support vector regression (SVR) with LOO cross-validation
Support vector regression is an extension of support vector machine developed for classification and was proposed by Drucker et al. (1996) as a scheme to predict continuous values. Support vector regression in linear case seeks beta coefficients that minimize the regularized sum of the deviation error not larger than some positive constant (ϵ) defined as where L ϵ (e) = 0 if e < ϵ and L ϵ (e) = e − ϵ otherwise and C (cost constrain violations) is the inverse of the regularization parameter. Equivalently, SVR solve the following optimization problem (in case this is feasible): When this problem does not have a solution (is infeasible), slack variables are usually introduced to allow for some errors and obtain a less rigid model, allowing an ϵ -precision approximation for

Result
As already mentioned, DS1 was fitted with models GBLUP, DL, GBM, and SVR. The genomic prediction ability assessment was performed using 5FCV and LOO with genomic and image information. Furthermore, all tables (Tables 1-10)

DS1
Several models were fitted for 2 traits (YLD and TGW); for different predictors; training sets (5FCV and LOO) including genomic (G),   ) for the prediction of a complete year using information from the other year (leave one environment out, LOO) for traits TGW and YLD using the models fitted GBLUP with BGLR (R software) and with DL (Python software) with the predictors G (red Type of NDVI__, no use of NDVI), which correspond to the genomic matrix, G + NDVIs_gf (green), G + NDVIs_vg (blue), and G + NDVIs_vg_gf (purple) for the NDVIs covariate in any of its 2 types NDVI averages and dates aligned.
year, G × year; and the different types of NDVIs (NDVI average across dates or NDVI aligned for dates), using the GBLUP with BGLR (R software) and DL (Python software). We considered fitting models with only the genomic component (G) (with no NDVI included in the model) and models that included NDVI as covariables measured at the vegetative (NDVI_vg) and grain filing stage (NDVI_gf), included both vegetative stage and grain filing (NDVI_vg_gf). Each of these cases were tested under 2 different schemes: NDVI with dates aligned and NDVI with average values across all data. The different genomic prediction accuracies for the 5FCV obtained from fitted sub-models of 1 and 2 are shown in Tables 2 and 3. In the last 2 blocks of Table 3 (GBM-TGW and GBM-YLD) and Table 4, prediction accuracies are shown under the same 5FCV strategy for 2 additional explored machine learning models, gradient boosting machine (GBM) and support vector regression (SVR). The first method was implemented using the R packages gbm (Greenwell et al. 2022) with 5,000 trees and default values for interaction depth and learning rate (1 and 0.001); the second one was implemented with the R package e1071 (Meyer et al. 2022) using linear kernel and default values for the rest of the parameters (ϵ = 0.1 in the insensitive loss-function, tolerance 0.01, and cost of constraint violations equal to 1). Trait TGW had GP accuracies higher than trait YLD, and the mean correlation between observed and predicted values (Cor Mean) was consistently higher (and NRMSEP consistently smaller) for NDVI average than NDVI with data aligned (Table 2). Moreover, the GE model did not improve GP accuracy nor did the inclusion of various NDVIs (NDVI_vg, NDVI_gy, and NDVI_vg_gf) except for NDVI_average. Similar result patterns were found for DL (first two blocks in Table 3, TGW-DL, and YLD-DL); however, GP accuracy was slightly lower than those achieved by the models fitted by GBLUP. Note that the term year × G was not included in DL, as the previous results indicated ( , 2019 that including GE reduced the GP accuracy of DL. Furthermore, the results (Table 3) of the GBM in trait TGW show lower accuracy with both metrics when compared to what was achieved by the DL model, while with the trait YLD, DL show a slightly lower accuracy in almost all the cases except with the predictors year + G + NDVI_gf and year + G + NDVI_vg_gf when average NDVI was used. Now, with respect to the SVR (Table 4), the results found for DL with metric Cor Mean were slightly better than SVR except in trait TGW with the models that included aligned NDVIs in its predictors. Here, the differences in the results were moderately better for DL. Similar results were found with metric NRMSEP, but for trait YLD, the difference was more remarkable and favored DL except in the case of predictor year + G + NDV_vg, where the difference was almost indistinguishable.
The GP accuracy for the LOO prediction including the various NDVIs data when models were fitted with the BGLR R package, model 1 and the DL approach are shown in Tables 5 and 6; the corresponding results for GBM and SVR are listed in Tables 7 and 8. Under sub-models 1, for both traits and for all cases of NDVI, dates aligned and the average prediction of cycle (year) 2016-2017 using cycle 2015-2016 as training were much higher than those obtained when predicting 2015-2016 using 2016-2017 as training (Table 5); these results are for both metrics Cor Mean and NRMSEP. This does not occur when using DL (Table 6) where model prediction includes NDVI Dates Aligned versus Average. With DL the GP accuracy is slightly lower than those found when fitting models with GBLUP for both traits.
In terms of the GBLUP models, a contrary behavior was observed with the GBM model because in both traits, a slightly better accuracy was obtained when predicting 2015-2016 using 2016-2017 as training data when compared to the accuracy obtained by predicting 2016-2017 with 2015-2016 as training data. However, in all cases, GBM yielded lower accuracy with respect to DL models and to GBLUP when predicting 2016-2017. Gradient boosting machine also provided lower accuracy than SVR, and SVR gave slightly better accuracy than DL when predicting trait TGW in year 2015-2016 but not when predicting the same trait in year 2016-2017, resulting in DL being superior in accuracy. When predicting the second trait (YLD) in year 2015-2016, DL was superior in accuracy to SVR. The same was true when predicting the same trait in year 2016-2017 with the predictor excluding NDVIs covariables (G), the predictor average of both kind of NDVIs measurements (G + NDVI_vg_gf) and predictor including aligned NDVIs at the vegetative stage (G + NDVI_vg), or including aligned NDVIs at the grain filling stage (G + NDVI_gf). When fitting the models G (Type of NDVI_, no NDVI used), G + NDVI_gf, G + NDVI_vg, and G + NDVI_gf_vg, the results of the metric NRMSEP for the cases of NDVI average and NDVI data aligned. Figure 2 shows the results of predicting one year using the other as a training set for both traits TGW and YLD. Most of the results show that when using NDVI for prediction with an entire year using the other year as training, the DL method had a higher GP accuracy (lower NRMSEP) than those obtained from fitting the GBLUP model. Except for when predicting genomic (G) in GBM and SVR, DL overcame the fitted GLM for trait TGW in prediction year 2016-2017 and in same year when predicting TGW with NDVI data aligned and when predicting YLD with average DNVI data, where the better accuracy was obtained with GBLUP and SVR, respectively. When predicting the trait YLD in year 2016-2017 with NDVI average, all fitted SVR models, followed by GBLUP models, had better accuracy than DL. Another exception was also found in the same context (predicting trait YLD in year 2016-2017) but with NDVI data aligned, where GBLUP models outperformed all other models; this was followed by DL model in two out three predictors.
Similarly, Fig. 3 displays the results of TGW and YLD from the metric Cor Mean when fitting the models G (Type of NDVI__, no NDVI used), G + NDVI_gf, G + NDVI_vg, and G + NDVI_gf_vg for the cases of NDVI average and NDVI data aligned and predicting one cycle using the other as training set for both traits TGW and YLD. Prediction trait TGW in year 2015-2016 with G and G + NDVI_vg for NDVI_average had higher GP accuracy for DL than GBLUP, GBM, and SVR, while for the same trait in same year but with G + NDVI_gf (average), G + NDVI_vg (dates aligned), and G + NDVI_vg_gf (average and dates aligned), SVR had higher GP accuracy than all other models (DL, GBLUP, and GBM) followed by DL. For G + NDVI_gf (dates aligned) DL and SVR had similar accuracy. Now, when predicting trait TGW in year 2016-2017, GBLUP had higher GP accuracy followed by DL, except in G + NDVI_vg_gf where DL was superior. Here, the worst performance was obtained with GBM.
Continuing with results displayed in Fig. 3, when predicting YLD in year 2015-2016 DL showed higher accuracy with G, G + NDVI_gf (dates aligned), G + NDVI_vg (dates aligned), and G + NDVI_vg_gf (average), followed by SVR, while the rest of SVR cases had a higher GP accuracy followed by DL. When predicting YLD in year 2016-2017, GBLUP had the higher accuracy followed by DL, except with G + NDVI_vg_gf (dates aligned) where DL was superior. Here again the worst behavior was obtained with GBM.
Another set of results were obtained by fitting 8 different models including year + G, and year + G + year × G combined with

Results of 5-fold cross-validation (5FCV) for 2 traits TGW (thousand grain weight) and YLD (grain yield) for 2 metrics, Normalized Root Mean Squares Error Prediction [NRMSEP (SD)] and correlation [Cor (SD)] between observed and predicted values for models including year (year)
, genomic (G), and NDVI. Two types of NDVIs were included for NDVI vegetative (vg) and NDVI grain filling (gf), one aligning the dates (dates aligned) and the other one averaging (average) the dates. SD is the standard deviation across folds. Different fitted sub-models of multimodal DL 2, first 2 blocks in the table (MMDL-TGW  NDVI_vg, NDVI_gf, and NDVI_vg_gf for predicting TGW and YLD considering year + G and year + G + year × G without including NDVI and including the various types of NDVI combined as average or as dates aligned. The GP accuracy results are shown for 5FCV for metrics NRMSEP (Fig. 4) and Cor Mean (Fig. 5) for models fitted under GBLUP and DL for both traits, TGW and YLD.
Results displayed in Fig. 4 show similar GP accuracy based on NRMSEP for the 8 models with no increase in GP accuracy when adding image to the genomic (G) prediction for TGW and YLD. Another result based on 5FCV is that DL, GBM, and SVR methods slightly decreased in GP accuracy compared with the models fitted by the GBLUP for the various G and the types of NDVI  Results of predicting 1 entire year using the other (leave one environment out, LOO) for 2 traits TGW (thousand grains weight) and YLD (grain yield) for 2 metrics Normalized Root Mean Squares Error Prediction (NRMSEP) and correlation (Cor Mean) between observed and predicted values for models including and genomic (G) and NDVI. Two types of NDVIs were included for NDVI vegetative (vg) and NDVI grain filling (gf), one aligning the dates (dates aligned) and the other one averaging (average) the dates. Fitted sub-models of model 1 with (GBLUP) BGLR R package. Results of predicting 1 entire year using the other (leave one environment out, LOO) for 2 traits TGW (thousand grain weight) and YLD (grain yield) for 2 metrics, Normalized Root Mean Squares Error Prediction (NRMSEP) and correlation (Cor Mean) between observed and predicted values for models including and genomic (G) and NDVI. Two types of NDVIs were included for NDVI vegetative (vg) and NDVI grain filling (gf), one aligning the dates (dates aligned) and the other one averaging the dates (average). (SD standard deviation). Fitted sub-models of DL model 2. Results of predicting 1 entire year using the other (leave one environment out, LOO) for 2 traits TGW (thousand grains weight) and YLD (grain yield) for 2 metrics, Normalized Root Mean Squares Error Prediction (NRMSEP) and correlation (Cor Mean) between observed and predicted values for models including and genomic (G) and NDVI. Two types of NDVIs were included for NDVI vegetative (vg) and NDVI grain filling (gf), one aligning the dates (Dates aligned) and the other one averaging (average) the dates. Gradient boosting machine (GBM) by using 5,000 trees, with interaction depth and learning rate the default values (1 and 0.001) in the gbm R package.
combinations considered for both traits, being more notable for GBM in trait TGW and for SVR in trait YLD. Furthermore, Fig. 5 shows similar GP accuracy based on Cor Mean for the 8 models with a slight increase in GP accuracy when adding image to the genomic prediction. Also, the 5FCV based on Cor Mean showed that DL, GBM, and SVR methods slightly decreased in GP accuracy compared with the models fitted by the GBLUP for the various G and the types of NDVI combinations considered. These results also occur for TGW and YLD.
It is interesting to observe how the mean values of the optimal hyperparameters found for each DL for the 7 models for the 5FCV for both traits TGW and YLD changed for the different models (Table 9). The average values of each hyperparameter for all models were very similar, except for a few cases. For example, for trait TGW, the average weight decay of the model with predictor year + G is approximately 81 and 1,411 times the value of this hyperparameter in the rest of the 6 models (rows 3-8 Table 9), while the model with predictor year + G + NDVIs_gf and NDVIs with dates Results of predicting 1 entire year using the other (leave one environment out, LOO) for 2 traits TGW (thousand grains weight) and YLD (grain yield) for 2 metrics, Normalized Root Mean Squares Error Prediction (NRMSEP) and correlation (Cor Mean) between observed and predicted values for models including and genomic (G) and NDVI. Two types of NDVIs were included for NDVI vegetative (vg) and NDVI grain filling (gf), one aligning the dates (dates aligned) and the other one averaging (average) the dates. Support vector regression (SVR) by using linear kernel and default values for the rest of the parameters as done with the e1071 R package (ϵ = 0.1 in the insensitive loss-function, tolerance 0.01, and cost of constraint violations equal to 1). Average across partitions of the optimal hyperparameters values found for each multimodal DL models in the 5-fold cross-validation (5FCV) evaluation performance strategy. Empty cell means do not apply.
aligned, have an average value of patience hyperparameter (Pat) at least 10 units larger than the rest of the models (rows 2 and 4-8 from Table 9). For trait YLD, a similar but a slightly more ambiguous pattern was observed in the hyperparameters shared by all models (rows 9-15 Table 9). For example, the average neuron values of models with predictors year + G + NDVIs_gf and year + G + NDVIs_vg_gf, both with dates aligned, differed more than the rest of models, with at least 226 units. We also observed a notable difference across models in the average number of hidden layers for input 2 (G), ranging from 1.2 to 4.2, and the larger average of hidden layers was obtained in the first model (row 9), which is near of the upper bound (see Table 1) of the specified domain space for such a hyperparameter. In this same model (row 9) the average weight decay value was observed to differ greatly with respect to the other models, between 31 and 520 times greater.
Similarly, for the LOO (one year predicting all the lines in another year) genomic prediction method, the values of the optimal hyperparameters found are shown in Table 10 for each trait, for each model, and for each predicted year, where a greater variation can be appreciated with respect to the 5FCV case.

Results DS2
The results where each row panel corresponds to each trait and column panel for the predicted environment are shown in Fig. 6 (NRMSEP) and Fig. 7 (Cor). When predicting the irrigated environment with the drought environment, DL had higher accuracy (less NRMSEP) than the GBLUP model in all analyzed traits and all years. The most considerable differences were obtained in traits YLD and BM (Fig. 6). When predicting drought environment with information on the irrigated environment, the DL model and GBLUP model had similar accuracy except in the first year (2015-2016) where a more significant difference was obtained.
A more ambiguous pattern is found regarding the Pearson's correlation metric (Cor, Fig. 7). For trait BM, when predicting the drought environment with the irrigated environment in 2 of the 3 years, DL had higher accuracy, while when predicting the irrigated environment, in 2 of the 3 years GBLUP model had higher accuracy. However, in 1 of these years, the difference was marginal. For trait HI, the GBLUP model was more accurate than DL when predicting drought environment. When predicting this same trait in the irrigated environment, DL resulted better in 1 of the 2 years (2015-2016 and 2016-2017). For trait PMG, GBLUP resulted better than DL in all cases but with a marginal difference in both years when predicting drought environment. For the last trait (YLD), the DL model was less accurate than the GBLUP model when predicting the drought environment, but in 2 of the 3 years, the difference was less. When predicting this same trait in the irrigated environment, in 2 of the 3 years, DL was better.

Bases of the novelty of the proposed multimodal DL
The novelty of the DL developed in this study includes 3 modalities corresponding to the three types of input data: genomic, year, and image. The superiority, in some scenarios, of the DL methods in part can be attributed to the novel architecture used in the deep learning models implemented. The novelty of the architecture is due to three components.
The first component is the architecture of multimodal deep learning models (MMDL; see Fig. 1) with three-types of inputs corresponding to the effects of environments (X year ), the genomic information (Z * L =Z L L T ), and the effect of NDVIs (X NDVIs ). For each modality (type of input), a separate DL model is trained in the initial phase, and in a second stage, the outputs of all 3 DL models are concatenated, forming another MP model with few layers. Finally, from this model, predictions are obtained. The training process of this multimodal DL model is done simultaneously; however, it is more challenging since a careful tuning process needs to be implemented to guarantee reasonable predictions. Nevertheless, multimodal learning helps researchers understand and analyze when various sensors are engaged in the processing of information. For this reason, multimodal DL have been successfully applied to deal with multiple types of modalities, i.e. image, video, text, audio, body gestures, facial expressions, and physiological signals. Because multimodal DL models can combine multiple heterogenous sources of information, prediction accuracy is improved (Ramachandram and Taylor 2017). The second novelty of the multimodal DL used in this study is that it belongs to residual networks (ResNet) (He et al. 2016), which reduce (or avoid) the vanishing gradient problem, where the gradient is back-propagated to earlier layers and becomes infinitively small. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly, making it difficult to train. Therefore, it is not enough to simply stack layers together. ResNet architectures allow us to successfully train deep neural networks up to hundreds or even thousands of layers since it introduces the so-called "identity shortcut connection" that skips one or more layers; therefore, this technique reduces or avoids the vanishing gradient problem. Under ResNet, stacking layers should not degrade the network performance, since "identity shortcut connection" simply stacks identity mappings (layers that are not of any use; He et al. 2016). For this reason, ResNet architectures have a powerful representational ability employed in many regression and classification problems.
The third novelty of this research refers to the tuning process of the hyperparameters under Bayesian optimization using the function (library bayes_opt), which does not perform an exhaustive search of the space like grid search and random search that evaluate all (or some) hyperparameter configurations resulting from the cartesian product of the discretized search space of each hyperparameter. Bayesian optimization (Mockus 2012) is more efficient since it builds a probabilistic model for a given function and analyzes this model to decide where to next evaluate the given function. The two main components of Bayesian optimization are (1) a prior function that captures the main patterns of the unknown objective function and a probabilistic model that describes the data generation mechanism and (2) an acquisition function (loss function) that describes how optimal a sequence of queries is.
Our results show that even with small-to medium-sized data sets, the deep learning methodology is very competitive. Nevertheless, using more sophisticated architectures takes advantage of the data and deep learning methodology. Validation on additional data must be done to better understand the behavior of the multimodal DL model. This includes larger data sets with denser NDVI data to combine different types of NN distinct from MP as was used here in each modality. Other late fusion strategies (Ramachandram et al. 2017;Baltrušaitis et al. 2018) also need to be explored, such as weighing the individual outputs based on its uncertainty as proposed by Wang et al. (2021), or including more layers after the individual outputs of each submodel are combined (Huang et al. 2020) with the potential advantage of learning interactions between features of different modalities (Stahlschmidt et al. 2022).

Comparing multimodal DL with other results
Interestingly, Danilevicz et al. (2021) employed multimodal DL using 2 tabular inputs (modules), 1 module has information from the parents, the seed stock, fertilizer, and planting date (using a sigmoid layer), and the other is a spectral module (employing a linear layer). The weights from each of these 2 modules were concatenated, and their weights were combined in the fusion module that produces the outputs of the multimodal yield prediction. The multimodal DL proposed by Danilevicz et al. (2021) functions as a support tool for early identification of good maize lines. However, the authors noted that since the model was trained in a single location data (and thus no GE was incorporated into the multimodal DL), further training with new data from other environments would be required to improve the prediction accuracy. The recent study conducted by Kick et al. (2023) integrates genomics, environments, and management into the multimodal DL and examined the effect of various types of interactions. Results show that although the DL and GBLUP provided similar prediction accuracy, the GBLUP has the lowest average error, whereas the multimodal DL performed more consistently with similar average errors.
Our study shows that the prediction of cultivars in new environments and future years is one of the key issues in genomic-enabled prediction. The results from the DL proposed in this study for the prediction accuracy of new years are similar or higher than those obtained from the GBLUP when using genomic and high-throughput phenotype (image) for traits' grain yield and thousand kernel weight from DS1 and grain yield under irrigation for DS2. Support vector regression also performs very well when predicting grain yield in future years.

Conclusions
In this study, we present outcomes from some genomic prediction models (GBLUP, GBM, and SVR) and compared them with the novel DL method using a wheat data set comprised of genomic and high-throughput phenotype (images) for 2 traits measured in 2 years on multiple wheat lines (DS1). Results show that the novel DL approach presented in this study was accurate for predicting 1 year based on the other one (LOO) and outperformed the other models for both YLD and TGW traits, NDVI average and NDVI dates aligned, and at both metrics used to assess the GP accuracy (NRMSPE and Cor Mean). However, GP accuracy results obtained for the 5FCV indicated that the GBLUP model in DS1 was slightly superior to the DL, when genomic and various types of NDVI measurements were included. The genomic prediction of 4 traits (BM, HI, PMG, and YLD) for drought and irrigation in 3 years (DS2) show slight increases in GP accuracy of DL over GBLUP for traits HI and YLD under irrigated conditions for some years but not for the 3 years under drought conditions. Results from this study show that the multimodal DL method has a robust degree of generalization with other very data-specific DL previously reported. It is also important to stress that the GBLUP R package is a very vigorous parametric statistical software for GP accuracy. The DL method used in this study is novel and presents a good degree of generalization and important accuracy for predicting new years. One reason is that instead of concatenating all feature types and using them to feed the created network, the outputs are combined to create the output value for each type of information and individual neural network.

Data availability
The phenotypic and genomic data corresponding to the two data sets used in this study (DS1 and DS2) can be downloaded from the following link https://hdl.handle.net/11529/10548885.