A zero altered Poisson random forest model for genomic-enabled prediction

Abstract In genomic selection choosing the statistical machine learning model is of paramount importance. In this paper, we present an application of a zero altered random forest model with two versions (ZAP_RF and ZAPC_RF) to deal with excess zeros in count response variables. The proposed model was compared with the conventional random forest (RF) model and with the conventional Generalized Poisson Ridge regression (GPR) using two real datasets, and we found that, in terms of prediction performance, the proposed zero inflated random forest model outperformed the conventional RF and GPR models.


Introduction
Novel methodologies like genomic selection (GS) proposed by Bernardo (1994) and Meuwissen et al. (2001) are gaining popularity in plant breeding because they are revolutionizing the plant breeding paradigm. The basic idea of GS is to perform the process of selection of candidate individuals by only genotyping and phenotyping a reference population and with this information train a statistical model that is then used for predicting genomic breeding values or phenotypic values of a testing (breeding) population that only contains genotypic information. The acceptance and popularity of GS continues to increase since empirical evidence shows that there are no significant differences between the performance of GS and that of phenotypic selection (Roorkiwal et al. 2016;Crossa et al. 2017;Wolfe et al. 2017;Huang et al. 2019). Some of the advantages of GS are: (1) it shortens the generation interval, (2) it requires fewer resources, and (3) it reduces the cost per cycle (Farah et al. 2016;Crossa et al. 2017).
When using GS as a predictive methodology, we need to choose the right model in each circumstance to guarantee an optimal performance. For this reason, nowadays many statistical machine learning models are often used in GS, since there is no universal model that works for all the data at hand (Wolpert and Macready 1997). The development of prediction models for GS is an active area of research that aims to improve the prediction performance of the existing statistical machine learning algorithms in the context of a large number of independent variables (pÞ and a small sample size (nÞ, of correlated traits and input information from different sources, among others. When the traits are counts, like the number of panicles per plant, number of seeds per plant, number of infected spikelets per plant, days to heading, days to maturity, and days to germination, among others (Montesinos-Ló pez et al. 2016, 2020a, 2020b, there are regression models like generalized Poisson regression (Stroup 2012), Bayesian Generalized Poisson regression (Montesinos-Ló pez et al. 2015, 2016 and even deep neural networks models (Montesinos-Ló pez et al. 2020a, 2020b. However, all these models use as a loss function the negative of the log likelihood of a Poisson distribution. Poisson distribution is very popular for count data (that take values of 0, 1, 2,. . . with an unrestricted upper limit), but has two main disadvantages: (1) it is an intrinsic property of a Poisson distribution that the variance is equal to the mean, and for this reason many times it is unable to capture over-dispersion efficiently, and (2) it cannot efficiently model excess zeros in the response variable.
Another problem of the Poisson family of regression models is that they are parametric models that many times are not efficient for capturing nonlinear patterns. For this reason, many machine learning algorithms have been successfully implemented in GS (Sarkar et al. 2015;Stephan et al. 2015;Naderi et al. 2016;Waldmann 2016;Li et al. 2018) to capture nonlinear effects. One of the most popular machine-learning methods is Random Forests (RF, Breiman 2001), which is a tree-based ensemble method for continuous (regression), binary and categorical classification using multiple variables as input (Chen and Ishwaran 2012;Alarcon et al. 2015;Li et al. 2016). RF has been applied in genome-wide association studies to identify single nucleotide polymorphisms (SNP) associated with phenotypes, and to map QTL on the genome (Brieuc et al. 2015;Everson et al. 2015;Petralia et al. 2015;Stephan et al. 2015;Li et al. 2018). In addition, RF has been used for cancer identification and treatment, for epistasis detection (Pashaei et al. 2015;Shi and He 2016), for prediction of protein DNA-binding sites from amino acid sequences (Wu et al. 2009) and protein-protein interaction sites in sequence (Sikic et al. 2009), and for gene network pathway analysis (Pang et al. 2006;Wang et al. 2010;Chen and Ishwaran 2012).
There is evidence that RF performs better than other methods for binary traits when the sample size is large and the percentage of missing data is low (García-Magariños et al. 2009). However, Naderi et al. (2016) found that, for binary traits, RF outperformed the GBLUP method only in a scenario combining the highest heritability, the extensive number of markers (50 K SNP chip), and the largest number of QTL. Gonzá lez- Recio and Forni (2011) found that RF performed better than Bayesian regressions in detecting resistant and susceptible animals from based on genetic markers. They also reported that RF produced the most consistent results with very good predictive ability and outperformed other methods in terms of correct classification.
The popular RF models were originally developed for continuous, binary and categorical data. The RF for continuous response variables uses the sum of squared errors (least square) as splitting criteria, while the random forest for binary and categorical data uses the Gini index of the log-likelihood based on a Bernoulli distribution. There are also RF models for count data (Chaudhuri et al. 1995;Loh 2002) that can be implanted in R using the package part (Therneau and Atkinson 2019). However, these RF models for count data are not appropriate for counts with excess zeros. For this reason, Lee and Jin (2006) proposed a RF method for counts with an excess of zeros, by building the splitting criterion with the zero-inflated Poisson distribution, but it models both the excess zero part and the Poisson part jointly, which is unlike the basic hurdle and zero-inflated regression models that use two models, thus allowing different covariates' effects for each part. A common model is based on the assumption that the excess of zeros is generated by an independent random variable. For this reason, conventional regression models for counts with an excess of zeros use a logistic model for predicting excess zeros and a truncated Poisson model for counts larger than zero.
For this reason, in this paper, we present an application of the zero-truncated Poisson random forest with excess zeros proposed by Mathlouthi et al. (2019); its building process is similar to the zero altered (or inflated) Poisson regression since two models are used in the building process: one to model excess zeros (zero part) and the other to model counts larger than zero (Poisson part). The proposed method is semi-parametric since it includes only a few assumptions about a specific parametric form. The zero part was modeled using a conventional binary random forest model, while the truncated Poisson part was modeled using an RF with a new splitting criterion based on the zero-truncated Poisson distribution.

Univariate ridge regression model
Under this model, the relationship between the response variable that is continuous y i ð Þ and the input information [ where e i is assumed distributed as normal with mean zero and variance ðr 2 Þ. The estimates of bs using univariate ridge regression (RR) are obtained by minimizing the following penalized residual sum of squares (loss function): where k is the tuning hyper-parameter that can be chosen by cross-validation. The optimization of this loss function (LLÞ was done using the R package glmnent (Lasso and Elastic-Net Regularized Generalized Linear Models) (Friedman et al. 2010).

Univariate generalized Poisson regression model
Since we are in a context where the number of independent variables (p) is larger than the number of observations ðnÞ, the penalized loss function for the univariate generalized Poisson regression (GPR) model is equal to: where LL was derived as the negative penalized log likelihood based on a Poisson distribution, represent the inverse link function that is an exponential function and correspond to a log link function, and k is regularization parameter that can be computed using cross-validation. The type of penalization that contains the loss function is called Ridge penalization since the sum of the squared beta coefficients is taken into account in the penalization term. The loss function was optimized with the R package glmnent and the k hyper-parameter was estimated with 10-fold cross-validations for both Ridge regression models (RR and GPR). More details about this model can be found in Montesinos-Ló pez et al. (2020b).

Random forests
Random forest (RF) is a modification of bootstrap aggregating that builds a large collection of trees, and then averages out the results. Each tree is built using the least-square splitting criterion (loss function), the usual one when the response variable is continuous. For training data (Breiman 2001), RF takes B bootstrap samples and randomly selects subsets of features as candidate predictors for splitting tree nodes. Each tree minimizes the average loss function in the bootstrapped data and is constructed using the following algorithm: Step 1. From the training dataset, draw bootstrap samples of size N train .
Step 2. With the bootstrapped data, grow a random-forest tree T b with the least-square splitting criterion, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size is reached. 1) Randomly draw mtry out of the m independent variables (IVs). mtry is a user-specified parameter. 2) Pick the best independent variable among the mtry IVs.
3) Split the node into two child nodes. The split ends when a stopping criterion is reached, for instance, when a node has less than a predetermined number of observations. No pruning is performed.
Step 3. Output the ensemble of trees T b f g B 1 . The predicted value of testing set (ŷ i ) individuals with input x i is calculated asŷ i ¼ 1 Readers are referred to Breiman (2001) and Waldmann (2016) for details on the theory of RF. Tree hyper-parameters, including the number of trees ðntreeÞ, number of independent variables (features) sampled in each iteration (mtry), and number of samples in the final nodes (nodesize) must be defined by the user. For dataset 1 we assessed the following combinations of values of ntree =(100, 300, 500), mtry =(30, 50, 100) and nodesize =(2, 5, 15), while for dataset 2, we used the same combination of values for ntree and nodesize, but a different combination of the number of feature samples, mtry ¼ ð150; 230; 320Þ.

Zero altered Poisson random forest
The two versions of the zero altered Poisson random forests (ZAP_RF and ZAPC_RF) like zero altered Poisson (ZAP) regression models, assumed that Y ¼ 0 with probability h (0 h < 1), and that Y follows a zero truncated Poisson distribution with parameter l ðl > 0Þ, given that Y > 0 (Mathlouthi et al. 2019). That is, they are based on the ZAP random variable: The mean and variance for ZAP are: In general, zero altered models are two-part models, where the first part is a logistic model, and the second part is a truncated count model. However, under the ZAP_RF and ZAPC_RF, instead of assuming a linear predictor (like ZAP regression models), it is assumed that the links between the covariates and the responses (Mathlouthi et al. 2019) through l and h are given by nonparametric link functions like: where f l and f h are general unknown link functions. A general nonparametric and flexible procedure can be used to estimate f l and f h in (1). However, here we used random forest in two steps instead of a parametric model: Step 1. Zero model. Fit a binary RF to the response I Y ¼ 0 ð Þ , that is, the binary variable takes a value of 1 if Y ¼ 0 and a value of 0 if Y > 0. This model produces estimates ofĥ.
Step 2. Truncated model. Fit an RF using only the positive (Y > 0) observations. Assume there are N þ such observations This model produces estimates ofl. However, to exploit the Poisson assumption, the splitting criteria used in the RF with the truncated part was derived from the zero truncated Poisson likelihood that is equal to: where LL þ is the log-likelihood function of a sample of a zero truncated Poisson distribution. The estimate of l is obtained by solving @LL þ @l ¼ 0, which reduces to: For a given candidate split, the loglikelihood function given in equation (2) is computed separately in the two children nodes and the best split is the one that maximizes: where d LL þ (left node) and d LL þ ðright node) are the log-likelihood for each node.
Once we have the estimates of l and h, the predicted values of Y under the ZAP_RF are obtained with: It is important to point out that in the prediction formula given above, (Ŷ) is equal to the mean of the ZAP model, while under the ZAPC_RF, the predictions are obtained as: The ZAPC_RF is a conventional logistic regression model where the predicted values are probabilities and those probabilities are converted to a binary outcome if the probability is larger (or smaller) than some probability threshold (most of the time this threshold is 0.5). However, under the ZAPC_RF, instead of converting the probabilities to 0 and 1, we convert to zero ifĥ > 0:5 and to the estimated expected count value (lÞ ifĥ 0:5. One limitation of the ZAPC_RF (similar to the logistic regression) is that the probability threshold is not unique since many other values between zero and one can be used. However, the threshold value of 0.5 is used most of the time since it assumes no prior information, and for this reason, both categories have the same probability of occurring.

Phenotypic dataset 1
This dataset is composed of 115 spring wheat lines developed by the International Maize and Wheat Improvement Center (CIMMYT) and the trait measured was Fusarium head blight (FHB) severity. The experiments were performed in 2011 and data were collected in three environments (Env1, Env2, and Env3). These datasets were the same ones used by Montesinos-Ló pez et al.

Genotypic dataset 1
For each line under study, we used 1635 SNPs, that resulted after quality control genotyped using an Illumina 9 K SNP chip with 8632 single nucleotide polymorphisms (SNPs) (Cavanagh et al. 2013). Markers were coded as zero (absence) or one (presence). Specific details of this genotypic information are available in Montesinos-Ló pez et al. (2020b).

Genotypic dataset 2
In this dataset, after quality control and imputations, 11,617 SNPs were still available and these markers also were coded as zero or one. This genotypic information was used for evaluation in terms of prediction performance of the proposed models. More details of these phenotypic datasets can be found in Montesinos-Ló pez et al. (2020b).

Metrics used to measure prediction performance
Cross-validation was used to evaluate the prediction performance in unseen data. Since our data contain the same lines in I environments, we used an outer fivefold cross-validation that mimics a situation where lines were evaluated in some environments for all traits but some lines were missing in other environments. We used cross-validation because the resulting test error is very nearly unbiased and because our datasets are not very large (Theodoridis 2020). Four folds were used for training and onefold for testing. We repeated the training 5 times, each time selecting one part (different each time) of the data for testing and the remaining 4 parts for training. This cross-validation strategy gives us the advantage of testing with one part of the data that has not been involved in training, so it can be considered independent, and eventually at the same time using all the data, both for training and testing (Theodoridis 2020). We reported the average prediction performance combining the 5 estimates of the  testing sets in terms of average Spearman correlation (ASC), mean arctangent absolute percentage error (MAAPE) and mean absolute error of prediction (MAE), for each environment and across environments. The ASC was used instead of Pearson's correlation because the response variable is not normally distributed. In terms of ASC, the closer to one, the better the prediction performance, while under MAE and MAAPE, the closer to zero, the better the prediction performance. It is important to point out that the process for tuning the hyper-parameter (k) in the generalized Poisson regression (GPR) was done with 10-fold inner cross-validation, while the tuning process for the random forest models (RF, ZAP_RF, ZAPC_RF) was done with 5-fold inner crossvalidation inside each outer fold. This means that in each outer fold, 20% of the data was used for tuning (TUN) and 80% of the information for inner training (ITRN). Each of the 9 (data_set_1 and data_set_2) combinations of the grid search was trained with the inner training set in each outer fold; its prediction performance was evaluated in the inner tuning (TUN) set and the average in terms of MAE was obtained for each fold of the 5 inner tuning sets. For estimating the lambda hyper-parameter ðk) in GPR, we used 10-fold partition. These are the default values for the software and do not require significant amounts of computational resources, while for random forest, we used only fivefold since random forest is performed for each combination of hyperparameters and this increases considerably the computational resources.
After selecting the best combination of hyper-parameters in terms of MAE, the model was refitted, but using the whole outer training set (80% of data) in each fold. Finally, for each outer testing set, we computed each of the three metrics (ASC, MAAPE and MAE) with its corresponding standard error (SE); then the average of the 5 outer folds and its SE was reported as a measure of prediction performance and variability in each metric. It is important to point out that the 5-fold cross-validation strategy was implemented with only 1 replication. The cv.zap.rf() function developed in the R statistical software to implement the ZAP_RF and ZAPC_RF proposed models is given in Appendix A.

Variable importance measures
For the proposed zero altered Poisson methods (ZAP_RF and ZAPC_RF), it was possible to obtain variable importance measures (VIM), since there are many measures of variable importance. One common approach for regression trees is to calculate the decrease in prediction accuracy from the testing dataset. For each tree, the testing set portion of the data was passed through the tree and the prediction error (PE) was recorded. Each predictor variable was then randomly permuted and j new PE were calculated. The differences between the two were then averaged over all trees, and normalized by the standard deviation of the differences. The variable showing the largest decrease in prediction accuracy was the most important variable. The results were displayed in a variable importance plot of the top ranked variables. Since the ZAP_RF and ZAPC_RF models are composed of a zero part and a truncated part, two plots were obtained for each trait, and the final VIM estimates of each independent variable were the average values of the five implemented testing sets.

Data availability
The phenotypic and genotypic data for dataset 1 used in this study are contained in the R file Data_Real_Count.RData, and available at the following link: http://hdl.handle.net/11529/ 10575. For dataset 2, the phenotypic and genotypic data are contained in the R file Data_set 2.RData, available at the following link: http://hdl.handle.net/11529/10548438. The largest and smallest correlations are in bold.

Results
The results are given in three subsections. In the first subsection, for each trait in each dataset, we show the percentage of excess zeros. In the second subsection, we give a description of the prediction performance of dataset 1, while in the third subsection, the same description is given, but for dataset 2.
Percentage of excess zeros in each dataset Figure 1A shows that in trait FHB that belongs to dataset 1, the percentage of zeros was 34.87%. For trait PTR that belongs to dataset 2, the percentage of zeros was 5.97%, while for the second trait (SB) in this dataset, the percentage of zeros was 9.86% and for the last trait (SN) in this second dataset, the percentage of zeros was 3.96%. Table 1 provides a summary of the phenotypic information of each of the four traits under study, where it is evident that the mean and median are quite different, which is an indicator that the data are not symmetric and non-normally distributed. Table 1 also shows that the minimum count is zero in the four traits and the maximum is 20 in two traits (SB and SN), 18 in trait FHB and 19 in trait PTR. Table 2 gives the phenotypic correlation between the environments of each trait. In trait FHB (dataset 1), we can observe a perfect correlation between environments Batan2012 and Batan2014, but a poor correlation between environment Chunchi2014 and Batan2012 and Batan2014. In trait PTR (dataset 2) the largest correlation (0.605) was between Env5 and Env6, while the lowest (0.291) was between Env3 and Env6. Most of the correlations between environments are between 0.3 and 0.4. In trait SN (dataset 2), the largest correlation (0.79) was also observed between Env5 and Env6, while the lowest (0.412) was between Env1 and Env5; the remaining correlations were between the minimum and maximum values mentioned before. Finally, for trait SB (dataset 2), the largest (0.456) and minimum (0.316) correlations were between Env1 and Env2 and between Env2 and Env4, respectively.

Dataset 1
In Figure 2, we compare the prediction performance of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) in dataset 1 for trait FHB. The prediction performance was evaluated in terms of Spearman's correlation, MAAPE and MAE for each environment. First we provide the results taking into account the genotype by environment (GE) interaction in the predictor. In terms of ASC, Figure 2A shows that the best prediction performance (in the three environments) was observed under the ZAP_RF model in environment Chunchi2014, while the worst was observed under the RR model in environment Batan2014, and the best model outperformed the worst by 0:809 À 0:340  Figure 2B shows that in the three environments, the best performance was under the ZAPC_RF model and the worst was under the RR model. In Batan2012, Batan2014, and Chunchi2014, the ZAPC_RF outperformed the RR model by 0:927 À 0:74 ð Þ Â 100 0:74 ¼ 25:270%, 0:932 À 0:748 ð Þ Â 100 0:748 ¼ 24:599% and 0:737 À 0:544 ð Þ Â 100 0:544 ¼ 35:478%, respectively. The second best model was ZAP_RF, which was slightly better than the RF model, but there were no significant differences between them in terms of MAAPE performance ( Figure 2B). In MAE terms, the ZAPC_RF model was also the best in environments Batan2012 and Batan2014, while in environment Chunchi2014, the best model was RF ( Figure 2C). In environments Batan2012 and Batan2014, ZAPC_RF outperformed the worst model (RR) by 1:072 À 0:863 ð Þ Â  (Figure 2A). In terms of MAAPE, the best and worst models were ZAPC_RF (MAAPE ¼ 0.608; Chunchi2014) and RR (MAAPE ¼ 0.934; Batan2014) and the best model outperformed the worst by 0:934 À 0:608 ð Þ Â 100 0:608 ¼ 53:52% ( Figure 2B). Finally, in terms of MAE, the best model (ZAPC_RF) outperformed the worst by 2:0412 À 0:904 ð Þ Â 100 0:904 ¼ 125:83% ( Figure 2C). Figure 3A indicates that under Spearman's correlation across environments, the best and worst models were ZAP_RF and RR, respectively. The best model outperformed the worst model by 0:674 À 0:456 ð Þ Â 100 0:674 ¼ 32:22%. Across environments, Figure 3B shows that in MAAPE terms, the ZAPC_RF model was the best and the RR model was the worst.  Figure 4 provides the variable important values (VIM) for the conventional random forest model (A) and for the ZAP_RF model (B, C). Figure 4B corresponds to the truncated part (A) and Figure  4C to the zero part (B) of the ZAP_RF model. These plots only contain the 30 most important variable important measures. Figure  4A indicates that the three most important predictors for the conventional RF model are Chunchi2014, V4, and V77 (without GE interaction) and Chunchi2014, Chunchi2014-1 and V4 (with GE interaction), while under the ZAP_RF model, the three most important predictors, under the truncated part ignoring the GE interaction, were Chunchi2014, V116 and V3, while taking into account the GE interaction term were Chunchi2014, Chunchi2014-1 and V116. Under the zero part, predictors Chunchi2014, V39, V16 (without the GE interaction) and predictors Chunchi2014, V39 and V16 are the most important predictors taking into account the GE interaction term.

Dataset 2
Trait PTR First, we give the results for trait PTR, then for trait SB and finally for SN. The prediction performance of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) of dataset 2 for trait PTR was evaluated in terms of Spearman's correlation, MAAPE and MAE for each environment ( Figure 5). First, we provide the prediction performance taking into account the GE interaction. In terms of Spearman's correlation, for the PTR trait the best and worst prediction performances were observed under the ZAPC_RF (Spearman ¼ 0.565; Env6) and RF (Spearman ¼ 0.439, Env2) models, respectively. The best model outperformed the worst model by 0:565 À 0:439 ð Þ Â  the worst by 2:975 À 2:029 ð Þ Â 100 2:029 ¼ 46:60% ( Figure 5C). Without taking into account the GE interaction, we can see that in terms of Spearman's correlation, the best model was the ZAP_RF (Spearman ¼ 0.554, Env6), while the worst was ZAPC_RF (  NO in the plots means that the genotypeÂenvironment (GE) interaction was ignored, while YES means that the GE interaction term was taken into account. Figure 6A indicates that across-environments taking into account the GE interaction the best model was ZAP_RF (Spearman ¼ 0.547), the worst was GPR (Spearman ¼ 0.521) model, and the best model outperformed the worst by 0:547 À 0:521 ð Þ Â 100 0:547 ¼ 4:67%. Figure 6B shows that in terms of MAAPE, across environments, the ZAP_RF (MAAPE ¼ 0.400) model was the best and the worst was the RR (MAAPE ¼ 0.458) model. But the best model outperformed the worst model by only 0:458 À 0:400 ð Þ Â 100 0:400 ¼ 12:70%. In terms of MAE, the best model was also ZAP_RF (MAE¼ 2.312) which outperformed the worst model RR (MAE ¼ 2.706) by 2:706 À 2:312 ð Þ Â 100 2:312 ¼ 14:54% ( Figure 6C). Without taking into account the GE interaction, the best prediction performance across-environments under Spearman's correlation was with model RF (Spearman ¼ 0.542) and the worst was with model ZAPC_RF ( Figure 7 provides the VIM for the conventional random forest model (A) and for the ZAP_RF model (B, truncated part; C, zero part) for trait PTR of dataset 2. These plots only contain the 30 most important VIM. The three most important predictors for the conventional RF model correspond to predictors V8, V12 and V7 (with and without GE interaction), respectively. For the ZAP_RF model under the truncated part, the same predictors, V7, V8 and V12 (with and without GE interaction) were the most important predictors. Under the zero part, three dummies out of the six environments (Env6, Env4 and Env3) were the most important predictors ignoring the GE interaction, while with the GE interaction, the most important predictors were Z.G46.Env4, V189 and Z.G202.Env2.
The 30 most important VIM for the conventional random forest model ( Figure 13A) and for the ZAP_RF model ( Figure  13B, truncated part; Figure 13C, zero part) are given for trait SN in dataset 2. The three most important predictors for the conventional RF model were: V10, V14 and V7 (without GE interaction) and V10,V14 and V54 (with GE interaction). For the ZAP_RF model under the truncated part, the most important predictors were: V10, V14 and V24 (without the GE term) and V10, V14 and V7 (with the GE interaction term). Under the zero part, V247, V55 and V203 were the most Figure 8 Prediction performance in terms of average Spearman's correlation (Spearman; A), mean arctangent absolute percentage error (MAAPE; B) and mean absolute error of prediction (MAE; C) of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) for each environment in dataset 2 for trait SB. NO in the plots means that the genotypeÂenvironment (GE) interaction was ignored, while YES means that the GE interaction term was taken into account.
important predictors ignoring the GE interaction, while with the GE interaction, the most important predictors were V44, Z.G277.Env3, and V189.

Discussion
Due to the fact that there is no universal model that works in all circumstances, many statistical machine learning models have been adopted for genomic prediction. Random forest is one of the models adopted for genomic prediction with many successful applications (Sarkar et al. 2015;Stephan et al. 2015;Naderi, et al. 2016;Waldmann 2016;Li et al. 2018).
Some of the reasons for the increased popularity of random forests are: (1) they require very simple input preparation and can handle binary, categorical and numerical independent variables without the need for any preprocessing like scaling, (2) they perform implicit variable selection and provide a ranking of predictor (feature) importance, (3) they are inexpensive in terms of computational resources needed for its training since there are few hyper-parameters that need to be tuned (number of trees, number of features sampled and number of samples in the final Figure 9 Prediction performance in terms of average Spearman's correlation (Spearman; A), mean arctangent absolute percentage error (MAAPE; B) and mean absolute error of prediction (MAE; C) of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) across environments in dataset 2 for trait SB. NO in the plots means that the genotypeÂenvironment (GE) interaction was ignored, while YES means that the GE interaction term was taken into account. nodes) and due to the fact that instead of working directly with all independent variables simultaneously each time, they use only a fraction of the independent variables, (4) some algorithms can beat random forests, but it is never by much, and other algorithms many times take much longer to build and tune than an RF model, (5) contrary to deep neural networks that are really hard to build, it is really hard to build a bad random forest, since it depends on very few hyper-parameters and some of them are not very sensitive, which means that a lot of tweaking and fiddling is not required to get a decent random forest model, (6) RFs are very versatile since they can deal with continuous, binary and categorical response variables, (7) they have a very simple learning algorithm, (8) they are easy to implement since there are many free and open-source implementations, and (9) RF parallelization is possible because each tree is grown independently.
The model originally proposed for estimation purposes by Mathlouthi et al. (2019) expanded the versatility of the random forest algorithm since ZAP_RF and ZAPC_RF are appropriate for count data with excess zeros. The main advantage of these methods is their flexibility, meaning they can adapt to the data at hand without having to specify a parametric form. We found that the proposed methods outperformed Ridge regression and Poisson Ridge regression and slightly outperformed the conventional random forest. For this reason, the proposed methods contribute to the lack of efficient algorithms for dealing with count data with excess zeros. The previously mentioned advantages of conventional RF are inherited by the proposed methods since the only difference between the conventional RF and the proposed zero altered Poisson random forest models is that instead of training only a conventional random forest model with the sum of squared errors (least squares) as splitting criteria, now two random forest models are trained, one for the excess of zeros (with conventional splitting criteria for binary outcomes like the Gini index or log-likelihood of Bernoulli distribution) and another for counts larger than zero that use the log-likelihood of zero truncated Poisson distribution as splitting criteria. This change in using two models instead of one allows the conventional random forest to be modified to deal better with count data with excess zeros. Also, the proposed zero altered Poisson random forest methods allow reporting the important features (predictors), but Figure 10 Predictor importance for trait SB in dataset 2 under conventional random forest (A) and under zero altered Poisson random forest for trait SB (B and C). The first column contains the results without interaction (NO) and the second column contains the results with interaction (YES).
instead of one graph, two are generated, one for the zero-altered part that shows which features are the most important to the counts with excess zeros and the other for the remaining counts (1, 2,. . .). These two graphs of important predictors are very useful to gain insight into the biological meaning of the most important predictors.
The proposed zero altered Poisson random forest methods (ZAP_RF and ZAPC_RF) belong to the category of ensemble regression tree models, that by their nature it is difficult to evaluate the effect of each predictor. This means that these methods differ from parametric models (e.g., a linear mixed model) for GWAS since they do not provide the parameter estimates and p-values (for significance) for measuring the degree of importance of each predictor. However, many other non-parametric models allow calculating variable importance values (denoted as VIM) to indicate the contributions of individual predictors to the prediction error. Figures 4, 7, 10, and 13 show the distribution profiles of the VIM values of the ranked predictors (from the most important to Figure 11 Prediction performance in terms of average Spearman's correlation (Spearman; A), mean arctangent absolute percentage error (MAAPE; B) and mean absolute error of prediction (MAE; C) of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) for each environment in dataset 2 for trait SN. NO in the plots means that the genotypeÂenvironment (GE) interaction was ignored, while YES means that the GE interaction term was taken into account.
the least important ones) for RF analyses and for the proposed methods. The larger the predictor VIM value, the more important a predictor is. Most of the predictors were found to have either very small positive influence or no effect on the VIM values in RF and the proposed methods. Also, since the proposed zero altered Poisson random forest models (ZAP_RF and ZAPC_RF) were built with two models (a truncated and zero part), they provide two plots for the VIM values, one for the truncated part and another for the zero part, indicating that different predictors influence each model. These plots are of paramount importance because they allow identifying which predictors play the most important role in the prediction of the response variable of interest.
It is important to point out that under a univariate Poisson regression, the inverse link function is equal to l i ¼ expðg þ P p j¼1 x ij b j Þ. However, if we change the inverse link function to l i ¼ g þ P p j¼1 x ij b j ; that is, an identity inverse link function, and if Figure 12 Prediction performance in terms of average Spearman's correlation (Spearman; A), mean arctangent absolute percentage error (MAAPE; B) and mean absolute error of prediction (MAE; C) of the five models (GPR, RF, RR, ZAP_RF, ZAPC_RF) across environments in dataset 2 for trait SN. NO in the plots means that the genotypeÂenvironment (GE) interaction was ignored, while YES means that the GE interaction term was taken into account.
we assume that y i $ Normal l i ¼ g þ P p j¼1 x ij b j ; r 2 , we move from a univariate Poisson regression to a univariate Gaussian regression model. However, there is a lot of empirical evidence that for count response variables, the Poisson regression model should be preferred since it guarantees that all predictions are non-negative (which is not guaranteed with a normal model) (Montesinos-Ló pez et al. 2015, 2016. When the Gaussian regression is used instead of Poisson regression, negative outputs of the Gaussian regression must be truncated to zero, and it is unclear how this affects the optimality of the predictive distribution (Montesinos-Ló pez et al. 2015, 2016. However, in terms of prediction performance there is also evidence that many times (for particular datasets) using a Gaussian model gives similar prediction performance to a Poisson regression; however, when the count contains an excess of zeros, the normal approximation fails to capture the excess of zeros, and for this reason, improved versions of Poisson regression such as the zero-inflated Poisson and zero altered Poisson regression, are used. For this reason, the goal of our proposed method is to improve the prediction performance of counts in the presence of an excess of zeros.
Also, as conventional RF, the individual decision trees generated by the proposed methods (ZAP_RF and ZAPC_RF) are prone to overfitting (that is, they have high variance and low bias), but by resampling the data many times to create a large number of un-pruned decision trees, the accuracy of prediction based on sample data is improved due to the fact that the variance component is reduced. Also, the proposed methods do not differentiate between random (lines) and fixed effects (environments) since they are non-parametric models; for this reason, the environmental, genotypic and genotypic Âenvironmental effects used in the inputs are treated as additional predictors in the model, that also influence the response variable, as shown in the plots of predictor importance for each trait (Figures 4, 7, 10, and 13).

Figure 13
Predictor importance for trait SN in dataset 2 under conventional random forest (A) and under the zero altered Poisson random forest for trait SN (B and C). The first column contains the results without interaction (NO) and the second column contains the results with interaction (YES).

Conclusion
In this paper, a zero altered Poisson random forest model was evaluated for genomic prediction. This model is a modified random forest that instead of fitting only one RF model, two random forest models are implemented: one for the zero counts (with splitting criteria using the Gini index) and another for the counts larger than zero (with a splitting criterion based on the loglikelihood of a zero truncated Poisson distribution). The two versions of the proposed model for excess zeros (ZAP_RF and ZAPC_RF) were compared in terms of prediction performance with Ridge regression for continuous outcomes, Poisson Ridge regression and conventional random forest. Our results suggest that the two versions of the proposed zero altered Poisson random forest model most of the time was the best in terms of prediction performance and clearly outperformed Ridge regression and Poisson Ridge regression, but produced only a slight improvement over the conventional random forest model. However, we observed that in dataset 1, which contains a larger percentage of excess zeros, the proposed model was clearly better than all models. For this reason, we also provide the cv.zap.rf() function to implement in R the proposed models to enable other scientists with other real data to benchmark the prediction performance of the proposed methods. Finally, we encourage the use of the proposed zero altered random forest models because their implementation is straightforward using the proposed cv.zap.rf() function in the R statistical software, and they produce very competitive predictions like the conventional random forest model.