-
PDF
- Split View
-
Views
-
Cite
Cite
Daniel W. Apley, Jingyu Zhu, Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 82, Issue 4, September 2020, Pages 1059–1086, https://doi.org/10.1111/rssb.12377
- Share Icon Share
Summary
In many supervised learning applications, understanding and visualizing the effects of the predictor variables on the predicted response is of paramount importance. A shortcoming of black box supervised learning models (e.g. complex trees, neural networks, boosted trees, random forests, nearest neighbours, local kernel-weighted methods and support vector regression) in this regard is their lack of interpretability or transparency. Partial dependence plots, which are the most popular approach for visualizing the effects of the predictors with black box supervised learning models, can produce erroneous results if the predictors are strongly correlated, because they require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data. As an alternative to partial dependence plots, we present a new visualization approach that we term accumulated local effects plots, which do not require this unreliable extrapolation with correlated predictors. Moreover, accumulated local effects plots are far less computationally expensive than partial dependence plots. We also provide an R package ALEPlot as supplementary material to implement our proposed method.
1 Introduction
With the proliferation of larger and richer data sets in many predictive modelling application domains, black box supervised learning models (e.g. complex trees, neural networks, boosted trees, random forests, nearest neighbours, local kernel-weighted methods and support vector regression) are increasingly commonly used in place of more transparent linear and logistic regression models to capture non-linear phenomena. However, one shortcoming of black box supervised learning models is that they are difficult to interpret in terms of understanding the effects of the predictor variables (which we refer to as ‘predictors’ for short) on the predicted response. For many applications, understanding the effects of the predictors is critically important. This is obviously so if the purpose of the predictive modelling is explanatory, such as identifying new disease risk factors from electronic medical record databases. Even if the purpose is purely predictive, understanding the effects of the predictors may still be quite important. If the effect of a predictor violates intuition (e.g. if it appears from the supervised learning model that the risk of experiencing a cardiac event decreases as patients age), then this is either an indication that the fitted model is unreliable or that a surprising new phenomenon has been discovered. In addition, predictive models must be transparent in many regulatory environments, e.g. to demonstrate to regulators that consumer credit risk models do not penalize credit applicants based on age, race, etc.
To be more concrete, suppose that we have fitted a supervised learning model for approximating , where Y is a scalar response variable, is a vector of d predictors and f(·) is the fitted model that predicts Y (or the probability that Y falls into a particular class, in the classification setting) as a function of X. To simplify the notation, we omit any circumflex symbol over f, with the understanding that it is fitted from data. The training data to which the model is fitted consists of n (d + 1)-variate observations . Throughout, we use upper case to denote a random variable and lower case to denote specific or observed values of the random variable.
Fig. 1(a) illustrates how is computed at a specific value for a toy example with n = 200 observations of (X1, X2) following a uniform distribution along the line segment but with independent N(0, 0.052) variables added to both predictors (see Hooker (2007), for a similar example demonstrating the adverse consequences of extrapolation in PD plots). Although we ignore the response variable for now, we return to this example in Section 4 and fit various models f(x) to these data. The salient point in Fig. 1(a), which illustrates the problem with PD plots, is that the integral in equation (1) is the weighted average of as X2 varies over its marginal distribution. This integral is over the entire vertical line segment in Fig. 1(a) and requires rather severe extrapolation beyond the envelope of the training data. If we were to fit a simple parametric model of the correct form (e.g. ), then this extrapolation might be reliable. However, by nature of its flexibility, a non-parametric supervised learning model like a regression tree cannot be expected to extrapolate reliably. As we demonstrate later (see Figs 5–7 in Section 4), this renders the PD plot an unreliable indicator of the effect of X1.

Illustration of the differences between the computation of (a) and (b) at
The main objective of this paper is to introduce a new method of assessing the main and interaction effects of the predictors in black box supervised learning models that avoids the foregoing problems with PD plots and marginal plots. Regarding the terminology, plots of an estimate of are often referred to as ‘marginal plots’, because ignoring X2 in this manner is equivalent to working with the joint distribution of (Y, X1) after marginalizing across X2. Unfortunately, plots of are also sometimes referred to as ‘marginal plots’ (e.g. in the gbm package (Ridgeway (2015) and with contributions from others) for fitting boosted trees in R), presumably because the integral in expression (1) is with respect to the marginal distribution p2(x2). In this paper, marginal plots will refer to how we have defined them above.
The approach might also be useful for certain non-black-box supervised learning models that are parameterized in a non-linear manner that is less easy to interpret. For example, multinomial logit models can be challenging to interpret in terms of the effect of the predictors on the response category probabilities, and their odds ratios are often non-intuitive (Wulff, 2015).)
The function can be interpreted as the ALEs of in the following sense (see Fig. 9 in Section 5 for more formal discussion of the correct and incorrect ways to interpret ALE plots). In equation (5), we calculate the local effect of at , then average this local effect across all values of x2 with weight , and then finally accumulate or integrate this averaged local effect over all values of z1 up to . As illustrated in Fig. 2, when averaging the local effect across x2, the use of the conditional density , instead of the marginal density p2(x2), avoids the extrapolation that is required in PD plots. The avoidance of extrapolation is similar to marginal plots, which also use the conditional density . However, by averaging (across x2) and accumulating (up to ) the local effects via equation (5), as opposed to directly averaging f(·) via equation (3), ALE plots avoid the omitted nuisance variable bias that renders marginal plots of little use for assessing the main effects of the predictors. This relates closely to the use of paired differences to block out nuisance factors in more general statistical settings, which we discuss in Section 5.4.

Methods also exist for visualizing the effects of predictors by plotting a collection of curves, rather than a single curve that represents some aggregate effect. Consider the effect of a single predictor Xj, and let denote the other predictors. Conditioning plots (coplots) (Chambers and Hastie, 1992; Cleveland, 1993), conditional response plots (Cook, 1995) and individual conditional expectation plots (Goldstein et al., 2015) plot quantities like against for a collection of discrete values of (conditional response and individual conditional expectation plots), or similarly they plot against for each set Sk in some partition {Sk:k = 1, 2, …} of the space of . Such a collection of curves has more in common with interaction effect plots (as in Fig. 12 in Section 5.2) than with main effect plots, for which one desires, by definition, a single aggregated curve.
The format of the remainder of the paper is as follows. In Section 2, we define the ALE main effects for individual predictors and the ALE second-order interaction effects for pairs of predictors. In Section 3 we present estimators of the ALE main and second-order interaction effects, which are conceptually straightforward and computationally efficient (much more efficient than PD plots), and we prove their consistency. We focus on main and second-order interaction effects and discuss general higher order effects and their estimation in the on-line supplementary materials. In Section 4 we give examples that illustrate the ALE plots and, in particular, how they can produce correct results when PD plots are corrupted because of their reliance on extrapolation. In Section 5, we discuss interpretation and uncertainty quantification of ALE plots and some of their desirable properties and computational advantages, and we illustrate with a real data example. We also discuss their relation to functional analysis-of-variance (ANOVA) decompositions for dependent variables (e.g. Hooker (2007)) that have been developed to avoid the same extrapolation problem as highlighted in Fig. 1(a). ALE plots are far more computationally efficient and systematic to compute than functional ANOVA decompositions; and they yield a fundamentally different decomposition of f(x) that is better suited for visualization of the effects of the predictors. Section 6 concludes the paper. The on-line supplementary materials include proofs of all the theorems, discussion of third- and higher order ALE effects, implementation details and additional details of the real data example. See Apley (2016), on which the main topic of this paper is based, for further examples. We also provide as on-line supplementary material an R package ALEPlot to implement ALE plots.
The data that are analysed in the paper and the programs that were used to analyse them can be obtained from
https://academic.oup.com/jrsssb/issue/.
2 Definition of main and second-order accumulated local effects
In this section we define the ALE main effect functions for each predictor (equation (5) is a special case for d = 2 and differentiable f(·)) and the ALE second-order effect functions for each pair of predictors. ALE plots are plots of estimates of these functions, and the estimators are defined in Section 3. We do not envision ALE plots being commonly used to visualize third- and high order effects, since high order effects are difficult to interpret and usually not as predominant as main and second-order effects. For this reason, and to simplify the notation, we focus on main and second-order effects and relegate the definition of higher order ALE effects to the on-line supplementary materials.
Throughout this section, we assume that p has compact support , and the support of pj is the interval for each j ∈ {1, 2, …, d}. For each K = 1, 2, …, and j ∈ {1, 2, …, d}, let be a partition of into K intervals with and . Define , which represents the fineness of the partition. For any , define to be the index of the interval of into which x falls, i.e. for . Let denote the subset of d − 1 predictors excluding Xj, i.e. . The following definition is a generalization of equation (5) for a function f(·) that is not necessarily differentiable and for any d ⩾ 2. The generalization essentially replaces the derivative and integral in equation (5) with limiting forms of finite differences and summations respectively.
Definition 1(uncentred ALE main effect)
Consider any j ∈ {1, 2, …, d}, and suppose that the sequence of partitions is such that . When f(·) and p are such that the following limit exists and is independent of the particular sequence of partitions (see theorem A.1 in appendix A of the on-line supplementary materials for sufficient conditions on the existence and uniqueness of the limit), we define the uncentred ALE main (which is also known as the first-order) effect function of Xj as (for )(6)
The following theorem, the proof of which is in appendix A of the on-line supplementary materials, states that, for differentiable f(·), the uncentred ALE main effect of Xj in expression (6) has an equivalent but more revealing form that is analogous to equation (5).
Theorem 1(uncentred ALE main effect for differentiable f(·))
Let denote the partial derivative of f(x) with respect to when the derivative exists. In definition 1, suppose that
- (a)
is differentiable in on ,
- (b)
is continuous in on and
- (c)
is continuous in zj on .
Then, for each ,(7)
Remark 1The ALE plot function attempts to quantify something that is quite similar to the PD plot function in expression (1) and can be interpreted in the same manner. For example, ALE plots and PD plots both have a desirable additive recovery property, i.e., if is additive, then both and are equal to the desired true effect , up to an additive constant. Hence, a plot of versus correctly reveals the true effect of Xj on f, no matter how black box the function f is. If second-order interaction effects are present in f, a similar additive recovery property holds for the ALE second-order interaction effects that we define next (see Section 5.4 for a more general additive recovery property that applies to interactions of any order). In spite of the similarities in the characteristics of f that they are designed to extract, the differences in and lead to very different methods of estimation. As will be demonstrated in the later sections, the ALE plot functions are estimated in a far more computationally efficient manner that also avoids the extrapolation problem that renders PD plots unreliable with highly correlated predictors.
We next define the ALE second-order effects. For each pair of indices {j, l}⊆{1, 2, …, d}, let denote the subset of d − 2 predictors excluding {Xj, Xl}, i.e. . For general f(·), the uncentred ALE second-order effect of (Xj, Xl) is defined similarly to equation (6), except that we replace the one-dimensional finite differences by two-dimensional second-order finite differences on the two-dimensional grid that is the Cartesian product of the one-dimensional partitions of and , and the summation is over this two-dimensional grid.
Definition 2(uncentred ALE second-order effect)
Consider any pair {j, l}⊆{1, …, d} and corresponding sequences of partitions and such that . When f(·) and p are such that the following limit exists and is independent of the particular sequences of partitions, we define the uncentred ALE second-order effect function of (Xj, Xl) as (for )where(9)is the second-order finite difference of with respect to across cell of the two-dimensional grid that is the Cartesian product of and .(10)
Analogous to theorem 1, theorem 2 (which is proved in appendix A of the on-line supplementary materials) states that, for differentiable f(·), the uncentred ALE second-order effect of (Xj, Xl) in expression (9) has an equivalent integral form.
Theorem 2(uncentred ALE second-order effect for differentiable f(·))
Let denote the second-order partial derivative of f(x) with respect to and when the derivative exists. In definition 2, suppose that
- (a)
is differentiable in on ,
- (b)
is continuous in on and
- (c)
is continuous in (zj, zl) on .
Then, for each ,(11)
If we define the zero-order effect for any function of X as its expected value with respect to p, then we can view the ALE first-order effect of Xj as being obtained by first calculating its uncentred first-order effect (6) and then, for the resulting function, subtracting its zero-order effect. Likewise, the ALE second-order effect of (Xj, Xl) is obtained by first calculating the uncentred second-order effect (9), then, for the resulting function, subtracting both of its first-order effects of Xj and of Xl, and then, for this resulting function, subtracting its zero-order effect. The ALE higher-order effects are defined analogously in appendix B of the on-line supplementary materials. The uncentred higher order effect is first calculated, and then all lower order effects are sequentially calculated and subtracted one order at a time, until the final result has all lower order effects that are identically 0.
Remark 2In appendix B of the on-line supplementary materials we define ALE higher order effect functions for |J|>2, where |J| denotes the cardinality of the set of predictor indices J. appendix C of the on-line supplementary materials shows that this leads to a functional-ANOVA-like decomposition of f viaThis ALE decomposition has a certain orthogonality-like property, which we contrast with other functional ANOVA decompositions in Section 5.6.
Remark 3In light of the decomposition theorem in remark 2, we can use the following R2-like measures to assess how well f(·) is approximated by its ALE main effects only, by its ALE main plus second-order interaction effects, etc. Specifically, for m ∈ {1, …, d}, we can calculateto measure how well f(·) is approximated by its ALE effects up to order m. An estimate of this statistic can be obtained by replacing the ALE functions by their corresponding estimators (which are defined in the next section) and taking the variances to be the sample variances as X varies over its training data values. In practice, one would typically compute estimates of only for m ∈ {1, 2}. Note that the ALE decomposition theorem that is mentioned in remark 2 implies that .
Remark 4The ALE function definitions in this section apply to predictor distributions pj that are continuous numerical with compact support. For discrete pj, one could consider modifying equations (6) and (9) by using a fixed finite partition whose interval end points coincide with the support of pj. We do not develop this, however, because our focus is on estimation and interpretation of the ALE effects, and the estimators in the following section are meaningful for either continuous or discrete pj. In the case that Xj is a nominal categorical predictor, we must decide how to order its categories before estimating its ALE effect (which requires differencing f across neighbouring categories of Xj). In appendix E of the on-line supplementary materials, we discuss a strategy for this that we have found to work well in practice.
3 Estimation of and
In appendix D of the on-line supplementary materials we briefly describe how to estimate the ALE higher order effect for a general subset J⊆{1, 2, …, d} of predictor indices. Our focus in this section is on estimating the first-order (|J|=1) and second-order (|J|=2) effects, since these are the most common and useful (i.e. interpretable).
As an overview, the estimate is obtained by computing estimates of the quantities in equations (6), (7), (8), (9), (10), (11), (12), (13), (14) for J = j (a single index) or for J = {j, l} (a pair of indices). For the estimates we
- (a)
replace the sequence of partitions in equation (6) (or (9)) by some appropriate fixed partition of the sample range of and
- (b)
replace the conditional expectations in equation (6) (or (9)) by sample averages across , conditioned on falling into the corresponding interval or cell of the partition.
In this, and denote the ith observation of the subsets of predictors and respectively.
More specifically, for each j ∈ {1, 2, …, d}, let {Nj(k) = (zk − 1, j, zk, j]:k = 1, 2, …, K} be a sufficiently fine partition of the sample range of {xi, j:i = 1, 2, …, n} into K intervals. Since the estimator is computed for a fixed K, we have omitted it as a superscript on the partition, with the understanding that the partition implicitly depends on K. In all of our examples later in the paper, we chose zk, j as the (k/K)-quantile of the empirical distribution of {xi, j:i = 1, 2, …, n} with z0, j chosen just below the smallest observation, and zK, j chosen as the largest observation. Fig. 3 illustrates the notation and concepts in computing the ALE main effect estimator for the first predictor j = 1 for the case of d = 2 predictors. For k = 1, 2, …, K, let nj(k) denote the number of training observations that fall into the kth interval Nj(k), so that . For a particular value x of the predictor , let denote the index of the interval into which x falls, i.e. .
![Illustration of the notation and concepts in computing the ALE main effect estimator f^j,ALE(xj) for j = 1 with d = 2 predictors (, scatter plot of {(xi, 1, xi, 2):i = 1, 2,\ldots, n} for n = 30 training observations): the range of {xi, 1:i = 1, 2,\ldots, n} is partitioned into K = 5 intervals {N1(k) = (zk − 1, 1, zk, 1]:k = 1, 2,\ldots, 5} (in practice, K should usually be chosen much larger than 5); the numbers of training observations falling into the five intervals are n1(1) = 4, n1(2) = 6, n1(3) = 6, n1(4) = 5 and n1(5) = 9; the horizontal line segments shown in the N1(4) region are the segments across which the finite differences f(z4,j,xi,\j)−f(z3,j,xi,\j) are calculated and then averaged in the inner summand of equation (15) for k = 4 and j = 1](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/82/4/10.1111_rssb.12377/2/m_rssb_82_4_1059_fig-3.jpeg?Expires=1747888255&Signature=JzmVP-vTWhYvqGQaSTyeL4I7-G5OpFnFWfmnkK5Ze2VR57E2pc~S6kaiDS6oiO6zIjBkyLEOGlmwuzdWNVKemVmuTmn9wW0UQ~z09KA57MeFJAr3Qm~8ohUsqNvxuoxpPFnbcUccq8oYtSTKiUxApK0MgfB9F7bMInKPjtCrUPU7YeLVfdGBozDKrZxsHhBBqX4nWrJtyWtR6GMuKpGJ9RT0aK1eblga96zSGmwmB4gRTxttpoKdjF2NCgbQHw4CjjZTqLz2ntudWLh4c0q149OzKZNWG4i76WJVfDVIA0SaEscN7NOl9wD7lV-YNy388ojsMHCVdZUI5UpgiRsaBw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Illustration of the notation and concepts in computing the ALE main effect estimator for j = 1 with d = 2 predictors (, scatter plot of {(xi, 1, xi, 2):i = 1, 2,\ldots, n} for n = 30 training observations): the range of {xi, 1:i = 1, 2,\ldots, n} is partitioned into K = 5 intervals {N1(k) = (zk − 1, 1, zk, 1]:k = 1, 2,\ldots, 5} (in practice, K should usually be chosen much larger than 5); the numbers of training observations falling into the five intervals are n1(1) = 4, n1(2) = 6, n1(3) = 6, n1(4) = 5 and n1(5) = 9; the horizontal line segments shown in the N1(4) region are the segments across which the finite differences are calculated and then averaged in the inner summand of equation (15) for k = 4 and j = 1
To estimate the ALE second-order effect of a pair of predictors (Xj, Xl), we partition the sample range of {(xi, j, xi, l):i = 1, 2, …, n} into a grid of K2 rectangular cells obtained as the Cartesian product of the individual one-dimensional partitions. Fig. 4 illustrates the notation and concepts. Let (k, m) (with k and m integers between 1 and K) denote the indices into the grid of rectangular cells with k corresponding to and m corresponding to . By analogy with Nj(k) and nj(k) defined in the context of estimating fj,ALE(·), let N{j, l}(k, m) = Nj(k) × Nl(m) = (zk − 1, j, zk, j]×(zm − 1, l, zm, l] denote the cell that is associated with indices (k, m), and let n{j, l}(k, m) denote the number of training observations that fall into cell N{j, l}(k, m), so that .

Illustration of the notation used in computing the ALE second-order effect estimator for K = 5: the ranges of {xi, j:i = 1, 2,\ldots, n} and {xi, l:i = 1, 2,\ldots, n} are each partitioned into five intervals, and their Cartesian product forms the grid of rectangular cells {N{j, l}(k, m) = Nj(k) × Nl(m):k = 1, 2,\ldots, 5;m = 1, 2,\ldots, 5}; the cell with bold borders is the region N{j, l}(4, 3); the second-order finite differences in equation (18) for (k, m) = (4, 3) are calculated across the corners of this cell; in the inner summation of equation (17), these differences are then averaged over the n{j, l}(4, 3) = 2 observations in region N{j, l}(4, 3)
Theorems 3 and 4 in appendix A of the on-line supplementary materials show that, under mild conditions, equations (16) and (20) are consistent estimators of the ALE main effect (8) of Xj and ALE second-order effect (14) of (Xj, Xl) respectively.
ALE plots are plots of the ALE effect estimates and versus the predictors that are involved. ALE plots have substantial computational advantages over PD plots, which we discuss in Section 5.5. In addition, ALE plots can produce reliable estimates of the main and interaction effects in situations where PD plots break down, which we illustrate with examples in the next section, as well as an example on real data in Section 5.2.
4 Toy examples illustrating when accumulated local effects plots are reliable but partial dependence plots break down
4.1 Example 1
This example was introduced in Section 1. For this example, d = 2 and n = 200, and (X1, X2) follows a uniform distribution along a segment of the line with independent N(0, 0.052) variables added to both predictors. Fig. 5 shows a scatter plot of X2versusX1. The true response was generated according to the noiseless model for the 200 training observations in Fig. 5, to which we fit a tree by using the tree package of R (Ripley, 2015). The tree was overgrown and then pruned back to have 100 leaf nodes, which was approximately the optimal number of leaf nodes according to a cross-validation error sum-of-squares criterion. Note that the optimal size tree is relatively large, because the response here is a deterministic function of the predictors with no response observation error. The first eight splits of the fitted tree f(x) are also depicted in Fig. 5. Fig. 6 shows main effect PD plots, marginal plots and ALE plots for the full 100-node fitted tree f(x), calculated via expressions (2), (4) and (15) and (16). For both j = 1 and j = 2, is much more accurate than either or . By inspection of the fitted tree in Fig. 5, it is clear why performs so poorly in this example. For small -values like , the PD plot estimate is much higher than it should be, because it is based on the extrapolated values of f(x) in the upper left-hand corner of the scatter plot in Fig. 5, which were substantially overestimated because of the nature of the tree splits and the absence of any data in that region. For similar reasons, for small x2-values is substantially underestimated because of the extrapolation in the lower right-hand corner of the scatter plot in Fig. 5. In contrast, by avoiding this extrapolation, and are estimated quite accurately and are quite close to the true linear (for ) and quadratic (for x2) effects, as seen in Figs 6(a) and 6(b) respectively.

Depiction of the first eight splits in the tree fitted to the example 1 data: (a) scatter plot of x2versus showing splits corresponding to the truncated tree in (b)

For the tree model fitted to the example 1 data, plots of (), (
), (
) and the true main effect of Xj (
) for (a) j = 1, for which the true effect of X1 is linear, and (b) j = 2, for which the true effect of X2 is quadratic: for both j = 1 and j = 2, is much more accurate than either or
Also note that the marginal plots in Fig. 6 perform very poorly. As expected, because of the strong correlation between X1 and X2, and are quite close to each other and are each combinations of the true effects of X1 and X2. In the subsequent examples, we do not further consider marginal plots.
4.2 Example 2
This example is a modification of example 1 having the same d = 2 and n = 200, and (X1, X2) following a uniform distribution along a segment of the line with independent N(0, 0.052) variables added to both predictors. However, the true response is now generated as noisy observations according to the model with ɛ∼N(0, 0.12), and we fit a neural network model instead of a tree. For the neural network, we used the nnet package of R (Venables and Ripley, 2002) with 10 nodes in the single hidden layer, a linear output activation function and a decay or regularization parameter of 0.0001, all of which were chosen as approximately optimal via multiple replicates of tenfold cross-validation (the cross-validation R2 for this model varied between 0.965 and 0.975, depending on the data set that was generated, which is quite close to the theoretical R2-value of 1−var(ɛ)/var(Y) = 0.972). We repeated the procedure in a Monte Carlo simulation with 50 replicates, where on each replicate we generated a new training data set of 200 observations and refit the neural network model with the same tuning parameters mentioned above. The estimated main effect functions and (for j = 1, 2) over all 50 replicates are shown in Fig. 7. For this example also, is far superior to . On every replicate, and are quite close to the true linear and quadratic effects respectively. In sharp contrast, and are so inaccurate on many replicates that they are of little use in understanding the true effects of X1 and X2.

Comparison of (a) , (b) , (c) and (d) for neural network models fitted over 50 Monte Carlo replicates of the example 2 data: , true effect function (linear for X1 and quadratic for X2);
, estimated effect functions over the 50 Monte Carlo replicates
Fig. 8 explains why the much larger variability of PD plots in Fig. 7 results from their requirement of severe extrapolation of f(·). Fig. 8(a) shows a point that is inside the training data envelope and a second point that falls far outside the envelope. The box plots in Fig. 8(b) show that the variability of the predicted response from the fitted neural networks is very large at the extrapolated location but quite reasonable at the interior location . Since PD plots require evaluating f(·) at many points outside the data envelope (see Fig. 1), the high variability of these extrapolated f(·) values translates to high variability of the PD plots

Illustration of how extrapolation results in high variability of the PD plots in Fig. 7: (a) training data envelope and two points at which PD plots must evaluate the neural network’s predicted response, () and (
), which fall inside and outside the data envelope respectively; (b) boxplots of the predicted responses and for the fitted neural network models across 50 Monte Carlo replicates
5 Discussion
5.1 The wrong and right ways to interpret accumulated local effects plots (and partial dependence plots)
This section provides a word of caution on how not to interpret ALE plots when the predictors are highly correlated, which also applies to interpreting PD plots. Reconsider example 1, in which X1 and X2 are highly correlated (see Fig. 5), and the ALE and PD plots are as in Fig. 6. The wrong way to interpret the ALE plot is that it implies that, if we fix (say) and then vary x2 over its entire range, the response (of which f(x) is a predictive model) is expected to vary as in Fig. 6(b). And this interpretation is wrong even if f(x) is truly additive in the predictors. Indeed, varying x2 over its entire range with held fixed would take x far outside the envelope of the training data to which f was fitted, as can be seen in Fig. 5. f(x) is obviously unreliable for this level of extrapolation, which was the main motivation for ALE plots, and we have highly uncertain knowledge of the hypothetical values of the response far outside the data envelope.
However, ALE plots are still very useful if we interpret them correctly, and the correct interpretation is illustrated in Fig. 9. In this toy example, suppose that is additive, the effect of Xj is the quadratic function and Xj is highly correlated with the other predictors. Fig. 9(a) shows the local effects of Xj within each bin, for K = 10 equally spaced bins over the support [0, 1]. Here, the local effect within a bin is defined as the summand in the equation (6) definition of , which represents the average change in f(X) as Xj changes from the left end point to the right end point of the bin.
![Illustration of the right way to interpret ALE plots, for an example in which f(x)=Σj=1dfj(xj) is additive with quadratic fj(xj)=xj2, and K = 10 equally spaced bins over the support [0, 1] were used: (a) local effects of Xj within each bin (i.e. the summand in equation (6)) (the local effects are local and require no extrapolation outside the data envelope); (b) gj,ALE(xj) and can be viewed as piecing together (or accumulating) the local effects in a manner that facilitates easier visualization of the underlying global effect, which is quadratic in this case](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/82/4/10.1111_rssb.12377/2/m_rssb_82_4_1059_fig-9.jpeg?Expires=1747888255&Signature=PwCeXnzmZ5u4S5lqlNs~LxHry63nrksn7-t32J0siWS62l99~d8EvDzEh59dWxAk5K6LJa8bQhTybi1Bbei9fEzxaFtlo611d-EWDCprR6fYgccWJFESLhTZEXfN3JheN4sa~QE4TnfxfIYiX-osiFm2pw47oBT1FTHbqQJo13p~zufuXRYvD1MHWSIgBZiTIKp-k8Ku2zMl7hw7ZkGBu3ggUU0-3EqzOo3mN7AqNKcnrdjHbVjFgfrVR6cIqry-briqs7nF0rEtOIzy8hYI6bkzM1mU3~N~7zFFaIYm5z9xr-FYFx1OF~3XTfjntddex1uxYShkkSUCzLT823ippg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Illustration of the right way to interpret ALE plots, for an example in which is additive with quadratic , and K = 10 equally spaced bins over the support [0, 1] were used: (a) local effects of Xj within each bin (i.e. the summand in equation (6)) (the local effects are local and require no extrapolation outside the data envelope); (b) and can be viewed as piecing together (or accumulating) the local effects in a manner that facilitates easier visualization of the underlying global effect, which is quadratic in this case
The local effects are exactly that—local—and require no extrapolation beyond the envelope of the data, since the changes in f(X) are averaged across the conditional distribution of , given that Xj falls in that bin. Consequently, if the bin widths are not too small and f is not too noisy, a local effect plot like that in Fig. 9(a) could be interpreted to reveal the effect of Xj on f.
However, the effect of Xj is much easier to visualize if we accumulate the local effects via equation (6) and plot instead, as in Fig. 9(b). Aside from vertical centring, this is exactly the ALE plot, and it is best viewed as a way of piecing together the local effects in a manner that provides easier visualization of the underlying global effect of a predictor. The additive recovery property that is discussed in Section 5.4 provides even stronger justification for this manner of piecing together the local effects. Namely, if is additive, the ALE plot manner of piecing together the local effects produces the correct global effect function . Of course, we must still keep in mind that the global effect may only hold when the set of predictors x jointly fall within the data envelope.
5.2 Illustration with a bike sharing real data example
We now show an example with a real, larger data set (see appendix F of the on-line supplementary materials for some preliminary exploratory data analysis and model diagnostics). The data are a compilation of the bike sharing rental counts from the Capital Bikeshare system (Washington DC, USA) over the 2-year period 2011–2012, aggregated hourly, together with hourly weather and seasonal information over the same time period. The data file can be found at https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset (Fanaee-T and Gama, 2014). After removing 46 observations with missing entries, there are n = 17333 cases or rows in the training data set corresponding to 17333 h of data. We use the following d = 10 predictors: year (X1, categorical with two categories: 0≡2011; 1≡2012), month (X2, treated as numerical: 1≡ January; 2≡ February, …; 12≡ December), hour (X3, treated as numerical: {0, 1, …, 23}), holiday (X4, categorical: 0≡ non-holiday; 1≡ holiday), weekday (X5, treated as numerical: {0, 1, …, 6} representing day of a week with 0≡ Sunday), workingday (X6, categorical: 1≡ neither weekend nor holiday; 0≡ otherwise), weather situation (X7, treated as numerical: {1, 2, 3, 4}, smaller values correspond to nicer weather situations), atemp (X8, numerical: feeling temperature in degrees Celsius), hum (X9, numerical: humidity) and windspeed (X10, numerical: wind speed). We do not use date and season in the data file as predictors since they are dependent on the other predictors. We also exclude the temp-variable (representing actual temperature) since it has a correlation coefficient of above 0.99 with atemp. The set of retained predictors are still highly correlated. For example, month and temperature are highly correlated, and so are weekday and workingday. Finally, the response variable is the total number of bike rental counts in each hour. We fit the supervised learning models to the logarithm of the rental counts, since the rental counts are highly right skewed, and since using the log-transformation resulted in slightly better cross-validation R2 (in terms of predicting the untransformed rental counts). The function f(x) for all the ALE and PD plots is the predicted hourly rental count, obtained by taking the exponential of the predicted logarithm of the rental count from the fitted supervised learning models.
We fit a neural network model (see appendix F of the on-line supplementary materials for similar results for a random forest, instead of a neural network) using the R nnet package (Venables and Ripley, 2002) with 25 nodes in the single hidden layer (size=25 ), a linear output activation function (linout=TRUE) and a regularization parameter of 0.001 (decay=0.001). These parameters are approximately optimal according to multiple replicates of threefold cross-validation (cross-validation R2≈0.92). (The cross-validation R2 is computed for predicting the rental counts on their original scale, rather than their log-transformed versions.) Fig. 10, Fig. 11(a) and Fig. 12 show ALE main and interaction effects plots for various predictors. We used K = 100 for both the main effect plots and the interaction plots. We note that the exact results vary somewhat each time that the model is fitted, which is likely to be due to the random initial guesses for the neural network parameters in the nnet package. In Section 5.3 we account for this and other sources of uncertainty to show that the interpretations that we discuss in this section are meaningful.
![For the bike sharing example with f(x) exponential of a neural network model for predicting log-transformed hourly bike rental counts, (a) ALE main effect plots for month (X2), (b) hour of the day (X3), (c) weather situation (X7) and (d) wind speed (X10): the zero-order effects have been included, i.e. the plots are of fj,ALE(xj)+E[f(X)]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/82/4/10.1111_rssb.12377/2/m_rssb_82_4_1059_fig-10.jpeg?Expires=1747888256&Signature=T8~2aEmfw7S-BoWGjs6ErpLcsFfh9yzrPcmlwGo~tKT73hVVuXMKc6gSINEYPNOfB-jOFlgYZ6IsbY0krd5r1~~IcmbauOwqmLWG-SO4r2qc3r-jnuGz2I1CPwnrW9IrOwO8OCvD~ymK~PzWL2ch9SAOPCTTF~neAd4mdpYX8wUElP~i-rg42KSZNJfbqauIWjroNAKx701OrW1TTjPn4Fbgc2nHmrWP-wnvAIVXaC9wACB2haJTKu3HQvORhccCSDCx75fMN-gx2LIzLQAyektHr6cvfZeaMYIGvXfTBFjbuAQ7rDcKaAQs7Xmv81LbXfsoARo96kwLPN5xYBGYOw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
For the bike sharing example with f(x) exponential of a neural network model for predicting log-transformed hourly bike rental counts, (a) ALE main effect plots for month (X2), (b) hour of the day (X3), (c) weather situation (X7) and (d) wind speed (X10): the zero-order effects have been included, i.e. the plots are of
![For the bike sharing data example with neural network predicted counts for f(x), (a) ALE main effect plot and (b) PD main effect plot for feeling temperature X8: both plots include the zero-order effect E[f(X)]; the two plots differ substantially, and the ALE plot better agrees with intuition](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/82/4/10.1111_rssb.12377/2/m_rssb_82_4_1059_fig-11.jpeg?Expires=1747888256&Signature=2EbLt8G2jvtKc~IsqX1EXx~7P0BZqy18YKPet8578xlLwfRbsB1-p~f-uct5Mwf-dOfq0~Y5LD1J3pnnlaX1M1e1IJI6y3FfRW--snt324bmD6oRqilnwEvC8TtBaQlwJEr4XKpZIPnHKeIpHiC7MFQubCZMRiCcfVgZ8XDAMiFhnjg3snyWN7OE9JfdajXBtb~dUp7LxWre2K68bHhYAo4v426y~wjYZWBCLeAPw5ZXeSMP8j4sDT0WzaJL3owJVtBl16aUVeYiLmgg145VjHUEK3PcszHTvnEOaoeEc9d7jUc6er~sUuD2TWfuedGLKlKd5gkSdPYbENJDfOIp5g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
For the bike sharing data example with neural network predicted counts for f(x), (a) ALE main effect plot and (b) PD main effect plot for feeling temperature X8: both plots include the zero-order effect ; the two plots differ substantially, and the ALE plot better agrees with intuition
![ALE second-order interaction plots for the predictors hour (X3) and weather situation (X7) (a) without and (b) with the main effects of X3 and X7 included: (a) plots f^{3,7},ALE(x3,x7), and (b) plots E[f(X)]+f^3,ALE(x3)+f^7,ALE(x7)+f^{3,7},ALE(x3,x7); the numbers on the contours are the function values](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/82/4/10.1111_rssb.12377/2/m_rssb_82_4_1059_fig-12.jpeg?Expires=1747888256&Signature=SXlk-lqGx8GhJrhvbhbdP~u-vxIlk6AabVrf9Ibc6CTYTk5q1zvoSxOWry-ZL6tYvYy66bSCkUAp9PuiOiQN-KFNDw1ZWAJ~8CPCUfuYGUBpyX4wJh5CBu7RGFzhSf1KDEgq7VlayNAnFnWPPI-o9ngBxZwCsRNFChFY3w9ZudYdtlA4WyEuIMYQug3xIUGDms~eDeQ0xya0vawz5KN~PCeUtE19pxbo6hVhq0Y-K~0n3GAyAgRheiVDJBDGQ~RV0c9kWw~WEi8EPCt~yy-nOg5PERLRrvXy7WBSRqnO-GI9geqAMon9f50kWmSdNywYHU5l2wJI470jj9GfBJ3fBw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
ALE second-order interaction plots for the predictors hour (X3) and weather situation (X7) (a) without and (b) with the main effects of X3 and X7 included: (a) plots , and (b) plots ; the numbers on the contours are the function values
The ALE main effect plots are shown for month (X2), hour (X3), weather situation (X7) and wind speed (X10) in Fig. 10, and for feeling temperature (X8) in Fig. 11(a). All the ALE main effect plots provide clear interpretations of the (main) effects of the predictors. For the effect of month (X2), the number of rentals is lowest in January and gradually increases month by month until it peaks in September (month 9), after which it declines in the winter months. For the effect of hour of the day (X3), the number of rentals increases until it first peaks at the morning rush hour around 8.00 a.m. (hour 8), after which it decreases to moderate levels over the late morning and early afternoon hours, and then peaks again at the evening rush hour around 5.00–6.00 p.m. (hours 17–18). For the effect of weather situation (X7), the number of rentals monotonically decreases as the weather situation worsens. Recall that a larger value of X7 corresponds to worse weather conditions. For the effect of wind speed (X10), the number of rentals also monotonically decreases as the wind speed increases. For the effect of atemp (X8, in Fig. 11(a)), the number of rentals steadily increases as atemp (i.e. feeling temperature) increases up until about 23–24 ∘C (73.4 − 75.2 ∘F), after which it steadily decreases. This makes perfect sense, since a feeling temperature of 23 − 24 ∘C might be considered as nearly optimal for comfortably biking around a city (note that feeling temperature takes into account factors such as humidity and breeze, so the actual temperature typically would be somewhat lower), and feeling temperatures that are either substantially higher or lower than this will make bike rental less appealing for many people.
In comparison, Fig. 11(b) shows the PD main effect plot for feeling temperature, which is substantially different from the ALE main effect plot in Fig. 11(a), even though the two are for the exact same fitted neural network model. The difference is due to the high correlation between feeling temperature and some of the other predictors and the resulting extrapolation that makes PD plots unreliable. In this case, the PD plot indicates that the number of bike rentals increases as feeling temperature increases up until about 25 − 26 ∘C (77 − 78.8 ∘F) and then only slightly decreases as the feeling temperature increases far beyond this. In particular, the PD plot implies that the numbers of bike rentals at extremely high feeling temperatures such as 40 and 50 ∘C (104 and 122 ∘F) are about the same as the numbers of bike rentals at 20 and 14 ∘C (68 and 57.2 ∘F) respectively. Most people would consider the latter to be mild and comfortable biking temperatures, and the former to be too hot. The ALE plot for feeling temperature in Fig. 11(a), which implies that bike rentals will decrease drastically as the feeling temperature increases far beyond the comfortable range, is in better agreement with intuition. In addition to providing more reliable interpretations, the ALE plots are much faster to compute than the PD plots (see Section 5.5).
Fig. 12 shows two versions of the ALE second-order interaction effect plot for the hour and weather situation predictors X3 and X7, without and with the main effects of X3 and X7 included. Fig. 12(b) plots , whereas Fig. 12(a) plots only . Generally speaking, including the main effects as in Fig. 12(b) provides a clearer picture of the joint effects of two predictors, whereas excluding the main effects as in Fig. 12(a) allows the overall magnitude of the interaction effect to be more easily assessed. Our ALEPlot R package enables either to be plotted.
The interaction ALE plot in Fig. 12 reveals an interesting relationship and is an important supplement to the main effects ALE plots for X3 and X7 in Fig. 10. Consider the decrease in bike rental counts that is due to larger weather situation values (i.e. less pleasant weather). If there were no interactions, this decrease would be the same regardless of the hour or the levels of the other predictors. But, from Fig. 12(a), there is clearly a strong interaction between X3 and X7, since the contour values vary over a range of about 120 units (from −60 to 60), which is even larger than the range for the main effect (from 110 to 190) in Fig. 10. The effect of weather situation, which is a decrease in rentals as weather situation increases, is clearly amplified during the rush hour peaks, and in general at hours when the overall rental counts are expected to be higher. As always, we must be careful about interpreting interaction plots when main effects are not included. From Fig. 12(a), at some hours (e.g. around hour 0, which is midnight) increases as weather situation increases. However, when the main effects of X3 and X7 are included as in Fig. 12(b), it is clear that increasing weather situation decreases bike rentals at any hour.
It makes sense that the effects of weather situation are amplified by the effects of hour (on which the overall bike rental counts depend heavily), and this example illustrates how visualizations like this can aid the model building process by suggesting modifications of the model that one might consider. For example, Fig. 12(b) indicates that the effects of weather situation and/or hour on rental counts might be better modelled as multiplicative.
Remark 5For the main effect plot of month (X2), we might expect the values for January and December to match approximately. However, the two values differ substantially in Fig. 10. This is likely to be because we have only 2 years of data, and the operation had grown steadily and substantially over this 2-year period. As a result the fitted neural network (and the fitted random-forest model that is described in appendix F of the on-line supplementary materials) was unable to distinguish the seasonal effect of month from the overall growth trend, even with year included as an additional predictor, which explains why the December value for the main effect of month is higher than the January value. This is a deficiency of the data and fitted model f(·), and not a deficiency of the ALE plot itself. It is likely that, if f(·) was fitted to more years of data (e.g. 5 or 6 years), the fitted model could have distinguished the monthly seasonal effect from the overall growth trend, in which case the ALE plots for f(·) would have reflected this. In general, if the fitted supervised learning model has inaccuracies or deficiencies for any reason (e.g. due to outliers, insufficient data, corrupted data or optimization problems during model fitting), ALE plots will not overcome this. However, when the fitted f(·) does capture the true effects within the training data envelope (as in the simulated examples in Section 4), ALE plots can distinguish the effects of multiple predictors in situations where PD plots fail because of extrapolation.
Remark 6A related topic in understanding the effects of predictors is assessment of the relative importance of the individual predictors. A crude way to achieve this is by looking at the ranges of the ALE main effect functions. For example, in Fig. 10, spans a range of approximately 420, whereas the other ALE main effect functions have ranges less than 100. We can therefore conclude that hour is the most important predictor, at least in terms of its main effect. A serious shortcoming of this approach is that it does not take into account interactions between predictors. To assess the total importance of a predictor, we must also look at the ranges of its higher order ALE interaction effects between all other predictors. We are currently working on a set of ALE-related variable importance measures that capture both the main and the total effects on the basis of this idea, which we expect to implement in a future version of our ALEPlot package.
5.3 Uncertainty quantification of accumulated local effects plots
Throughout the paper, for notational simplicity we have omitted any circumflex symbols over the supervised learning model f(·). However, one should keep in mind that it is always estimated from data, and in this section we make the distinction between some ‘true’ f(·) (e.g. for regression problems or for binary classification) versus its estimate computed from a set of training data. Two general sources of errors in the estimated ALE plot functions are
- (a)
differences between and f(·) (due to the usual errors when fitting supervised learning models to training data) and
- (b)
differences between and (due to sampling variation in the training X-values that are used when computing the ALE plot estimator) when both are applied to , and is viewed as given.
Analytical results accounting for the latter might be straightforward to develop, but general results for the former would be intractable because of the complexity and generality of supervised learning models. Because ignoring the former and accounting only for the latter would give a misleading and false sense of security, we use a straightforward bootstrapping approach in this section to account for both sources. Regarding differences between and f(·), the bootstrapping approach below accounts for variability error only and does not account for any bias error that results from the structure of the supervised learning model being unable to represent the true f(·). The bias error is impossible to account for in a general sense.
For bootstrapping ALE main effect functions, we first draw B bootstrap samples , where and are the bth bootstrap sample of the response and predictors, drawn from the original training data. We then fit supervised learning models to each of these bootstrap samples and, for each and , compute ALE plot function estimates . Fig. 13 shows B = 50 bootstrapped ALE main effect plots for hour X3 and feeling temperature X8 in the bike sharing example of Section 5.2, along with their mean functions and their mean ± 1 standard deviation functions, where the means and standard deviations are taken across the 50 bootstrap replicates (pointwise in ). From Fig. 13, seems to have more uncertainty than , which is likely to be due to the higher level of multicollinearity between feeling temperature and the other predictors and/or the relative lack of data at feeling temperatures below around −10 ∘C or above around 40 ∘C. Regardless, all bootstrap replicates of both and display the same features and trends as those plotted in Section 5.2. In particular, all 50 exhibit the same phenomenon whereby bike rentals decrease substantially when feeling temperature increases beyond about 24 ∘C. Hence, our interpretations and conclusions about the predictor effects for the bike sharing example appear to be statistically significant.

Bootstrap estimates of ALE main effect functions for (a) hour (X3) and (b) feeling temperature (X8) in the bike sharing data example (see Section 5.2): , 50 bootstrap-estimated ALE main effect functions ;
, pointwise average of ;
, pointwise averages ± 1 standard deviation
5.4 Paired differencing and additive recovery
The ALE functions have an attractive additive recovery property that was mentioned in remark 1. Suppose that is an additive function of the individual predictors. Then it is straightforward to show that the ALE main effects are for (j = 1, 2, …, d), up to an additive constant, i.e. the ALE effects recover the correct additive functions. More generally, the following result states that higher order ALE effects fJ,ALE(xJ) have a similar additive recovery property.
5.4.1 Additive recovery for accumulated local effects plots
Suppose that f is of the form for some 1 ⩽ k ⩽ d, i.e. f has interactions of order k, but no higher order interactions than that. Then, for every subset J with |J| = k, for some functions that are of strictly lower order than k. In other words, for every J with |J| = k, the ALE effect returns the correct k-order interaction , since the presence of strictly lower order functions does not alter the interpretation of a k-order interaction.
The proof of this additive recovery property for ALE plots follows directly from the decomposition theorem in appendix C of the on-line supplementary materials. It also follows that, if the functions in the expression for f(x) are adjusted so that each has no lower order ALE effects, then for each J⊆{1, 2, …, d}. Although PD plots have a similar additive recovery property (see below), marginal plots have no such property. For example, if , and the predictors are dependent, then each may be a combination of the main effects of many predictors. As discussed previously, this can be viewed as the omitted variable bias in regression, whereby a regression of Y on (say) X1, omitting a correlated nuisance variable X2 on which Y also depends, will bias the effect of X1 on Y.
The mechanism by which ALE plots avoid this omitted nuisance variable bias is illustrated in Fig. 14, for the same example depicted in Figs 1, 2 and 5. First note that the marginal plot functions in this example are severely biased, because averages the function itself (as opposed to its derivative) with respect to the conditional distribution of (see equations (3) and (4)). For example, for considered in Fig. 6, the marginal plot averaged function is , which is biased by the functional dependence of f(x) on the correlated nuisance variable X2. In contrast with averaging the function f itself, the ALE effect estimated via expressions (15)–(16) averages only the local effect of f represented by the paired differences f(zk, 1, xi, 2) − f(zk − 1, 1, xi, 2) in equation (15). As illustrated in Fig. 14, this paired differencing is what blocks out the effect of the correlated nuisance variable X2. Continuing the example, the paired differences for the ALE plot completely block out the effect of X2, so the accumulated effect is correct.

Illustration of how, when estimating , the differences f(zk, 1, xi, 2) − f(zk − 1, 1, xi, 2) and in equation (15) are paired differences that block out the nuisance variable X2: here, k = k1(0.3) and k′ = k1(0.8)
5.4.2 Multiplicative recovery for accumulated local effects plots for independent subsets of predictors
Suppose that the model is of the form for some J⊂{1, 2, …, d} with independent of . In this case it is straightforward to show that the ALE |J|-order interaction effect of is for some lower order functions , i.e. the ALE |J|-order interaction effect recovers the correct function , except for a multiplicative constant and the additive presence of strictly lower order functions.
5.4.3 Comparison with partial difference plots
PD plots also have the same additive and multiplicative recovery properties that were just discussed. Moreover, for , PD plots have multiplicative recovery (up to a multiplicative constant) even when and are dependent (Hastie et al., 2009). Although it is probably desirable to have multiplicative recovery when and are independent, it is unclear whether multiplicative recovery is even desirable if and are dependent.
For example, suppose that with X1 (J = 1) and X2 (∖J = 2) standard normal random variables with correlation coefficient ρ. It is straightforward to show that , and , compared with , and f2,PD(x2) = 0. Because of the strong interaction, it is essential to look at the second-order interaction effects to understand the functional dependence of f(·) on the predictors. Both and correctly recover the interaction, up to lower order functions of the individual predictors. Regarding the main effects, however, the picture is more ambiguous. First, it is important to note that, with a strong interaction and dependent predictors, it is unclear whether the main effects are even meaningful. And it is equally unclear whether the PD main effect is any more or less meaningful than the ALE main effect . If X1 and X2 were independent in this example, then it would probably be desirable to view the main effects of X1 and X2 as 0, but in this case would actually be in agreement. The situation is murkier with dependent predictors in this example. The local effect of X1 depends strongly on the value of x2, in that it is amplified for larger and changes sign if x2 changes sign. Thus, if ρ is large and positive, the local effect of X1 is positive for and negative for , which is the local effect of a quadratic relationship. In this case one might argue that the quadratic is more revealing than . However, the debate is largely academic, because when strong interactions between dependent predictors are present the lower order effects should not be interpreted in a vacuum.
5.5 Computational advantages of accumulated local effects plots over partial difference plots
For general supervised learning models f(x), ALE plots have an enormous computational advantage over PD plots. Suppose that we want to compute for one subset J⊆{1, 2, …, d} over a grid in the -space with K discrete locations for each variable. Computation of over this grid requires a total of evaluations of the supervised learning model f(x) (see expressions (15)–(20) or (D.1) in the on-line appendix for |J|>2). In comparison, computation of over this grid requires a total of evaluations of f(x). For example, for K = 50, PD main effects and second-order interaction effects require respectively 25 and 625 times more evaluations of f(x) than do the corresponding ALEs. Moreover, as we discuss in appendix E of the on-line supplementary materials, the evaluations of f(x) can be easily vectorized (in R, for example, by appropriately calling the predict function that is built into most supervised learning packages in R).
Also note that the number of evaluations of f(x) for ALE plots does not depend on K, which is convenient. As n increases, the observations become denser, in which case we may want the fineness of the discretization to increase as well. If we choose (which results in the same average number of observations per cell as n increases), then the number of evaluations of f(x) is for ALE plots versus for PD plots.
For the bike sharing example in Section 5.2, we implemented the ALE and PD plots by using our R package ALEPlot on a WindowsTM laptop with Intel(R) Core(TM) i7-7600U central processor unit and 2.80-GHz processor. The ALE main effect plots and the ALE second-order interaction plots took less than 1 s each. In comparison, with the same K = 100, the PD main effect plots took about 5 s each, and the PD interaction plots (which were not shown in Section 5.2) took about 8 min each. The PD plot computational expense is proportional to K for main effects and to K2 for second-order interaction effects, whereas the ALE plot computational expense is largely independent of K. The ALE interaction plots were orders of magnitude faster to compute than the PD interaction plots (less than 1 s versus 8 min).
5.6 Relation to functional analysis of variance with dependent inputs
In the context of the closely related problem of functional ANOVA with dependent input (i.e. predictor) variables, the extrapolation issue that motivated ALE plots has been previously considered. Hooker (2007) proposed a functional ANOVA decomposition of f(x) into component functions by adopting the Stone (1994) approach of using weighted integrals in the function approximation optimality criterion and in the component function orthogonality constraints. Hooker (2007) used as a weighting function, which indirectly avoids extrapolation of f(x) in regions in which there are no training observations, because any such extrapolations are assigned little or no weight. The resulting ANOVA component functions are hierarchically orthogonal under the correlation inner product, in the sense that and are uncorrelated when u⊂J. However, and are not uncorrelated for general u ≠ J.
In comparison, we show in appendix C of the on-line supplementary materials that the ALE decomposition that was mentioned in remark 2 has the following orthogonality-like property. For each J⊆{1, 2, …, d}, let denote the operator that maps a function f to its ALE effect fJ,ALE, i.e. such that (see appendix B of the on-line supplementary materials for details). The collection of operators behaves like a collection of operators that project onto orthogonal subspaces of an inner product space. Namely, if ‘∘’ denotes the composite function operator, then , and for each with .
In other words, the ALE -order effect of the predictors for the function fJ,ALE is identically 0 when ; and the ALE |J|-order effect of the predictors for the function fJ,ALE is the same function fJ,ALE. For example, for any pair of predictors {Xj, Xl}, the ALE main effect of any predictor (including Xj or Xl) for the function is identically 0. Thus each ALE second-order interaction effect function has ALE main effects that are all identically 0. Likewise, each ALE main effect function has ALE second-order interaction effects that are all identically 0. And, for the function , the ALE second-order interaction effect of any other pair of predictors is identically 0. Similarly, the ALE first- and second-order interaction effects for any ALE third-order effect function are all identically 0, and vice versa.
For visualizing the effects of the predictors on black box supervised learning models, the correlation orthogonality in other functional ANOVA decompositions may be less relevant and less useful than the ALE pseudo-orthogonality. As discussed in Roosen (1995), if the predictors are dependent, it may even be preferable to impose artificially a product in the functional ANOVA decomposition to avoid conflating direct and indirect effects of a predictor, and this will typically result in ANOVA component functions that are no longer uncorrelated. For example, suppose that , and X1 and X2 are correlated. Any functional ANOVA decomposition that gives uncorrelated main effects will not give the correct main effects and f2(x2) = x2 that are needed to understand the true functional dependence of f(x) on and x2. In contrast, the ALE and PD main effect functions are the correct functions and f2(x2) = x2, up to an additive constant (see Section 5.4). Functional ANOVA can be coerced into producing the correct main effects and f2,ANOVA(x2) = x2 by artificially imposing a product distribution , but then the ANOVA component functions are no longer uncorrelated. Moreover, artificially imposing a product in functional ANOVA still leaves the extrapolation problem that plagues PD plots and that motivated ALE plots and the work of Hooker (2007).
In addition, practical implementation is far more cumbersome for functional ANOVA decompositions than for ALE (or PD) plots, for multiple reasons. First, must be estimated in the functional ANOVA approach of Hooker (2007), which is problematic in high dimensions. In contrast, the ALE effect estimators (15)–(20) involve summations over the training data but require no estimate of . Second, each ALE plot effect function can be computed one at a time by using straightforward and computationally efficient calculations that involve only finite differencing, averaging and summing. In contrast, the functional ANOVA component functions must be computed simultaneously, which requires the solution of a complex system of equations. Follow-up work in Li and Rabitz (2012), Chastaing et al. (2012) and Rahman (2014) improved the solution techniques, sometimes restricting the component ANOVA functions to be expansions in basis functions such as polynomials and splines, but these are more restrictive (perhaps negating the benefits of fitting a black box supervised learning model in the first place), as well as more cumbersome and computationally expensive than ALE plots.
A closely related approach for predictor effect visualization would be to fit a generalized additive model (GAM) (see for example Hastie and Tibshirani (1986) and Wood (2017)) to f(·), as a surrogate model, and then to plot the individual GAM component functions. In fact, if we include interaction functions of all orders in the GAM, it can be viewed as a specific type of functional ANOVA decomposition, but without any explicit orthogonality-like constraints. In this regard, the computational and conceptual advantages of ALE plots over GAMs are similar to their advantages over functional ANOVA that were discussed above. In particular, the individual ALE effect functions can be computed one at a time, and the computed functions do not depend on which other ALE main or interaction effect functions are computed. In contrast, the computed GAM functions depend heavily on which other GAM functions (main effects, second-order interaction effects, third-order interaction effects, etc.) are included when the GAM is fitted to f(·). One might be tempted to include second- and higher order interaction functions in the GAM, but this would have serious computational disadvantages when there are more than a few predictors. However, fitting a GAM with only main effects would obviously not allow interactions to be visualized. In addition, because the GAM approach fits a surrogate model to f(·), it requires selecting all the tuning parameters and basis functions of the GAM surrogate model. In contrast, the ALE approach operates directly on f(·) without involving any surrogate model.
6 Conclusions
For visualizing the effects of the predictor variables in black box supervised learning models, PD plots are the most widely used method. The ALE plots that we have proposed in this paper are an alternative that has two important advantages over PD plots. First, by design, ALE plots avoid the extrapolation problem that can render PD plots unreliable when the predictors are highly correlated (see Figs 6, 7 and 11). Second, ALE plots are substantially less computationally expensive than PD plots, requiring only evaluations of the supervised learning model f(x) to compute each , compared with evaluations to compute each . In light of this, we suggest that ALE plots should be adopted as a standard visualization component in supervised learning software. We have also provided, as supplementary material, an R package ALEPlot to implement the ALE plots. As future research, based on ALE plot concepts, we are developing a set of variable importance measures to quantify the relative importance of individual predictors, in terms of both their main effect importance and their total effect importance. We plan to implement these in a future version of the ALEPlot package.
Supporting information
Additional ‘supporting information’ may be found in the on-line version of this article:
‘Appendix’.
Acknowledgements
This work was supported in part by Air Force Office of Scientific Research grant #FA9550-18-1-0381, which the authors gratefully acknowledge. The authors also thank the reviewers and Joint Editor for many helpful comments that have improved this paper.