Accurately computing the log-sum-exp and softmax functions

Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification. Because of the exponentials that these functions contain, the evaluation is prone to overflow and underflow, especially in low-precision arithmetic. Software implementations commonly use alternative formulas that avoid overflow and reduce the chance of harmful underflow, employing a shift or another rewriting. Although mathematically equivalent, these variants behave differently in floating-point arithmetic and shifting can introduce subtractive cancellation. We give rounding error analyses of different evaluation algorithms and interpret the error bounds using condition numbers for the functions. We conclude, based on the analysis and numerical experiments, that the shifted formulas are of similar accuracy to the unshifted ones, so can safely be used, but that a division-free variant of softmax can suffer from loss of accuracy.


Introduction
In many applications, especially in a wide range of machine learning classifiers such as multinomial linear regression and naive Bayes classifiers (Williams & Barber, 1998;Murphy, 2012;Calafiore et al., 2019), one needs to compute an expression of the form where x = [x 1 , x 2 , . . . , x n ] T ∈ R n and log is the natural logarithm. The function f : R n → R is often referred to as log-sum-exp or LSE. Its gradient g : R n → R n , given by   is called softmax and is also a key function in classification algorithms (Efron & Hastie, 2016, p. 355;Goodfellow et al., 2016, p. 78;. It is often the case that log-sum-exp and softmax are required simultaneously. The most obvious danger in evaluating (1.1) and (1.2) is overflow. We are interested in IEEE arithmetic in the precisions half (fp16), single (fp32) and double (fp64) (IEEE, 2019), as well as the bfloat16 half-precision format (Intel Corporation, 2018). Table 1 shows the key parameters of interest for these precisions: the unit roundoff u, the largest finite number r max and the smallest positive normalized and subnormal floating-point numbers. If some x i exceeds the relevant log r max value in Table 2 then overflow will occur in evaluating e x i . Clearly, overflow is possible even for quite modestly sized x, especially for half and single precision.
Underflow is also possible. For example, for n = 1, if x 1 is a finite floating-point number with x 1 < log r (s) min /2 then 1 fl(f (x 1 )) = fl(log(fl(e x 1 ))) = fl(log 0) = −∞ with round to nearest, whereas f (x 1 ) = x 1 . For n > 1, underflow is damaging when all the x i are less than log r (s) min . As well as avoiding harmful underflow it is desirable to avoid generating subnormal numbers, which incur a performance penalty if handled in software; 2 see Higham (2002) or Muller et al. (2018) for details of subnormal numbers.
A way to avoid overflow, and to attempt to avoid harmful underflow and subnormal numbers, in evaluating log-sum-exp is to rewrite e a e x i −a = log e a n i=1 e x i −a . Hence (1.3) We note that (1.3) remains valid for complex x i with log the principal logarithm (the one whose imaginary part lies in (−π , π ]), provided that a ∈ R (Aprahamian & Higham, 2014, Lem. 2.5). The softmax function can be expressed in a related form: (1.4) This shifting, typically with a = max i x i , is a well-known way to attempt to avoid overflow and underflow in the evaluation of f and g, described in many places, including on Wikipedia, 3 in blog posts 4 and even in a YouTube video. 5 The functions logsumexp in SciPy 1.3.1 (Jones et al., 2001) and LogSumExp in R (R Core Team) both implement (1.3) with a = max i x i . The function softmax in the MATLAB Deep Learning Toolbox (R2019b) (Deep Learning Toolbox) uses (1.4) with a = max i x i . An alternative to (1.4), which removes the denominator of (1.2) by rewriting it as e y and moving it into the numerator, is (1.5) The conciseness of this division-free formula makes it attractive for implementing softmax when a log-sum-exp function is available. This formula is used in the SciPy 1.4.1 function softmax, in a MATLAB toolbox (Matlab Code for Machine Learning Algorithms in Book PRML) associated with the book Bishop (2006), in the internal function softmax in the MATLAB Statistics and Machine Learning Toolbox (R2019b) (Statistics and Machine Learning Toolbox) and in Wang et al. (2018); in each case the log-sum-exp term is computed by (1.3) with a = max i x i . Formula (1.5) can also be found in codes posted in online communities such as Stack Exchange. Because of the importance of the log-sum-exp and softmax functions, great efforts are made to optimize their implementations in software (Czaja et al., 2019) and hardware (Wang et al., 2018). Yet we are not aware of any investigation of the accuracy of the different formulas in floating-point arithmetic, and indeed the accuracy properties are not clear. In particular, when a = max i x i < 0, y in (1.3) is computed as the sum of two terms of opposite sign, so there could potentially be damaging subtractive cancellation that appears not to be present in (1.1). We note that Guo et al. (2020, sec. 4.3) limit the size of a in (1.4), stating that the shift causes loss of accuracy by up to a factor e a .
In this work we carry out a rounding error analysis of the unshifted and shifted formulas and (1.5) in order to determine which choices of formulas give the best accuracy and reliability. We relate the error bounds to the conditioning of f and g. We show that the shifted formulas have broadly similar error bounds to the unshifted ones and so are entirely appropriate for practical use. We find, however, that the alternative softmax formula (1.5) has a less favorable error bound than the shifted formula and tends to produce larger errors in practice.
We begin, in the next section, by investigating the conditioning of the log-sum-exp and softmax functions. In Section 3 we give detailed rounding error analyses of the basic formulas. In Section 4 we analyze the shifted formulas and (1.5) and compare their error bounds with those for unshifted formulas. Numerical experiments are given in Section 5 to test the accuracy of the evaluations and also to examine how the sum of the computed softmax vector entries compares with the exact value 1. Conclusions are given in Section 6.
From now on we write (1.6) We will use the standard model of floating-point arithmetic (Higham, 2002, sec. 2

Condition numbers and forward stability
Before considering algorithms for computing log-sum-exp and softmax we investigate the conditioning of these functions, that is, the sensitivity of f (x) and g(x) in (1.1) and (1.2) to small perturbations in x.
We define the condition number of f in the usual way (see, e.g., Higham, 2008, chap. 3), by This definition implies that so that cond(f , x) measures the worst-case relative change in f corresponding to a small relative change in x. It is easy to show that for the ∞-norm, We identify two extreme cases. First, the condition number is infinite when x i = − log n for all i, because f (x) = 0. Hence when x i ≈ − log n for all i the condition number must be large. Secondly, |f (x)| ≥ max i x i , so if max i x i = max i |x i | then cond ∞ (f , x) 1 and the problem is perfectly conditioned.
A forward stable algorithm for computing log-sum-exp is one for which the relative error of the computed result is bounded by p(n)cond(f , x)u, for some low-degree polynomial p. Ideally, we would like the algorithm that we use to be forward stable. To see whether it is reasonable to expect forward stability consider the case n = 1. Then y = f (x) = log e x = x, so cond(f , x) = 1: the problem is perfectly conditioned. When we compute f using standard library functions we can expect to obtain relative errors in the computed exponential and logarithm bounded by u (de Dinechin et al., 2007;Muller, 2016;Muller et al., 2018, Chap. 10), that is, (2. 3) The term 1 + δ 2 causes just a small relative perturbation of y, so we have This relative error bound is much larger than u for |x| 1, even though the problem is perfectly conditioned. So it is not reasonable to expect an algorithm to be unconditionally forward stable in floating-point arithmetic. For this trivial computation, backward error and forward error are the same, so we also conclude that we cannot expect to obtain an algorithm that is unconditionally backward stable.
The softmax function has condition number which is given explicitly by Here the n × n matrix G(x) = (∂g i /∂x j ) is the Jacobian of g and · denotes any vector norm and the corresponding subordinate matrix norm. We note in passing that G is the Hessian of f and can be shown to be symmetric positive semidefinite for all x (Boyd & Vandenberghe, 2004, p. 74). Now can be arbitrarily large.
We also note that shifting, as in (1.3) and (1.4), does not change the functions so does not change their condition numbers; likewise for (1.5). These reformulations may, of course, affect the accuracy of the floating-point evaluation.

Basic algorithms and error analysis
Algorithm 3.1 gives a naive implementation of (1.1) and (1.2).
Algorithm 3.1 Given x ∈ R n this algorithm computes f (x) = log n i=1 e x i and the gradient What can be said about the accuracy of this algorithm when it is implemented in floating-point arithmetic? To answer this question we carry out a rounding error analysis. Throughout this section we assume that there is no overflow or underflow.

7
First we consider the error in evaluating the sum of positive terms Evaluating w i = e x i yields a computed result satisfying where, as noted in Section 2, we can expect the relative error from the exponential evaluation to satisfy |δ 1 | u. Therefore Write the (exact) sum of computed quantities as The rounding error analysis in Higham (1993) and Higham (2002, sec. 4.2) shows that the computed sum s satisfies 6 Here and throughout this paper, the O(u 2 ) term is innocuous in that it becomes significant only when the corresponding first-order term is already so large that it provides no useful information; see Blanchard et al. (2020, sec. 2.1.2) for details of these terms for summation. Writing s − s = s − s + s − s we obtain Then the computed log-sum-exp is which gives the following result.
Theorem 3.2 (Basic log-sum-exp algorithm). In the absence of overflow and underflow the computed log-sum-exp y from Algorithm 3.1 satisfies This bound is a factor (|y| + n + 1)/ x ∞ larger than cond ∞ (f , x)u in (2.2). But |y| |x max | + log n from (1.1) (or see (4.2) below), so this factor is bounded by 1 + (n + 1 + log n)/ x ∞ . Hence we have forward stability as long as x ∞ 1, but for x ∞ 1 the bound does not guarantee forward stability. This is consistent with the bound (2.4) for the case n = 1.
Turning to the evaluation of the softmax function g from its definition (1.2), by (3.1) we have where δ 2 accounts for the division, and so by (3.3), Therefore This bound guarantees a relative error of order at most nu in every component of g. We weaken the bound into a normwise bound for the next theorem.
Theorem 3.3 (Basic softmax algorithm). In the absence of overflow and underflow the computed softmax g from Algorithm 3.1 satisfies While the error bounds of Theorems 3.2 and 3.3 have a very satisfactory form, they provide no useful information when n 1/u, and for fp16 this happens for n as small as 2048. We note, however, that the n terms, which come from the summation, are pessimistic. It is shown by Higham & Mary (2019, Thm. 3.1) and Higham & Mary (2020, Thm. 2.5) that, under probabilistic models of rounding errors, n in the error bound for summation can be replaced by a small constant multiple of √ n with high probability, and the same holds for the bounds of Theorems 3.2 and 3.3.
Next consider the alternative formula (1.5), which we rewrite here: ( 3.7) With y = f (x) evaluated in floating-point arithmetic by Algorithm 3.1, we obtain 10) where, using Theorem 3.2, We summarize this result as follows.
Theorem 3.4 (Alternative softmax algorithm). In the absence of overflow and underflow the computed g from (3.7) with the log-sum-exp computed by Algorithm 3.1 satisfies From (4.2) and (4.3) below, using the notation (1.6), we have |y| + max j |x j − y| |x max | + |x max − x min | + 2 log n. Hence (3.11) is less favorable than (3.6) when x max − x min n or |x max | n. The analysis therefore suggests that (1.2) should be preferred to (1.5).
To give an intuitive explanation for the potential inaccuracy in (3.7) we refer to the steps leading to (3.10). A large absolute error in the argument of the exp in (3.9) may lead to a large relative error in the result. This effect can be traced back to the appearance of x j − y in (3.8).

Algorithms with shifting
Now we consider the use of shifts in the log-sum-exp and softmax evaluations in order to avoid overflow and reduce the chance of harmful underflow. We are particularly interested to see whether the potential cancellation caused by the shift can lead to numerical instability.
Recall definition (1.6) of x max and x min . Overflow in the exponential evaluations in (1.3) is certainly avoided if we take a = x max , as we then have x i − a 0 and hence 0 e x i −a 1 for all i. We can rewrite (1.3) as where x k = x max (if there is more than one such k, we can take any of them). From this expression we see that x max y x max + log n. (4.2) It follows that when x max ≥ 0 the sum 'x max + log(·)' that produces y in (4.1) cannot suffer cancellation. Note that for n = 1, (4.1) trivially provides the exact result y = x max , in contrast to the basic formula (1.1).
For later use we note that (4.2) implies that, for any j, The log term in (4.1) has the form log(1+s), where s ≥ 0. If s is very small then 1+s will round to 1 and the logarithm will evaluate as zero, even though log(1+s) ≈ s = 0. To avoid this loss of information we will use the function log1p(s) = log(1+s) provided in, for example, C, MATLAB and NumPy. These functions guarantee an accurate result for small s (which can be achieved with a simple formula based on log; Hewlett-Packard, 1982;Higham, 2002, Prob. 1.5). The improved accuracy brought by log1p(s) for small s is likely to benefit y when x max ≈ −s.
These considerations lead to Algorithm 4.1. Note that while it is important to avoid forming 1 + s for the f -evaluation, for g we can safely form 1 + s because if s is small it has little influence on g.
Algorithm 4.1 avoids overflow. If underflow occurs in the exponential then it is in a term in the sum added to 1 in (4.1), so that term is negligible and the underflow is harmless. Note, in particular, that if x i ≈ x < log r The main question is how shifting affects the accuracy of the evaluations. We give a rounding error analysis to assess this question. The analysis is a generalization of that in the previous section for the unshifted algorithm. We first examine the error in evaluating the sum of non-negative terms (4.4) Evaluating w i = e x i −a yields a computed result satisfying and hence Assuming for notational simplicity that k = n we can write the (exact) sum of computed quantities as The rounding error analysis in Higham (1993) and Higham (2002, sec. 4.2) shows that the computed sum s satisfies where t i = i+1 j=1 w j , so that, since w i ≥ 0, Hence which guarantees an accurate computed sum as long as n + x max − x min is not too large. The final stage of the computation is to evaluate y = x max + log(1 + s) using the computed s, for which we have Here we are assuming that the log1p function has the property fl(log1p(s)) = log1p(s)(1 + δ), |δ| u.
Ignoring the innocuous δ 4 term and writing, by (4.6), (4.7) we have ACCURATELY COMPUTING THE LOG-SUM-EXP AND SOFTMAX FUNCTIONS 13 using a Taylor series expansion about 1 + s of the logarithm. Hence Bounding η using (4.7) gives or, as a relative error bound, since s ≥ 0, Simplifying the bound gives the next result.
Theorem 4.2 (Shifted log-sum-exp algorithm). The computed log-sum-exp y from Algorithm 4.1 satisfies The main question is how this result compares with Theorem 3.2 for the unshifted algorithm. The only difference in the bounds is that |y| + n + 1 in (3.5) is replaced by |y + n − x min | here. Now |y + n − x min | |y| + n is possible only if x min 0 and x min x max , so let us assume that these two inequalities hold. The term |y + n − x min | comes from bounding the terms (1 + a − x i + n − i)w i in (4.5), where w i is defined in (4.4) and x i = x min , and if x min 0 then w i = e x i −a = e x min −x max 1. Hence the potentially large constant is mitigated by the w i term that it multiplies-something that is lost in the manipulations to achieve a readable bound. We conclude that shifting should have little effect on the accuracy.
We note that (4.10) is weaker than necessary when s 1 (recall that s ≥ 0), which happens when x i x max for all i = k, since we bounded s/(1 + s) by 1 in going from (4.8) to (4.9). If s 1 then (4.8) becomes Since s 1 also implies x i x max for i = k and hence y ≈ x max we then have which is a factor s smaller than (4.10).
Turning to the evaluation of the softmax function g from the shifted formula (1.4) we have, using (4.6), where δ 2 corresponds to the exponential evaluation and δ 3 to the division, and Therefore Hence we have obtained the following result.
Theorem 4.3 (Shifted softmax algorithm). The computed g from Algorithm 4.1 satisfies Again, this is broadly commensurate with Theorem 3.3 for the unshifted evaluation, bearing in mind the comments following Theorem 4.2.
Finally we consider (1.5) with the log-sum-exp computed by Algorithm 4.1. In floating-point arithmetic we have the same equation (3.8) as for the unshifted algorithm but now, using (4.10), with θ bounded by We have obtained the following result.
Theorem 4.4 (Alternative shifted softmax algorithm). The computed g from (1.5) with the log-sum-exp computed by Algorithm 4.1 satisfies This is broadly similar to Theorem 3.4 for the unshifted alternative softmax algorithm.

Computational experiments
We now perform some experiments in a realistic setting, using MATLAB R2020a. The codes and data used for the experiments are available online. 7 Our aims are to examine the sharpness of the rounding error bounds and to give a pairwise comparison of the accuracy of the algorithms in floating-point arithmetic. Our data come from a deep learning application. To generate the data we first set up and trained an artificial neural network, using the MATLAB Deep Learning Toolbox (Deep Learning Toolbox). More precisely, we trained a network to classify handwritten digit data from the widely used MNIST data set (LeCun et al.). Here each data This is the default architecture from the Deep Learning Toolbox, where further details may be found.
The network was trained on 7500 images (750 from each of the ten categories), with 2500 further images (250 from each of the ten categories) used for validation.
The network takes as input a 28 × 28 matrix corresponding to the pixels in the image and returns a non-negative 10 × 1 vector whose ith component may be interpreted as the probability that the image came from category i. If we categorize according to the highest probability from the output then the trained network misclassifed 27 of the 2500 validation images, corresponding to a 98.9% success rate.
The network uses single precision arithmetic, fp32. In our experiments we are concerned only with floating-point arithmetic issues, and we treat the trained network as a means to produce a realistic data set. To do this we extracted the 2500 single precision vectors from the validation set that were passed into the softmax layer and converted them to fp16 or bfloat16. We then used this data in our implementation of the softmax and log-sum-exp algorithms that we have studied in the previous sections.
To record errors in computed results we applied the basic algorithm, Algorithm 3.1, in single precision to provide a reference solution and used the chop function of Higham & Pranesh (2019) to simulate half-precision arithmetic, in both the fp16 format and the bfloat16 format.
We first describe experiments in fp16. The components in the 2500 test vectors x ∈ R 10 vary between about −19 and +20. As indicated in Table 2, e x overflows in fp16 for x 11. Hence, in these tests, overflow is an issue for the basic log-sum-exp implementation in Algorithm 3.1: it generated an Inf for 475 of the 2500 test vectors. The shifted version of log-sum-exp in Algorithm 4.1 did not overflow. In the plots below we do not include results for the cases where Algorithm 3.1 produced overflow. First we look at the log-sum-exp algorithms. In the upper-left plot of Fig. 1 we used the basic implementation of log-sum-exp, Algorithm 3.1. We scatter plot over the 2025 vectors where no overflow occurred. For each such vector the horizontal coordinate is the leading term in the error bound of Theorem 3.2, scaled by u, that is, 1 + (n + 1)/|y|. Here, as shown in Table 1, u = 4.88 × 10 −4 for fp16. The vertical coordinate is the actual scaled relative error | y − y|/(u|y|). The plot also gives a reference line of slope 1 from the origin. We see that the bound is always satisfied and is reasonably sharp in many cases.
In the upper-right plot of Fig. 1 we show corresponding results for the shifted log-sum-exp implementation in Algorithm 4.1, using the bound from Theorem 4.2.
In the lower part of Fig. 1 we scatter plot the floating-point errors for the basic and shifted algorithms. Here, for 1863 of the 2025 cases (92%), the two errors were identical to all digits in the half-precision computation. In more detail, over all the data points the ratio of the error in the basic logsum-exp (horizontal axis) divided by the error in the shifted version (vertical axis) varied between 0.19 and 59, with a mean of 1.07 and a standard error of 0.03. This indicates that the two versions perform similarly, with the shift producing slightly better results.
We now move on to the four softmax implementations. In Fig. 2 we use the shifted softmax implementation from Algorithm 4.1, analysed in Theorem 4.3, as the basis for comparison. The upper left plot has the scaled error g−g ∞ /(u g ∞ ) from Algorithm 4.1 on the horizontal axis and the scaled error from the basic softmax in Algorithm 3.1 on the vertical axis. The upper-right plot compares the shifted softmax against the alternative algorithm using (3.7) and the unshifted log-sum-exp analyzed in Theorem 3.4. Similarly, the lower plot compares the alternative shifted softmax algorithm that uses (3.9) with the shifted log-sum-exp, which is analyzed in Theorem 4.4. We see that the softmax values obtained from Algorithm 4.1 are generally slightly more accurate than those from Algorithm 3.1, whereas the alternative softmax versions based on the rewrite in (1.5) are mostly less accurate than those from Algorithm 4.1.
Overall, the results in Figs 1 and 2 are consistent with our floating-point error analysis. A further test is to compute the sum of each softmax vector, which should equal 1. In Fig. 3 we compare the softmax sums for the basic algorithm (Algorithm 3.1, red circles) analyzed in Theorem 3.3 and the alternative version (blue crosses) analyzed in Theorem 3.4. Similarly, Fig. 4 compares the shifted softmax algorithm analyzed in Theorem 4.3 and its alternative analyzed in Theorem 4.4. The order along the x-axis is arbitrary; it corresponds to the order in which the data vectors were generated. These figures provide further evidence that the alternative softmax algorithms are less accurate than the basic or shifted algorithms.
We also conducted the corresponding experiments in simulated bfloat16 arithmetic. Here, as indicated in Tables 1 and 2, the number range is increased at the expense of reduced precision. In this case there was no overflow in any of the algorithms. The results were very similar to those for fp16, so they are not shown here.

Conclusions
The log-sum-exp and softmax functions both feature in many computational pipelines, so it is important to compute them accurately and to avoid generating infs or NaNs because of overflow or underflow. To this end a shift is usually incorporated into the defining formulas, yielding (1.3) and (1.4). It is important to understand the effect of the shift on the accuracy of the computed result, especially when computations are carried out in a low-precision arithmetic such as bfloat16 or fp16, which have the equivalent of only three or four decimal digits of precision.
Our rounding error analysis shows that shifting by the largest element of the input vector does not lessen the accuracy of the computed log-sum-exp and softmax, so the shifted formulas can be safely used. Underlying this pleasing fact is the phenomenon that any large error constants caused by shifting are cancelled by multiplication with small exponentials.
We obtained an explicit formula for the condition number of log-sum-exp and bounds for the condition number of softmax, and we were able to identify situations in which the log-sum-exp algorithms are guaranteed to be forward stable.
For the alternative and widely used softmax formula that avoids division, (1.5), we obtained larger error bounds than for the shifted formula (1.4). Since our numerical experiments confirm that larger errors are typically obtained in practice we recommend using (1.4) instead of (1.5) to evaluate softmax.
In summary, Algorithm 4.1 is our recommendation for computing log-sum-exp and softmax. It avoids overflow, reduces the chance of harmful underflow and generally produces results as accurate as those from the unshifted formulas.