Pruning Neural Networks Using Multi-Armed Bandits

The successful application of deep learning has led to increasing expectations of their use in embedded systems. This, in turn, has created the need to find ways of reducing the size of neural networks. Decreasing the size of a neural network requires deciding which weights should be removed without compromising accuracy, which is analogous to the kind of problems addressed by multi-armed bandits (MABs). Hence, this paper explores the use of MABs for reducing the number of parameters of a neural network. Different MAB algorithms, namely (cid:2) -greedy, win-stay, lose-shift, UCB1, KL-UCB, BayesUCB, UGapEb, successive rejects and Thompson sampling are evaluated and their performance compared to existing approaches. The results show that MAB pruning methods, especially those based on UCB, outperform other pruning methods.


INTRODUCTION
The use of deep learning has led to dramatic improvements in performance in many pattern recognition applications [1][2][3][4][5][6][7]. These deep learning models consist of neural networks with a large number of parameters that require a significant amount of memory and computational resource [8]. Hence, there is renewed interest in developing algorithms for reducing the size of neural networks while retaining their predictive power. A survey of the literature reveals four categories of algorithms: direct methods, network pruning (NP) methods, use of regularization and methods that utilize sensitivity analysis, which are summarized below.
Direct methods [9] work by assessing the effect of setting weights to zero and removing weights that have little impact on performance. NP methods [10][11][12][13] are based on the view that very small weights are the least important and can be removed without affecting performance. Regularization methods [14,15] extend the loss function L i (such as mean square error) to include an additional term, R(W), that aims to reduce the magnitude of weights and promote generalization [14,15]: where λ > 0 is a parameter that can be set to a value that reflects the weight given to the regularization term and N is the number of examples in the training set. There are many types of regularization functions R(W), and two that have been widely used are the sum of the squares of the weights (L2-norm) and the sum of the absolute values of the weights (L1-norm). Sensitivity analysis methods [16][17][18][19][20][21][22] aim to assess the effect of perturbing the weights on the loss function. These include a method due to Le Cun et al. [19], known as optimal brain damage (OBD), that approximates the change in loss δL with the Taylor series [19]: where the δw i , δw j are the weight perturbations, g i are the components of the gradient of the loss function L with respect to the weights W and h ij are elements of a Hessian matrix H: Le Cun et al. [19] note that computing the Hessian matrix is computationally expensive, and to simplify the computation, they assume that the change in loss can be approximated by the diagonal elements of the Hessian, resulting in a simplification of equation (1) to

S. Ameen and S. Vadera
Given this simplification, the saliency s k of a weight w k can then be computed by [19]: where the second-order derivatives, h kk , are computed in a manner similar to the way the gradient is computed in backpropagation. In a later study, Hassibi et al. [20] also use equation (1) to develop a method known as optimal brain surgeon (OBS) but argue that it is not necessary to make the simplifying assumption that the non-diagonal elements of the Hessian matrix are zero. The above methods for pruning neural networks have different merits. The direct methods are of O(NP 3 ), where P is the number of weights and N is the size of the training set, and hence are considered to be intractable [9]. Regularization can lead to weights decaying towards zero, although as Collins and Kohli [23] and Gupta et al. [24] show, they may not reduce the weights to zero. One approach to address this might be to apply magnitude-based pruning methods to remove any small weights after regularization. However, Hassibi et al. [20] show that pruning based on magnitude can lead to removing important weights, and Srinivas and Babu [25] conclude that methods based on sensitivity analysis find it difficult to prune deep neural networks.
This paper explores an alternative approach based on the observation that there is a trade-off in deciding which weights to remove. That is, having too many weights can lead to overfitting, but equally, removing too many weights can result in underfitting the data. Although direct methods are optimal, evaluating the merits of weights by using all the training data to assess their value is not computationally feasible except for small data sets and networks. Equally, utilizing only a small sample of the data and trials may not lead to an accurate assessment of the value of a weight. Hence, this paper develops and evaluates an algorithm based on the use of multi-armed bandits (MABs), which have been successfully applied to problems where trials are carried out to explore the merits of various options [26][27][28][29][30][31][32][33][34][35][36][37][38].
The rest of this paper is organized as follows: Section 2 presents the background on MABs, Section 3 develops a new pruning algorithm that utilizes MABs, Section 4 presents the results of an empirical evaluation and Section 5 concludes the paper.

BACKGROUND ON MABS
The term MAB refers to a framework that is based on modelling a gambler who faces a collection of slot machines and needs to select which machines to play in order to maximize returns. Prior to each lever pull, the gambler will know the expected return or pay off based on the previous history of rewards and will be able to use this to decide which arm to pull next in order to achieve his or her goal. Typically, the goal can be to maximize the cumulative reward [39] or to find the best arm [40]. When the aim is to maximize the cumulative reward, a key decision for the gambler is to decide whether to exploit the best arm to date or explore other arms with the hope of gaining greater reward. In contrast, when the aim is to find the best arm, a key decision is to select an arm that will lead to high confidence that the arm ultimately selected will indeed be the best one to exploit. Given the goal of the algorithm developed in this paper is to remove weights that do not impact the performance of a neural network, we utilize MABs for identifying the best arms given a fixed budget defined by the number of lever pulls.
There are several alternative strategies for selecting the next arm to pull which can be grouped into three categories: random exploration [41], optimistic exploration [42] and Bayesian bandits [43,44]. The following subsections summarize algorithms in these categories.

Random exploration
In random exploration methods [41], arms are pulled randomly, expected returns are calculated and a strategy for deciding when the best arm should be exploited and when other arms are tried is employed. In the simplest algorithm, the next arm is chosen randomly for a fixed number of times, an average reward computed and the best arm selected and exploited repeatedly. A more informed strategy, known as -greedy [37,39,45], pulls (i.e. exploits) the current best arm with a probability 1and otherwise pulls another arm randomly (i.e. explores). More formally, given k arms, the -greedy algorithm selects the next arm a t+1 as follows: with prob 1select randomly from{1 . . . k} with prob , where μ t (i) denotes the average reward for arm i obtained over t rounds.
Selecting a suitable for this algorithm can be challenging. If is large then, it will waste time pulling random arms without gaining much, but if is too small, then the learning process will be slow [45]. Hence, some authors have proposed a strategy of decaying over time [39]. For example, White [46] proposes decaying by where φ is a small number and t is the number of rounds to date. Another technique, known as WinStay, LoseShift (WSLS) [47,48], changes the probability of selecting an arm depending on whether it results in a reward or not in the current round. If a Section C: Computational Intelligence, Machine Learning and Data Analytics The Computer Journal, Vol. 00 No. 0, 2019 Pruning Neural Networks Using Multi-Armed Bandits 3 selected arm results in a reward (i.e. wins), then its probability of being selected in the future is increased; otherwise, it is reduced. More formally, let P t (a) be the probability of choosing arm a at time t, then the WSLS update equations are where β is a scaling parameter for rewarding the winner and penalizing the losers.

Optimistic explorations
As mentioned above, prior to pulling the next arm, a player will know the expected reward for each arm based on the history of lever pulls. A simple approach to selecting the next arm is to use an arm with the largest reward. However, this ignores the fact that early estimates of the rewards may be inaccurate, making it ineffective, irrespective of whether we wish to maximize cumulative reward or, as in our case, for the best arm problem. Thus, the main idea for optimistic exploration is to maintain confidence bounds on the expected rewards and to select an arm with the largest upper bound. This ensures that there is sufficient exploration at the start, knowing that the bounds will tighten as the number of lever pulls increase. MAB methods that adopt this approach are known as upper confidence bound (UCB) algorithms [39,49]. More formally, UCB algorithms aim to select the next arm, a t+1 as follows: where μ i is the expected reward for arm a i and P f i is a padding function that is used to provide an upper bound for the reward for an arm a i . One of the earliest and most widely cited UCB algorithms, known as UCB1, uses the following selection function [39]: where n i is the number of times arm i has been chosen and t is the total number of rounds. UCB1 begins by playing each arm once to create an initial estimate. Then, for each iteration t, an arm is selected using equation 2. Initially, when arms have only been pulled a few times, the padding function in equation 2 allows exploration, but as the number of rounds increases, the extent of the padding function reduces, leading to greater exploitation of the arms that return the largest rewards. KL-UCB [50,51] presents an alternative approach where the padding function is derived from the Kullback-Leibler (KL) divergence measure, leading to a selection function where the next arm to pull is given by where c is a constant and d is the KL divergence measure defined by [52] Both UCB1 and KL-UCB focus on the upper bounds of individual bandits. In contrast, some recent MAB algorithms also utilize the information in the lower bounds. One such algorithm is the unified gap-based exploration algorithm (UGapEb) [53]. As with the above algorithms, it maintains the lower and upper bounds (l i , u i ) for each arm, but, in addition, it also maintains a set S m with the top m arms, which are selected on the basis of the gap, g i between the lower bound of an arm and the best upper bound amongst the arms: Given this set, the algorithm considers whether an arm that is not currently in S m can make it into the set in the next round. To assess this, it picks two arms: an arm in Sm with the lowest lower bound and an arm that is not in Sm but which has the highest upper bound. From these, it selects the arm with the largest uncertainty (i.e. u i − l i ) and repeats the process a fixed number of times (fixed budget setting).
The bounds (l i , u i ) for an arm a i , with T i lever pull at round t, are defined as follows and can be derived using the Chernoff-Hoeffding bounds [53]: where a and b are user provided parameters that need tuning to improve performance. Audibert and Bubeck [40] argue that finding suitable parameters for algorithms such as UGapEb can be challenging and propose a parameter-free algorithm, successive rejects (SR), that involves K − 1 phases in which the available arms are attempted for a fixed number of times and the least promising arm is removed following each phase. The number of rounds n k , in the k th phase, is set with a view to achieving the theoretical lower bound for the best arm problem [40]:

Bayesian bandits
In the Bayesian approach, the potential reward from each arm is represented by a probability distribution that is updated in a Bayesian fashion. If P(R) is the prior probability of a reward, then the goal is to compute a posterior distribution P(R|h t ) where h t is the history of rewards and actions.
One of the first algorithms to adopt a Bayesian approach was Thompson sampling [26,27,52]. Given s a , the number of times an arm results in a reward, and f a , the number of times an arm fails to deliver a reward, the probability distribution for an arm is defined by the beta distribution [43,44]: where α is set to s a + 1, β is set to f a + 1 and B(α, β) is a normalizing constant.
In a more recent development that uses a Bayesian approach, Kaufmann et al. [54] propose an algorithm BayesUCB, in which the quantiles of a distribution are estimated to increasingly tight bounds and used to determine the next step: where Q is the standard quantile function and λ t−1 i denotes the posterior distribution of the mean reward for the i th arm.

A MULTI-ARMED PRUNING ALGORITHM
To utilize the MAB algorithms described in Section 2, we need to define the arms and the reward function. The arms, a k , are defined as the weights w i,j from a weight matrix W connecting the neurons in the layers of the network, and pulling an arm is considered to be equivalent to setting a weight to zero. The reward is defined as the difference between the accuracy of a network before and after removing a weight and is based on applying the network on a random sample of the data. The weight selected, together with the reward, becomes part of the history, which is then used by a MAB algorithm to select the next weight and the process repeated for a fixed number of rounds.
The reward function used varies depending on the type of bandit algorithm used. For UCB1, -greedy, KL-UCB and WSLS, the reward is computed by first calculating the difference in loss: where L is the loss when the network is applied on the sample of the data D, W denotes the weights and W the weights after pruning. The reward is then computed using where the value of Threshold determines how much loss in performance can be tolerated when pruning. For example, suppose pruning results in a slightly worse performance, giving -0.05 for δL, then a threshold of 0.1 would still result in a reward. The divisor, Constant, is defined in a way that ensures that the reward is bounded between zero and one. The Thompson sampling and BayesUCB algorithms assume Bernoulli rewards, and hence the reward is one if δL is larger than zero and zero otherwise.
In summary, given the number of rounds T and loss function L, the main steps of the MAB algorithm for pruning a neural network are 1. Initialize the round number: t = 1. 2. Let D be a random sample from the training data. 3. If any weight w ji has yet to be played, then let the selected indices be (j, i), else use a MAB selection policy to return the indices (j, i) based on the current history. 4. Perform forward propagation with the initial weights W to obtain L(D|W). 5. Save the selected weight w ji in case it needs to be reinstated and set w ji to zero to obtain W . 6. Perform forward propagation with the revised weights W to obtain L(D|W ). 7. Compute the change in loss and reward for round t: 8. Update the average reward for the selected arm: μ ji = (n ji − 1)/n ji * μ ji + 1/n ji * R ji,t .

Increment the number of rounds t and repeat from
Step 2 for T rounds after resetting the selected weight w ji . 10. Rank the weights based on the rewards μ ji and select a user defined proportion to be set to zero to obtain the pruned network.
This algorithm has been implemented with Step 3 invoking the selection policy of a specific MAB algorithm such as UCB1, KL-UCB, Bayes-UCB, UGapEb and Thompson sampling and Step 7 utilizing the relevant reward function. 1 Its worth noting that this algorithm inherits the extensive body of Pruning Neural Networks Using Multi-Armed Bandits 5 research on the theoretical convergence properties of the different MAB algorithms. Readers interested in these properties are referred to the survey paper by Burtini et al. [45].

EMPIRICAL EVALUATION
The MAB pruning algorithms were evaluated by carrying out two sets of experiments. In the first, the NNSYSID package [55] was used to build neural networks for 12 data sets from the UCI machine learning repository [56] whose characteristics are summarized in Appendix A. The inputs and outputs of the neural networks reflected the features and class of the data sets. For consistency, each network adopted two hidden layers, with each hidden layer utilizing 20 neurons. The data were randomly divided into 60:40 training:testing sets. The models were trained using stochastic gradient descent with a batch size of 10 examples, momentum of 0.9 and weight decay of 0.005. The learning rate was initialized to 0.01, and the Softmax function was used as the activation function. Once trained, the neural networks were pruned using the different methods and their performance analysed on the testing set.
In the second set of experiments, the LeNet deep learning model, with two convolutional layers and two fully connected layers, was adopted and trained on the MNIST data set [57]. The model was then pruned using the different methods and their performance compared on the MNIST testing data. The following subsections present the results from the two sets of experiments.

Results for the UCI data sets
The empirical evaluation of the eight MAB methods on the UCI data was carried out in comparison with the following four algorithms: 1. Random pruning (RP), in which the weights are sampled randomly for a fixed number of times and the weights removed are selected based on their average effect on reducing the error. 2 2. An NP method [58], which removes weights that are below a user specified threshold value. 3. OBD [19], which, as described in Section 1, is derived using a Taylor series approximation for the change in error that would occur if the weights were perturbed but makes some simplifying assumptions about the offdiagonal elements of the resulting Hessian matrix. 4. OBS [20,21], which also adopts a Taylor series approximation for the change in error but does not make the assumptions made by OBD.
The experiments aimed to address what happens to the accuracy when a network is pruned using the different methods 2 Note that RP is equivalent to setting to 1 in -greedy. and to compare the computational time of the methods. Table 1 presents the percentage error rate for each of the pruning methods on data sets from the UCI repository after pruning 10% of the weights after 1800 rounds of the MAB methods. The -greedy and UGapEb algorithms both have hyper parameters. In these experiments, for -greedy, we set to 0.5 to allow sufficient opportunities for exploring alternative arms. For UGapEb, we set the parameters a and b (in equation 3) to 0.25 and 1, respectively, based on the recommendations in Terayama et al. [59]. The average error rate, presented in the last row, shows that UCB1 and UGapE methods perform well relative to the other methods for pruning.
To compare the methods more formally, we adopted the methodology recommended by Demsar [60], who advocates use of a non-parametric test introduced by Friedman [61] to determine if there is a difference amongst the methods, and if so, to follow up with the Nemenyi test [62] to assess if one method is significantly better than another. Table 3 presents the average rank of the 13 methods over the 12 data sets, with UCB1 being ranked the most effective in terms of reducing the error rate and NP ranked the least effective. Applying the Friedman test results in a P-value of 1.2 × 10 −12 confirming that there is a significant difference amongst some of the methods and hence the Nemnyi post hoc test was carried out. The critical difference (CD) for the Nemenyi test was calculated using [62]: where α is the confidence level, which was set to 0.05, K is the number of models (or classifiers) and N is the number of measurements (data sets). To compute q α,K , the studentized range statistic for infinite degrees of freedom divided by √ 2 was used. Figure 1 displays the results, where the methods are plotted according to their average rank. The best ranked methods are to the right, and the line at the top indicates the CD. The coloured lines group the methods that are not significantly different at the 0.05 level. These results show that • The UCB family of methods performed significantly better than NP. Thompson sampling for pruning is also significantly better than NP. • The performance of BayesUCB and KL-UCB is very similar, which is consistent with the theoretical results due to Kaufmann et al. [63]. • Although the bandit-based methods have a higher average rank, the Nemenyi test does not distinguish these methods significantly from OBD or OBS in terms of minimizing the error.   Banknote  16  16  17  15  15  15  16  18  16  18  15  16  17  Blood TraNS.  12  12  18  12  12  12  12  13  12  13  12  12  14  Credit Approval 11  11  10  10  11  11  13  14  11  20  10  11  13  Haberman  20  20  23  20  20  20  20  20  20  21  20  21  21  Liver Disorders 30  30  35  30  30  30  30  37  30  32  30  31  32  MAGIC Gamma 20  23  25  20  20  20  21  26  20  22  20  21  23  Mammographic 19  17  19  17  17  18  19  20  19  19  18  19 Table 2 presents the run-time performance of these methods showing that the UCB family, -greedy, WSLS and Thompson sampling have the best run-time performance. OBD and OBS are computationally intensive given the need to compute the Hessian matrix. Thus, UCB methods achieve better performance on average than OBD and OBS but in significantly less time.

Results for the MNIST data set
The MNIST (Modified National Institute of Standards and Technology) data set is a well-known collection of handwritten digits that has been used in evaluating many handwriting recognition algorithms [57]. One of the most widely adopted deep learning architecture for this data set is the LeNet model [57]. In this model, the network has two 5×5 convolutional layers with 20 and 50 filters, respectively, and two fully connected layers with 500 and 10 neurons in the output layer. This results in the first two layers (the convolutional layers) having 500 and 25,000 weights, respectively, and the third and fourth layers (fully connected layers) having 2,500,000 and 500 nodes, respectively. The base line accuracy of this model is 98.06%, so it provides a good example for assessing which methods can best remove weights without adversely affecting the level of accuracy. To assess this, we applied a selection of MAB methods, with the number of rounds set to 150,000, to prune 50% of the weights in the second and third layers, which have the most weights. The methods we selected include one from the UCB1 family given their performance is similar, Thompson sampling and NP. Table 4 presents the results, where the first column lists the pruning methods, the second column presents the accuracy obtained after pruning the first fully connected layer of LeNet  and the last column presents the accuracy after pruning the second convolution layer of LeNet. The results show that the use of the bandit algorithms maintains accuracy but NP results in a significant decline in accuracy.

CONCLUSION AND FUTURE WORK
Pruning neural networks has re-emerged as a significant research topic following the emergence of deep neural networks which can be memory intensive and computationally demanding. Pruning neural networks can involve various tradeoffs such as the size of a network versus accuracy and time spent in assessing the value of a weight versus accuracy of the assessment. This paper has explored the use of MABs for pruning neural networks. Several MAB methods, namely UCB1, UGapEb, BayesUCB, KL-UCB, -greedy, WSLS, SR and Thompson sampling were utilized for pruning neural networks and compared with existing methods. In terms of retaining accuracy, the results on data sets from the UCI repository show that UCB1 and UGapEb are the most effective methods for pruning when compared to existing methods such as OBD, OBS and NP. Computationally, UCB1 is also significantly less demanding than methods such as OBS and OBD.
The bandit methods were also applied to Le Cun's deep learning model to examine their effectiveness and performed well in terms of maintaining the accuracy of the original model.
There are several directions for future research on the use of MAB algorithms for pruning neural networks. First, there are MAB algorithms, such as the SR algorithm, that are designed to guarantee the theoretical lower bound for the best arm problem. As our evaluation of SR indicates, these algorithms can be computationally demanding for deep neural networks. Hence, the development of best arm MAB algorithms that ensure near optimal regret bounds, perform well empirically and are also computationally feasible for pruning neural networks remains an open challenge. A second area of interesting work is to evaluate the use of MABs at different layers of granularity, such as for pruning neurons and feature maps. In conclusion, this study shows that MABs based on UCB offer an effective way of pruning neural networks. Given the growth and scale of applications of deep neural networks, pruning them remains an open challenge, and we have therefore shared the implementations on GitHub so others can also build upon the work presented in this paper.