Movement Pruning: Adaptive Sparsity by Fine-Tuning Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2 1 Hugging Face, 2 Cornell University {victor,thomas}@huggingface.co;arush@cornell.edu arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use ofmovement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high- sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters. 1 Introduction Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art performance in applications in natural language processing and related fields. In this setup, a large model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these large models, and training the models have high environmental costs [Strubell et al., 2019]. Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at only a small cost of accuracy. Pruning methods, which remove weights based on their importance, are a particularly simple and effective method for compressing models to be sent to edge devices such as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high absolute values, is the most widely used method for weight pruning. It has been applied to a large variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al., 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis [Frankle et al., 2019]. While magnitude pruning is highly effective for standard supervised learning, it is inherently less useful in the transfer learning regime. In supervised learning, weight values are primarily determined by the end-task training data. In transfer learning, weight values are mostly predetermined by the original model and are only fine-tuned on the end task. This prevents these methods from learning to prune based on the fine-tuning step, or “fine-pruning.” In this work, we argue that to effectively reduce the size of models for transfer learning, one should instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and high values can be pruned if they shrink during training. This strategy moves the selection criteria from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning utilizing the straight-through estimator [Bengio et al., 2013]. We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019, Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of remaining weights), we observe significant improvements over magnitude pruning and other 1st-order methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI) [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of the differences between magnitude pruning and movement pruning shows that the two methods lead to radically different pruned models with movement pruning showing greater ability to adapt to the end-task. 2 Related Work In addition to magnitude pruning, there are many other approaches for generic model weight pruning. Most similar to our approach are methods for using parallel score matrices to augment the weight matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo- lutional networks. Differing from our methods, these methods keep the weights of the model fixed (either from a randomly initialized network or a pre-trained network) and the scores are updated to find a good sparse subnetwork. Many previous works have also explored using higher-order information to select prunable weights. LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for deletion. Our method does not require the (possibly costly) computation of second-order derivatives since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In contrast, we found it useful to preserve the direction of movement in our algorithm. Compressing pretrained language models for transfer learning is also a popular area of study. Other approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model and targets individual weight. We also show that having a teacher can further improve our approach. Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train sparse language models from scratch. This differs from our approach which focuses on the fine-tuning stage. Finally, another popular compression approach is quantization. Quantization has been applied to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014] providing high memory compression rates at the cost of no or little performance. As shown in previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and can be combined to further improve the performance/size ratio. 3 Background: Score-Based Pruning We first establish shared notation for discussing different neural network pruning strategies. Let W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard product. A common strategy is to keep the top-vpercent of weights by importance. We define Top v as a function which selects thev%highest values inS:1; STop(S) (1) v i;j = i;j in topv% 0; o.w. Magnitude-based weight pruning determines the mask based on the absolute value of each weight as a measure of importance. Formally, we have importance scoresS= jWi;j j , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is first trained until convergence and weights with the lowest magnitudes are removed afterward. The sparsified model is then re-trained with the removed weights fixed to 0. This loop is repeated until the desired sparsity level is reached. 2 Magnitude pruning L0 regularization Movement pruning Soft movement pruning Pruning Decision 0th order 1st order 1st order 1st order Masking Function Top v Continuous Hard-Concrete Top v Thresholding Pruning Structure Local or Global Global Local or Global Global Learning Objective L L+l0 E(L0 ) L L+mvp R(S) Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP P PScoresS jW ) )i;j j (@L )(t W(t) f(S(t) ) (@L )(t) W(t) (@L )(t) W(t t@W i;j i;j i;j i;j t@W i;j i;j t@W i;j Table 1: Summary of the pruning methods considered in this work and their specificities. The expression offofL0 regularization is detailed in Eq (3). In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements magnitude pruning by allowing masked weights to be updated such that they are not fixed for the entire duration of the training. Automated gradual pruning enables the model to recover from previous masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv during training using a cubic sparsity scheduler:v(t) =vf + (v t 3 i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a final valuevf innpruning steps afterti steps of warm-up. The model is thus pruned and trained jointly. 4 Movement Pruning Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running model. In this work, we focus on movement pruning methods where importance is derived from first-order information. Intuitively, instead of selecting weights that are far from zero, we retain connections that are moving away from zero during the training process. We consider two versions of movement pruning: hard and soft. For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP Wand their importance scoresS. During the forward pass, we compute for alli,a ni = Wk=1 i;k Mi;k xk . Since the gradient of Top v is 0 everywhere it is defined, we follow Ramanujan et al. [2020], Mallya and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al., 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The approximation of gradient of the lossLwith respect toSi;j is given by @L @L @a= i @L= W x@S j (2) i;j @a i @S i;j @a i;ji This implies that the scores of weights are updated, even if these weights are masked in the forward pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge. We also consider a relaxed (soft) version of movement pruning based on the binary mask function described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a fixed global threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to control the sparsity level, we add a regularization termR(S) =mvp (Si;j i;j )which encourages the importance scores to decrease over time 1 . The coefficientmvp controls the penalty intensity and thus the sparsity level. Finally we note that these approaches yield a similar updateL0 regularization based pruning, another movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0, l <0, andr >1: u U(0;1) Si;j =(log(u)log(1u) +Si;j )=b Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j )) The expectedP L0 norm has a closed form involving the parameters of the hard-concrete: E(L0 ) = logSi;j i;j blog(l=r). Thus, the weights and scores of the model can be optimized in P1 We also experimented with jSi;j i;j jbut it turned out to be harder to tune while giving similar results. 3 (a) Magnitude pruning (b) Movement pruning Figure 1: During fine-tuning (on MNLI), the weights stay close to their pre-trained values which limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning selects weights that are moving away from 0. an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A coefficientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form: @L @L rl= W@S i;j xj f(Si;j )wheref(Si;j ) = S Zi;j 1g (3) i;j @a i b i;j (1Si;j )1f0 At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l and weights multiplied by 0 can simply be discarded. Table 1 highlights the characteristics of each pruning method. The main differences are in the masking functions, pruning structure, and the final gradient form. Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given by the standard gradient derivation: @L = @L M@W i;j @a i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j @W i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j Eq (2),S @Li;j is increasing when <0, which happens in two cases: @S i;j (a) @L <0andW@W i;j >0i;j (b) @L >0andW@W i;j <0i;j It means that during trainingWi;j is increasing while being positive or is decreasing while being negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely, Si;j is decreasing when @L >0which means thatW@S i;j is shrinking towards 0. i;j While magnitude pruning selects the most important weights as the ones which maximize their distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from equation (2), afterTgradient updates, we have XS(T) @L= )(t) W(t) (4) i;j S (@W i;j i;jt