Movement Pruning:
                              Adaptive Sparsity by Fine-Tuning


                                Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
                                       1 Hugging Face, 2 Cornell University
                             {victor,thomas}@huggingface.co;arush@cornell.edu


     arXiv:2005.07683v1  [cs.CL]  15 May 2020                                         Abstract

                       Magnitude pruning is a widely used strategy for reducing model size in pure
                       supervised learning; however, it is less effective in the transfer learning regime that
                       has become standard for state-of-the-art natural language processing applications.
                       We propose the use ofmovement pruning, a simple, deterministic ﬁrst-order weight
                       pruning method that is more adaptive to pretrained model ﬁne-tuning. We give
                       mathematical foundations to the method and compare it to existing zeroth- and
                       ﬁrst-order pruning methods. Experiments show that when pruning large pretrained
                       language models, movement pruning shows signiﬁcant improvements in high-
                       sparsity regimes. When combined with distillation, the approach achieves minimal
                       accuracy loss with down to only 3% of the model parameters.


                 1 Introduction

                 Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
                 performance in applications in natural language processing and related ﬁelds. In this setup, a large
                 model pretrained on a massive generic dataset is then ﬁne-tuned on a smaller annotated dataset to
                 perform a speciﬁc end-task. Model accuracy has been shown to scale with the pretrained model and
                 dataset size [Raffel et al., 2019]. However, signiﬁcant resources are required to ship and deploy these
                 large models, and training the models have high environmental costs [Strubell et al., 2019].
                 Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
                 only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
                 are a particularly simple and effective method for compressing models to be sent to edge devices such
                 as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
                 absolute values, is the most widely used method for weight pruning. It has been applied to a large
                 variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
                 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
                 [Frankle et al., 2019].
                 While magnitude pruning is highly effective for standard supervised learning, it is inherently less
                 useful in the transfer learning regime. In supervised learning, weight values are primarily determined
                 by the end-task training data. In transfer learning, weight values are mostly predetermined by the
                 original model and are only ﬁne-tuned on the end task. This prevents these methods from learning to
                 prune based on the ﬁne-tuning step, or “ﬁne-pruning.”
                 In this work, we argue that to effectively reduce the size of models for transfer learning, one should
                 instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
                 ﬁne-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
                 high values can be pruned if they shrink during training. This strategy moves the selection criteria
                 from the 0th to the 1st-order and facilitates greater pruning based on the ﬁne-tuning objective. To


                 Preprint. Under review.                 test this approach, we introduce a particularly simple, deterministic version of movement pruning
                 utilizing the straight-through estimator [Bengio et al., 2013].
                 We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
                 Vaswani et al., 2017] on a diverse set of ﬁne-tuning tasks. In highly sparse regimes (less than 15% of
                 remaining weights), we observe signiﬁcant improvements over magnitude pruning and other 1st-order
                 methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
                 BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
                 [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
                 the differences between magnitude pruning and movement pruning shows that the two methods lead
                 to radically different pruned models with movement pruning showing greater ability to adapt to the
                 end-task.

                 2 Related Work

                 In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
                 Most similar to our approach are methods for using parallel score matrices to augment the weight
                 matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
                 lutional networks. Differing from our methods, these methods keep the weights of the model ﬁxed
                 (either from a randomly initialized network or a pre-trained network) and the scores are updated to
                 ﬁnd a good sparse subnetwork.
                 Many previous works have also explored using higher-order information to select prunable weights.
                 LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
                 deletion. Our method does not require the (possibly costly) computation of second-order derivatives
                 since the importance scores are obtained simply as the by-product of the standard ﬁne-tuning. Theis
                 et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
                 contrast, we found it useful to preserve the direction of movement in our algorithm.
                 Compressing pretrained language models for transfer learning is also a popular area of study. Other
                 approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
                 [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
                 and targets individual weight. We also show that having a teacher can further improve our approach.
                 Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
                 sparse language models from scratch. This differs from our approach which focuses on the ﬁne-tuning
                 stage. Finally, another popular compression approach is quantization. Quantization has been applied
                 to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
                 providing high memory compression rates at the cost of no or little performance. As shown in
                 previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
                 can be combined to further improve the performance/size ratio.

                 3 Background: Score-Based Pruning

                 We ﬁrst establish shared notation for discussing different neural network pruning strategies. Let
                 W2Rn n refer to a generic weight matrix in the model (we consider square matrices, but they
                 could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
                 associated importance scoresS2Rn n . Given importance scores, each pruning strategy computes a
                 maskM2 f0;1gn n . Inference for an inputxbecomesa= (W M)x, where is the Hadamard
                 product. A common strategy is to keep the top-vpercent of weights by importance. We deﬁne Top v as a function which selects thev%highest values inS: 1; STop(S)                                     (1) v  i;j =     i;j in topv%
                                                  0; o.w.

                 Magnitude-based weight pruning determines the mask based on the absolute value of each weight      as a measure of importance. Formally, we have importance scoresS= jWi;j j     , and masks 1 i;j n M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is ﬁrst trained until convergence and weights with the lowest
                 magnitudes are removed afterward. The sparsiﬁed model is then re-trained with the removed weights
                 ﬁxed to 0. This loop is repeated until the desired sparsity level is reached.

                                                  2                              Magnitude pruning    L0 regularization Movement pruning Soft movement pruning
                  Pruning Decision 0th order 1st order 1st order 1st order
                  Masking Function Top v     Continuous Hard-Concrete Top v        Thresholding
                  Pruning Structure Local or Global Global Local or Global Global
                  Learning Objective      L L+ l0 E(L0 )          L L+ mvp R(S)
                  Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP               P           PScoresS          jW              )                               )i;j j    (@L )(t W(t) f(S(t) )    (@L )(t) W(t)     (@L )(t) W(t
                                            t@W    i;j  i;j                         i;j i;j            t@W    i;j i;j         t@W i;j
                 Table 1: Summary of the pruning methods considered in this work and their speciﬁcities. The
                 expression offofL0 regularization is detailed in Eq (3).


                 In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements
                 magnitude pruning by allowing masked weights to be updated such that they are not ﬁxed for the
                 entire duration of the training. Automated gradual pruning enables the model to recover from previous
                 masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv       during training using a cubic sparsity scheduler:v(t) =vf + (v          t 3
                                                             i  vf ) 1 t i . The sparsity n t level at time stept,v(t) is increased from an initial valuevi (usually 0) to a ﬁnal valuevf innpruning
                 steps afterti steps of warm-up. The model is thus pruned and trained jointly.

                 4 Movement Pruning

                 Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
                 model. In this work, we focus on movement pruning methods where importance is derived from
                 ﬁrst-order information. Intuitively, instead of selecting weights that are far from zero, we retain
                 connections that are moving away from zero during the training process. We consider two versions of
                 movement pruning: hard and soft.
                 For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP      Wand their importance scoresS.
                 During the forward pass, we compute for alli,a    ni =    Wk=1 i;k Mi;k xk .
                 Since the gradient of Top v is 0 everywhere it is deﬁned, we follow Ramanujan et al. [2020], Mallya
                 and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al.,
                 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
                 approximation of gradient of the lossLwith respect toSi;j is given by
                                        @L   @L @a=     i   @L=   W x@S                   j                   (2)
                                         i;j  @a i @S i;j  @a  i;ji
                 This implies that the scores of weights are updated, even if these weights are masked in the forward
                 pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
                 We also consider a relaxed (soft) version of movement pruning based on the binary mask function
                 described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a ﬁxed global
                 threshold value that controls the binary mask. The mask is calculated asP M= (S>  ). In order to
                 control the sparsity level, we add a regularization termR(S) = mvp    (Si;j   i;j )which encourages
                 the importance scores to decrease over time 1 . The coefﬁcient mvp controls the penalty intensity and
                 thus the sparsity level.
                 Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
                 movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
                 hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0,
                 l <0, andr >1:                                               u  U(0;1)            Si;j = (log(u) log(1 u) +Si;j )=b
                         Zi;j = (r l)Si;j +l M i;j = min(1;ReLU(Zi;j ))

                 The expectedP       L0 norm has a closed form involving the parameters of the hard-concrete:                                       E(L0 ) =
                       logSi;j     i;j  blog( l=r). Thus, the weights and scores of the model can be optimized in
                                     P1 We also experimented with   jSi;j i;j jbut it turned out to be harder to tune while giving similar results.

                                                  3                              (a) Magnitude pruning             (b) Movement pruning
                 Figure 1: During ﬁne-tuning (on MNLI), the weights stay close to their pre-trained values which
                 limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
                 plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
                 selects weights that are moving away from 0.


                 an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A
                 coefﬁcient l0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form:
                         @L   @L                      r l=   W@S        i;j xj f(Si;j )wheref(Si;j ) =    S            Zi;j  1g     (3)
                           i;j  @a i                       b  i;j (1 S i;j )1f0
                                                                                   
                 At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (r l) (S)+l
                  and weights multiplied by 0 can simply be discarded.
                 Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
                 functions, pruning structure, and the ﬁnal gradient form.

                 Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given
                 by the standard gradient derivation: @L = @L M@W i;j   @a  i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j  @W   i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j
                 Eq (2),S               @Li;j is increasing when    <0, which happens in two cases: @S i;j
                       (a) @L <0andW@W         i;j >0i;j
                       (b) @L >0andW@W         i;j <0i;j
                 It means that during trainingWi;j is increasing while being positive or is decreasing while being
                 negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely,
                 Si;j is decreasing when @L >0which means thatW@S                  i;j is shrinking towards 0. i;j
                 While magnitude pruning selects the most important weights as the ones which maximize their
                 distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from
                 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
                 pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
                 equation (2), afterTgradient updates, we have
                                                XS(T)          @L=          )(t) W(t)                   (4) i;j     S  (@W     i;j i;jt<T

                 Figure 1 shows this difference empirically by comparing weight values during ﬁne-tuning against
                 their pre-trained value. As observed by Gordon et al. [2020], ﬁne-tuned weights stay close in absolute
                 value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained

                                                  4                 values implies that we know with high conﬁdence before even ﬁne-tuning which weights will be
                 pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
                 pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
                 the pruning decision since the selection is made during ﬁne-tuning (moving away from 0), and both
                 low and high values can be pruned. We posit that this is critical for the success of the approach as it
                 is able to prune based on the task-speciﬁc data, not only the pre-trained value.

                 5 Experimental Setup

                 Transfer learning for NLP uses large pre-trained language models that are ﬁne-tuned on target tasks
                 [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
                 speciﬁc pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
                 We freeze the embedding modules and ﬁne-tune the transformer layers and the task-speciﬁc head.
                 We perform experiments on three monolingual (English) tasks, which are common benchmarks for
                 the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
                 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
                 [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
                 is formulated as a span extraction task, MNLI and QQP are paired sentence classiﬁcation tasks.
                 For a given task, we ﬁne-tune the pre-trained model for the same number of updates (between 6
                 and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
                 scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
                 (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
                 especially in high sparsity regimes. The schedule forvis:
                                 8<vi                    0 t < t i
                                   v      v          )3 t                        (5):f + (vi  f )(1 t ti  tf
                                                  n t     i  t < T tf
                                   vf                    o.w.
                 wheretf is the number of cool-down steps.
                 We compare our results against several state-of-the-art pruning baselines: Reweighted Proximal
                 Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
                 [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
                 structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
                 authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
                 models with varying hyper-parameters [Turc et al., 2019].

                 6 Results

                 Figure 2 displays the results for the main pruning methods at different levels of pruning on each
                 dataset. First, we observe the consistency of the comparison between magnitude and movement
                 pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
                 all methods with little or no loss with respect to the dense model whereas the performance of
                 movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
                 pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
                 ﬁrst-order methods show strong performances with less than 15% of remaining weights.
                 Table 2 shows the speciﬁc model scores for different methods at high sparsity levels. Magnitude
                 pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regular-
                 ization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
                 indicate that in high sparsity regimes, importance scores derived from the movement accumulated
                 during ﬁne-tuning induce signiﬁcantly better pruned models compared to absolute values.
                 Next, we compare the difference in performance between ﬁrst-order methods. We see that straight-
                 through based hard movement pruning (MvP) is comparable withL0 regularization (with a signiﬁcant
                 gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
                    2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
                 performance

                                                  5                 Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft move-
                 ment pruning consistently outperforms other methods in high sparsity regimes.We plot the
                 performance of the standard ﬁne-tuned BERT along with 95% of its performance.


                 Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
                  state-of-the art pruning methods at different high sparsity levels.

                                BERT base  Remaining
                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
                      SQuAD - Dev             10%    67.7/78.5 69.9/80.1 71.9/81.7 71.3/81.580.4/88.1EM/F1               3%    40.1/54.5 61.6/73.6 65.2/76.3 69.6/79.9
                      MNLI - Dev              10%    77.8/79.0 77.9/78.5 79.3/79.5 80.7/81.2acc/MM acc   84.5/84.9     3%    68.9/69.8 75.2/75.6 76.1/76.7 79.0/79.7
                      QQP - Dev              10%    78.8/75.1 87.6/81.9 89.1/85.5 90.2/86.891.4/88.4acc/F1                3%    72.1/58.4 86.5/81.1 85.6/81.0 89.2/85.5


                 hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
                 among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
                 movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
                 it yet yields stronger performances for the same compute budget.
                 Finally, movement pruning and soft movement pruning compare favorably to the other baselines, ex-
                 cept for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
                 the ﬁne-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efﬁcient and
                 more effective to train a large model and compress it afterward than training a smaller model from
                 scratch. We do note though that current hardware does not support optimized inference for sparse
                 models: from an inference speed perspective, it might often desirable to use a small dense model
                 such as mini-BERT over a sparse alternative of the same size.

                 Distillation further boosts performance Following previous work, we can further leverage knowl-
                 edge distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
                 domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline ﬁne-tunedBERT-basemodel as
                 teacher. The training objective is a linear combination of the training loss and a knowledge distillation


                 Figure 3: Comparisons between different pruning methods augmented with distillation.Distillation
                 improves the performance across all pruning methods and sparsity levels.


                                                  6                 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
                 beneﬁt from distillation signal further enhancing the ratio Performance VS Model Size.

                                BERT base  Remaining
                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
                      SQuAD - Dev             10%    70.2/80.1 72.4/81.9 75.6/84.3 76.6/84.980.4/88.1EM/F1               3%    45.5/59.6 65.5/75.9 67.5/78.0 72.9/82.4
                      MNLI - Dev              10%    78.3/79.3 78.7/79.8 80.1/80.4 81.2/81.8acc/MM acc   84.5/84.9     3%    69.4/70.6 76.2/76.5 76.5/77.4 79.6/80.2
                      QQP - Dev              10%    79.8/65.0 88.1/82.8 89.7/86.2 90.5/87.191.4/88.4acc/F1                3%    72.4/57.8 87.1/82.0 86.1/81.5 89.3/85.6


                            (a) Distribution of remaining weights     (b) Scores and weights learned by
                                                        movement pruning
                 Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
                 weight distribution.


                 loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
                 pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
                 pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
                 instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
                 with distillation, soft movement pruning yields the strongest performances across all pruning methods
                 and studied datasets: it reaches 95% ofBERT-basewith only a fraction of the weights in the encoder
                 ( 5% on SQuAD and MNLI).

                 7 Analysis

                 Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
                 the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
                 observe that by deﬁnition, magnitude pruning removes all the weights that are close to zero, ending
                 up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
                 the whole interval except for values close to 0.
                 Figure 4b displays each individual weight against its associated importance score in movement
                 pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
                 relationship between the scores and the weights. Both weights with high absolute value or low
                 absolute value can be considered important. However, high scores are systematically associated with
                 non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
                 scores (section 4): a high scoreSindicates that during ﬁne-tuning, the associated weight moved away
                 from 0 and is thus non-null.

                 Local and global masks perform similarly  We study the inﬂuence of the locality of the pruning
                 decision. While local Top v selects thev% most important weights matrix by matrix, global Top v uncovers non-uniform sparsity patterns in the network by selecting thev% most important weights in

                                                  7                 Figure 5: Comparison of local and global selec- Figure 6:Remaining weights per layer in thetions of weights on SQuAD at different sparsity Transformer.Global magnitude pruning tends tolevels.For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global Top v performs similarly at all ods allocate the weight to the lower layers whilelevels of sparsity.                     heavily pruning the highest layers.


                 the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
                 the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
                 found that the sparsity tends to increase with the depth of the network layer.
                 Figure 5 compares the performance of local selection (matrix by matrix) against global selection
                 (all the matrices) for magnitude pruning and movement pruning. Despite being able to ﬁnd a
                 global sparsity structure, we found that global did not signiﬁcantly outperform local, except in high
                 sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
                 Even though the distillation signal boosts the performance of pruned models, the end performance
                 difference between local and global selections remains marginal.
                 Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
                 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
                 through the network, and it has been shown to be crucial for the performance in high sparsity regimes
                 [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
                 tend to allocate a signiﬁcant part of the weights to the lowest layers while heavily pruning in the
                 highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
                 uniformly across layers.


                 8 Conclusion

                 We consider the case of pruning of pretrained models for task-speciﬁc ﬁne-tuning and compare
                 zeroth- and ﬁrst-order pruning methods. We show that a simple method for weight pruning based on
                 straight-through gradients is effective for this task and that it adapts using a ﬁrst-order importance
                 score. We apply this movement pruning to a transformer-based architecture and empirically show that
                 our method consistently yields strong improvements over existing methods in high-sparsity regimes.
                 The analysis demonstrates how this approach adapts to the ﬁne-tuning regime in a way that magnitude
                 pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
                 penalties [Bach et al., 2011] to remove entire columns or ﬁlters. In this setup, we would associate a
                 score to a group of weights (a column or a row for instance). In the transformer architecture, it would
                 give a systematic way to perform feature selection and remove entire columns of the embedding
                 matrix.


                 References
                 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
                   Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
                   transformer.ArXiv, abs/1910.10683, 2019.

                                                  8                 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
                   learning in nlp. InACL, 2019.
                 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
                   efﬁcient neural network. InNIPS, 2015.
                 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
                   with pruning, trained quantization and huffman coding. InICLR, 2016.
                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InNIPS,
                   2016.
                 Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
                   abs/1902.09574, 2019.
                 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
                   hypothesis at scale.ArXiv, abs/1903.01611, 2019.
                 Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
                   through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
                   bidirectional transformers for language understanding. InNAACL, 2019.
                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
                   Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
                 Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
                   l0 regularization. InICLR, 2017.
                 Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
                   sentence understanding through inference. InNAACL, 2018.
                 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
                   machine comprehension of text. InEMNLP, 2016.
                 Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, ﬁxed network by
                   learning to mask.ArXiv, abs/1801.06519, 2018.
                 Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
                   What’s hidden in a randomly weighted neural network? InCVPR, 2020.
                 Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
                 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
                   performance comparisons. InNIPS, 1993.
                 Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
                   dense networks and ﬁsher pruning.ArXiv, abs/1801.05787, 2018.
                 Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
                   momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
                 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
                   bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
                 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
                   task-speciﬁc knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
                 Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
                   structured dropout. InICLR, 2020a.
                 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
                   2019.

                                                  9                 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
                   multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
                 Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
                   and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
                   abs/2004.07320, 2020b.
                 Oﬁr Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
                   abs/1910.06188, 2019.
                 Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
                   networks using vector quantization.ArXiv, abs/1412.6115, 2014.
                 Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
                   zalez. Train large, then compress: Rethinking model size for efﬁcient training and inference of
                   transformers.ArXiv, abs/2002.11794, 2020.
                 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model
                   compression. InICLR, 2018.
                 Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
                   weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
                 Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
                   natural language processing. InNAACL, 2019.
                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                   models are unsupervised multitask learners. 2019.
                 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
                   Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
                   approach.ArXiv, abs/1907.11692, 2019.
                 Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
                   URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
                 Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
                   pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
                  Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014.
                 Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
                   The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
                 Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
                 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
                   InNIPS, 2014.
                 Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
                   Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
                 Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
                   compression and acceleration on mobile devices. InECCV, 2018.
                 Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
                   through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.


                                                  10                 A Appendices

                 A.1 Guarantees on the decrease of the training loss

                 As the scores are updated, the relative order of the importances is likely shufﬂed, and some connections
                 will be replaced by more important ones. Under certain conditions, we are able to formally prove that
                 as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
                 [Ramanujan et al., 2020] to consider the case of ﬁne-tuableW.
                 We suppose that (a) the training lossLis smooth and admits a ﬁrst-order Taylor development
                 everywhere it is deﬁned and (b) the learning rate ofW( W >0) is small. We deﬁne the TopK
                 function as the analog of the Top v function, wherekis an integer instead of a proportion. We ﬁrst
                 consider the case wherek= 1in the TopK masking, meaning that only one connection is remaining
                 (and the other weights are deactivated/masked). Let’s denoteWi;j this sole remaining connection at
                 stept. Following Eq (1), it means that81 u;v n;S (t)u;v  S(t) .i;j
                 We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
                 t+ 1is(k;l). We have:
                                   (
                                    Att;    81 u;v n;S (t)u;v  S(t)
                                                             i;j                   (6) Att+ 1; 81 u;v n;S (t+1)u;v   S(t+1)
                                                              k;l

                 Eq(6)gives the following inequality:S(t+1)  S(t)  S(t+1)  S(t) . After re-injecting the gradient k;l    k;l   i;j     i;j update in Eq (2), we have:
                                          @L        )       @L
                                         S   W(t x                                (7)@a  k;l l     S  W(t) x
                                            k          @a  i;j ji

                 Moreover, the conditions in Eq(6)lead to the following inferences:a(t) =W(t) x    a(t+1) =i     i;j j and k
                 W(t+1) xk;l  l .

                 Since                        t)W is small,jj(a(t+1) ;a (t+1) ) (a( ;a (t) )jj i    k      i  k  2 is also small. Because the training lossLis
                 smooth, we can write the 1st order Taylor development ofLin point(a(t) ;a (t) ):i  k


                            L(a(t+1) ;a (t+1) )  L(a(t) ;a (t) )i    k       i  k
                               @L    (a(t+1)  a(t)   @L) +   (a(t+1)  a(t) )@a  k     kk            @a  i     ii
                               @L=   W(t+1)   @Lx     W(t) x@a  k;l  l  
                                k         @a  i;j ji
                               @L           @L       @L                         (8) =   W(t+1) x                 W(t)    @Lx       (t) x@a  k;l  l + (    W(t) x
                                k           @a  k;l l +
                                              k       @a  k;l l )    Wi;j jk        @a i
                               @L=   (W(t+1) x                )   @L
                              @a   k;l  l  W(t)     @Lx          xk;l l ) + (   W(t
                                k                 @a  k;l l     W(t) x
                                                     k       @a  i;j j )
                                                               i
                               @L       @L            @L       @L=   x           (S(t) )           x@a l (  W   x     k;l ) + (   W(t) l     W(t) x
                                k      @a l m
                                          k            @a  k;lk       @a  i;j j )
                                                                  i
                 The ﬁrst term is null because of inequalities(6)and the second term is negative because of inequality
                 (7). ThusL(a(t+1) ;a (t+1) )  L(a(t) ;a (t) ): when connection(k;l)becomes more important than i    k        i (i;j), the connections are swapped and the training loss decreases between step k                         tandt+ 1.            Similarly, we can generalize the proof to a setE=f(ai ;b i );(ci ;d i );i Ng ofNswapping
                 connections.
                 We note that this proof is not speciﬁc to theTopKmasking function. In fact, we can extend the proof
                 using theThresholdmasking functionM:= (S>= )[Mallya and Lazebnik, 2018]. Inequalities
                 (6) are still valid and the proof stays unchanged.

                                                  11                 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
                 it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
                 would also be true one speciﬁc case: thenegative thresholdmasking function (M:= (S<  )where
                   <0).
                 We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
                                  (
                                   Att;    81 u;v n;S (t)    S(t)
                                                      i;j      u;v                 (9)Att+ 1; 81 u;v n;S (t+1)    S(t+1)
                                                      k;l        u;v

                 The inequality on the gradient update becomes:   @LS  W(t) x@a k k;l l <   @LS  W@a  i;j xj and following i
                 the same development as in Eq(8), we haveL(a(t+1) ;a (t+1) )  L(a(t) ;a (t) ) 0: the loss increases. i    k       i We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
                 the absolute value of the score as a proxy for importance.


                                                  12