662 lines
46 KiB
Plaintext
662 lines
46 KiB
Plaintext
|
Movement Pruning:
|
|||
|
Adaptive Sparsity by Fine-Tuning
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
|
|||
|
1 Hugging Face, 2 Cornell University
|
|||
|
{victor,thomas}@huggingface.co;arush@cornell.edu
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract
|
|||
|
|
|||
|
Magnitude pruning is a widely used strategy for reducing model size in pure
|
|||
|
supervised learning; however, it is less effective in the transfer learning regime that
|
|||
|
has become standard for state-of-the-art natural language processing applications.
|
|||
|
We propose the use ofmovement pruning, a simple, deterministic first-order weight
|
|||
|
pruning method that is more adaptive to pretrained model fine-tuning. We give
|
|||
|
mathematical foundations to the method and compare it to existing zeroth- and
|
|||
|
first-order pruning methods. Experiments show that when pruning large pretrained
|
|||
|
language models, movement pruning shows significant improvements in high-
|
|||
|
sparsity regimes. When combined with distillation, the approach achieves minimal
|
|||
|
accuracy loss with down to only 3% of the model parameters.
|
|||
|
|
|||
|
|
|||
|
1 Introduction
|
|||
|
|
|||
|
Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
|
|||
|
performance in applications in natural language processing and related fields. In this setup, a large
|
|||
|
model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to
|
|||
|
perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and
|
|||
|
dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these
|
|||
|
large models, and training the models have high environmental costs [Strubell et al., 2019].
|
|||
|
Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
|
|||
|
only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
|
|||
|
are a particularly simple and effective method for compressing models to be sent to edge devices such
|
|||
|
as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
|
|||
|
absolute values, is the most widely used method for weight pruning. It has been applied to a large
|
|||
|
variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
|
|||
|
2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
|
|||
|
[Frankle et al., 2019].
|
|||
|
While magnitude pruning is highly effective for standard supervised learning, it is inherently less
|
|||
|
useful in the transfer learning regime. In supervised learning, weight values are primarily determined
|
|||
|
by the end-task training data. In transfer learning, weight values are mostly predetermined by the
|
|||
|
original model and are only fine-tuned on the end task. This prevents these methods from learning to
|
|||
|
prune based on the fine-tuning step, or “fine-pruning.”
|
|||
|
In this work, we argue that to effectively reduce the size of models for transfer learning, one should
|
|||
|
instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
|
|||
|
fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
|
|||
|
high values can be pruned if they shrink during training. This strategy moves the selection criteria
|
|||
|
from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To
|
|||
|
|
|||
|
|
|||
|
Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning
|
|||
|
utilizing the straight-through estimator [Bengio et al., 2013].
|
|||
|
We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
|
|||
|
Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of
|
|||
|
remaining weights), we observe significant improvements over magnitude pruning and other 1st-order
|
|||
|
methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
|
|||
|
BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
|
|||
|
[Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
|
|||
|
the differences between magnitude pruning and movement pruning shows that the two methods lead
|
|||
|
to radically different pruned models with movement pruning showing greater ability to adapt to the
|
|||
|
end-task.
|
|||
|
|
|||
|
2 Related Work
|
|||
|
|
|||
|
In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
|
|||
|
Most similar to our approach are methods for using parallel score matrices to augment the weight
|
|||
|
matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
|
|||
|
lutional networks. Differing from our methods, these methods keep the weights of the model fixed
|
|||
|
(either from a randomly initialized network or a pre-trained network) and the scores are updated to
|
|||
|
find a good sparse subnetwork.
|
|||
|
Many previous works have also explored using higher-order information to select prunable weights.
|
|||
|
LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
|
|||
|
deletion. Our method does not require the (possibly costly) computation of second-order derivatives
|
|||
|
since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis
|
|||
|
et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
|
|||
|
contrast, we found it useful to preserve the direction of movement in our algorithm.
|
|||
|
Compressing pretrained language models for transfer learning is also a popular area of study. Other
|
|||
|
approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
|
|||
|
[Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
|
|||
|
and targets individual weight. We also show that having a teacher can further improve our approach.
|
|||
|
Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
|
|||
|
sparse language models from scratch. This differs from our approach which focuses on the fine-tuning
|
|||
|
stage. Finally, another popular compression approach is quantization. Quantization has been applied
|
|||
|
to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
|
|||
|
providing high memory compression rates at the cost of no or little performance. As shown in
|
|||
|
previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
|
|||
|
can be combined to further improve the performance/size ratio.
|
|||
|
|
|||
|
3 Background: Score-Based Pruning
|
|||
|
|
|||
|
We first establish shared notation for discussing different neural network pruning strategies. Let
|
|||
|
W2Rn |