More documents for Corpus
This commit is contained in:
parent
e78ae20e92
commit
266e371642
4936
Corpus/CORPUS.txt
4936
Corpus/CORPUS.txt
File diff suppressed because it is too large
Load Diff
Binary file not shown.
File diff suppressed because it is too large
Load Diff
|
@ -1,662 +0,0 @@
|
|||
Movement Pruning:
|
||||
Adaptive Sparsity by Fine-Tuning
|
||||
|
||||
|
||||
|
||||
|
||||
Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
|
||||
1 Hugging Face, 2 Cornell University
|
||||
{victor,thomas}@huggingface.co;arush@cornell.edu
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract
|
||||
|
||||
Magnitude pruning is a widely used strategy for reducing model size in pure
|
||||
supervised learning; however, it is less effective in the transfer learning regime that
|
||||
has become standard for state-of-the-art natural language processing applications.
|
||||
We propose the use ofmovement pruning, a simple, deterministic first-order weight
|
||||
pruning method that is more adaptive to pretrained model fine-tuning. We give
|
||||
mathematical foundations to the method and compare it to existing zeroth- and
|
||||
first-order pruning methods. Experiments show that when pruning large pretrained
|
||||
language models, movement pruning shows significant improvements in high-
|
||||
sparsity regimes. When combined with distillation, the approach achieves minimal
|
||||
accuracy loss with down to only 3% of the model parameters.
|
||||
|
||||
|
||||
1 Introduction
|
||||
|
||||
Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
|
||||
performance in applications in natural language processing and related fields. In this setup, a large
|
||||
model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to
|
||||
perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and
|
||||
dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these
|
||||
large models, and training the models have high environmental costs [Strubell et al., 2019].
|
||||
Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
|
||||
only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
|
||||
are a particularly simple and effective method for compressing models to be sent to edge devices such
|
||||
as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
|
||||
absolute values, is the most widely used method for weight pruning. It has been applied to a large
|
||||
variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
|
||||
2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
|
||||
[Frankle et al., 2019].
|
||||
While magnitude pruning is highly effective for standard supervised learning, it is inherently less
|
||||
useful in the transfer learning regime. In supervised learning, weight values are primarily determined
|
||||
by the end-task training data. In transfer learning, weight values are mostly predetermined by the
|
||||
original model and are only fine-tuned on the end task. This prevents these methods from learning to
|
||||
prune based on the fine-tuning step, or “fine-pruning.”
|
||||
In this work, we argue that to effectively reduce the size of models for transfer learning, one should
|
||||
instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
|
||||
fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
|
||||
high values can be pruned if they shrink during training. This strategy moves the selection criteria
|
||||
from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To
|
||||
|
||||
|
||||
Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning
|
||||
utilizing the straight-through estimator [Bengio et al., 2013].
|
||||
We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
|
||||
Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of
|
||||
remaining weights), we observe significant improvements over magnitude pruning and other 1st-order
|
||||
methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
|
||||
BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
|
||||
[Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
|
||||
the differences between magnitude pruning and movement pruning shows that the two methods lead
|
||||
to radically different pruned models with movement pruning showing greater ability to adapt to the
|
||||
end-task.
|
||||
|
||||
2 Related Work
|
||||
|
||||
In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
|
||||
Most similar to our approach are methods for using parallel score matrices to augment the weight
|
||||
matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
|
||||
lutional networks. Differing from our methods, these methods keep the weights of the model fixed
|
||||
(either from a randomly initialized network or a pre-trained network) and the scores are updated to
|
||||
find a good sparse subnetwork.
|
||||
Many previous works have also explored using higher-order information to select prunable weights.
|
||||
LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
|
||||
deletion. Our method does not require the (possibly costly) computation of second-order derivatives
|
||||
since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis
|
||||
et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
|
||||
contrast, we found it useful to preserve the direction of movement in our algorithm.
|
||||
Compressing pretrained language models for transfer learning is also a popular area of study. Other
|
||||
approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
|
||||
[Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
|
||||
and targets individual weight. We also show that having a teacher can further improve our approach.
|
||||
Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
|
||||
sparse language models from scratch. This differs from our approach which focuses on the fine-tuning
|
||||
stage. Finally, another popular compression approach is quantization. Quantization has been applied
|
||||
to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
|
||||
providing high memory compression rates at the cost of no or little performance. As shown in
|
||||
previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
|
||||
can be combined to further improve the performance/size ratio.
|
||||
|
||||
3 Background: Score-Based Pruning
|
||||
|
||||
We first establish shared notation for discussing different neural network pruning strategies. Let
|
||||
W2Rn |