Last paper added to corpus
This commit is contained in:
parent
266e371642
commit
b208cacbf4
7535
Corpus/CORPUS.txt
7535
Corpus/CORPUS.txt
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
@ -1,535 +0,0 @@
|
||||||
The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
|
|
||||||
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy,
|
|
||||||
including cookie policy.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
The 4 Research Techniques to
|
|
||||||
|
|
||||||
Train Deep Neural Network
|
|
||||||
|
|
||||||
Models More E:ciently
|
|
||||||
|
|
||||||
|
|
||||||
James Le Follow
|
|
||||||
Oct 29, 2019 · 9 min read
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Photo by Victor Freitas on Unsplash
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 1 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
Deep learning and unsupervised feature learning have shown
|
|
||||||
great promise in many practical applications. State-of-the-art
|
|
||||||
performance has been reported in several domains, ranging
|
|
||||||
from speech recognition and image recognition to text
|
|
||||||
processing and beyond.
|
|
||||||
|
|
||||||
|
|
||||||
It’s also been observed that increasing the scale of deep
|
|
||||||
learning—with respect to numbers of training examples, model
|
|
||||||
parameters, or both—can drastically improve accuracy. These
|
|
||||||
results have led to a surge of interest in scaling up the training
|
|
||||||
and inference algorithms used for these models and in
|
|
||||||
improving optimization techniques for both.
|
|
||||||
|
|
||||||
|
|
||||||
The use of GPUs is a signiFcant advance in recent years that
|
|
||||||
makes the training of modestly-sized deep networks practical.
|
|
||||||
A known limitation of the GPU approach is that the training
|
|
||||||
speed-up is small when the model doesn’t Ft in a GPU’s
|
|
||||||
memory (typically less than 6 gigabytes).
|
|
||||||
|
|
||||||
|
|
||||||
To use a GPU eLectively, researchers often reduce the size of
|
|
||||||
the dataset or parameters so that CPU-to-GPU transfers are not
|
|
||||||
a signiFcant bottleneck. While data and parameter reduction
|
|
||||||
work well for small problems (e.g. acoustic modeling for speech
|
|
||||||
recognition), they are less attractive for problems with a large
|
|
||||||
number of examples and dimensions (e.g., high-resolution
|
|
||||||
images).
|
|
||||||
|
|
||||||
|
|
||||||
In the previous post, we
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 2 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
talked about 5 diLerent
|
|
||||||
algorithms for ePcient deep
|
|
||||||
learning inference. In this
|
|
||||||
article, we’ll discuss the
|
|
||||||
upper right part of the
|
|
||||||
quadrant on the left. What
|
|
||||||
are the best research
|
|
||||||
techniques to train deep
|
|
||||||
neural networks more
|
|
||||||
ePciently?
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
1 — Parallelization Training
|
|
||||||
Let’s start with parallelization. As the Fgure below shows, the
|
|
||||||
number of transistors keeps increasing over the years. But
|
|
||||||
single-threaded performance and frequency are plateauing in
|
|
||||||
recent years. Interestingly, the number of cores is increasing.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 3 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
So what we really need to know is how to parallelize the
|
|
||||||
problem to take advantage of parallel processing. There are a
|
|
||||||
lot of opportunities to do that in deep neural networks.
|
|
||||||
|
|
||||||
|
|
||||||
For example, we can do data parallelism: feeding 2 images
|
|
||||||
into the same model and running them at the same time. This
|
|
||||||
does not aLect latency for any single input. It doesn’t make it
|
|
||||||
shorter, but it makes the batch size larger. It also requires
|
|
||||||
coordinated weight updates during training.
|
|
||||||
|
|
||||||
|
|
||||||
For example, in JeL Dean’s paper “Large Scale Distributed Deep
|
|
||||||
Networks,” there’s a parameter server (as a master) and a
|
|
||||||
couple of model workers (as slaves) running their own pieces of
|
|
||||||
training data and updating the gradient to the master.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Another idea is model parallelism — splitting up the model
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 4 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
and distributing each part to diLerent processors or diLerent
|
|
||||||
threads. For example, imagine we want to run convolution in
|
|
||||||
the image below by doing a 6-dimension “for” loop. What we
|
|
||||||
can do is cut the input image by 2x2 blocks, so that each
|
|
||||||
thread/processor handles 1/4 of the image. Also, we can
|
|
||||||
parallelize the convolutional layers by the output or input
|
|
||||||
feature map regions, and the fully-connected layers by the
|
|
||||||
output activation.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Machine learning models are moving closer
|
|
||||||
|
|
||||||
and closer to edge devices. Fritz AI is here
|
|
||||||
|
|
||||||
to help with this transition. Explore our
|
|
||||||
|
|
||||||
suite of developer tools that makes it easy to
|
|
||||||
|
|
||||||
teach devices to see, hear, sense, and think.
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 5 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
2 — Mixed Precision Training
|
|
||||||
Larger models usually require more compute and memory
|
|
||||||
resources to train. These requirements can be lowered by using
|
|
||||||
reduced precision representation and arithmetic.
|
|
||||||
|
|
||||||
Performance (speed) of any program, including neural network
|
|
||||||
training and inference, is limited by one of three factors:
|
|
||||||
arithmetic bandwidth, memory bandwidth, or latency.
|
|
||||||
Reduced precision addresses two of these limiters. Memory
|
|
||||||
bandwidth pressure is lowered by using fewer bits to store the
|
|
||||||
same number of values. Arithmetic time can also be lowered on
|
|
||||||
processors that oLer higher throughput for reduced precision
|
|
||||||
math. For example, half-precision math throughput in recent
|
|
||||||
GPUs is 2× to 8× higher than for single-precision. In addition
|
|
||||||
to speed improvements, reduced precision formats also reduce
|
|
||||||
the amount of memory required for training.
|
|
||||||
|
|
||||||
Modern deep learning training systems use a single-precision
|
|
||||||
(FP32) format. In their paper “Mixed Precision Training,”
|
|
||||||
researchers from NVIDIA and Baidu addressed training with
|
|
||||||
reduced precision while maintaining model accuracy.
|
|
||||||
|
|
||||||
SpeciFcally, they trained various neural networks using the
|
|
||||||
IEEE half-precision format (FP16). Since FP16 format has a
|
|
||||||
narrower dynamic range than FP32, they introduced three
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 6 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
techniques to prevent model accuracy loss: maintaining a
|
|
||||||
master copy of weights in FP32, loss-scaling that minimizes
|
|
||||||
gradient values becoming zeros, and FP16 arithmetic with
|
|
||||||
accumulation in FP32.
|
|
||||||
|
|
||||||
|
|
||||||
Using these techniques, they
|
|
||||||
demonstrated that a wide
|
|
||||||
variety of network
|
|
||||||
architectures and
|
|
||||||
applications can be trained
|
|
||||||
to match the accuracy of
|
|
||||||
FP32 training. Experimental
|
|
||||||
results include convolutional
|
|
||||||
and recurrent network
|
|
||||||
architectures, trained for classiFcation, regression, and
|
|
||||||
generative tasks.
|
|
||||||
|
|
||||||
|
|
||||||
Applications include image classiFcation, image generation,
|
|
||||||
object detection, language modeling, machine translation, and
|
|
||||||
speech recognition. The proposed methodology requires no
|
|
||||||
changes to models or training hyperparameters.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
3 — Model Distillation
|
|
||||||
Model distillation refers to the idea of model compression by
|
|
||||||
teaching a smaller network exactly what to do, step-by-step,
|
|
||||||
using a bigger, already-trained network. The ‘soft labels’ refer
|
|
||||||
to the output feature maps by the bigger network after every
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 7 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
convolution layer. The smaller network is then trained to learn
|
|
||||||
the exact behavior of the bigger network by trying to replicate
|
|
||||||
its outputs at every level (not just the Fnal loss).
|
|
||||||
|
|
||||||
|
|
||||||
The method was Frst proposed by Bucila et al., 2006 and
|
|
||||||
generalized by Hinton et al., 2015. In distillation, knowledge is
|
|
||||||
transferred from the teacher model to the student by
|
|
||||||
minimizing a loss function in which the target is the
|
|
||||||
distribution of class probabilities predicted by the teacher
|
|
||||||
model. That is — the output of a softmax function on the
|
|
||||||
teacher model’s logits.
|
|
||||||
|
|
||||||
|
|
||||||
So how do teacher-student
|
|
||||||
networks exactly work?
|
|
||||||
|
|
||||||
|
|
||||||
The highly-complex teacher
|
|
||||||
network is Frst trained
|
|
||||||
separately using the
|
|
||||||
complete dataset. This step
|
|
||||||
requires high computational
|
|
||||||
performance and thus can
|
|
||||||
only be done ohine (on
|
|
||||||
high-performing GPUs).
|
|
||||||
|
|
||||||
While designing a student network, correspondence needs
|
|
||||||
to be established between intermediate outputs of the
|
|
||||||
student network and the teacher network. This
|
|
||||||
correspondence can involve directly passing the output of a
|
|
||||||
layer in the teacher network to the student network, or
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 8 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
performing some data augmentation before passing it to the
|
|
||||||
student network.
|
|
||||||
|
|
||||||
Next, the data are forward-passed through the teacher
|
|
||||||
network to get all intermediate outputs, and then data
|
|
||||||
augmentation (if any) is applied to the same.
|
|
||||||
|
|
||||||
Finally, the outputs from the teacher network are back-
|
|
||||||
propagated through the student network so that the student
|
|
||||||
network can learn to replicate the behavior of the teacher
|
|
||||||
network.
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
The future of machine learning is on the
|
|
||||||
|
|
||||||
edge. Subscribe to the Fritz AI Newsletter
|
|
||||||
|
|
||||||
to discover the possibilities and beneIts of
|
|
||||||
|
|
||||||
embedding ML models inside mobile apps.
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
4 — Dense-Sparse-Dense Training
|
|
||||||
The research paper “Dense-Sparse-Dense Training for Deep
|
|
||||||
Neural Networks” was published back in 2017 by researchers
|
|
||||||
from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense-
|
|
||||||
Sparse-Dense (DSD) takes 3 sequential steps:
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 9 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
Dense: Normal neural net training…business as usual. It’s
|
|
||||||
notable that even though DSD acts as a regularizer, the
|
|
||||||
usual regularization methods such as dropout and weight
|
|
||||||
regularization can be applied as well. The authors don’t
|
|
||||||
mention batch normalization, but it would work as well.
|
|
||||||
|
|
||||||
|
|
||||||
Sparse: We regularize the
|
|
||||||
network by removing
|
|
||||||
connections with small
|
|
||||||
weights. From each layer in
|
|
||||||
the network, a percentage of
|
|
||||||
the layer’s weights that are
|
|
||||||
closest to 0 in absolute value is selected to be pruned. This
|
|
||||||
means that they are set to 0 at each training iteration. It’s
|
|
||||||
worth noting that the pruned weights are selected only
|
|
||||||
once, not at each SGD iteration. Eventually, the network
|
|
||||||
recovers the pruned weights’ knowledge and condenses it in
|
|
||||||
the remaining ones. We train this sparse net until
|
|
||||||
convergence.
|
|
||||||
|
|
||||||
Dense: First, we re-enable the pruned weights from the
|
|
||||||
previous step. The net is again trained normally until
|
|
||||||
convergence. This step increases the capacity of the model.
|
|
||||||
It can use the recovered capacity to store new knowledge.
|
|
||||||
The authors note that the learning rate should be 1/10th of
|
|
||||||
the original. Since the model is already performing well, the
|
|
||||||
lower learning rate helps preserve the knowledge gained in
|
|
||||||
the previous step.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 10 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
Removing pruning in the dense step allows the training to
|
|
||||||
escape saddle points to eventually reach a better minimum.
|
|
||||||
This lower minimum corresponds to improved training and
|
|
||||||
validation metrics.
|
|
||||||
|
|
||||||
|
|
||||||
Saddle points are areas in the multidimensional space of the
|
|
||||||
model that might not be a good solution but are hard to escape
|
|
||||||
from. The authors hypothesize that the lower minimum is
|
|
||||||
achieved because the sparsity in the network moves the
|
|
||||||
optimization problem to a lower-dimensional space. This space
|
|
||||||
is more robust to noise in the training data.
|
|
||||||
|
|
||||||
|
|
||||||
The authors tested DSD on image classiFcation (CNN), caption
|
|
||||||
generation (RNN), and speech recognition (LSTM). The
|
|
||||||
proposed method improved accuracy across all three tasks. It’s
|
|
||||||
quite remarkable that DSD works across domains.
|
|
||||||
|
|
||||||
|
|
||||||
DSD improved all CNN models tested — ResNet50, VGG,
|
|
||||||
and GoogLeNet. The improvement in absolute top-1
|
|
||||||
accuracy was respectively 1.12%, 4.31%, and 1.12%. This
|
|
||||||
corresponds to a relative improvement of 4.66%, 13.7%,
|
|
||||||
and 3.6%. These results are remarkable for such Fnely-
|
|
||||||
tuned models!
|
|
||||||
|
|
||||||
|
|
||||||
DSD was applied to
|
|
||||||
NeuralTalk, an amazing
|
|
||||||
model that generates a
|
|
||||||
description from an image.
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 11 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
To verify that the Dense-
|
|
||||||
Sparse-Dense method works
|
|
||||||
on an LSTM, the CNN part of
|
|
||||||
Neural Talk is frozen. Only
|
|
||||||
the LSTM layers are trained. Very high (80% deducted by
|
|
||||||
the validation set) pruning was applied at the Sparse step.
|
|
||||||
Still, this gives the Neural Talk BLEU score an average
|
|
||||||
relative improvement of 6.7%. It’s fascinating that such a
|
|
||||||
minor adjustment produces this much improvement.
|
|
||||||
|
|
||||||
Applying DSD to speech recognition (Deep Speech 1)
|
|
||||||
achieves an average relative improvement of Word Error
|
|
||||||
Rate of 3.95%. On a similar but more advanced Deep
|
|
||||||
Speech 2 model Dense-Sparse-Dense is applied iteratively
|
|
||||||
two times. On the Frst iteration, pruning 50% of the
|
|
||||||
weights, then 25% of the weights are pruned. After these
|
|
||||||
two DSD iterations, the average relative improvement is
|
|
||||||
6.5%.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Conclusion
|
|
||||||
I hope that I’ve managed to explain these research techniques
|
|
||||||
for ePcient training of deep neural networks in a transparent
|
|
||||||
way. Work on this post allowed me to grasp how novel and
|
|
||||||
clever these techniques are. A solid understanding of these
|
|
||||||
approaches will allow you to incorporate them into your model
|
|
||||||
training procedure when needed.
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 12 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Editor’s Note: Heartbeat is a contributor-driven online
|
|
||||||
publication and community dedicated to exploring the emerging
|
|
||||||
intersection of mobile app development and machine learning.
|
|
||||||
We’re committed to supporting and inspiring developers and
|
|
||||||
engineers from all walks of life.
|
|
||||||
|
|
||||||
|
|
||||||
Editorially independent, Heartbeat is sponsored and published by
|
|
||||||
Fritz AI, the machine learning platform that helps developers
|
|
||||||
teach devices to see, hear, sense, and think. We pay our
|
|
||||||
contributors, and we don’t sell ads.
|
|
||||||
|
|
||||||
|
|
||||||
If you’d like to contribute, head on over to our call for
|
|
||||||
contributors. You can also sign up to receive our weekly
|
|
||||||
newsletters (Deep Learning Weekly and the Fritz AI
|
|
||||||
Newsletter), join us on Slack, and follow Fritz AI on Twitter for
|
|
||||||
all the latest in mobile machine learning.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Neural Networks Deep Learning Heartbeat Guides And Tutorials
|
|
||||||
|
|
||||||
Machine Learning
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 13 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Discover Medium Make Medium Become a member
|
|
||||||
yours Welcome to a place where Get unlimited access to the
|
|
||||||
words matter. On Medium, Follow all the topics you best stories on Medium —
|
|
||||||
smart voices and original care about, and we’ll and support writers while
|
|
||||||
ideas take center stage - deliver the best stories for you’re at it. Just $5/month.
|
|
||||||
with no ads in sight. Watch you to your homepage and Upgrade
|
|
||||||
inbox. Explore
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
About Help Legal
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 14 of 14
|
|
|
@ -1,678 +0,0 @@
|
||||||
The State of Sparsity in Deep Neural Networks
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Trevor Gale * 1y Erich Elsen * 2 Sara Hooker 1y
|
|
||||||
|
|
||||||
|
|
||||||
Abstract like image classification and machine translation commonly
|
|
||||||
have tens of millions of parameters, and require billions ofWe rigorously evaluate three state-of-the-art tech- floating-point operations to make a prediction for a singleniques for inducing sparsity in deep neural net-
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
arXiv:1902.09574v1 [cs.LG] 25 Feb 2019 input sample.works on two large-scale learning tasks: Trans-
|
|
||||||
former trained on WMT 2014 English-to-German, Sparsity has emerged as a leading approach to address these
|
|
||||||
and ResNet-50 trained on ImageNet. Across thou- challenges. By sparsity, we refer to the property that a subset
|
|
||||||
sands of experiments, we demonstrate that com- of the model parameters have a value of exactly zero 2 . With
|
|
||||||
plex techniques (Molchanov et al.,2017;Louizos zero valued weights, any multiplications (which dominate
|
|
||||||
et al.,2017b) shown to yield high compression neural network computation) can be skipped, and models
|
|
||||||
rates on smaller datasets perform inconsistently, can be stored and transmitted compactly using sparse matrix
|
|
||||||
and that simple magnitude pruning approaches formats. It has been shown empirically that deep neural
|
|
||||||
achieve comparable or better results. Based on networks can tolerate high levels of sparsity (Han et al.,
|
|
||||||
insights from our experiments, we achieve a 2015;Narang et al.,2017;Ullrich et al.,2017), and this
|
|
||||||
new state-of-the-art sparsity-accuracy trade-off property has been leveraged to significantly reduce the cost
|
|
||||||
for ResNet-50 using only magnitude pruning. Ad- associated with the deployment of deep neural networks,
|
|
||||||
ditionally, we repeat the experiments performed and to enable the deployment of state-of-the-art models in
|
|
||||||
byFrankle & Carbin(2018) andLiu et al.(2018) severely resource constrained environments (Theis et al.,
|
|
||||||
at scale and show that unstructured sparse archi- 2018;Kalchbrenner et al.,2018;Valin & Skoglund,2018).
|
|
||||||
tectures learned through pruning cannot be trained Over the past few years, numerous techniques for induc-from scratch to the same test set performance as ing sparsity have been proposed and the set of models anda model trained with joint sparsification and op- datasets used as benchmarks has grown too large to rea-timization. Together, these results highlight the sonably expect new approaches to explore them all. Inneed for large-scale benchmarks in the field of addition to the lack of standardization in modeling tasks, themodel compression. We open-source our code, distribution of benchmarks tends to slant heavily towardstop performing model checkpoints, and results of convolutional architectures and computer vision tasks, andall hyperparameter configurations to establish rig- the tasks used to evaluate new techniques are frequentlyorous baselines for future work on compression not representative of the scale and complexity of real-worldand sparsification. tasks where model compression is most useful. These char-
|
|
||||||
acteristics make it difficult to come away from the sparsity
|
|
||||||
literature with a clear understanding of the relative merits
|
|
||||||
1. Introduction of different approaches.
|
|
||||||
Deep neural networks achieve state-of-the-art performance In addition to practical concerns around comparing tech-
|
|
||||||
in a variety of domains including image classification (He niques, multiple independent studies have recently proposed
|
|
||||||
et al.,2016), machine translation (Vaswani et al.,2017), that the value of sparsification in neural networks has been
|
|
||||||
and text-to-speech (van den Oord et al.,2016;Kalchbren- misunderstood (Frankle & Carbin,2018;Liu et al.,2018).
|
|
||||||
ner et al.,2018). While model quality has been shown to While both papers suggest that sparsification can be viewed
|
|
||||||
scale with model and dataset size (Hestness et al.,2017), as a form of neural architecture search, they disagree on
|
|
||||||
the resources required to train and deploy large neural net- what is necessary to achieve this. Specifically,Liu et al.
|
|
||||||
works can be prohibitive. State-of-the-art models for tasks 2 The term sparsity is also commonly used to refer to the pro-
|
|
||||||
* Equal contribution y This work was completed as part of the portion of a neural networks weights that are zero valued. Higher
|
|
||||||
Google AI Residency 1 Google Brain 2 DeepMind. Correspondence sparsity corresponds to fewer weights, and smaller computational
|
|
||||||
to: Trevor Gale<tgale@google.com>. and storage requirements. We use the term in this way throughout
|
|
||||||
this paper. The State of Sparsity in Deep Neural Networks
|
|
||||||
|
|
||||||
(2018) re-train learned sparse topologies with a random Some of the earliest techniques for sparsifying neural net-
|
|
||||||
weight initialization, whereasFrankle & Carbin(2018) posit works make use of second-order approximation of the loss
|
|
||||||
that the exact random weight initialization used when the surface to avoid damaging model quality (LeCun et al.,
|
|
||||||
sparse architecture was learned is needed to match the test 1989;Hassibi & Stork,1992). More recent work has
|
|
||||||
set performance of the model sparsified during optimization. achieved comparable compression levels with more com-
|
|
||||||
putationally efficient first-order loss approximations, andIn this paper, we address these ambiguities to provide a further refinements have related this work to efficient em-strong foundation for future work on sparsity in neural net- pirical estimates of the Fisher information of the modelworks.Our main contributions:(1) We perform a com- parameters (Molchanov et al.,2016;Theis et al.,2018).prehensive evaluation of variational dropout (Molchanov
|
|
||||||
et al.,2017),l0 regularization (Louizos et al.,2017b), and Reinforcement learning has also been applied to automat-
|
|
||||||
magnitude pruning (Zhu & Gupta,2017) on Transformer ically prune weights and convolutional filters (Lin et al.,
|
|
||||||
trained on WMT 2014 English-to-German and ResNet-50 2017;He et al.,2018), and a number of techniques have
|
|
||||||
trained on ImageNet. To the best of our knowledge, we been proposed that draw inspiration from biological phe-
|
|
||||||
are the first to apply variational dropout andl0 regulariza- nomena, and derive from evolutionary algorithms and neu-
|
|
||||||
tion to models of this scale. While variational dropout and romorphic computing (Guo et al.,2016;Bellec et al.,2017;
|
|
||||||
l0 regularization achieve state-of-the-art results on small Mocanu et al.,2018).
|
|
||||||
datasets, we show that they perform inconsistently for large- A key feature of a sparsity inducing technique is if andscale tasks and that simple magnitude pruning can achieve how it imposes structure on the topology of sparse weights.comparable or better results for a reduced computational While unstructured weight sparsity provides the most flex-budget. (2) Through insights gained from our experiments, ibility for the model, it is more difficult to map efficientlywe achieve a new state-of-the-art sparsity-accuracy trade-off to parallel processors and has limited support in deep learn-for ResNet-50 using only magnitude pruning. (3) We repeat ing software packages. For these reasons, many techniquesthe lottery ticket (Frankle & Carbin,2018) and scratch (Liu focus on removing whole neurons and convolutional filters,et al.,2018) experiments on Transformer and ResNet-50 or impose block structure on the sparse weights (Liu et al.,across a full range of sparsity levels. We show that unstruc- 2017;Luo et al.,2017;Gray et al.,2017). While this is prac-tured sparse architectures learned through pruning cannot tical, there is a trade-off between achievable compressionbe trained from scratch to the same test set performance as levels for a given model quality and the level of structurea model trained with pruning as part of the optimization imposed on the model weights. In this work, we focusprocess. (4) We open-source our code, model checkpoints, on unstructured sparsity with the expectation that it upperand results of all hyperparameter settings to establish rig- bounds the compression-accuracy trade-off achievable withorous baselines for future work on model compression and structured sparsity techniques.sparsification 3 .
|
|
||||||
|
|
||||||
3. Evaluating Sparsification Techniques at2. Sparsity in Neural Networks Scale
|
|
||||||
We briefly provide a non-exhaustive review of proposed
|
|
||||||
approaches for inducing sparsity in deep neural networks. As a first step towards addressing the ambiguity in the
|
|
||||||
sparsity literature, we rigorously evaluate magnitude-based
|
|
||||||
Simple heuristics based on removing small magnitude pruning (Zhu & Gupta,2017), sparse variational dropoutweights have demonstrated high compression rates with (Molchanov et al.,2017), andl0 regularization (Louizosminimal accuracy loss (Strom¨ ,1997;Collins & Kohli,2014; et al.,2017b) on two large-scale deep learning applications:
|
|
||||||
Han et al.,2015), and further refinement of the sparsifica- ImageNet classification with ResNet-50 (He et al.,2016),
|
|
||||||
tion process for magnitude pruning techniques has increased and neural machine translation (NMT) with the Transformer
|
|
||||||
achievable compression rates and greatly reduced computa- on the WMT 2014 English-to-German dataset (Vaswani
|
|
||||||
tional complexity (Guo et al.,2016;Zhu & Gupta,2017). et al.,2017). For each model, we also benchmark a random
|
|
||||||
Many techniques grounded in Bayesian statistics and in- weight pruning technique, representing the lower bound
|
|
||||||
formation theory have been proposed (Dai et al.,2018; of compression-accuracy trade-off any method should be
|
|
||||||
Molchanov et al.,2017;Louizos et al.,2017b;a;Ullrich expected to achieve.
|
|
||||||
et al.,2017). These methods have achieved high compres- Here we briefly review the four techniques and introduce sion rates while providing deep theoretical motivation and our experimental framework. We provide a more detailed
|
|
||||||
connections to classical sparsification and regularization overview of each technique in AppendixA.
|
|
||||||
techniques.
|
|
||||||
3 https://bit.ly/2ExE8Yj The State of Sparsity in Deep Neural Networks
|
|
||||||
|
|
||||||
3.1. Magnitude Pruning Table 1.Constant hyperparameters for all Transformer exper-
|
|
||||||
Magnitude-based weight pruning schemes use the magni- iments.More details on the standard configuration for training the
|
|
||||||
tude of each weight as a proxy for its importance to model Transformer can be found inVaswani et al.(2017).
|
|
||||||
quality, and remove the least important weights according Hyperparameter Value
|
|
||||||
to some sparsification schedule over the course of training. dataset translatewmtendepacked
|
|
||||||
For our experiments, we use the approach introduced in training iterations 500000
|
|
||||||
Zhu & Gupta(2017), which is conveniently available in the batch size 2048 tokens
|
|
||||||
TensorFlow modelpruning library 4 . This technique allows learning rate schedule standard transformerbase
|
|
||||||
for masked weights to reactivate during training based on optimizer Adam
|
|
||||||
gradient updates, and makes use of a gradual sparsification sparsity range 50% - 98%
|
|
||||||
schedule with sorting-based weight thresholding to achieve beam search beam size 4; length penalty 0.6
|
|
||||||
a user specified level of sparsification. These features enable
|
|
||||||
high compression ratios at a reduced computational cost rel- optimized directly using the reparameterization trick, and
|
|
||||||
ative to the iterative pruning and re-training approach used the expectedl0 -norm can be computed using the value of the
|
|
||||||
byHan et al.(2015), while requiring less hyperparame- cumulative distribution function of the random gate variable
|
|
||||||
ter tuning relative to the technique proposed byGuo et al. evaluated at zero.
|
|
||||||
(2016).
|
|
||||||
3.4. Random Pruning Baseline
|
|
||||||
3.2. Variational Dropout For our experiments, we also include a random sparsification
|
|
||||||
Variational dropout was originally proposed as a re- procedure adapted from the magnitude pruning technique
|
|
||||||
interpretation of dropout training as variational inference, ofZhu & Gupta(2017). Our random pruning technique
|
|
||||||
providing a Bayesian justification for the use of dropout uses the same sparsity schedule, but differs by selecting the
|
|
||||||
in neural networks and enabling useful extensions to the weights to be pruned each step at random rather based on
|
|
||||||
standard dropout algorithms like learnable dropout rates magnitude and does not allow pruned weights to reactivate.
|
|
||||||
(Kingma et al.,2015). It was later demonstrated that by This technique is intended to represent a lower-bound of the
|
|
||||||
learning a model with variational dropout and per-parameter accuracy-sparsity trade-off curve.
|
|
||||||
dropout rates, weights with high dropout rates can be re-
|
|
||||||
moved post-training to produce highly sparse solutions 3.5. Experimental Framework
|
|
||||||
(Molchanov et al.,2017). For magnitude pruning, we used the TensorFlow model
|
|
||||||
Variational dropout performs variational inference to learn pruning library. We implemented variational dropout and
|
|
||||||
the parameters of a fully-factorized Gaussian posterior over l0 regularization from scratch. For variational dropout, we
|
|
||||||
the weights under a log-uniform prior. In the standard for- verified our implementation by reproducing the results from
|
|
||||||
mulation, we apply a local reparameterization to move the the original paper. To verify ourl0 regularization implemen-
|
|
||||||
sampled noise from the weights to the activations, and then tation, we applied our weight-level code to Wide ResNet
|
|
||||||
apply the additive noise reparameterization to further reduce (Zagoruyko & Komodakis,2016) trained on CIFAR-10 and
|
|
||||||
the variance of the gradient estimator. Under this parame- replicated the training FLOPs reduction and accuracy re-
|
|
||||||
terization, we directly optimize the mean and variance of sults from the original publication. Verification results for
|
|
||||||
the neural network parameters. After training a model with variational dropout andl0 regularization are included in
|
|
||||||
variational dropout, the weights with the highest learned AppendicesBandC. For random pruning, we modified
|
|
||||||
dropout rates can be removed to produce a sparse model. the TensorFlow model pruning library to randomly select
|
|
||||||
weights as opposed to sorting them based on magnitude.
|
|
||||||
3.3.l0 Regularization For each model, we kept the number of training steps con-
|
|
||||||
l0 regularization explicitly penalizes the number of non- stant across all techniques and performed extensive hyper-
|
|
||||||
zero weights in the model to induce sparsity. However, parameter tuning. While magnitude pruning is relatively
|
|
||||||
thel0 -norm is both non-convex and non-differentiable. To simple to apply to large models and achieves reasonably
|
|
||||||
address the non-differentiability of thel0 -norm,Louizos consistent performance across a wide range of hyperparame-
|
|
||||||
et al.(2017b) propose a reparameterization of the neural ters, variational dropout andl0 -regularization are much less
|
|
||||||
network weights as the product of a weight and a stochastic well understood. To our knowledge, we are the first to apply
|
|
||||||
gate variable sampled from a hard-concrete distribution. these techniques to models of this scale. To produce a fair
|
|
||||||
The parameters of the hard-concrete distribution can be comparison, we did not limit the amount of hyperparameter
|
|
||||||
tuning we performed for each technique. In total, our results 4 https://bit.ly/2T8hBGn encompass over 4000 experiments. The State of Sparsity in Deep Neural Networks
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Figure 2.Average sparsity in Transformer layers.Distributions
|
|
||||||
calculated on the top performing model at 90% sparsity for each
|
|
||||||
technique.l0 regularization and variational dropout are able to
|
|
||||||
learn non-uniform distributions of sparsity, while magnitude prun-
|
|
||||||
ing induces user-specified sparsity distributions (in this case, uni-
|
|
||||||
form).
|
|
||||||
form the random pruning technique, randomly removing
|
|
||||||
weights produces surprisingly reasonable results, which is
|
|
||||||
perhaps indicative of the models ability to recover from
|
|
||||||
Figure 1.Sparsity-BLEU trade-off curves for the Transformer. damage during optimization.
|
|
||||||
Top: Pareto frontiers for each of the four sparsification techniques
|
|
||||||
applied to the Transformer. Bottom: All experimental results with What is particularly notable about the performance of mag-
|
|
||||||
each technique. Despite the diversity of approaches, the relative nitude pruning is that our experiments uniformly remove the
|
|
||||||
performance of all three techniques is remarkably consistent. Mag- same fraction of weights for each layer. This is in stark con-
|
|
||||||
nitude pruning notably outperforms more complex techniques for trast to variational dropout andl0 regularization, where the
|
|
||||||
high levels of sparsity. distribution of sparsity across the layers is learned through
|
|
||||||
the training process. Previous work has shown that a non-
|
|
||||||
4. Sparse Neural Machine Translation uniform sparsity among different layers is key to achieving
|
|
||||||
high compression rates (He et al.,2018), and variational
|
|
||||||
We adapted the Transformer (Vaswani et al.,2017) model dropout andl0 regularization should theoretically be able to
|
|
||||||
for neural machine translation to use these four sparsifica- leverage this feature to learn better distributions of weights
|
|
||||||
tion techniques, and trained the model on the WMT 2014 for a given global sparsity.
|
|
||||||
English-German dataset. We sparsified all fully-connected
|
|
||||||
layers and embeddings, which make up 99.87% of all of Figure2shows the distribution of sparsity across the differ-
|
|
||||||
the parameters in the model (the other parameters coming ent layer types in the Transformer for the top performing
|
|
||||||
from biases and layer normalization). The constant hyper- model at 90% global sparsity for each technique. Bothl0
|
|
||||||
parameters used for all experiments are listed in table1. We regularization and variational dropout learn to keep more
|
|
||||||
followed the standard training procedure used byVaswani parameters in the embedding, FFN layers, and the output
|
|
||||||
et al.(2017), but did not perform checkpoint averaging. transforms for the multi-head attention modules and induce
|
|
||||||
This setup yielded a baseline BLEU score of 27.29 averaged more sparsity in the transforms for the query and value in-
|
|
||||||
across five runs. puts to the attention modules. Despite this advantage,l0
|
|
||||||
regularization and variational dropout did not significantly
|
|
||||||
We extensively tuned the remaining hyperparameters for outperform magnitude pruning, even yielding inferior re-
|
|
||||||
each technique. Details on what hyperparameters we ex- sults at high sparsity levels.
|
|
||||||
plored, and the results of what settings produced the best
|
|
||||||
models can be found in AppendixD. It is also important to note that these results maintain a
|
|
||||||
constant number of training steps across all techniques and
|
|
||||||
that the Transformer variant with magnitude pruning trains4.1. Sparse Transformer Results & Analysis 1.24x and 1.65x faster thanl0 regularization and variational
|
|
||||||
All results for the Transformer are plotted in figure1. De- dropout respectively. While the standard Transformer train-
|
|
||||||
spite the vast differences in these approaches, the relative ing scheme produces excellent results for machine transla-
|
|
||||||
performance of all three techniques is remarkably consis- tion, it has been shown that training the model for longer
|
|
||||||
tent. Whilel0 regularization and variational dropout pro- can improve its performance by as much as 2 BLEU (Ott
|
|
||||||
duce the top performing models in the low-to-mid sparsity et al.,2018). Thus, when compared for a fixed training cost
|
|
||||||
range, magnitude pruning achieves the best results for highly magnitude pruning has a distinct advantage over these more
|
|
||||||
sparse models. While all techniques were able to outper- complicated techniques. The State of Sparsity in Deep Neural Networks
|
|
||||||
|
|
||||||
|
|
||||||
Table 2.Constant hyperparameters for all RN50 experiments.
|
|
||||||
Hyperparameter Value
|
|
||||||
dataset ImageNet
|
|
||||||
training iterations 128000
|
|
||||||
batch size 1024 images
|
|
||||||
learning rate schedule standard
|
|
||||||
optimizer SGD with Momentum
|
|
||||||
sparsity range 50% - 98%
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
5. Sparse Image Classification
|
|
||||||
To benchmark these four sparsity techniques on a large-
|
|
||||||
scale computer vision task, we integrated each method into
|
|
||||||
ResNet-50 and trained the model on the ImageNet large-
|
|
||||||
scale image classification dataset. We sparsified all convolu-
|
|
||||||
tional and fully-connected layers, which make up 99.79%
|
|
||||||
of all of the parameters in the model (the other parameters Figure 3.Sparsity-accuracy trade-off curves for ResNet-50.
|
|
||||||
coming from biases and batch normalization). Top: Pareto frontiers for variational dropout, magnitude pruning,
|
|
||||||
and random pruning applied to ResNet-50. Bottom: All experi- The hyperparameters we used for all experiments are listed mental results with each technique. We observe large variation in
|
|
||||||
in Table2. Each model was trained for 128000 iterations performance for variational dropout andl0 regularization between
|
|
||||||
with a batch size of 1024 images, stochastic gradient descent Transformer and ResNet-50. Magnitude pruning and variational
|
|
||||||
with momentum, and the standard learning rate schedule dropout achieve comparable performance for most sparsity levels,
|
|
||||||
(see AppendixE.1). This setup yielded a baseline top-1 with variational dropout achieving the best results for high sparsity
|
|
||||||
accuracy of 76.69% averaged across three runs. We trained levels.
|
|
||||||
each model with 8-way data parallelism across 8 accelera-
|
|
||||||
tors. Due to the extra parameters and operations required for will be non-zero. 5 .Louizos et al.(2017b) reported results
|
|
||||||
variational dropout, the model was unable to fit into device applyingl0 regularization to a wide residual network (WRN)
|
|
||||||
memory in this configuration. For all variational dropout (Zagoruyko & Komodakis,2016) on the CIFAR-10 dataset,
|
|
||||||
experiments, we used a per-device batch size of 32 images and noted that they observed small accuracy loss at as low
|
|
||||||
and scaled the model over 32 accelerators. as 8% reduction in the number of parameters during training.
|
|
||||||
Applying our weight-levell0 regularization implementation
|
|
||||||
5.1. ResNet-50 Results & Analysis to WRN produces a model with comparable training time
|
|
||||||
sparsity, but with no sparsity in the test-time parameters.Figure3shows results for magnitude pruning, variational For models that achieve test-time sparsity, we observe sig-dropout, and random pruning applied to ResNet-50. Surpris- nificant accuracy degradation on CIFAR-10. This result isingly, we were unable to produce sparse ResNet-50 mod- consistent with our observation forl els withl 0 regularization applied
|
|
||||||
0 regularization that did not significantly damage to ResNet-50 on ImageNet.model quality. Across hundreds of experiments, our models
|
|
||||||
were either able to achieve full test set performance with The variation in performance for variational dropout andl0
|
|
||||||
no sparsification, or sparsification with test set performance regularization between Transformer and ResNet-50 is strik-
|
|
||||||
akin to random guessing. Details on all hyperparameter ing. While achieving a good accuracy-sparsity trade-off,
|
|
||||||
settings explored are included in AppendixE. variational dropout consistently ranked behindl0 regulariza-
|
|
||||||
tion on Transformer, and was bested by magnitude pruningThis result is particularly surprising given the success ofl0 for sparsity levels of 80% and up. However, on ResNet-50regularization on Transformer. One nuance of thel0 regular- we observe that variational dropout consistently producesization technique ofLouizos et al.(2017b) is that the model
|
|
||||||
can have varying sparsity levels between the training and 5 The fraction of time a parameter is set to zero during training
|
|
||||||
test-time versions of the model. At training time, a parame- depends on other factors, e.g. the |