3435 lines
197 KiB
Plaintext
3435 lines
197 KiB
Plaintext
Published as a conference paper at ICLR 2019
|
||
|
||
|
||
|
||
|
||
THE LOTTERY TICKET HYPOTHESIS :
|
||
FINDING SPARSE , T RAINABLE NEURAL NETWORKS
|
||
|
||
|
||
Jonathan Frankle Michael Carbin
|
||
MIT CSAIL MIT CSAIL
|
||
jfrankle@csail.mit.edu mcarbin@csail.mit.edu
|
||
|
||
|
||
|
||
ABSTRACT
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
arXiv:1803.03635v5 [cs.LG] 4 Mar 2019 Neural network pruning techniques can reduce the parameter counts of trained net-
|
||
works by over 90%, decreasing storage requirements and improving computational
|
||
performance of inference without compromising accuracy. However, contemporary
|
||
experience is that the sparse architectures produced by pruning are difficult to train
|
||
from the start, which would similarly improve training performance.
|
||
We find that a standard pruning technique naturally uncovers subnetworks whose
|
||
initializations made them capable of training effectively. Based on these results, we
|
||
articulate thelottery ticket hypothesis: dense, randomly-initialized, feed-forward
|
||
networks contain subnetworks (winning tickets) that—when trained in isolation—
|
||
reach test accuracy comparable to the original network in a similar number of
|
||
iterations. The winning tickets we find have won the initialization lottery: their
|
||
connections have initial weights that make training particularly effective.
|
||
We present an algorithm to identify winning tickets and a series of experiments
|
||
that support the lottery ticket hypothesis and the importance of these fortuitous
|
||
initializations. We consistently find winning tickets that are less than 10-20% of
|
||
the size of several fully-connected and convolutional feed-forward architectures
|
||
for MNIST and CIFAR10. Above this size, the winning tickets that we find learn
|
||
faster than the original network and reach higher test accuracy.
|
||
|
||
|
||
1 I NTRODUCTION
|
||
|
||
Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990;
|
||
Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than
|
||
90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015)
|
||
or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained
|
||
networks, making inference more efficient. However, if a network can be reduced in size, why do we
|
||
not train this smaller architecture instead in the interest of making training more efficient as well?
|
||
Contemporary experience is that the architectures uncovered by pruning are harder to train from the
|
||
start, reaching lower accuracy than the original networks. 1
|
||
|
||
Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected
|
||
network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect
|
||
of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels
|
||
of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that
|
||
iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy.
|
||
|
||
1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate
|
||
the difficulty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain
|
||
the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the
|
||
pruned layers...gradient descent is able to find a good solution when the network is initially trained, but not after
|
||
re-initializing some layers and retraining them.” (Han et al., 2015)
|
||
2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion
|
||
would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of
|
||
minimum validation loss during training. See Appendix C for more details on this choice.
|
||
|
||
1 Published as a conference paper at ICLR 2019
|
||
|
||
|
||
|
||
|
||
|
||
Lenet random Conv-6 random Conv-4 random Conv-2 random
|
||
30K 1.000
|
||
|
||
|
||
|
||
|
||
Accuracy at Early-Stop (Test)
|
||
Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) Early-Stop Iteration (Val.) 0.8
|
||
40K 0.975
|
||
20K
|
||
0.950 0.7
|
||
20K 10K
|
||
0.925 0.6
|
||
0 0 0.900
|
||
10041.116.97.02.91.20.50.2 10041.217.07.1 3.0 1.3 10041.116.97.02.91.20.50.2 10041.217.07.1 3.0 1.3
|
||
Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining
|
||
|
||
Figure 1: The iteration at which early-stopping would occur (left) and the test accuracy at that iteration
|
||
(right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for
|
||
CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled
|
||
sparse networks (average of ten trials). Solid lines are winning tickets (average of five trials).
|
||
|
||
|
||
In this paper, we show that there consistently exist smaller subnetworks that train from the start and
|
||
learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in
|
||
Figure 1 show networks that we find. Based on these results, we statethe lottery ticket hypothesis.
|
||
The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet-
|
||
work that is initialized such that—when trained in isolation—it can match the test accuracy of the
|
||
original network after training for at most the same number of iterations.
|
||
|
||
More formally, consider a dense feed-forward neural networkf(x; |