3435 lines
197 KiB
3435 lines
197 KiB
Published as a conference paper at ICLR 2019
Jonathan Frankle Michael Carbin
jfrankle@csail.mit.edu mcarbin@csail.mit.edu
arXiv:1803.03635v5 [cs.LG] 4 Mar 2019 Neural network pruning techniques can reduce the parameter counts of trained net-
works by over 90%, decreasing storage requirements and improving computational
performance of inference without compromising accuracy. However, contemporary
experience is that the sparse architectures produced by pruning are difficult to train
from the start, which would similarly improve training performance.
We find that a standard pruning technique naturally uncovers subnetworks whose
initializations made them capable of training effectively. Based on these results, we
articulate thelottery ticket hypothesis: dense, randomly-initialized, feed-forward
networks contain subnetworks (winning tickets) that—when trained in isolation—
reach test accuracy comparable to the original network in a similar number of
iterations. The winning tickets we find have won the initialization lottery: their
connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments
that support the lottery ticket hypothesis and the importance of these fortuitous
initializations. We consistently find winning tickets that are less than 10-20% of
the size of several fully-connected and convolutional feed-forward architectures
for MNIST and CIFAR10. Above this size, the winning tickets that we find learn
faster than the original network and reach higher test accuracy.
Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990;
Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than
90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015)
or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained
networks, making inference more efficient. However, if a network can be reduced in size, why do we
not train this smaller architecture instead in the interest of making training more efficient as well?
Contemporary experience is that the architectures uncovered by pruning are harder to train from the
start, reaching lower accuracy than the original networks. 1
Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected
network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect
of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels
of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that
iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy.
1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate
the difficulty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain
the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the
pruned layers...gradient descent is able to find a good solution when the network is initially trained, but not after
re-initializing some layers and retraining them.” (Han et al., 2015)
2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion
would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of
minimum validation loss during training. See Appendix C for more details on this choice.
1 Published as a conference paper at ICLR 2019
Lenet random Conv-6 random Conv-4 random Conv-2 random
30K 1.000
Accuracy at Early-Stop (Test)
Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) Early-Stop Iteration (Val.) 0.8
40K 0.975
0.950 0.7
20K 10K
0.925 0.6
0 0 0.900
10041. 10041.217.07.1 3.0 1.3 10041. 10041.217.07.1 3.0 1.3
Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining
Figure 1: The iteration at which early-stopping would occur (left) and the test accuracy at that iteration
(right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for
CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled
sparse networks (average of ten trials). Solid lines are winning tickets (average of five trials).
In this paper, we show that there consistently exist smaller subnetworks that train from the start and
learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in
Figure 1 show networks that we find. Based on these results, we statethe lottery ticket hypothesis.
The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet-
work that is initialized such that—when trained in isolation—it can match the test accuracy of the
original network after training for at most the same number of iterations.
More formally, consider a dense feed-forward neural networkf(x; |