3435 lines
197 KiB
Plaintext
3435 lines
197 KiB
Plaintext
|
Published as a conference paper at ICLR 2019
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
THE LOTTERY TICKET HYPOTHESIS :
|
|||
|
FINDING SPARSE , T RAINABLE NEURAL NETWORKS
|
|||
|
|
|||
|
|
|||
|
Jonathan Frankle Michael Carbin
|
|||
|
MIT CSAIL MIT CSAIL
|
|||
|
jfrankle@csail.mit.edu mcarbin@csail.mit.edu
|
|||
|
|
|||
|
|
|||
|
|
|||
|
ABSTRACT
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
arXiv:1803.03635v5 [cs.LG] 4 Mar 2019 Neural network pruning techniques can reduce the parameter counts of trained net-
|
|||
|
works by over 90%, decreasing storage requirements and improving computational
|
|||
|
performance of inference without compromising accuracy. However, contemporary
|
|||
|
experience is that the sparse architectures produced by pruning are difficult to train
|
|||
|
from the start, which would similarly improve training performance.
|
|||
|
We find that a standard pruning technique naturally uncovers subnetworks whose
|
|||
|
initializations made them capable of training effectively. Based on these results, we
|
|||
|
articulate thelottery ticket hypothesis: dense, randomly-initialized, feed-forward
|
|||
|
networks contain subnetworks (winning tickets) that—when trained in isolation—
|
|||
|
reach test accuracy comparable to the original network in a similar number of
|
|||
|
iterations. The winning tickets we find have won the initialization lottery: their
|
|||
|
connections have initial weights that make training particularly effective.
|
|||
|
We present an algorithm to identify winning tickets and a series of experiments
|
|||
|
that support the lottery ticket hypothesis and the importance of these fortuitous
|
|||
|
initializations. We consistently find winning tickets that are less than 10-20% of
|
|||
|
the size of several fully-connected and convolutional feed-forward architectures
|
|||
|
for MNIST and CIFAR10. Above this size, the winning tickets that we find learn
|
|||
|
faster than the original network and reach higher test accuracy.
|
|||
|
|
|||
|
|
|||
|
1 I NTRODUCTION
|
|||
|
|
|||
|
Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990;
|
|||
|
Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than
|
|||
|
90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015)
|
|||
|
or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained
|
|||
|
networks, making inference more efficient. However, if a network can be reduced in size, why do we
|
|||
|
not train this smaller architecture instead in the interest of making training more efficient as well?
|
|||
|
Contemporary experience is that the architectures uncovered by pruning are harder to train from the
|
|||
|
start, reaching lower accuracy than the original networks. 1
|
|||
|
|
|||
|
Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected
|
|||
|
network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect
|
|||
|
of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels
|
|||
|
of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that
|
|||
|
iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy.
|
|||
|
|
|||
|
1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate
|
|||
|
the difficulty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain
|
|||
|
the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the
|
|||
|
pruned layers...gradient descent is able to find a good solution when the network is initially trained, but not after
|
|||
|
re-initializing some layers and retraining them.” (Han et al., 2015)
|
|||
|
2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion
|
|||
|
would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of
|
|||
|
minimum validation loss during training. See Appendix C for more details on this choice.
|
|||
|
|
|||
|
1 Published as a conference paper at ICLR 2019
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Lenet random Conv-6 random Conv-4 random Conv-2 random
|
|||
|
30K 1.000
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Accuracy at Early-Stop (Test)
|
|||
|
Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) Early-Stop Iteration (Val.) 0.8
|
|||
|
40K 0.975
|
|||
|
20K
|
|||
|
0.950 0.7
|
|||
|
20K 10K
|
|||
|
0.925 0.6
|
|||
|
0 0 0.900
|
|||
|
10041.116.97.02.91.20.50.2 10041.217.07.1 3.0 1.3 10041.116.97.02.91.20.50.2 10041.217.07.1 3.0 1.3
|
|||
|
Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining Percent of Weights Remaining
|
|||
|
|
|||
|
Figure 1: The iteration at which early-stopping would occur (left) and the test accuracy at that iteration
|
|||
|
(right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for
|
|||
|
CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled
|
|||
|
sparse networks (average of ten trials). Solid lines are winning tickets (average of five trials).
|
|||
|
|
|||
|
|
|||
|
In this paper, we show that there consistently exist smaller subnetworks that train from the start and
|
|||
|
learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in
|
|||
|
Figure 1 show networks that we find. Based on these results, we statethe lottery ticket hypothesis.
|
|||
|
The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet-
|
|||
|
work that is initialized such that—when trained in isolation—it can match the test accuracy of the
|
|||
|
original network after training for at most the same number of iterations.
|
|||
|
|
|||
|
More formally, consider a dense feed-forward neural networkf(x; |