testing_generation/Corpus/THE LOTTERY TICKET HYPOTHES...

                 Published as a conference paper at ICLR 2019


                 THE LOTTERY TICKET HYPOTHESIS :
                  FINDING SPARSE , T RAINABLE NEURAL NETWORKS


                  Jonathan Frankle                    Michael Carbin
                  MIT CSAIL                         MIT CSAIL
                  jfrankle@csail.mit.edu           mcarbin@csail.mit.edu


                                              ABSTRACT


     arXiv:1803.03635v5  [cs.LG]  4 Mar 2019                  Neural network pruning techniques can reduce the parameter counts of trained net-
                       works by over 90%, decreasing storage requirements and improving computational
                       performance of inference without compromising accuracy. However, contemporary
                       experience is that the sparse architectures produced by pruning are difﬁcult to train
                       from the start, which would similarly improve training performance.
                       We ﬁnd that a standard pruning technique naturally uncovers subnetworks whose
                       initializations made them capable of training effectively. Based on these results, we
                       articulate thelottery ticket hypothesis: dense, randomly-initialized, feed-forward
                       networks contain subnetworks (winning tickets) that—when trained in isolation—
                       reach test accuracy comparable to the original network in a similar number of
                       iterations. The winning tickets we ﬁnd have won the initialization lottery: their
                       connections have initial weights that make training particularly effective.
                       We present an algorithm to identify winning tickets and a series of experiments
                       that support the lottery ticket hypothesis and the importance of these fortuitous
                       initializations. We consistently ﬁnd winning tickets that are less than 10-20% of
                       the size of several fully-connected and convolutional feed-forward architectures
                       for MNIST and CIFAR10. Above this size, the winning tickets that we ﬁnd learn
                       faster than the original network and reach higher test accuracy.


                  1 I NTRODUCTION

                 Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990;
                 Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than
                 90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015)
                 or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained
                 networks, making inference more efﬁcient. However, if a network can be reduced in size, why do we
                 not train this smaller architecture instead in the interest of making training more efﬁcient as well?
                 Contemporary experience is that the architectures uncovered by pruning are harder to train from the
                 start, reaching lower accuracy than the original networks. 1

                 Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected
                 network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect
                 of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels
                 of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that
                 iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy.

                    1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate
                 the difﬁculty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain
                 the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the
                 pruned layers...gradient descent is able to ﬁnd a good solution when the network is initially trained, but not after
                 re-initializing some layers and retraining them.” (Han et al., 2015)
                    2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion
                 would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of
                 minimum validation loss during training. See Appendix C for more details on this choice.

                                                  1                 Published as a conference paper at ICLR 2019


                           Lenet   random    Conv-6   random   Conv-4   random   Conv-2   random
                                    30K               1.000


                                                                    Accuracy at Early-Stop (Test)
                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.)                Early-Stop Iteration (Val.)                                 0.8
                   40K                               0.975
                                    20K
                                                    0.950               0.7
                   20K               10K
                                                    0.925               0.6
                    0                0               0.900
                     10041.116.97.02.91.20.50.2    10041.217.07.1 3.0 1.3     10041.116.97.02.91.20.50.2    10041.217.07.1 3.0 1.3
                       Percent of Weights Remaining       Percent of Weights Remaining        Percent of Weights Remaining       Percent of Weights Remaining

                 Figure 1: The iteration at which early-stopping would occur (left) and the test accuracy at that iteration
                 (right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for
                 CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled
                 sparse networks (average of ten trials). Solid lines are winning tickets (average of ﬁve trials).


                 In this paper, we show that there consistently exist smaller subnetworks that train from the start and
                 learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in
                 Figure 1 show networks that we ﬁnd. Based on these results, we statethe lottery ticket hypothesis.
                 The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet-
                 work that is initialized such that—when trained in isolation—it can match the test accuracy of the
                 original network after training for at most the same number of iterations.

                 More formally, consider a dense feed-forward neural networkf(x;)with initial parameters=
                 0  D  . When optimizing with stochastic gradient descent (SGD) on a training set,freaches
                 minimum validation losslat iterationjwith test accuracya. In addition, consider trainingf(x;m)
                 with a maskm2 f0;1gjj on its parameters such that its initialization ism0 . When optimizing
                 with SGD on the same training set (withmﬁxed),freaches minimum validation lossl0 at iterationj0
                 with test accuracya0 . The lottery ticket hypothesis predicts that9mfor whichj0 j(commensurate
                 training time),a0 a(commensurate accuracy), andkmk0  jj(fewer parameters).
                 We ﬁnd that a standard pruning technique automatically uncovers such trainable subnetworks from
                 fully-connected and convolutional feed-forward networks. We designate these trainable subnetworks,
                 f(x;m0 ),winning tickets, since those that we ﬁnd have won the initialization lottery with a
                 combination of weights and connections capable of learning. When their parameters are randomly
                 reinitialized (f(x;m0 )where0  D 0      0    ), our winning tickets no longer match the performance of
                 the original network, offering evidence that these smaller networks do not train effectively unless
                 they are appropriately initialized.
                 Identifying winning tickets.We identify a winning ticket by training a network and pruning its
                 smallest-magnitude weights. The remaining, unpruned connections constitute the architecture of the
                 winning ticket. Unique to our work, each unpruned connection’s value is then reset to its initialization
                 from original networkbeforeit was trained. This forms our central experiment:
                     1.Randomly initialize a neural networkf(x;0 )(where0  D  ).
                     2.Train the network forjiterations, arriving at parametersj .
                     3.Prunep%of the parameters inj , creating a maskm.
                     4.Reset the remaining parameters to their values in0 , creating the winning ticketf(x;m0 ).
                 As described, this pruning approach isone-shot: the network is trained once,p%of weights are
                 pruned, and the surviving weights are reset. However, in this paper, we focus oniterative pruning,
                 which repeatedly trains, prunes, and resets the network overnrounds; each round prunesp1
                                                                             n %of the
                 weights that survive the previous round. Our results show that iterative pruning ﬁnds winning tickets
                 that match the accuracy of the original network at smaller sizes than does one-shot pruning.
                 Results.We identify winning tickets in a fully-connected architecture for MNIST and convolutional
                 architectures for CIFAR10 across several optimization strategies (SGD, momentum, and Adam) with
                 techniques like dropout, weight decay, batchnorm, and residual connections. We use an unstructured
                 pruning technique, so these winning tickets are sparse. In deeper networks, our pruning-based strategy
                 for ﬁnding winning tickets is sensitive to the learning rate: it requires warmup to ﬁnd winning tickets
                 at higher learning rates. The winning tickets we ﬁnd are 10-20% (or less) of the size of the original

                                                  2                 Published as a conference paper at ICLR 2019


                    Network     Lenet Conv-2 Conv-4 Conv-6 Resnet-18 VGG-19
                                                    64, 64, pool 16, 3x[16, 16]   2x64 pool 2x128
                                            64, 64, pool  128, 128, pool  3x[32, 32]    pool, 4x256, pool
                    Convolutions         64, 64, pool  128, 128, pool 256, 256, pool  3x[64, 64]    4x512, pool, 4x512
                    FC Layers   300, 100, 10 256, 256, 10 256, 256, 10 256, 256, 10 avg-pool, 10 avg-pool, 10
                    All/Conv Weights 266K 4.3M / 38K 2.4M / 260K 1.7M / 1.1M 274K / 270K 20.0M
                    Iterations/Batch 50K / 60 20K / 60 25K / 60 30K / 60 30K / 128 112K / 64
                    Optimizer   Adam 1.2e-3 Adam 2e-4 Adam 3e-4 Adam 3e-4   SGD 0.1-0.01-0.001 Momentum 0.9!
                    Pruning Rate   fc20% conv10% fc20% conv10% fc20% conv15% fc20% conv20% fc0% conv20% fc0%

                 Figure 2: Architectures tested in this paper. Convolutions are 3x3. Lenet is from LeCun et al. (1998).
                 Conv-2/4/6 are variants of VGG (Simonyan & Zisserman, 2014). Resnet-18 is from He et al. (2016).
                 VGG-19 for CIFAR10 is adapted from Liu et al. (2019). Initializations are Gaussian Glorot (Glorot
                 & Bengio, 2010). Brackets denote residual connections around layers.


                 network (smaller size). Down to that size, they meet or exceed the original network’s test accuracy
                 (commensurate accuracy) in at most the same number of iterations (commensurate training time).
                 When randomly reinitialized, winning tickets perform far worse, meaning structure alone cannot
                 explain a winning ticket’s success.
                 The Lottery Ticket Conjecture.Returning to our motivating question, we extend our hypothesis
                 into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense,
                 randomly-initialized networks are easier to train than the sparse networks that result from pruning
                 because there are more possible subnetworks from which training might recover a winning ticket.
                 Contributions.
                      We demonstrate that pruning uncovers trainable subnetworks that reach test accuracy compa-
                       rable to the original networks from which they derived in a comparable number of iterations.
                      We show that pruning ﬁnds winning tickets that learn faster than the original network while
                       reaching higher test accuracy and generalizing better.
                      We propose thelottery ticket hypothesisas a new perspective on the composition of neural
                       networks to explain these ﬁndings.
                 Implications.In this paper, we empirically study the lottery ticket hypothesis. Now that we have
                 demonstrated the existence of winning tickets, we hope to exploit this knowledge to:
                 Improve training performance.Since winning tickets can be trained from the start in isolation, a hope
                 is that we can design training schemes that search for winning tickets and prune as early as possible.
                 Design better networks.Winning tickets reveal combinations of sparse architectures and initializations
                 that are particularly adept at learning. We can take inspiration from winning tickets to design new
                 architectures and initialization schemes with the same properties that are conducive to learning. We
                 may even be able to transfer winning tickets discovered for one task to many others.
                 Improve our theoretical understanding of neural networks.We can study why randomly-initialized
                 feed-forward networks seem to contain winning tickets and potential implications for theoretical
                 study of optimization (Du et al., 2019) and generalization (Zhou et al., 2018; Arora et al., 2018).

                  2 W INNING TICKETS IN FULLY -C ONNECTED NETWORKS

                 In this Section, we assess the lottery ticket hypothesis as applied to fully-connected networks trained
                 on MNIST. We use the Lenet-300-100 architecture (LeCun et al., 1998) as described in Figure 2.
                 We follow the outline from Section 1: after randomly initializing and training a network, we prune
                 the network and reset the remaining connections to their original initializations. We use a simple
                 layer-wise pruning heuristic: remove a percentage of the weights with the lowest magnitudes within
                 each layer (as in Han et al. (2015)). Connections to outputs are pruned at half of the rate of the rest of
                 the network. We explore other hyperparameters in Appendix G, including learning rates, optimization
                 strategies (SGD, momentum), initialization schemes, and network sizes.

                                                  3                 Published as a conference paper at ICLR 2019


                              100.0   51.3   21.1   7.0   3.6   1.9   51.3 (reinit)   21.1 (reinit)
                    0.99                    0.99                    0.99

                    0.98                    0.98                    0.98


                   Test Accuracy                     Test Accuracy                     Test Accuracy0.97                    0.97                    0.97

                    0.96                    0.96                    0.96

                    0.95                    0.95                    0.95

                    0.94                    0.94                    0.94
                      0   5000  10000  15000      0   5000  10000  15000      0   5000  10000  15000
                           Training Iterations               Training Iterations               Training Iterations

                 Figure 3: Test accuracy on Lenet (iterative pruning) as training proceeds. Each curve is the average
                 of ﬁve trials. Labels arePm —the fraction of weights remaining in the network after pruning. Error
                 bars are the minimum and maximum of any trial.


                 Notation.Pm =kmk0 is the sparsity of maskm, e.g.,Pjj                       m = 25%when 75% of weights are pruned.
                 Iterative pruning.The winning tickets we ﬁnd learn faster than the original network. Figure 3 plots
                 the average test accuracy when training winning tickets iteratively pruned to various extents. Error
                 bars are the minimum and maximum of ﬁve runs. For the ﬁrst pruning rounds, networks learn faster
                 and reach higher test accuracy the more they are pruned (left graph in Figure 3). A winning ticket
                 comprising 51.3% of the weights from the original network (i.e.,Pm = 51:3%) reaches higher test
                 accuracy faster than the original network but slower than whenPm = 21:1%. WhenPm <21:1%,
                 learning slows (middle graph). WhenPm = 3:6%, a winning ticket regresses to the performance of
                 the original network. A similar pattern repeats throughout this paper.
                 Figure 4a summarizes this behavior for all pruning levels when iteratively pruning by 20% per
                 iteration (blue). On the left is the iteration at which each network reaches minimum validation loss
                 (i.e., when the early-stopping criterion would halt training) in relation to the percent of weights
                 remaining after pruning; in the middle is test accuracy at that iteration. We use the iteration at which
                 the early-stopping criterion is met as a proxy for how quickly the network learns.
                 The winning tickets learn faster asPm decreases from 100% to 21%, at which point early-stopping
                 occurs38%earlier than for the original network. Further pruning causes learning to slow, returning
                 to the early-stopping performance of the original network whenPm = 3:6%. Test accuracy increases
                 with pruning, improving by more than 0.3 percentage points whenPm = 13:5%; after this point,
                 accuracy decreases, returning to the level of the original network whenPm = 3:6%.
                 At early stopping, training accuracy (Figure 4a, right) increases with pruning in a similar pattern to
                 test accuracy, seemingly implying that winning tickets optimize more effectively but do not generalize
                 better. However, at iteration 50,000 (Figure 4b), iteratively-pruned winning tickets still see a test
                 accuracy improvement of up to 0.35 percentage points in spite of the fact that training accuracy
                 reaches 100% for nearly all networks (Appendix D, Figure 12). This means that the gap between
                 training accuracy and test accuracy is smaller for winning tickets, pointing to improved generalization.
                 Random reinitialization.To measure the importance of a winning ticket’s initialization, we retain
                 the structure of a winning ticket (i.e., the maskm) but randomly sample a new initialization0  D 0    .
                 We randomly reinitialize each winning ticket three times, making 15 total per point in Figure 4. We
                 ﬁnd that initialization is crucial for the efﬁcacy of a winning ticket. The right graph in Figure 3
                 shows this experiment for iterative pruning. In addition to the original network and winning tickets at
                 Pm = 51%and21%are the random reinitialization experiments. Where the winning tickets learn
                 faster as they are pruned, they learn progressively slower when randomly reinitialized.
                 The broader results of this experiment are orange line in Figure 4a. Unlike winning tickets, the
                 reinitialized networks learn increasingly slower than the original network and lose test accuracy after
                 little pruning. The average reinitialized iterative winning ticket’s test accuracy drops off from the
                 original accuracy whenPm = 21:1%, compared to 2.9% for the winning ticket. WhenPm = 21%,
                 the winning ticket reaches minimum validation loss 2.51x faster than when reinitialized and is half a
                 percentage point more accurate. All networks reach 100% training accuracy forPm 5%; Figure

                                                  4                 Published as a conference paper at ICLR 2019


                           Random Reinit (Oneshot)    Winning Ticket (Oneshot)    Random Reinit (Iterative)    Winning Ticket (Iterative)
                   35K                    0.99                    1.00
                   30K


                                                              Accuracy at Early-Stop (Train)Accuracy at Early-Stop (Test) 0.98                    0.99


                   Early-Stop Iteration (Val.) 25K                    0.97                    0.98
                   20K                    0.96                    0.97
                   15K                    0.95                    0.96
                   10K                    0.94                    0.95
                    5K                    0.93                    0.94
                    0                    0.92                    0.93
                     10051.326.313.57.03.61.91.00.50.3     10051.326.313.57.03.61.91.00.50.3     10051.326.313.57.03.61.91.00.50.3
                         Percent of Weights Remaining            Percent of Weights Remaining            Percent of Weights Remaining
                               (a) Early-stopping iteration and accuracy for all pruning methods.
                    0.99                    25K                    0.990


                   Accuracy at Iteration 50K (Test)0.98


                                                               Accuracy at Early-Stop (Test) 0.983


                                         Early-Stop Iteration (Val.) 20K
                    0.97                                          0.976
                    0.96                    15K                    0.969
                    0.95                    10K                    0.962
                    0.94                                          0.955
                                          5K 0.93                                          0.948
                    0.92                     0                    0.941
                      10051.326.313.57.03.61.91.00.50.3     10087.575.062.650.137.625.112.7      10087.575.062.650.137.625.112.7
                         Percent of Weights Remaining            Percent of Weights Remaining            Percent of Weights Remaining
                   (b) Accuracy at end of training.    (c) Early-stopping iteration and accuracy for one-shot pruning.

                 Figure 4: Early-stopping iteration and accuracy of Lenet under one-shot and iterative pruning.
                 Average of ﬁve trials; error bars for the minimum and maximum values. At iteration 50,000, training
                 accuracy100%forPm 2%for iterative winning tickets (see Appendix D, Figure 12).


                 4b therefore shows that the winning tickets generalize substantially better than when randomly
                 reinitialized. This experiment supports the lottery ticket hypothesis’ emphasis on initialization:
                 the original initialization withstands and beneﬁts from pruning, while the random reinitialization’s
                 performance immediately suffers and diminishes steadily.
                 One-shot pruning.Although iterative pruning extracts smaller winning tickets, repeated training
                 means they are costly to ﬁnd. One-shot pruning makes it possible to identify winning tickets
                 without this repeated training. Figure 4c shows the results of one-shot pruning (green) and randomly
                 reinitializing (red); one-shot pruning does indeed ﬁnd winning tickets. When67:5%Pm 17:6%,
                 the average winning tickets reach minimum validation accuracy earlier than the original network.
                 When95:0%Pm 5:17%, test accuracy is higher than the original network. However, iteratively-
                 pruned winning tickets learn faster and reach higher test accuracy at smaller network sizes. The
                 green and red lines in Figure 4c are reproduced on the logarithmic axes of Figure 4a, making this
                 performance gap clear. Since our goal is to identify the smallest possible winning tickets, we focus
                 on iterative pruning throughout the rest of the paper.

                  3 W INNING TICKETS IN CONVOLUTIONAL NETWORKS

                 Here, we apply the lottery ticket hypothesis to convolutional networks on CIFAR10, increasing
                 both the complexity of the learning problem and the size of the networks. We consider the Conv-2,
                 Conv-4, and Conv-6 architectures in Figure 2, which are scaled-down variants of the VGG (Simonyan
                 & Zisserman, 2014) family. The networks have two, four, or six convolutional layers followed by
                 two fully-connected layers; max-pooling occurs after every two convolutional layers. The networks
                 cover a range from near-fully-connected to traditional convolutional networks, with less than 1% of
                 parameters in convolutional layers in Conv-2 to nearly two thirds in Conv-6. 3

                 Finding winning tickets. The solid lines in Figure 5 (top) show the iterative lottery ticket experiment
                 on Conv-2 (blue), Conv-4 (orange), and Conv-6 (green) at the per-layer pruning rates from Figure 2.
                 The pattern from Lenet in Section 2 repeats: as the network is pruned, it learns faster and test accuracy
                 rises as compared to the original network. In this case, the results are more pronounced. Winning

                    3 Appendix H explores other hyperparameters, including learning rates, optimization strategies (SGD, mo-
                 mentum), and the relative rates at which to prune convolutional and fully-connected layers.

                                                  5                 Published as a conference paper at ICLR 2019


                            Conv-2    Conv-2 reinit    Conv-4    Conv-4 reinit    Conv-6    Conv-6 reinit
                   20K                               0.85


                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.)16K                               0.80
                   12K                               0.75
                    8K                               0.70
                    4K                               0.65
                    0                               0.60
                      100  51.4  26.5  13.7  7.1  3.7  1.9  1.0       100  51.4  26.5  13.7  7.1  3.7  1.9  1.0
                              Percent of Weights Remaining                       Percent of Weights Remaining 1.0


                                                    Accuracy at Iteration 20/25/30K (Test) Accuracy at Early-Stop (Train)                                 0.85
                   0.9                               0.80
                   0.8                               0.75
                                                    0.70
                   0.7
                                                    0.65
                   0.6                               0.60
                      100  51.4  26.5  13.7  7.1  3.7  1.9  1.0       100  51.4  26.5  13.7  7.1  3.7  1.9  1.0
                              Percent of Weights Remaining                       Percent of Weights Remaining
                 Figure 5: Early-stopping iteration and test and training accuracy of the Conv-2/4/6 architectures when
                 iteratively pruned and when randomly reinitialized. Each solid line is the average of ﬁve trials; each
                 dashed line is the average of ﬁfteen reinitializations (three per trial). The bottom right graph plots test
                 accuracy of winning tickets at iterations corresponding to the last iteration of training for the original
                 network (20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6); at this iteration, training
                 accuracy100%forPm 2%for winning tickets (see Appendix D).


                  tickets reach minimum validation loss at best 3.5x faster for Conv-2 (Pm = 8:8%), 3.5x for Conv-4
                 (Pm = 9:2%), and 2.5x for Conv-6 (Pm = 15:1%). Test accuracy improves at best 3.4 percentage
                 points for Conv-2 (Pm = 4:6%), 3.5 for Conv-4 (Pm = 11:1%), and 3.3 for Conv-6 (Pm = 26:4%).
                 All three networks remain above their original average test accuracy whenPm >2%.
                 As in Section 2, training accuracy at the early-stopping iteration rises with test accuracy. However, at
                 iteration 20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6 (the iterations corresponding
                 to the ﬁnal training iteration for the original network), training accuracy reaches 100% for all networks
                 whenPm 2%(Appendix D, Figure 13) and winning tickets still maintain higher test accuracy
                 (Figure 5 bottom right). This means that the gap between test and training accuracy is smaller for
                 winning tickets, indicating they generalize better.
                 Random reinitialization.We repeat the random reinitialization experiment from Section 2, which
                 appears as the dashed lines in Figure 5. These networks again take increasingly longer to learn upon
                 continued pruning. Just as with Lenet on MNIST (Section 2), test accuracy drops off more quickly
                 for the random reinitialization experiments. However, unlike Lenet, test accuracy at early-stopping
                 time initially remains steady and even improves for Conv-2 and Conv-4, indicating that—at moderate
                 levels of pruning—the structure of the winning tickets alone may lead to better accuracy.
                 Dropout.Dropout (Srivastava et al., 2014; Hinton et al., 2012) improves accuracy by randomly dis-
                 abling a fraction of the units (i.e., randomly sampling a subnetwork) on each training iteration. Baldi
                 & Sadowski (2013) characterize dropout as simultaneously training the ensemble of all subnetworks.
                 Since the lottery ticket hypothesis suggests that one of these subnetworks comprises a winning ticket,
                 it is natural to ask whether dropout and our strategy for ﬁnding winning tickets interact.
                 Figure 6 shows the results of training Conv-2, Conv-4, and Conv-6 with a dropout rate of 0.5. Dashed
                 lines are the network performance without dropout (the solid lines in Figure 5). 4 We continue to ﬁnd
                 winning tickets when training with dropout. Dropout increases initial test accuracy (2.1, 3.0, and 2.4
                 percentage points on average for Conv-2, Conv-4, and Conv-6, respectively), and iterative pruning
                 increases it further (up to an additional 2.3, 4.6, and 4.7 percentage points, respectively, on average).
                 Learning becomes faster with iterative pruning as before, but less dramatically in the case of Conv-2.


                    4 We choose new learning rates for the networks as trained with dropout—see Appendix H.5.

                                                  6                 Published as a conference paper at ICLR 2019


                            Conv-2 dropout    Conv-2    Conv-4 dropout    Conv-4    Conv-6 dropout    Conv-6
                   40K                               0.85
                   35K


                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.)                                 0.81 30K
                   25K                               0.77
                   20K
                   15K                               0.73
                   10K                               0.69
                    5K
                    0                               0.65
                      100  51.4  26.5  13.7  7.1  3.7  1.9  1.0       100  51.4  26.5  13.7  7.1  3.7  1.9  1.0
                              Percent of Weights Remaining                       Percent of Weights Remaining
                 Figure 6: Early-stopping iteration and test accuracy at early-stopping of Conv-2/4/6 when iteratively
                 pruned and trained with dropout. The dashed lines are the same networks trained without dropout
                 (the solid lines in Figure 5). Learning rates are 0.0003 for Conv-2 and 0.0002 for Conv-4 and Conv-6.

                               rate 0.1   rand reinit   rate 0.01   rand reinit   rate 0.1, warmup 10K   rand reinit
                   0.94                    0.94                    0.94
                   0.92                    0.92                    0.92
                   0.90                    0.90


                                                               Test Accuracy (112K)0.90


                   Test Accuracy (30K)
                                         Test Accuracy (60K) 0.88                    0.88                    0.88
                   0.86                    0.86                    0.86
                   0.84                    0.84                    0.84
                   0.82                    0.82                    0.82
                   0.80                    0.80                    0.80
                      10041.016.86.9 2.8 1.2 0.50.20.1     10041.016.86.9 2.8 1.2 0.50.20.1     10041.016.86.9 2.8 1.2 0.50.20.1
                          Percent of Weights Remaining             Percent of Weights Remaining             Percent of Weights Remaining
                   Figure 7: Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned.


                 These improvements suggest that our iterative pruning strategy interacts with dropout in a comple-
                 mentary way. Srivastava et al. (2014) observe that dropout induces sparse activations in the ﬁnal
                 network; it is possible that dropout-induced sparsity primes a network to be pruned. If so, dropout
                 techniques that target weights (Wan et al., 2013) or learn per-weight dropout probabilities (Molchanov
                 et al., 2017; Louizos et al., 2018) could make winning tickets even easier to ﬁnd.

                  4 VGG AND RESNET FOR CIFAR10

                 Here, we study the lottery ticket hypothesis on networks evocative of the architectures and techniques
                 used in practice. Speciﬁcally, we consider VGG-style deep convolutional networks (VGG-19 on
                 CIFAR10—Simonyan & Zisserman (2014)) and residual networks (Resnet-18 on CIFAR10—He
                 et al. (2016)). 5 These networks are trained with batchnorm, weight decay, decreasing learning
                 rate schedules, and augmented training data. We continue to ﬁnd winning tickets for all of these
                 architectures; however, our method for ﬁnding them, iterative pruning, is sensitive to the particular
                 learning rate used. In these experiments, rather than measure early-stopping time (which, for these
                 larger networks, is entangled with learning rate schedules), we plot accuracy at several moments
                 during training to illustrate the relative rates at which accuracy improves.
                 Global pruning.On Lenet and Conv-2/4/6, we prune each layer separately at the same rate. For
                 Resnet-18 and VGG-19, we modify this strategy slightly: we prune these deeper networksglobally,
                 removing the lowest-magnitude weights collectively across all convolutional layers. In Appendix
                 I.1, we ﬁnd that global pruning identiﬁes smaller winning tickets for Resnet-18 and VGG-19. Our
                 conjectured explanation for this behavior is as follows: For these deeper networks, some layers have
                 far more parameters than others. For example, the ﬁrst two convolutional layers of VGG-19 have
                 1728 and 36864 parameters, while the last has 2.35 million. When all layers are pruned at the same
                 rate, these smaller layers become bottlenecks, preventing us from identifying the smallest possible
                 winning tickets. Global pruning makes it possible to avoid this pitfall.
                 VGG-19.We study the variant VGG-19 adapted for CIFAR10 by Liu et al. (2019); we use the
                 the same training regime and hyperparameters: 160 epochs (112,480 iterations) with SGD with
                    5 See Figure 2 and Appendices I for details on the networks, hyperparameters, and training regimes.

                                                  7                 Published as a conference paper at ICLR 2019


                               rate 0.1   rand reinit   rate 0.01   rand reinit   rate 0.03, warmup 20K   rand reinit
                   0.85                    0.85                    0.90


                   Test Accuracy (10K) 0.80


                                         Test Accuracy (20K) 0.80


                                                               Test Accuracy (30K)0.88

                                                               0.86
                   0.75                    0.75
                                                               0.84
                   0.70                    0.70
                                                               0.82
                   0.65                    0.65
                      100 64.441.727.117.811.88.05.5     100 64.441.727.117.811.88.05.5     100 64.441.727.117.811.88.05.5
                          Percent of Weights Remaining             Percent of Weights Remaining             Percent of Weights Remaining
                   Figure 8: Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned.


                 momentum (0.9) and decreasing the learning rate by a factor of 10 at 80 and 120 epochs. This
                 network has 20 million parameters. Figure 7 shows the results of iterative pruning and random
                 reinitialization on VGG-19 at two initial learning rates: 0.1 (used in Liu et al. (2019)) and 0.01. At the
                 higher learning rate, iterative pruning does not ﬁnd winning tickets, and performance is no better than
                 when the pruned networks are randomly reinitialized. However, at the lower learning rate, the usual
                 pattern reemerges, with subnetworks that remain within 1 percentage point of the original accuracy
                 whilePm 3:5%. (They are not winning tickets, since they do not match the original accuracy.)
                 When randomly reinitialized, the subnetworks lose accuracy as they are pruned in the same manner as
                 other experiments throughout this paper. Although these subnetworks learn faster than the unpruned
                 network early in training (Figure 7 left), this accuracy advantage erodes later in training due to the
                 lower initial learning rate. However, these subnetworks still learn faster than when reinitialized.
                 To bridge the gap between the lottery ticket behavior of the lower learning rate and the accuracy
                 advantage of the higher learning rate, we explore the effect of linear learning rate warmup from 0 to
                 the initial learning rate overkiterations. Training VGG-19 with warmup (k= 10000, green line) at
                 learning rate 0.1 improves the test accuracy of the unpruned network by about one percentage point.
                 Warmup makes it possible to ﬁnd winning tickets, exceeding this initial accuracy whenPm 1:5%.
                 Resnet-18.Resnet-18 (He et al., 2016) is a 20 layer convolutional network with residual connections
                 designed for CIFAR10. It has 271,000 parameters. We train the network for 30,000 iterations with
                 SGD with momentum (0.9), decreasing the learning rate by a factor of 10 at 20,000 and 25,000
                 iterations. Figure 8 shows the results of iterative pruning and random reinitialization at learning
                 rates 0.1 (used in He et al. (2016)) and 0.01. These results largely mirror those of VGG: iterative
                 pruning ﬁnds winning tickets at the lower learning rate but not the higher learning rate. The accuracy
                 of the best winning tickets at the lower learning rate (89.5% when41:7%Pm 21:9%) falls
                 short of the original network’s accuracy at the higher learning rate (90.5%). At lower learning rate,
                 the winning ticket again initially learns faster (left plots of Figure 8), but falls behind the unpruned
                 network at the higher learning rate later in training (right plot). Winning tickets trained with warmup
                 close the accuracy gap with the unpruned network at the higher learning rate, reaching 90.5% test
                 accuracy with learning rate 0.03 (warmup,k= 20000) atPm = 27:1%. For these hyperparameters,
                 we still ﬁnd winning tickets whenPm 11:8%. Even with warmup, however, we could not ﬁnd
                 hyperparameters for which we could identify winning tickets at the original learning rate, 0.1.

                  5 D ISCUSSION

                 Existing work on neural network pruning (e.g., Han et al. (2015)) demonstrates that the function
                 learned by a neural network can often be represented with fewer parameters. Pruning typically
                 proceeds by training the original network, removing connections, and further ﬁne-tuning. In effect,
                 the initial training initializes the weights of the pruned network so that it can learn in isolation during
                 ﬁne-tuning. We seek to determine if similarly sparse networks can learn from the start. We ﬁnd that
                 the architectures studied in this paper reliably contain such trainable subnetworks, and the lottery
                 ticket hypothesis proposes that this property applies in general. Our empirical study of the existence
                 and nature of winning tickets invites a number of follow-up questions.
                 The importance of winning ticket initialization.When randomly reinitialized, a winning ticket
                 learns more slowly and achieves lower test accuracy, suggesting that initialization is important to
                 its success. One possible explanation for this behavior is these initial weights are close to their ﬁnal

                                                  8                 Published as a conference paper at ICLR 2019


                 values after training—that in the most extreme case, they are already trained. However, experiments
                 in Appendix F show the opposite—that the winning ticket weights move further than other weights.
                 This suggests that the beneﬁt of the initialization is connected to the optimization algorithm, dataset,
                 and model. For example, the winning ticket initialization might land in a region of the loss landscape
                 that is particularly amenable to optimization by the chosen optimization algorithm.
                 Liu et al. (2019) ﬁnd that pruned networks are indeed trainable when randomly reinitialized, seemingly
                 contradicting conventional wisdom and our random reinitialization experiments. For example, on
                 VGG-19 (for which we share the same setup), they ﬁnd that networks pruned by up to 80% and
                 randomly reinitialized match the accuracy of the original network. Our experiments in Figure 7
                 conﬁrm these ﬁndings at this level of sparsity (below which Liu et al. do not present data). However,
                 after further pruning, initialization matters: we ﬁnd winning tickets when VGG-19 is pruned by up
                 to 98.5%; when reinitialized, these tickets reach much lower accuracy. We hypothesize that—up
                 to a certain level of sparsity—highly overparameterized networks can be pruned, reinitialized, and
                 retrained successfully; however, beyond this point, extremely pruned, less severely overparamterized
                 networks only maintain accuracy with fortuitous initialization.
                 The importance of winning ticket structure.The initialization that gives rise to a winning ticket
                 is arranged in a particular sparse architecture. Since we uncover winning tickets through heavy
                 use of training data, we hypothesize that the structure of our winning tickets encodes an inductive
                 bias customized to the learning task at hand. Cohen & Shashua (2016) show that the inductive bias
                 embedded in the structure of a deep network determines the kinds of data that it can separate more
                 parameter-efﬁciently than can a shallow network; although Cohen & Shashua (2016) focus on the
                 pooling geometry of convolutional networks, a similar effect may be at play with the structure of
                 winning tickets, allowing them to learn even when heavily pruned.
                 The improved generalization of winning tickets.We reliably ﬁnd winning tickets that generalize
                 better, exceeding the test accuracy of the original network while matching its training accuracy.
                 Test accuracy increases and then decreases as we prune, forming anOccam’s Hill(Rasmussen &
                 Ghahramani, 2001) where the original, overparameterized model has too much complexity (perhaps
                 overﬁtting) and the extremely pruned model has too little. The conventional view of the relationship
                 between compression and generalization is that compact hypotheses can better generalize (Rissanen,
                 1986). Recent theoretical work shows a similar link for neural networks, proving tighter generalization
                 bounds for networks that can be compressed further (Zhou et al. (2018) for pruning/quantization
                 and Arora et al. (2018) for noise robustness). The lottery ticket hypothesis offers a complementary
                 perspective on this relationship—that larger networks might explicitly contain simpler representations.
                 Implications for neural network optimization.Winning tickets can reach accuracy equivalent to
                 that of the original, unpruned network, but with signiﬁcantly fewer parameters. This observation
                 connects to recent work on the role of overparameterization in neural network training. For example,
                 Du et al. (2019) prove that sufﬁciently overparameterized two-layer relu networks (with ﬁxed-size
                 second layers) trained with SGD converge to global optima. A key question, then, is whether the
                 presence of a winning ticket is necessary or sufﬁcient for SGD to optimize a neural network to a
                 particular test accuracy. We conjecture (but do not empirically show) that SGD seeks out and trains a
                 well-initialized subnetwork. By this logic, overparameterized networks are easier to train because
                 they have more combinations of subnetworks that are potential winning tickets.


                  6 L IMITATIONS AND FUTURE WORK

                 We only consider vision-centric classiﬁcation tasks on smaller datasets (MNIST, CIFAR10). We do
                 not investigate larger datasets (namely Imagenet (Russakovsky et al., 2015)): iterative pruning is
                 computationally intensive, requiring training a network 15 or more times consecutively for multiple
                 trials. In future work, we intend to explore more efﬁcient methods for ﬁnding winning tickets that
                 will make it possible to study the lottery ticket hypothesis in more resource-intensive settings.
                 Sparse pruning is our only method for ﬁnding winning tickets. Although we reduce parameter-counts,
                 the resulting architectures are not optimized for modern libraries or hardware. In future work, we
                 intend to study other pruning methods from the extensive contemporary literature, such as structured
                 pruning (which would produce networks optimized for contemporary hardware) and non-magnitude
                 pruning methods (which could produce smaller winning tickets or ﬁnd them earlier).

                                                  9                 Published as a conference paper at ICLR 2019


                 The winning tickets we ﬁnd have initializations that allow them to match the performance of the
                 unpruned networks at sizes too small for randomly-initialized networks to do the same. In future
                 work, we intend to study the properties of these initializations that, in concert with the inductive
                 biases of the pruned network architectures, make these networks particularly adept at learning.
                 On deeper networks (Resnet-18 and VGG-19), iterative pruning is unable to ﬁnd winning tickets
                 unless we train the networks with learning rate warmup. In future work, we plan to explore why
                 warmup is necessary and whether other improvements to our scheme for identifying winning tickets
                 could obviate the need for these hyperparameter modiﬁcations.


                  7 R ELATED WORK


                  In practice, neural networks tend to be dramatically overparameterized. Distillation (Ba & Caruana,
                 2014; Hinton et al., 2015) and pruning (LeCun et al., 1990; Han et al., 2015) rely on the fact that
                 parameters can be reduced while preserving accuracy. Even with sufﬁcient capacity to memorize
                 training data, networks naturally learn simpler functions (Zhang et al., 2016; Neyshabur et al., 2014;
                 Arpit et al., 2017). Contemporary experience (Bengio et al., 2006; Hinton et al., 2015; Zhang et al.,
                 2016) and Figure 1 suggest that overparameterized networks are easier to train. We show that dense
                 networks contain sparse subnetworks capable of learning on their own starting from their original
                 initializations. Several other research directions aim to train small or sparse networks.
                 Prior to training.Squeezenet (Iandola et al., 2016) and MobileNets (Howard et al., 2017) are
                 speciﬁcally engineered image-recognition networks that are an order of magnitude smaller than
                 standard architectures. Denil et al. (2013) represent weight matrices as products of lower-rank factors.
                 Li et al. (2018) restrict optimization to a small, randomly-sampled subspace of the parameter space
                 (meaning all parameters can still be updated); they successfully train networks under this restriction.
                 We show that one need not even update all parameters to optimize a network, and we ﬁnd winning
                 tickets through a principled search process involving pruning. Our contribution to this class of
                 approaches is to demonstrate that sparse, trainable networks exist within larger networks.
                 After training.Distillation (Ba & Caruana, 2014; Hinton et al., 2015) trains small networks to mimic
                 the behavior of large networks; small networks are easier to train in this paradigm. Recent pruning
                 work compresses large models to run with limited resources (e.g., on mobile devices). Although
                 pruning is central to our experiments, we study why training needs the overparameterized networks
                 that make pruning possible. LeCun et al. (1990) and Hassibi & Stork (1993) ﬁrst explored pruning
                 based on second derivatives. More recently, Han et al. (2015) showed per-weight magnitude-based
                 pruning substantially reduces the size of image-recognition networks. Guo et al. (2016) restore
                 pruned connections as they become relevant again. Han et al. (2017) and Jin et al. (2016) restore
                 pruned connections to increase network capacity after small weights have been pruned and surviving
                 weights ﬁne-tuned. Other proposed pruning heuristics include pruning based on activations (Hu et al.,
                 2016), redundancy (Mariet & Sra, 2016; Srinivas & Babu, 2015a), per-layer second derivatives (Dong
                 et al., 2017), and energy/computation efﬁciency (Yang et al., 2017) (e.g., pruning convolutional
                 ﬁlters (Li et al., 2016; Molchanov et al., 2016; Luo et al., 2017) or channels (He et al., 2017)). Cohen
                 et al. (2016) observe that convolutional ﬁlters are sensitive to initialization (“The Filter Lottery”);
                 throughout training, they randomly reinitialize unimportant ﬁlters.
                 During training.Bellec et al. (2018) train with sparse networks and replace weights that reach
                 zero with new random connections. Srinivas et al. (2017) and Louizos et al. (2018) learn gating
                 variables that minimize the number of nonzero parameters. Narang et al. (2017) integrate magnitude-
                 based pruning into training. Gal & Ghahramani (2016) show that dropout approximates Bayesian
                 inference in Gaussian processes. Bayesian perspectives on dropout learn dropout probabilities during
                 training (Gal et al., 2017; Kingma et al., 2015; Srinivas & Babu, 2016). Techniques that learn per-
                 weight, per-unit (Srinivas & Babu, 2016), or structured dropout probabilities naturally (Molchanov
                 et al., 2017; Neklyudov et al., 2017) or explicitly (Louizos et al., 2017; Srinivas & Babu, 2015b)
                 prune and sparsify networks during training as dropout probabilities for some weights reach 1. In
                 contrast, we train networks at least once to ﬁnd winning tickets. These techniques might also ﬁnd
                 winning tickets, or, by inducing sparsity, might beneﬁcially interact with our methods.

                                                  10                 Published as a conference paper at ICLR 2019


                  REFERENCES
                 Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for
                   deep nets via a compression approach.ICML, 2018.
                 Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S
                   Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at
                   memorization in deep networks. InInternational Conference on Machine Learning, pp. 233–242,
                   2017.
                 Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in neural information
                   processing systems, pp. 2654–2662, 2014.
                 Pierre Baldi and Peter J Sadowski. Understanding dropout. InAdvances in neural information
                   processing systems, pp. 2814–2822, 2013.
                 Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training
                   very sparse deep networks.Proceedings of ICLR, 2018.
                 Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex
                   neural networks. InAdvances in neural information processing systems, pp. 123–130, 2006.
                 Joseph Paul Cohen, Henry Z Lo, and Wei Ding. Randomout: Using a convolutional gradient norm to
                   win the ﬁlter lottery.ICLR Workshop, 2016.
                 Nadav Cohen and Amnon Shashua. Inductive bias of deep convolutional networks through pooling
                   geometry.arXiv preprint arXiv:1605.06743, 2016.
                 Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. Predicting parameters in deep
                   learning. InAdvances in neural information processing systems, pp. 2148–2156, 2013.
                 Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise
                   optimal brain surgeon. InAdvances in Neural Information Processing Systems, pp. 4860–4874,
                   2017.
                 Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes
                   over-parameterized neural networks. InInternational Conference on Learning Representations,
                   2019. URLhttps://openreview.net/forum?id=S1eK3i09YQ.
                 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model
                   uncertainty in deep learning. Ininternational conference on machine learning, pp. 1050–1059,
                   2016.
                 Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. InAdvances in Neural Information
                   Processing Systems, pp. 3584–3593, 2017.
                 Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural
                   networks. InProceedings of the thirteenth international conference on artiﬁcial intelligence and
                   statistics, pp. 249–256, 2010.
                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InAdvances
                   In Neural Information Processing Systems, pp. 1379–1387, 2016.
                 Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
                   efﬁcient neural network. InAdvances in neural information processing systems, pp. 1135–1143,
                   2015.
                 Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John
                   Tran, and William J Dally. Dsd: Regularizing deep neural networks with dense-sparse-dense
                   training ﬂow.Proceedings of ICLR, 2017.
                 Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain
                   surgeon. InAdvances in neural information processing systems, pp. 164–171, 1993.

                                                  11                 Published as a conference paper at ICLR 2019


                 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
                   recognition. InProceedings of the IEEE conference on computer vision and pattern recognition,
                   pp. 770–778, 2016.
                 Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.
                   InInternational Conference on Computer Vision (ICCV), volume 2, pp. 6, 2017.
                 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv
                   preprint arXiv:1503.02531, 2015.
                 Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov.
                   Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
                   arXiv:1207.0580, 2012.
                 Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
                   Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for
                   mobile vision applications.arXiv preprint arXiv:1704.04861, 2017.
                 Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven
                   neuron pruning approach towards efﬁcient deep architectures.arXiv preprint arXiv:1607.03250,
                   2016.
                 Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt
                   Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.
                   arXiv preprint arXiv:1602.07360, 2016.
                 Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks
                   with iterative hard thresholding methods.arXiv preprint arXiv:1607.05423, 2016.
                 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint
                   arXiv:1412.6980, 2014.
                 Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameteri-
                   zation trick. InAdvances in Neural Information Processing Systems, pp. 2575–2583, 2015.
                 Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
                 Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. InAdvances in neural
                   information processing systems, pp. 598–605, 1990.
                 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
                   document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                 Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension
                   of objective landscapes.Proceedings of ICLR, 2018.
                 Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for
                   efﬁcient convnets.arXiv preprint arXiv:1608.08710, 2016.
                 Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value
                   of network pruning. InInternational Conference on Learning Representations, 2019. URL
                   https://openreview.net/forum?id=rJlnB3C5Ym.
                 Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In
                   Advances in Neural Information Processing Systems, pp. 3290–3300, 2017.
                 Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through
                   l_0regularization.Proceedings of ICLR, 2018.
                 Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neural
                   network compression.arXiv preprint arXiv:1707.06342, 2017.
                  Zelda Mariet and Suvrit Sra. Diversity networks.Proceedings of ICLR, 2016.

                                                  12                 Published as a conference paper at ICLR 2019


                 Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁes deep neural
                   networks.arXiv preprint arXiv:1701.05369, 2017.
                 Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional
                   neural networks for resource efﬁcient transfer learning.arXiv preprint arXiv:1611.06440, 2016.
                 Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent
                   neural networks.Proceedings of ICLR, 2017.
                 Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian
                   pruning via log-normal multiplicative noise. InAdvances in Neural Information Processing
                   Systems, pp. 6778–6787, 2017.
                 Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the
                   role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014.
                 Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In T. K. Leen, T. G. Dietterich,
                   and V. Tresp (eds.),Advances in Neural Information Processing Systems 13, pp. 294–300. MIT
                   Press, 2001. URLhttp://papers.nips.cc/paper/1925-occams-razor.pdf.
                 Jorma Rissanen. Stochastic complexity and modeling.The annals of statistics, pp. 1080–1100, 1986.
                 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
                   Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
                   challenge.International Journal of Computer Vision, 115(3):211–252, 2015.
                 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
                   recognition.arXiv preprint arXiv:1409.1556, 2014.
                 Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv
                   preprint arXiv:1507.06149, 2015a.
                 Suraj Srinivas and R Venkatesh Babu. Learning neural network architectures using backpropagation.
                   arXiv preprint arXiv:1511.05497, 2015b.
                 Suraj Srinivas and R Venkatesh Babu. Generalized dropout.arXiv preprint arXiv:1611.06791, 2016.
                 Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks.
                   InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
                   pp. 138–145, 2017.
                 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
                   Dropout: A simple way to prevent neural networks from overﬁtting.The Journal of Machine
                   Learning Research, 15(1):1929–1958, 2014.
                 Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural
                   networks using dropconnect. InInternational Conference on Machine Learning, pp. 1058–1066,
                   2013.
                 Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efﬁcient convolutional neural
                   networks using energy-aware pruning.arXiv preprint, 2017.
                 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
                   deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016.
                 Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Compressibility
                   and generalization in large-scale deep learning.arXiv preprint arXiv:1804.05862, 2018.


                                                  13                 Published as a conference paper at ICLR 2019


                  A A CKNOWLEDGMENTS

                 We gratefully acknowledge IBM, which—through the MIT-IBM Watson AI Lab—contributed the
                 computational resources necessary to conduct the experiments in this paper. We particularly thank
                 IBM researchers German Goldszmidt, David Cox, Ian Molloy, and Benjamin Edwards for their
                 generous contributions of infrastructure, technical support, and feedback. We also wish to thank
                 Aleksander Madry, Shaﬁ Goldwasser, Ed Felten, David Bieber, Karolina Dziugaite, Daniel Weitzner,
                 and R. David Edelman for support, feedback, and helpful discussions over the course of this project.
                 This work was supported in part by the Ofﬁce of Naval Research (ONR N00014-17-1-2699).

                  B I TERATIVE PRUNING STRATEGIES

                 In this Appendix, we examine two different ways of structuring the iterative pruning strategy that we
                 use throughout the main body of the paper to ﬁnd winning tickets.

                 Strategy 1: Iterative pruning with resetting.

                     1.Randomly initialize a neural networkf(x;m)where=0 andm= 1 jj is a mask.
                     2.Train the network forjiterations, reaching parametersmj .
                     3.Prunes%of the parameters, creating an updated maskm0 wherePm0 = (Pm s)%.
                     4.Reset the weights of the remaining portion of the network to their values in0 . That is, let
                       =0 .
                     5.Letm=m0 and repeat steps 2 through 4 until a sufﬁciently pruned network has been
                       obtained.

                 Strategy 2: Iterative pruning with continued training.

                     1.Randomly initialize a neural networkf(x;m)where=0 andm= 1 jj is a mask.
                     2.Train the network forjiterations.
                     3.Prunes%of the parameters, creating an updated maskm0 wherePm0 = (Pm s)%.
                     4.Letm=m0 and repeat steps 2 and 3 until a sufﬁciently pruned network has been obtained.
                     5.Reset the weights of the remaining portion of the network to their values in0 . That is, let
                       =0 .

                 The difference between these two strategies is that, after each round of pruning, Strategy 2 retrains
                 using the already-trained weights, whereas Strategy 1 resets the network weights back to their initial
                 values before retraining. In both cases, after the network has been sufﬁciently pruned, its weights are
                 reset back to the original initializations.
                 Figures 9 and 10 compare the two strategies on the Lenet and Conv-2/4/6 architectures on the
                 hyperparameters we select in Appendices G and H. In all cases, the Strategy 1 maintains higher
                 validation accuracy and faster early-stopping times to smaller network sizes.

                  C E ARLY STOPPING CRITERION

                 Throughout this paper, we are interested in measuring the speed at which networks learn. As a proxy
                 for this quantity, we measure the iteration at which an early-stopping criterion would end training.
                 The speciﬁc criterion we employ is the iteration of minimum validation loss. In this Subsection, we
                 further explain that criterion.
                 Validation and test loss follow a pattern where they decrease early in the training process, reach a
                 minimum, and then begin to increase as the model overﬁts to the training data. Figure 11 shows an
                 example of the validation loss as training progresses; these graphs use Lenet, iterative pruning, and
                 Adam with a learning rate of 0.0012 (the learning rate we will select in the following subsection).
                 This Figure shows the validation loss corresponding to the test accuracies in Figure 3.

                                                  14                 Published as a conference paper at ICLR 2019


                                            continued training    resetting

                   50K
                                                    0.98


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)40K
                                                    0.96
                   30K
                                                    0.94
                   20K
                                                    0.92 10K

                    0                               0.90
                      100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                              Percent of Weights Remaining                       Percent of Weights Remaining

                 Figure 9: The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket
                 experiment on the Lenet architecture when iteratively pruned using the resetting and continued
                 training strategies.


                           Conv-2 (continued training)  Conv-2 (resetting)  Conv-4 (continued training)  Conv-4 (resetting)  Conv-6 (continued training)  Conv-6 (resetting)
                                                    0.85 30K


                                                    Accuracy at Early-Stop (Val.)0.80


                   Early-Stop Iteration (Val.)20K                               0.75

                                                    0.70
                   10K
                                                    0.65

                    0                               0.60
                      100 56.2 31.9 18.2 10.5 6.1 3.6 2.1 1.2        100 56.2 31.9 18.2 10.5 6.1 3.6 2.1 1.2
                              Percent of Weights Remaining                       Percent of Weights Remaining

                 Figure 10: The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket
                 experiment on the Conv-2, Conv-4, and Conv-6 architectures when iteratively pruned using the
                 resetting and continued training strategies.


                              100.0   51.3   21.1   7.0   3.6   1.9   51.3 (reinit)   21.1 (reinit)

                     0.20                   0.20                   0.20

                     0.15                   0.15                   0.15


                    Validation Loss                     Validation Loss                     Validation Loss 0.10                   0.10                   0.10

                     0.05                   0.05                   0.05

                     0.00                   0.00                   0.00
                       0  5000 10000 15000 20000 25000    0  5000 10000 15000 20000 25000    0  5000 10000 15000 20000 25000
                            Training Iterations               Training Iterations               Training Iterations

                  Figure 11: The validation loss data corresponding to Figure 3, i.e., the validation loss as training
                  progresses for several different levels of pruning in the iterative pruning experiment. Each line is
                  the average of ﬁve training runs at the same level of iterative pruning; the labels are the percentage
                  of weights from the original network that remain after pruning. Each network was trained with
                 Adam at a learning rate of 0.0012. The left graph shows winning tickets that learn increasingly faster
                  than the original network and reach lower loss. The middle graph shows winning tickets that learn
                  increasingly slower after the fastest early-stopping time has been reached. The right graph contrasts
                  the loss of winning tickets to the loss of randomly reinitialized networks.


                                                  15                 Published as a conference paper at ICLR 2019


                              Random Reinit (Oneshot)   Winning Ticket (Oneshot)   Random Reinit (Iterative)   Winning Ticket (Iterative)

                    35K                    0.99                    1.00
                    30K


                                                               Accuracy at Early-Stop (Train)Accuracy at Early-Stop (Test) 0.98                    0.99


                   Early-Stop Iteration (Val.) 25K                    0.97                    0.98
                    20K                    0.96                    0.97
                    15K                    0.95                    0.96
                    10K                    0.94                    0.95
                    5K                    0.93                    0.94
                    0                    0.92                    0.93
                      10051.326.313.57.03.61.91.00.50.3     10051.326.313.57.03.61.91.00.50.3     10051.326.313.57.03.61.91.00.50.3
                         Percent of Weights Remaining            Percent of Weights Remaining            Percent of Weights Remaining
                                          0.99                     1.00


                                         Accuracy at Iteration 50K (Test)0.98                     0.99
                                          0.97


                                                               Accuracy at Iteration 0.98
                                          0.96

                                                               50000 (Train) 0.97
                                          0.95                     0.96
                                          0.94                     0.95
                                          0.93                     0.94
                                          0.92                     0.93
                                            10051.326.313.57.03.61.91.00.50.3      10051.326.313.57.03.61.91.00.50.3
                                               Percent of Weights Remaining             Percent of Weights Remaining

                 Figure 12: Figure 4 augmented with a graph of the training accuracy at the end of 50,000 iterations.


                 In all cases, validation loss initially drops, after which it forms a clear bottom and then begins
                 increasing again. Our early-stopping criterion identiﬁes this bottom. We consider networks that reach
                 this moment sooner to have learned “faster.” In support of this notion, the ordering in which each
                 experiment meets our early-stopping criterion in Figure 3 is the same order in which each experiment
                 reaches a particular test accuracy threshold in Figure 3.
                 Throughout this paper, in order to contextualize this learning speed, we also present the test accuracy
                 of the network at the iteration of minimum validation loss. In the main body of the paper, we ﬁnd
                 that winning tickets both arrive at early-stopping sooner and reach higher test accuracy at this point.


                  D T RAINING ACCURACY FOR LOTTERY TICKET EXPERIMENTS

                 This Appendix accompanies Figure 4 (the accuracy and early-stopping iterations of Lenet on MNIST
                 from Section 2) and Figure 5 (the accuracy and early-stopping iterations of Conv-2, Conv-4, and
                 Conv-6 in Section Section 3) in the main body of the paper. Those ﬁgures show the iteration of
                 early-stopping, the test accuracy at early-stopping, the training accuracy at early-stopping, and the
                 test accuracy at the end of the training process. However, we did not have space to include a graph
                 of the training accuracy at the end of the training process, which we assert in the main body of the
                 paper to be 100% for all but the most heavily pruned networks. In this Appendix, we include those
                 additional graphs in Figure 12 (corresponding to Figure 4) and Figure 13 (corresponding to Figure 5).
                 As we describe in the main body of the paper, training accuracy reaches 100% in all cases for all but
                 the most heavily pruned networks. However, training accuracy remains at 100% longer for winning
                 tickets than for randomly reinitialized networks.


                  E C OMPARING RANDOM REINITIALIZATION AND RANDOM SPARSITY

                 In this Appendix, we aim to understand the relative performance of randomly reinitialized winning
                 tickets and randomly sparse networks.

                     1.Networks found via iterative pruning with the original initializations (blue in Figure 14).
                     2.Networks found via iterative pruning that are randomly reinitialized (orange in Figure 14).
                     3.Random sparse subnetworks with the same number of parameters as those found via iterative
                       pruning (green in Figure 14).

                                                  16                      Published as a conference paper at ICLR 2019


                                   Conv-2     Conv-2 reinit     Conv-4     Conv-4 reinit     Conv-6     Conv-6 reinit

                        20K                                       0.85


                                                                 Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) 16K                                       0.80

                        12K                                       0.75

                         8K                                       0.70

                         4K                                       0.65

                          0                                       0.60
                            100  51.4  26.5  13.7  7.1   3.7   1.9  1.0         100  51.4  26.5  13.7  7.1   3.7   1.9  1.0
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                                                                  1.0


                                                                 Accuracy at Early-Stop (Train)0.9

                                                                  0.8

                                                                  0.7

                                                                  0.6
                                                                     100  51.4  26.5  13.7  7.1   3.7   1.9   1.0
                                                                                Percent of Weights Remaining


                                                                 Accuracy at Iteration 20/25/30K (Train) Accuracy at Iteration 20/25/30K (Test) 0.85                                       1.0

                        0.80                                       0.9
                        0.75
                                                                  0.8
                        0.70
                                                                  0.7 0.65
                        0.60                                       0.6
                            100  51.4  26.5  13.7  7.1   3.7   1.9  1.0         100  51.4  26.5  13.7  7.1   3.7   1.9   1.0
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                      Figure 13: Figure 5 augmented with a graph of the training accuracy at the end of the training process.


                                                              17                 Published as a conference paper at ICLR 2019


                 Figure 14 shows this comparison for all of the major experiments in this paper. For the fully-connected
                 Lenet architecture for MNIST, we ﬁnd that the randomly reinitialized networks outperform random
                 sparsity. However, for all of the other, convolutional networks studied in this paper, there is no
                 signiﬁcant difference in performance between the two. We hypothesize that the fully-connected
                 network for MNIST sees these beneﬁts because only certain parts of the MNIST images contain
                 useful information for classiﬁcation, meaning connections in some parts of the network will be more
                 valuable than others. This is less true with convolutions, which are not constrained to any one part of
                 the input image.


                  F E XAMINING WINNING TICKETS

                 In this Appendix, we examine the structure of winning tickets to gain insight into why winning tickets
                 are able to learn effectively even when so heavily pruned. Throughout this Appendix, we study the
                 winning tickets from the Lenet architecture trained on MNIST. Unless otherwise stated, we use the
                 same hyperparameters as in Section 2: glorot initialization and adam optimization.

                  F.1 W INNING TICKET INITIALIZATION (A DAM )

                 Figure 15 shows the distributions of winning ticket initializations for four different levels ofPm . To
                 clarify, these are the distributions of the initial weights of the connections that have survived the
                 pruning process. The blue, orange, and green lines show the distribution of weights for the ﬁrst
                 hidden layer, second hidden layer, and output layer, respectively. The weights are collected from ﬁve
                 different trials of the lottery ticket experiment, but the distributions for each individual trial closely
                 mirror those aggregated from across all of the trials. The histograms have been normalized so that the
                 area under each curve is 1.
                 The left-most graph in Figure 15 shows the initialization distributions for the unpruned networks. We
                 use glorot initialization, so each of the layers has a different standard deviation. As the network is
                 pruned, the ﬁrst hidden layer maintains its distribution. However, the second hidden layer and the
                 output layer become increasingly bimodal, with peaks on either side of 0. Interestingly, the peaks
                 are asymmetric: the second hidden layer has more positive initializations remaining than negative
                 initializations, and the reverse is true for the output layer.
                 The connections in the second hidden layer and output layer that survive the pruning process tend
                 to have higher magnitude-initializations. Since we ﬁnd winning tickets by pruning the connections
                 with the lowest magnitudes in each layer at theend, the connections with the lowest-magnitude
                 initializations must still have the lowest-magnitude weights at the end of training. A different trend
                 holds for the input layer: it maintains its distribution, meaning a connection’s initialization has less
                 relation to its ﬁnal weight.

                  F.2 W INNING TICKET INITIALIZATIONS (SGD)

                 We also consider the winning tickets obtained when training the network with SGD learning rate 0.8
                 (selected as described in Appendix G). The bimodal distributions from Figure 15 are present across
                 all layers (see Figure 16. The connections with the highest-magnitude initializations are more likely
                 to survive the pruning process, meaning winning ticket initializations have a bimodal distribution
                 with peaks on opposite sides of 0. Just as with the adam-optimized winning tickets, these peaks are
                 of different sizes, with the ﬁrst hidden layer favoring negative initializations and the second hidden
                 layer and output layer favoring positive initializations. Just as with the adam results, we conﬁrm that
                 each individual trial evidences the same asymmetry as the aggregate graphs in Figure 16.

                  F.3 R EINITIALIZING FROM WINNING TICKET INITIALIZATIONS

                 Considering that the initialization distributions of winning ticketsDm are so different from the
                 Gaussian distributionDused to initialize the unpruned network, it is natural to ask whether randomly
                 reinitializing winning tickets fromDm rather thanDwill improve winning ticket performance. We do
                 not ﬁnd this to be the case. Figure 17 shows the performance of winning tickets whose initializations
                 are randomly sampled from the distribution of initializations contained in the winning tickets for

                                                  18                      Published as a conference paper at ICLR 2019


                                        original initialization       random reinitialization       random sparsity

                                           Lenet                                      Conv-2
                                                                  0.75


                        Test Accuracy at Final Iteration 0.98


                                                                 Test Accuracy at Final Iteration0.70
                        0.96
                                                                  0.650.94

                        0.92                                       0.60

                        0.90                                       0.55
                            100 51.3 26.3 13.5 7.0  3.6  1.9  1.0  0.5  0.3         100  51.4  26.5  13.7  7.1   3.7   1.9  1.0
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                                           Conv-4                                      Conv-6
                        0.85                                       0.85


                        Test Accuracy at Final Iteration                                         Test Accuracy at Final Iteration0.80                                       0.80

                        0.75                                       0.75

                        0.70                                       0.70

                        0.65                                       0.65
                            100  53.5  29.1  16.2  9.2  5.4  3.2  2.0 1.3          100  56.2  31.9  18.2 10.5  6.1  3.6  2.1  1.2
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                                      Resnet-18 (0.03, warmup 20000)                            VGG-19 (0.1, warmup 10000)
                                                                  0.94


                        Test Accuracy at Final Iteration 0.90


                                                                 Test Accuracy at Final Iteration0.92
                        0.88                                       0.90
                                                                  0.880.86                                       0.86
                        0.84                                       0.84
                                                                  0.820.82
                                                                  0.80
                            100  64.4  41.7  27.1  17.8  11.8  8.0  5.5         100  41.0  16.8  6.9  2.8  1.2  0.5  0.2  0.1
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                                         Resnet-18 (0.1)                                   VGG-19 (0.1)
                                                                  0.94


                        Test Accuracy at Final Iteration 0.90


                                                                 Test Accuracy at Final Iteration0.92
                        0.88                                       0.90
                                                                  0.880.86                                       0.86
                        0.84                                       0.84
                                                                  0.820.82
                                                                  0.80
                            100  64.4  41.7  27.1  17.8  11.8  8.0  5.5         100  41.0  16.8  6.9  2.8  1.2  0.5  0.2  0.1
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                                         Resnet-18 (0.01)                                  VGG-19 (0.01)
                                                                  0.94


                        Test Accuracy at Final Iteration0.90


                                                                 Test Accuracy at Final Iteration 0.92
                        0.88                                       0.90
                                                                  0.880.86                                       0.86
                        0.84                                       0.84
                                                                  0.820.82
                                                                  0.80
                            100  64.4  41.7  27.1  17.8  11.8  8.0  5.5         100  41.0  16.8  6.9  2.8  1.2  0.5  0.2  0.1
                                      Percent of Weights Remaining                            Percent of Weights Remaining

                        Figure 14: The test accuracy at the ﬁnal iteration for each of the networks studied in this paper.


                                                              19                 Published as a conference paper at ICLR 2019


                                         41.05% Remaining           16.88% Remaining           6.95% Remaining 100.00% Remaining
                   10                                                888
                   8
                                                                     666
                   6
                                   Density                Density                Density
                   Density                                                  4444
                                                                     222               2
                                                                     00                                0
                      0.2   0.0   0.2      0.2   0.0   0.2       0.2   0.0   0.2       0.2   0.0   0.2
                         Initial Weight            Initial Weight            Initial Weight            Initial Weight

                 Figure 15: The distribution of initializations in winning tickets pruned to the levels speciﬁed in the
                 titles of each plot. The blue, orange, and green lines show the distributions for the ﬁrst hidden layer,
                 second hidden layer, and output layer of the Lenet architecture for MNIST when trained with the
                 adam optimizer and the hyperparameters used in 2. The distributions have been normalized so that
                 the area under each curve is 1.

                                                                          6.95% Remaining 100.00% Remaining           41.05% Remaining           16.88% Remaining
                   10                                12.5               8
                                    10.0 8                                10.0
                                                                     6
                   6                7.5

                   Density                                 7.5
                                                                    Density
                                   Density                Density                 4
                   4                5.0               5.0
                   2                2.5                               22.5
                   0
                      0.2   0.0   0.2       0.2   0.0   0.2       0.2   0.0   0.2      0.2   0.0   0.2
                         Initial Weight             Initial Weight             Initial Weight            Initial Weight

                        Figure 16: Same as Figure 15 where the network is trained with SGD at rate 0.8.


                 adam. More concretely, letDm =f(i) jm(i) = 1gbe the set of initializations found in the winning 0 ticket with maskm. We sample a new set of parameters0  D                 x;m0 ).0   m and train the networkf(     0 We perform this sampling on a per-layer basis. The results of this experiment are in Figure 17.
                 Winning tickets reinitialized fromDm perform little better than when randomly reinitialized fromD.
                 We attempted the same experiment with the SGD-trained winning tickets and found similar results.

                  F.4 P RUNING AT ITERATION 0

                 One other way of interpreting the graphs of winning ticket initialization distributions is as follows:
                 weights that begin small stay small, get pruned, and never become part of the winning ticket. (The
                 only exception to this characterization is the ﬁrst hidden layer for the adam-trained winning tickets.)
                 If this is the case, then perhaps low-magnitude weights were never important to the network and can
                 be pruned from the very beginning. Figure 18 shows the result of attempting this pruning strategy.
                 Winning tickets selected in this fashion perform even worse than when they are found by iterative


                                      winning ticket     reinit     sampled from ticket

                   20K                               0.99


                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Test)                                 0.98
                   15K                               0.97
                                                    0.96 10K                               0.95
                    5K                               0.94
                                                    0.93
                    0                               0.92
                      100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 17: The performance of the winning tickets of the Lenet architecture for MNIST when the
                 layers are randomly reinitialized from the distribution of initializations contained in the winning
                 ticket of the corresponding size.


                                                  20                 Published as a conference paper at ICLR 2019


                                      Iterative Pruning    Reinit    Prune at Iteration 0

                   50K                               0.990


                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Test) 45K                               0.983
                   40K                               0.976
                   35K                               0.969
                   30K                               0.962
                   25K                               0.955
                   20K                               0.948
                   15K                               0.941
                   10K                               0.934
                    5K                               0.927
                    0                               0.920
                      100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 18: The performance of the winning tickets of the Lenet architecture for MNIST when
                 magnitude pruning is performed before the network is ever trained. The network is subsequently
                 trained with adam.


                                              (not in ticket)    (in ticket)

                                                16.88% Remaining (layer2)         16.88% Remaining (output) 16.88% Remaining (layer1)
                                            5
                           40                                6
                                            4
                           30

                                           Density3
                                                            Density4

                           Density 20
                                            2
                                                            2
                           10               1
                            0               0                0
                            0.0    0.2    0.4     0.0    0.2    0.4     0.0    0.2    0.4
                               |Final Weight - Initial Weight|        |Final Weight - Initial Weight|        |Final Weight - Initial Weight|

                 Figure 19: Between the ﬁrst and last training iteration of the unpruned network, the magnitude by
                 which weights in the network change. The blue line shows the distribution of magnitudes for weights
                 that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for
                 weights that are in the eventual winning ticket.


                 pruning and randomly reinitialized. We attempted the same experiment with the SGD-trained winning
                 tickets and found similar results.

                  F.5 C OMPARING INITIAL AND FINAL WEIGHTS IN WINNING TICKETS

                 In this subsection, we consider winning tickets in the context of the larger optimization process. To
                 do so, we examine the initial and ﬁnal weights of the unpruned network from which a winning ticket
                 derives to determine whether weights that will eventually comprise a winning ticket exhibit properties
                 that distinguish them from the rest of the network.
                 We consider the magnitude of the difference between initial and ﬁnal weights. One possible rationale
                 for the success of winning tickets is that they already happen to be close to the optimum that gradient
                 descent eventually ﬁnds, meaning that winning ticket weights should change by a smaller amount
                 than the rest of the network. Another possible rationale is that winning tickets are well placed in the
                 optimization landscape for gradient descent to optimize productively, meaning that winning ticket
                 weights should change by a larger amount than the rest of the network. Figure 19 shows that winning
                 ticket weights tend to change by a larger amount then weights in the rest of the network, evidence
                 that does not support the rationale that winning tickets are already close to the optimum.
                 It is notable that such a distinction exists between the two distributions. One possible explanation for
                 this distinction is that the notion of a winning ticket may indeed be a natural part of neural network
                 optimization. Another is that magnitude-pruning biases the winning tickets we ﬁnd toward those
                 containing weights that change in the direction of higher magnitude. Regardless, it offers hope that
                 winning tickets may be discernible earlier in the training process (or after a single training run),
                 meaning that there may be more efﬁcient methods for ﬁnding winning tickets than iterative pruning.
                 Figure 20 shows the directions of these changes. It plots the difference between the magnitude of the
                 ﬁnal weight and the magnitude of the initial weight, i.e., whether the weight moved toward or away

                                                  21                 Published as a conference paper at ICLR 2019


                                              (not in ticket)    (in ticket)

                               16.88% Remaining (layer1)         16.88% Remaining (layer2)         16.88% Remaining (output)
                                            4
                           40
                                                            4330

                                           Density                Density
                           Density                 220
                                                            2
                           10               1

                            0               0                0
                            0.50  0.25 0.00  0.25  0.50  0.50  0.25 0.00  0.25  0.50  0.50  0.25 0.00  0.25  0.50
                               |Final Weight| - |Initial Weight|        |Final Weight| - |Initial Weight|        |Final Weight| - |Initial Weight|

                 Figure 20: Between the ﬁrst and last training iteration of the unpruned network, the magnitude by
                 which weights move away from 0. The blue line shows the distribution of magnitudes for weights
                 that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for
                 weights that are in the eventual winning ticket.


                                                                  6.95% Remaining 41.05% Remaining           16.88% Remaining
                                            12.512.5                               20
                                            10.010.0                               15
                                            7.57.5
                                                            Density
                           Density                Density                 10
                            5.0               5.0
                                                             5
                            2.5               2.5

                            0.0               0.0               0
                             0.0  0.2  0.4  0.6     0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6
                             Fraction of Incoming Connections Remaining     Fraction of Incoming Connections Remaining    Fraction of Incoming Connections Remaining

                 Figure 21: The fraction of incoming connections that survive the pruning process for each node in
                 each layer of the Lenet architecture for MNIST as trained with adam.


                 from 0. In general, winning ticket weights are more likely to increase in magnitude (that is, move
                 away from 0) than are weights that do not participate in the eventual winning ticket.

                  F.6 W INNING TICKET CONNECTIVITY

                 In this Subsection, we study the connectivity of winning tickets. Do some hidden units retain a
                 large number of incoming connections while others fade away, or does the network retain relatively
                 even sparsity among all units as it is pruned? We ﬁnd the latter to be the case when examining the
                 incoming connectivity of network units: for both adam and SGD, each unit retains a number of
                 incoming connections approximately in proportion to the amount by which the overall layer has
                 been pruned. Figures 21 and 22 show the fraction of incoming connections that survive the pruning
                 process for each node in each layer. Recall that we prune the output layer at half the rate as the rest of
                 the network, which explains why it has more connectivity than the other layers of the network.


                                 41.05% Remaining           16.88% Remaining           6.95% Remaining
                                            1510                                15
                            8
                                            10
                                                            10

                           Density6
                                           Density                Density
                            4                5                5
                            2
                            0                0                0
                            0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6
                             Fraction of Incoming Connections Remaining    Fraction of Incoming Connections Remaining    Fraction of Incoming Connections Remaining

                        Figure 22: Same as Figure 21 where the network is trained with SGD at rate 0.8.


                                                  22                 Published as a conference paper at ICLR 2019


                                41.05% Remaining           16.88% Remaining           6.95% Remaining
                                            10               25
                           6                8               20

                                            6               15

                           Density 4
                                           Density                Density 4               10
                           2
                                            2                5
                           0                0                0
                            0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6
                             Fraction of Outgoing Connections Remaining    Fraction of Outgoing Connections Remaining    Fraction of Outgoing Connections Remaining

                 Figure 23: The fraction of outgoing connections that survive the pruning process for each node in
                 each layer of the Lenet architecture for MNIST as trained with adam. The blue, orange, and green
                 lines are the outgoing connections from the input layer, ﬁrst hidden layer, and second hidden layer,
                 respectively.


                                41.05% Remaining                            6.95% Remaining 16.88% Remaining
                                            12.5               20
                           6
                                            10.0               15


                           Density4                7.5

                                                            Density
                                           Density                 10
                                            5.0
                           2                                5
                                            2.5
                           0                0.0               0
                            0.0  0.2  0.4  0.6     0.0  0.2  0.4  0.6    0.0  0.2  0.4  0.6
                             Fraction of Outgoing Connections Remaining     Fraction of Outgoing Connections Remaining    Fraction of Outgoing Connections Remaining

                        Figure 24: Same as Figure 23 where the network is trained with SGD at rate 0.8.


                 However, this is not the case for the outgoing connections. To the contrary, for the adam-trained
                 networks, certain units retain far more outgoing connections than others (Figure 23). The distributions
                 are far less smooth than those for the incoming connections, suggesting that certain features are far
                 more useful to the network than others. This is not unexpected for a fully-connected network on a
                 task like MNIST, particularly for the input layer: MNIST images contain centered digits, so the pixels
                 around the edges are not likely to be informative for the network. Indeed, the input layer has two
                 peaks, one larger peak for input units with a high number of outgoing connections and one smaller
                 peak for input units with a low number of outgoing connections. Interestingly, the adam-trained
                 winning tickets develop a much more uneven distribution of outgoing connectivity for the input layer
                 than does the SGD-trained network (Figure 24).


                  F.7 A DDING NOISE TO WINNING TICKETS

                 In this Subsection, we explore the extent to which winning tickets are robust to Gaussian noise added
                 to their initializations. In the main body of the paper, we ﬁnd that randomly reinitializing a winning
                 ticket substantially slows its learning and reduces its eventual test accuracy. In this Subsection,
                 we study a less extreme way of perturbing a winning ticket. Figure 25 shows the effect of adding
                 Gaussian noise to the winning ticket initializations. The standard deviation of the noise distribution
                 of each layer is a multiple of the standard deviation of the layer’s initialization Figure 25 shows noise
                 distributions with standard deviation0:5,,2, and3. Adding Gaussian noise reduces the test
                 accuracy of a winning ticket and slows its ability to learn, again demonstrating the importance of
                 the original initialization. As more noise is added, accuracy decreases. However, winning tickets
                 are surprisingly robust to noise. Adding noise of0:5barely changes winning ticket accuracy. Even
                 after adding noise of3, the winning tickets continue to outperform the random reinitialization
                 experiment.

                                                  23                 Published as a conference paper at ICLR 2019


                            winning ticket    reinit    noise 0.5    noise 1.0    noise 2.0    noise 3.0

                   20K                               0.99


                                                    Accuracy at Early-Stop (Test) Early-Stop Iteration (Test)                                 0.98
                   15K                               0.97
                                                    0.96 10K                               0.95
                    5K                               0.94
                                                    0.93
                    0                               0.92
                      100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 25: The performance of the winning tickets of the Lenet architecture for MNIST when
                 Gaussian noise is added to the initializations. The standard deviations of the noise distributions for
                 each layer are a multiple of the standard deviations of the initialization distributions; in this Figure,
                 we consider multiples 0.5, 1, 2, and 3.


                  G H YPERPARAMETER EXPLORATION FOR FULLY -C ONNECTED NETWORKS

                 This Appendix accompanies Section 2 of the main paper. It explores the space of hyperparameters
                 for the Lenet architecture evaluated in Section 2 with two purposes in mind:

                     1.To explain the hyperparameters selected in the main body of the paper.
                     2.To evaluate the extent to which the lottery ticket experiment patterns extend to other choices
                       of hyperparameters.

                  G.1 E XPERIMENTAL METHODOLOGY

                 This Section considers the fully-connected Lenet architecture (LeCun et al., 1998), which comprises
                 two fully-connected hidden layers and a ten unit output layer, on the MNIST dataset. Unless otherwise
                 stated, the hidden layers have 300 and 100 units each.
                 The MNIST dataset consists of 60,000 training examples and 10,000 test examples. We randomly
                 sampled a 5,000-example validation set from the training set and used the remaining 55,000 training
                 examples as our training set for the rest of the paper (including Section 2). The hyperparameter
                 selection experiments throughout this Appendix are evaluated using the validation set for determining
                 both the iteration of early-stopping and the accuracy at early-stopping; the networks in the main body
                 of this paper (which make use of these hyperparameters) have their accuracy evaluated on the test set.
                 The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire
                 training set is shufﬂed.
                 Unless otherwise noted, each line in each graph comprises data from three separate experiments. The
                 line itself traces the average performance of the experiments and the error bars indicate the minimum
                 and maximum performance of any one experiment.
                 Throughout this Appendix, we perform the lottery ticket experiment iteratively with a pruning rate of
                 20% per iteration (10% for the output layer); we justify the choice of this pruning rate later in this
                 Appendix. Each layer of the network is pruned independently. On each iteration of the lottery ticket
                 experiment, the network is trained for 50,000 training iterations regardless of when early-stopping
                 occurs; in other words, no validation or test data is taken into account during the training process, and
                 early-stopping times are determined retroactively by examining validation performance. We evaluate
                 validation and test performance every 100 iterations.
                 For the main body of the paper, we opt to use the Adam optimizer (Kingma & Ba, 2014) and Gaussian
                 Glorot initialization (Glorot & Bengio, 2010). Although we can achieve more impressive results on
                 the lottery ticket experiment with other hyperparameters, we intend these choices to be as generic
                 as possible in an effort to minimize the extent to which our main results depend on hand-chosen
                 hyperparameters. In this Appendix, we select the learning rate for Adam that we use in the main body
                 of the paper.
                 In addition, we consider a wide range of other hyperparameters, including other optimization
                 algorithms (SGD with and without momentum), initialization strategies (Gaussian distributions

                                                  24                 Published as a conference paper at ICLR 2019


                              2.5e-05   5e-05   0.0002   0.0008   0.0012   0.002   0.0032   0.0064

                   50K                               0.99000
                   45K                               0.98286
                   40K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 35K                               0.97571
                   30K                               0.96857
                   25K
                   20K                               0.96143
                   15K                               0.95429
                   10K
                                                    0.94714 5K
                    0                               0.94000
                      100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 26: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using the Adam optimizer at various
                 learning rates. Each line represents a different learning rate.


                               0.003125   0.00625   0.0125   0.025   0.1   0.4   0.8   1.2

                   50K                               0.99000
                   45K                               0.98286
                   40K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 35K                               0.97571
                   30K                               0.96857
                   25K
                   20K                               0.96143
                   15K                               0.95429
                   10K
                                                    0.94714 5K
                    0                               0.94000
                      100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 27: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent at
                 various learning rates.


                 with various standard deviations), network sizes (larger and smaller hidden layers), and pruning
                 strategies (faster and slower pruning rates). In each experiment, we vary the chosen hyperparameter
                 while keeping all others at their default values (Adam with the chosen learning rate, Gaussian Glorot
                 initialization, hidden layers with 300 and 100 units). The data presented in this appendix was collected
                 by training variations of the Lenet architecture more than 3,000 times.

                  G.2 L EARNING RATE

                 In this Subsection, we perform the lottery ticket experiment on the Lenet architecture as optimized
                 with Adam, SGD, and SGD with momentum at various learning rates.
                 Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria for
                 selecting the learning rate are as follows:

                     1.On the unpruned network, it should minimize training iterations necessary to reach early-
                       stopping and maximize validation accuracy at that iteration. That is, it should be a reasonable
                       hyperparameter for optimizing the unpruned network even if we are not running the lottery
                       ticket experiment.
                     2. When running the iterative lottery ticket experiment, it should make it possible to match
                       the early-stopping iteration and accuracy of the original network with as few parameters as
                       possible.
                     3.Of those options that meet (1) and (2), it should be on the conservative (slow) side so that it is
                       more likely to productively optimize heavily pruned networks under a variety of conditions
                       with a variety of hyperparameters.

                                                  25                 Published as a conference paper at ICLR 2019


                               0.003125    0.0125    0.025    0.05    0.1    0.2    0.4

                   50K                               0.99000
                   45K                               0.98286
                   40K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 35K                               0.97571
                   30K                               0.96857
                   25K
                   20K                               0.96143
                   15K                               0.95429
                   10K
                                                    0.94714 5K
                    0                               0.94000
                      100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 28: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent
                 with momentum (0.9) at various learning rates.


                 Figure 26 shows the early-stopping iteration and validation accuracy at that iteration of performing
                 the iterative lottery ticket experiment with the Lenet architecture optimized with Adam at various
                 learning rates. According to the graph on the right of Figure 26, several learning rates between 0.0002
                 and 0.002 achieve similar levels of validation accuracy on the original network and maintain that
                 performance to similar levels as the network is pruned. Of those learning rates, 0.0012 and 0.002
                 produce the fastest early-stopping times and maintain them to the smallest network sizes. We choose
                 0.0012 due to its higher validation accuracy on the unpruned network and in consideration of criterion
                 (3) above.
                 We note that, across all of these learning rates, the lottery ticket pattern (in which learning becomes
                 faster and validation accuracy increases with iterative pruning) remains present. Even for those
                 learning rates that did not satisfy the early-stopping criterion within 50,000 iterations (2.5e-05 and
                 0.0064) still showed accuracy improvements with pruning.

                  G.3 O THER OPTIMIZATION ALGORITHMS

                  G.3.1 SGD
                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 stochastic gradient descent (SGD) at various learning rates. The results of doing so appear in Figure
                 27. The lottery ticket pattern appears across all learning rates, including those that fail to satisfy the
                 early-stopping criterion within 50,000 iterations. SGD learning rates 0.4 and 0.8 reach early-stopping
                 in a similar number of iterations as the best Adam learning rates (0.0012 and 0.002) but maintain
                 this performance when the network has been pruned further (to less than 1% of its original size for
                 SGD vs. about 3.6% of the original size for Adam). Likewise, on pruned networks, these SGD
                 learning rates achieve equivalent accuracy to the best Adam learning rates, and they maintain that
                 high accuracy when the network is pruned as much as the Adam learning rates.

                  G.3.2 M OMENTUM
                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 28.
                 Once again, the lottery ticket pattern appears across all learning rates, with learning rates between
                 0.025 and 0.1 maintaining high validation accuracy and faster learning for the longest number of
                 pruning iterations. Learning rate 0.025 achieves the highest validation accuracy on the unpruned
                 network; however, its validation accuracy never increases as it is pruned, instead decreasing gradually,
                 and higher learning rates reach early-stopping faster.

                  G.4 I TERATIVE PRUNING RATE

                 When running the iterative lottery ticket experiment on Lenet, we prune each layer of the network
                 separately at a particular rate. That is, after training the network, we prunek%of the weights in

                                                  26                 Published as a conference paper at ICLR 2019


                                      0.1     0.4     0.6     0.8     0.2

                   50K                               0.99000
                   45K                               0.98286
                   40K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 35K                               0.97571
                   30K                               0.96857
                   25K
                   20K                               0.96143
                   15K                               0.95429
                   10K
                                                    0.94714 5K
                    0                               0.94000
                      100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3         100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 29: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment when pruned at different rates. Each line represents a differentpruning rate—the
                 percentage of lowest-magnitude weights that are pruned from each layer after each training iteration.


                                     0.0125   0.025   0.05   0.1   0.2   0.4   glorot

                   50K                               0.99000
                   45K                               0.98286
                   40K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 35K                               0.97571
                   30K                               0.96857
                   25K
                   20K                               0.96143
                   15K                               0.95429
                   10K
                                                    0.94714 5K
                    0                               0.94000
                      100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3       100.0 51.3 26.3 13.5 7.0 3.6 1.9 1.0 0.5 0.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 30: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment initialized with Gaussian distributions with various standard deviations. Each line
                 is a different standard deviation for a Gaussian distribution centered at 0.


                 each layer ( k %of the weights in the output layer) before resetting the weights to their original
                 initializations and training again. In the main body of the paper, we ﬁnd that iterative pruning ﬁnds 2
                 smaller winning tickets than one-shot pruning, indicating that pruning too much of the network at
                 once diminishes performance. Here, we explore different values ofk.
                 Figure 29 shows the effect of the amount of the network pruned on each pruning iteration on early-
                 stopping time and validation accuracy. There is a tangible difference in learning speed and validation
                 accuracy at early-stopping between the lowest pruning rates (0.1 and 0.2) and higher pruning rates (0.4
                 and above). The lowest pruning rates reach higher validation accuracy and maintain that validation
                 accuracy to smaller network sizes; they also maintain fast early-stopping times to smaller network
                 sizes. For the experiments throughout the main body of the paper and this Appendix, we use a
                 pruning rate of 0.2, which maintains much of the accuracy and learning speed of 0.1 while reducing
                 the number of training iterations necessary to get to smaller network sizes.
                 In all of the Lenet experiments, we prune the output layer at half the rate of the rest of the network.
                 Since the output layer is so small (1,000 weights out of 266,000 for the overall Lenet architecture),
                 we found that pruning it reaches a point of diminishing returns much earlier the other layers.

                  G.5 I NITIALIZATION DISTRIBUTION

                 To this point, we have considered only a Gaussian Glorot (Glorot & Bengio, 2010) initialization
                 scheme for the network. Figure 30 performs the lottery ticket experiment while initializing the Lenet
                 architecture from Gaussian distributions with a variety of standard deviations. The networks were
                 optimized with Adam at the learning rate chosen earlier. The lottery ticket pattern continues to appear
                 across all standard deviations. When initialized from a Gaussian distribution with standard deviation

                                                  27                 Published as a conference paper at ICLR 2019


                 0.1, the Lenet architecture maintained high validation accuracy and low early-stopping times for the
                 longest, approximately matching the performance of the Glorot-initialized network.

                  G.6 N ETWORK SIZE


                              25,9  50,17  100,34  150,50  200,67  300,100  400,134  500,167  700,233  900,300

                   50K
                   45K
                   40K


                   Early-Stop Iteration (Val.) 35K
                   30K
                   25K
                   20K
                   15K
                   10K
                    5K
                    0
                     978.6    501.7    257.3    132.1    67.9    34.9    18.0    9.3    4.8    2.5    1.3
                                               Thousands of Weights Remaining

                   0.99000

                   0.98286


                   Accuracy at Early-Stop (Val.)0.97571

                   0.96857

                   0.96143

                   0.95429

                   0.94714

                   0.94000
                      978.6    501.7    257.3    132.1    67.9    34.9    18.0    9.3    4.8    2.5    1.3
                                               Thousands of Weights Remaining

                 Figure 31: The early-stopping iteration and validation accuracy at at that iteration of the iterative
                 lottery ticket experiment on the Lenet architecture with various layer sizes. The label for each line
                 is the size of the ﬁrst and second hidden layers of the network. All networks had Gaussian Glorot
                 initialization and were optimized with Adam (learning rate 0.0012). Note that the x-axis of this plot
                 charts the number ofweightsremaining, while all other graphs in this section have charted thepercent
                 of weights remaining.

                 Throughout this section, we have considered the Lenet architecture with 300 units in the ﬁrst hidden
                 layer and 100 units in the second hidden layer. Figure 31 shows the early-stopping iterations and
                 validation accuracy at that iteration of the Lenet architecture with several other layer sizes. All
                 networks we tested maintain the 3:1 ratio between units in the ﬁrst hidden layer and units in the
                 second hidden layer.
                 The lottery ticket hypothesis naturally invites a collection of questions related to network size. Gener-
                 alizing, those questions tend to take the following form: according to the lottery ticket hypothesis, do
                 larger networks, which contain more subnetworks, ﬁnd “better” winning tickets? In line with the
                 generality of this question, there are several different answers.
                 If we evaluate a winning ticket by the accuracy it achieves, then larger networks do ﬁnd better
                 winning tickets. The right graph in Figure 31 shows that, for any particular number of weights (that
                 is, any particular point on the x-axis), winning tickets derived from initially larger networks reach
                 higher accuracy. Put another way, in terms of accuracy, the lines are approximately arranged from
                 bottom to top in increasing order of network size. It is possible that, since larger networks have
                 more subnetworks, gradient descent found a better winning ticket. Alternatively, the initially larger
                 networks have more units even when pruned to the same number of weights as smaller networks,
                 meaning they are able to contain sparse subnetwork conﬁgurations that cannot be expressed by
                 initially smaller networks.

                                                  28                 Published as a conference paper at ICLR 2019


                 If we evaluate a winning ticket by the time necessary for it to reach early-stopping, then larger
                 networks have less of an advantage. The left graph in Figure 31 shows that, in general, early-stopping
                 iterations do not vary greatly between networks of different initial sizes that have been pruned to the
                 same number of weights. Upon exceedingly close inspection, winning tickets derived from initially
                 larger networks tend to learn marginally faster than winning tickets derived from initially smaller
                 networks, but these differences are slight.
                 If we evaluate a winning ticket by the size at which it returns to the same accuracy as the original
                 network, the large networks do not have an advantage. Regardless of the initial network size, the
                 right graph in Figure 31 shows that winning tickets return to the accuracy of the original network
                 when they are pruned to between about 9,000 and 15,000 weights.

                  H H YPERPARAMETER EXPLORATION FOR CONVOLUTIONAL NETWORKS

                 This Appendix accompanies Sections 3 of the main paper. It explores the space of optimization
                 algorithms and hyperparameters for the Conv-2, Conv-4, and Conv-6 architectures evaluated in
                 Section 3 with the same two purposes as Appendix G: explaining the hyperparameters used in the main
                 body of the paper and evaluating the lottery ticket experiment on other choices of hyperparameters.

                  H.1 E XPERIMENTAL METHODOLOGY

                 The Conv-2, Conv-4, and Conv-6 architectures are variants of the VGG (Simonyan & Zisserman,
                 2014) network architecture scaled down for the CIFAR10 (Krizhevsky & Hinton, 2009) dataset. Like
                 VGG, the networks consist of a series of modules. Each module has two layers of 3x3 convolutional
                 ﬁlters followed by a maxpool layer with stride 2. After all of the modules are two fully-connected
                 layers of size 256 followed by an output layer of size 10; in VGG, the fully-connected layers are of
                 size 4096 and the output layer is of size 1000. Like VGG, the ﬁrst module has 64 convolutions in
                 each layer, the second has 128, the third has 256, etc. The Conv-2, Conv-4, and Conv-6 architectures
                 have 1, 2, and 3 modules, respectively.
                 The CIFAR10 dataset consists of 50,000 32x32 color (three-channel) training examples and 10,000
                 test examples. We randomly sampled a 5,000-example validation set from the training set and used the
                 remaining 45,000 training examples as our training set for the rest of the paper. The hyperparameter
                 selection experiments throughout this Appendix are evaluated on the validation set, and the examples
                 in the main body of this paper (which make use of these hyperparameters) are evaluated on test set.
                 The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire
                 training set is shufﬂed.
                 The Conv-2, Conv-4, and Conv-6 networks are initialized with Gaussian Glorot initialization (Glorot
                 & Bengio, 2010) and are trained for the number of iterations speciﬁed in Figure 2. The number
                 of training iterations was selected such that heavily-pruned networks could still train in the time
                 provided. On dropout experiments, the number of training iterations is tripled to provide enough time
                 for the dropout-regularized networks to train. We optimize these networks with Adam, and select the
                 learning rate for each network in this Appendix.
                 As with the MNIST experiments, validation and test performance is only considered retroactively
                 and has no effect on the progression of the lottery ticket experiments. We measure validation and test
                 loss and accuracy every 100 training iterations.
                 Each line in each graph of this section represents the average of three separate experiments, with
                 error bars indicating the minimum and maximum value that any experiment took on at that point.
                 (Experiments in the main body of the paper are conducted ﬁve times.)
                 We allow convolutional layers and fully-connected layers to be pruned at different rates; we select
                 those rates for each network in this Appendix. The output layer is pruned at half of the rate of the
                 fully-connected layers for the reasons described in Appendix G.

                  H.2 L EARNING RATE

                 In this Subsection, we perform the lottery ticket experiment on the the Conv-2, Conv-4, and Conv-6
                 architectures as optimized with Adam at various learning rates.

                                                  29                         Published as a conference paper at ICLR 2019


                                       0.0001     0.0002     0.0003     0.0004     0.0005     0.0006     0.0007     0.0008

                            10K
                                                                           0.7
                            8K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 6K                                             0.6

                            4K
                                                                           0.5
                            2K

                             0
                                100   51.4   26.5   13.7   7.1   3.7   1.9   1.0          100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                            15K                                             0.80
                            12K


                                                                          Accuracy at Early-Stop (Val.)0.75


                           Early-Stop Iteration (Val.) 10K
                                                                           0.70
                            7K
                                                                           0.65
                            5K
                                                                           0.60
                            2K
                                                                           0.55
                             0
                                100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3           100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                            20K

                                                                           0.80


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 15K
                                                                           0.75

                            10K
                                                                           0.70

                            5K                                             0.65

                             0                                             0.60
                                100  61.8  39.4  25.8  17.3 11.9 8.3  5.8 4.1 3.0 2.1         100  61.8  39.4  25.8  17.3 11.9 8.3  5.8 4.1 3.0 2.1
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                         Figure 32: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using the Adam optimizer at various learning rates. Each line represents a different learning rate.


                                                                       30                 Published as a conference paper at ICLR 2019


                 Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria
                 for selecting the learning rate are the same as in Appendix G: minimizing training iterations and
                 maximizing accuracy at early-stopping, ﬁnding winning tickets containing as few parameters as
                 possible, and remaining conservative enough to apply to a range of other experiments.
                 Figure 32 shows the results of performing the iterative lottery ticket experiment on the Conv-2 (top),
                 Conv-4 (middle), and Conv-6 (bottom) architectures. Since we have not yet selected the pruning rates
                 for each network, we temporarily pruned fully-connected layers at 20% per iteration, convolutional
                 layers at 10% per iteration, and the output layer at 10% per iteration; we explore this part of the
                 hyperparameter space in a later subsection.
                 For Conv-2, we select a learning rate of 0.0002, which has the highest initial validation accuracy,
                 maintains both high validation accuracy and low early-stopping times for the among the longest,
                 and reaches the fastest early-stopping times. This learning rate also leads to a 3.3 percentage point
                 improvement in validation accuracy when the network is pruned to 3% of its original size. Other
                 learning rates, such 0.0004, have lower initial validation accuracy (65.2% vs 67.6%) but eventually
                 reach higher absolute levels of validation accuracy (71.7%, a 6.5 percentage point increase, vs. 70.9%,
                 a 3.3 percentage point increase). However, learning rate 0.0002 shows the highest proportional
                 decrease in early-stopping times: 4.8x (when pruned to 8.8% of the original network size).
                 For Conv-4, we select learning rate 0.0003, which has among the highest initial validation accuracy,
                 maintains high validation accuracy and fast early-stopping times when pruned by among the most,
                 and balances improvements in validation accuracy (3.7 percentage point improvement to 78.6%
                 when 5.4% of weights remain) and improvements in early-stopping time (4.27x when 11.1% of
                 weights remain). Other learning rates reach higher validation accuracy (0.0004—3.6 percentage point
                 improvement to 79.1% accuracy when 5.4% of weights remain) or show better improvements in
                 early-stopping times (0.0002—5.1x faster when 9.2% of weights remain) but not both.
                 For Conv-6, we also select learning rate 0.0003 for similar reasons to those provided for Conv-4.
                 Validation accuracy improves by 2.4 percentage points to 81.5% when 9.31% of weights remain
                 and early-stopping times improve by 2.61x when pruned to 11.9%. Learning rate 0.0004 reaches
                 high ﬁnal validation accuracy (81.9%, an increase of 2.7 percentage points, when 15.2% of weights
                 remain) but with smaller improvements in early-stopping times, and learning rate 0.0002 shows
                 greater improvements in early-stopping times (6.26x when 19.7% of weights remain) but reaches
                 lower overall validation accuracy.
                 We note that, across nearly all combinations of learning rates, the lottery ticket pattern—where
                 early-stopping times were maintain or decreased and validation accuracy was maintained or increased
                 during the course of the lottery ticket experiment—continues to hold. This pattern fails to hold at
                 the very highest learning rates: early-stopping times decreased only brieﬂy (in the case of Conv-2 or
                 Conv-4) or not at all (in the case of Conv-6), and accuracy increased only brieﬂy (in the case of all
                 three networks). This pattern is similar to that which we observe in Section 4: at the highest learning
                 rates, our iterative pruning algorithm fails to ﬁnd winning tickets.

                  H.3 O THER OPTIMIZATION ALGORITHMS

                  H.3.1 SGD
                 Here, we explore the behavior of the lottery ticket experiment when the Conv-2, Conv-4, and Conv-6
                 networks are optimized with stochastic gradient descent (SGD) at various learning rates. The results
                 of doing so appear in Figure 33. In general, these networks—particularly Conv-2 and Conv-4—
                 proved challenging to train with SGD and Glorot initialization. As Figure 33 reﬂects, we could not
                 ﬁnd SGD learning rates for which the unpruned networks matched the validation accuracy of the
                 same networks when trained with Adam; at best, the SGD-trained unpruned networks were typically
                 2-3 percentage points less accurate. At higher learning rates than those in Figure 32, gradients tended
                 to explode when training the unpruned network; at lower learning rates, the networks often failed to
                 learn at all.
                 At all of the learning rates depicted, we found winning tickets. In all cases, early-stopping times
                 initially decreased with pruning before eventually increasing again, just as in other lottery ticket
                 experiments. The Conv-6 network also exhibited the same accuracy patterns as other experiments,
                 with validation accuracy initially increasing with pruning before eventually decreasing again.

                                                  31                         Published as a conference paper at ICLR 2019


                                         0.0001       0.0005       0.0008       0.001       0.0015       0.0025

                                                                           0.70 20K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.65 15K


                            10K                                             0.60


                            5K                                             0.55


                             0                                             0.50
                                100   51.4   26.5   13.7   7.1   3.7   1.9   1.0          100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                                                  0.001       0.002       0.003       0.005       0.01


                            25K
                                                                           0.74
                            20K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.72
                            15K
                                                                           0.70
                            10K
                                                                           0.68
                            5K
                                                                           0.66
                             0
                                100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3           100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                                         0.0025       0.005       0.01       0.02       0.025       0.03       0.035

                                                                           0.80 30K

                                                                           0.78


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.) 20K                                             0.76

                                                                           0.74
                            10K
                                                                           0.72

                             0                                             0.70
                                100  61.8  39.4  25.8  17.3 11.9 8.3  5.8 4.1 3.0 2.1         100  61.8  39.4  25.8  17.3 11.9 8.3  5.8 4.1 3.0 2.1
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                         Figure 33: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using SGD at various learning rates. Each line represents a different learning rate. The legend for
                         each pair of graphs is above the graphs.


                                                                       32                 Published as a conference paper at ICLR 2019


                 However, the Conv-2 and Conv-4 architectures exhibited a different validation accuracy pattern
                 from other experiments in this paper. Accuracy initially declined with pruning before rising as
                 the network was further pruned; it eventually matched or surpassed the accuracy of the unpruned
                 network. When they eventually did surpass the accuracy of the original network, the pruned networks
                 reached early-stopping in about the same or fewer iterations than the original network, constituting
                 a winning ticket by our deﬁnition. Interestingly, this pattern also appeared for Conv-6 networks at
                 slower SGD learning rates, suggesting that faster learning rates for Conv-2 and Conv-4 than those in
                 Figure 32 might cause the usual lottery ticket accuracy pattern to reemerge. Unfortunately, at these
                 higher learning rates, gradients exploded on the unpruned networks, preventing us from running these
                 experiments.

                  H.3.2 M OMENTUM

                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 34.
                 In general, the lottery ticket pattern continues to apply, with early-stopping times decreasing and
                 accuracy increasing as the networks are pruned. However, there were two exceptions to this pattern:

                     1.At the very lowest learning rates (e.g., learning rate 0.001 for Conv-4 and all but the highest
                       learning rate for Conv-2), accuracy initially decreased before increasing to higher levels
                       than reached by the unpruned network; this is the same pattern we observed when training
                       these networks with SGD.
                     2.At the very highest learning rates (e.g., learning rates 0.005 and 0.008 for Conv-2 and Conv-
                       4), early-stopping times never decreased and instead remained stable before increasing; this
                       is the same pattern we observed for the highest learning rates when training with Adam.


                  H.4 I TERATIVE PRUNING RATE

                 For the convolutional network architectures, we select different pruning rates for convolutional and
                 fully-connected layers. In the Conv-2 and Conv-4 architectures, convolutional parameters make up a
                 relatively small portion of the overall number of parameters in the models. By pruning convolutions
                 more slowly, we are likely to be able to prune the model further while maintaining performance.
                 In other words, we hypothesize that, if all layers were pruned evenly, convolutional layers would
                 become a bottleneck that would make it more difﬁcult to ﬁnd lower parameter-count models that are
                 still able to learn. For Conv-6, the opposite may be true: since nearly two thirds of its parameters are
                 in convolutional layers, pruning fully-connected layers could become the bottleneck.
                 Our criterion for selecting hyperparameters in this section is to ﬁnd a combination of pruning rates
                 that allows networks to reach the lowest possible parameter-counts while maintaining validation
                 accuracy at or above the original accuracy and early-stopping times at or below that for the original
                 network.
                 Figure 35 shows the results of performing the iterative lottery ticket experiment on Conv-2 (top),
                 Conv-4 (middle), and Conv-6 (bottom) with different combinations of pruning rates.
                 According to our criteria, we select an iterative convolutional pruning rate of 10% for Conv-2, 10% for
                 Conv-4, and 15% for Conv-6. For each network, any rate between 10% and 20% seemed reasonable.
                 Across all convolutional pruning rates, the lottery ticket pattern continued to appear.

                  H.5 L EARNING RATES (D ROPOUT )

                 In order to train the Conv-2, Conv-4, and Conv-6 architectures with dropout, we repeated the exercise
                 from Section H.2 to select appropriate learning rates. Figure 32 shows the results of performing
                 the iterative lottery ticket experiment on Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) with
                 dropout and Adam at various learning rates. A network trained with dropout takes longer to learn, so
                 we trained each architecture for three times as many iterations as in the experiments without dropout:
                 60,000 iterations for Conv-2, 75,000 iterations for Conv-4, and 90,000 iterations for Conv-6. We
                 iteratively pruned these networks at the rates determined in Section H.4.

                                                  33                 Published as a conference paper at ICLR 2019


                               0.0001    0.0003    0.0005    0.0007    0.001    0.0015

                                                    0.70

                                                    0.68 15K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                 0.66
                   10K
                                                    0.64
                    5K
                                                    0.62

                    0                               0.60
                      100  51.4  26.5  13.7  7.1  3.7  1.9  1.0       100  51.4  26.5  13.7  7.1  3.7  1.9  1.0
                               Percent of Weights Remaining                        Percent of Weights Remaining

                                0.001     0.002     0.003     0.004     0.005     0.008

                   25K                               0.80

                   20K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                 0.75
                   15K

                   10K
                                                    0.70
                    5K

                    0                               0.65
                      100  53.5  29.1  16.2  9.2  5.4 3.2 2.0 1.3       100  53.5  29.1  16.2  9.2  5.4 3.2 2.0 1.3
                               Percent of Weights Remaining                        Percent of Weights Remaining

                                0.001     0.002     0.003     0.004     0.005     0.008

                                                    0.85 30K


                                                    Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                 0.80 20K


                   10K                               0.75


                    0                               0.70
                      100  61.8 39.4 25.8 17.3 11.9 8.3 5.8 4.1 3.0 2.1      100  61.8 39.4 25.8 17.3 11.9 8.3 5.8 4.1 3.0 2.1
                               Percent of Weights Remaining                        Percent of Weights Remaining

                 Figure 34: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                 using SGD with momentum (0.9) at various learning rates. Each line represents a different learning
                 rate. The legend for each pair of graphs is above the graphs. Lines that are unstable and contain large
                 error bars (large vertical lines) indicate that some experiments failed to learn effectively, leading to
                 very low accuracy and very high early-stopping times; these experiments reduce the averages that the
                 lines trace and lead to much wider error bars.


                                                  34                         Published as a conference paper at ICLR 2019


                                                      0.05     0.1     0.15     0.2     0.25     0.3


                            6K                                             0.72


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.70 4K


                                                                           0.68
                            2K

                                                                           0.66
                             0
                               100   51.4   26.5   13.7   7.1   3.7   1.9   1.0          100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                            25K                                             0.80

                            20K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.75
                            15K

                            10K                                             0.70

                            5K
                                                                           0.65
                             0
                                100   51.4   26.5   13.7   7.1   3.7   1.9   1.0         100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                                                                           0.85 30K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.80
                            20K

                                                                           0.75

                            10K
                                                                           0.70


                             0                                             0.65
                                100   51.4   26.5   13.7   7.1   3.7   1.9   1.0          100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                         Figure 35: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures with an
                         iterative pruning rate of 20% for fully-connected layers. Each line represents a different iterative
                         pruning rate for convolutional layers.


                                                                       35                 Published as a conference paper at ICLR 2019


                 The Conv-2 network proved to be difﬁcult to consistently train with dropout. The top right graph
                 in Figure 36 contains wide error bars and low average accuracy for many learning rates, especially
                 early in the lottery ticket experiments. This indicates that some or all of the training runs failed to
                 learn; when they were averaged into the other results, they produced the aforementioned pattern
                 in the graphs. At learning rate 0.0001, none of the three trials learned productively until pruned to
                 more than 26.5%, at which point all three trials started learning. At learning rate 0.0002, some of the
                 trials failed to learn productively until several rounds of iterative pruning had passed. At learning
                 rate 0.0003, all three networks learned productively at every pruning level. At learning rate 0.0004,
                 one network occasionally failed to learn. We selected learning rate 0.0003, which seemed to allow
                 networks to learn productively most often while achieving among the highest initial accuracy.
                 It is interesting to note that networks that were unable to learn at a particular learning rate (for
                 example, 0.0001) eventually began learning after several rounds of the lottery ticket experiment (that
                 is, training, pruning, and resetting repeatedly). It is worth investigating whether this phenomenon
                 was entirely due to pruning (that is, removing any random collection of weights would put the
                 network in a conﬁguration more amenable to learning) or whether training the network provided
                 useful information for pruning, even if the network did not show improved accuracy.
                 For both the Conv-4 and Conv-6 architectures, a slightly slower learning rate (0.0002 as opposed to
                 0.0003) leads to the highest accuracy on the unpruned networks in addition to the highest sustained
                 accuracy and fastest sustained learning as the networks are pruned during the lottery ticket experiment.
                 With dropout, the unpruned Conv-4 architecture reaches an average validation accuracy of 77.6%, a
                 2.7 percentage point improvement over the unpruned Conv-4 network trained without dropout and
                 one percentage point lower than the highest average validation accuracy attained by a winning ticket.
                 The dropout-trained winning tickets reach 82.6% average validation accuracy when pruned to 7.6%.
                 Early-stopping times improve by up to 1.58x (when pruned to 7.6%), a smaller improvement than
                 then 4.27x achieved by a winning ticket obtained without dropout.
                 With dropout, the unpruned Conv-6 architecture reaches an average validation accuracy of 81.3%,
                 an improvement of 2.2 percentage points over the accuracy without dropout; this nearly matches
                 the 81.5% average accuracy obtained by Conv-6 trained without dropout and pruned to 9.31%.
                 The dropout-trained winning tickets further improve upon these numbers, reaching 84.8% average
                 validation accuracy when pruned to 10.5%. Improvements in early-stopping times are less dramatic
                 than without dropout: a 1.5x average improvement when the network is pruned to 15.1%.
                 At all learning rates we tested, the lottery ticket pattern generally holds for accuracy, with improve-
                 ments as the networks are pruned. However, not all learning rates show the decreases in early-stopping
                 times. To the contrary, none of the learning rates for Conv-2 show clear improvements in early-
                 stopping times as seen in the other lottery ticket experiments. Likewise, the faster learning rates for
                 Conv-4 and Conv-6 maintain the original early-stopping times until pruned to about 40%, at which
                 point early-stopping times steadily increase.

                  H.6 P RUNING CONVOLUTIONS VS . P RUNING FULLY -C ONNECTED LAYERS

                 Figure 37 shows the effect of pruning convolutions alone (green), fully-connected layers alone
                 (orange) and pruning both (blue). The x-axis measures the number of parameters remaining to
                 emphasize the relative contributions made by pruning convolutions and fully-connected layers to
                 the overall network. In all three cases, pruning convolutions alone leads to higher test accuracy
                 and faster learning; pruning fully-connected layers alone generally causes test accuracy to worsen
                 and learning to slow. However, pruning convolutions alone has limited ability to reduce the overall
                 parameter-count of the network, since fully-connected layers comprise 99%, 89%, and 35% of the
                 parameters in Conv-2, Conv-4, and Conv-6.


                                                  36                         Published as a conference paper at ICLR 2019


                                                      0.0001     0.0002     0.0003     0.0004     0.0005


                            60K


                                                                          Accuracy at Early-Stop (Val.)0.6


                           Early-Stop Iteration (Val.) 40K

                                                                           0.4

                            20K
                                                                           0.2

                             0
                                100   51.4   26.5   13.7   7.1   3.7   1.9   1.0          100   51.4   26.5   13.7   7.1   3.7   1.9   1.0
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                                                                           0.85


                            60K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.80

                            40K

                                                                           0.75
                            20K


                             0                                             0.70
                                100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3           100   53.5   29.1  16.2  9.2  5.4  3.2  2.0  1.3
                                            Percent of Weights Remaining                                  Percent of Weights Remaining

                                                                           0.90

                            80K


                                                                          Accuracy at Early-Stop (Val.) Early-Stop Iteration (Val.)                                               0.85
                            60K

                                                                           0.80
                            40K

                                                                           0.75 20K

                             0                                             0.70
                                100  56.2  31.9  18.2  10.5  6.1  3.6  2.1  1.2           100  56.2  31.9  18.2  10.5  6.1  3.6  2.1  1.2
                                            Percent of Weights Remaining                                  Percent of Weights Remaining


                         Figure 36: The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using dropout and the Adam optimizer at various learning rates. Each line represents a different
                         learning rate.


                                                                       37                      Published as a conference paper at ICLR 2019


                                            prune both      prune fc layers only      prune conv layers only
                         10K                                      0.70


                                                                  Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) 8K                                      0.68

                          6K                                      0.66

                          4K                                      0.64

                          2K                                      0.62

                           0                                      0.60
                               4000   3000   2000   1000    0            4000   3000   2000   1000    0
                                   Thousands of Weights Remaining                    Thousands of Weights Remaining

                         10K                                      0.80


                                                                  Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) 8K                                      0.78

                          6K                                      0.76

                          4K                                      0.74

                          2K                                      0.72

                           0                                      0.70
                                  2000        1000          0               2000        1000          0
                                   Thousands of Weights Remaining                    Thousands of Weights Remaining

                         10K                                      0.85


                                                                  Accuracy at Early-Stop (Test) Early-Stop Iteration (Val.) 8K                                      0.82

                          6K                                      0.79

                          4K                                      0.76

                          2K                                      0.73

                           0                                      0.70
                                1500     1000      500       0             1500     1000      500       0
                                   Thousands of Weights Remaining                    Thousands of Weights Remaining


                      Figure 37: Early-stopping iteration and accuracy of the Conv-2 (top), Conv-4 (middle), and Conv-6
                      (bottom) networks when only convolutions are pruned, only fully-connected layers are pruned, and
                      both are pruned. The x-axis measures the number of parameters remaining, making it possible to
                      see the relative contributions to the overall network made by pruning FC layers and convolutions
                      individually.


                                                              38                 Published as a conference paper at ICLR 2019


                  I H YPERPARAMETER EXPLORATION FOR VGG-19 AND RESNET -18 ON
                    CIFAR10

                 This Appendix accompanies the VGG-19 and Resnet-18 experiments in Section 4. It details the
                 pruning scheme, training regimes, and hyperparameters that we use for these networks.

                  I.1 G LOBAL PRUNING

                 In our experiments with the Lenet and Conv-2/4/6 architectures, we separately prune a fraction of
                 the parameters in each layer (layer-wise pruning). In our experiments with VGG-19 and Resnet-18,
                 we instead pruneglobally; that is, we prune all of the weights in convolutional layers collectively
                 without regard for the speciﬁc layer from which any weight originated.
                 Figures 38 (VGG-19) and 39 (Resnet-18) compare the winning tickets found by global pruning
                 (solid lines) and layer-wise pruning (dashed lines) for the hyperparameters from Section 4. When
                 training VGG-19 with learning rate 0.1 and warmup to iteration 10,000, we ﬁnd winning tickets when
                 Pm 6:9%for layer-wise pruning vs.Pm 1:5%for global pruning. For other hyperparameters,
                 accuracy similarly drops off when sooner for layer-wise pruning than for global pruning. Global
                 pruning also ﬁnds smaller winning tickets than layer-wise pruning for Resnet-18, but the difference is
                 less extreme than for VGG-19.
                 In Section 4, we discuss the rationale for the efﬁcacy of global pruning on deeper networks. In
                 summary, the layers in these deep networks have vastly different numbers of parameters (particularly
                 severely so for VGG-19); if we prune layer-wise, we conjecture that layers with fewer parameters
                 become bottlenecks on our ability to ﬁnd smaller winning tickets.
                 Regardless of whether we use layer-wise or global pruning, the patterns from Section 4 hold: at
                 learning rate 0.1, iterative pruning ﬁnds winning tickets for neither network; at learning rate 0.01, the
                 lottery ticket pattern reemerges; and when training with warmup to a higher learning rate, iterative
                 pruning ﬁnds winning tickets. Figures 40 (VGG-19) and 41 (Resnet-18) present the same data as
                 Figures 7 (VGG-19) and 8 (Resnet-18) from Section 4 with layer-wise pruning rather than global
                 pruning. The graphs follow the same trends as in Section 4, but the smallest winning tickets are larger
                 than those found by global pruning.

                  I.2 VGG-19 D ETAILS

                 The VGG19 architecture was ﬁrst designed by Simonyan & Zisserman (2014) for Imagenet. The
                 version that we use here was adapted by Liu et al. (2019) for CIFAR10. The network is structured
                 as described in Figure 2: it has ﬁve groups of 3x3 convolutional layers, the ﬁrst four of which are
                 followed by max-pooling (stride 2) and the last of which is followed by average pooling. The network
                 has one ﬁnal dense layer connecting the result of the average-pooling to the output.
                 We largely follow the training procedure for resnet18 described in Appendix I:

                      We use the same train/test/validation split.
                      We use the same data augmentation procedure.
                      We use a batch size of 64.
                      We use batch normalization.
                      We use a weight decay of 0.0001.
                      We use three stages of training at decreasing learning rates. We train for 160 epochs (112,480
                       iterations), decreasing the learning rate by a factor of ten after 80 and 120 epochs.
                      We use Gaussian Glorot initialization.

                 We globally prune the convolutional layers of the network at a rate of 20% per iteration, and we do
                 not prune the 5120 parameters in the output layer.
                 Liu et al. (2019) uses an initial pruning rate of 0.1. We train VGG19 with both this learning rate and
                 a learning rate of 0.01.

                                                  39                 Published as a conference paper at ICLR 2019


                  I.3 R ESNET -18 D ETAILS

                 The Resnet-18 architecture was ﬁrst introduced by He et al. (2016). The architecture comprises 20
                 total layers as described in Figure 2: a convolutional layer followed by nine pairs of convolutional
                 layers (with residual connections around the pairs), average pooling, and a fully-connected output
                 layer.
                 We follow the experimental design of He et al. (2016):

                      We divide the training set into 45,000 training examples and 5,000 validation examples. We
                       use the validation set to select hyperparameters in this appendix and the test set to evaluate
                       in Section 4.
                      We augment training data using random ﬂips and random four pixel pads and crops.
                      We use a batch size of 128.
                      We use batch normalization.
                      We use weight decay of 0.0001.
                      We train using SGD with momentum (0.9).
                      We use three stages of training at decreasing learning rates. Our stages last for 20,000,
                       5,000, and 5,000 iterations each, shorter than the 32,000, 16,000, and 16,000 used in He
                       et al. (2016). Since each of our iterative pruning experiments requires training the network
                       15-30 times consecutively, we select this abbreviated training schedule to make it possible
                       to explore a wider range of hyperparameters.
                      We use Gaussian Glorot initialization.

                 We globally prune convolutions at a rate of 20% per iteration. We do not prune the 2560 parameters
                 used to downsample residual connections or the 640 parameters in the fully-connected output layer,
                 as they comprise such a small portion of the overall network.

                  I.4 L EARNING RATE

                 In Section 4, we observe that iterative pruning is unable to ﬁnd winning tickets for VGG-19 and
                 Resnet-18 at the typical, high learning rate used to train the network (0.1) but it is able to do so at a
                 lower learning rate (0.01). Figures 42 and 43 explore several other learning rates. In general, iterative
                 pruning cannot ﬁnd winning tickets at any rate above 0.01 for either network; for higher learning
                 rates, the pruned networks with the original initialization perform no better than when randomly
                 reinitialized.

                  I.5 W ARMUP ITERATION

                 In Section 4, we describe how adding linear warmup to the initial learning rate makes it possible to
                 ﬁnd winning tickets for VGG-19 and Resnet-18 at higher learning rates (and, thereby, winning tickets
                 that reach higher accuracy). In Figures 44 and 45, we explore the number of iterationskover which
                 warmup should occur.
                 For VGG-19, we were able to ﬁnd values ofkfor which iterative pruning could identify winning
                 tickets when the network was trained at the original learning rate (0.1). For Resnet-18, warmup made
                 it possible to increase the learning rate from 0.01 to 0.03, but no further. When exploring values ofk,
                 we therefore us learning rate 0.1 for VGG-19 and 0.03 for Resnet-18.
                 In general, the greater the value ofk, the higher the accuracy of the eventual winning tickets.

                 Resnet-18. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship
                 reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select
                 k= 20000, which achieves the highest validation accuracy.

                 VGG-19. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship
                 reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select
                 k= 10000, as there is little beneﬁt to larger values ofk.


                                                  40                      Published as a conference paper at ICLR 2019


                                     rate 0.1 (global)   rate 0.1 (layerwise)   rate 0.01 (global)   rate 0.01 (layerwise)   rate 0.03, warmup 10K (global)   rate 0.03, warmup 10K (layerwise)

                                                    0.94                         0.94
                        0.900
                                                    0.92                         0.92
                        0.875                         0.90


                                                                              Test Accuracy (112K)0.90


                                                   Test Accuracy (60K)
                        Test Accuracy (30K) 0.850                         0.88                         0.88

                        0.825                         0.86                         0.86
                                                    0.84                         0.84 0.800
                                                    0.82                         0.82 0.775
                                                    0.80                         0.80
                           100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 38: Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned with global (solid) and layer-wise (dashed) pruning.


                                     rate 0.1 (global)   rate 0.1 (layerwise)   rate 0.01 (global)   rate 0.01 (layerwise)   rate 0.03, warmup 20K (global)   rate 0.03, warmup 20K (layerwise)


                        0.85                         0.85                         0.90


                        Test Accuracy (10K) 0.80


                                                   Test Accuracy (20K) 0.80


                                                                              Test Accuracy (30K)0.88

                                                                               0.86
                        0.75                         0.75
                                                                               0.84
                        0.70                         0.70
                                                                               0.82
                        0.65                         0.65
                           100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 39: Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned with global (solid) and layer-wise (dashed) pruning.


                                      rate 0.1    rand reinit    rate 0.01    rand reinit    rate 0.1, warmup 10K    rand reinit
                        0.94                         0.94                         0.94
                        0.92                         0.92                         0.92
                        0.90                         0.90


                                                                              Test Accuracy (112K)0.90


                        Test Accuracy (30K)
                                                   Test Accuracy (60K) 0.88                         0.88                         0.88
                        0.86                         0.86                         0.86
                        0.84                         0.84                         0.84
                        0.82                         0.82                         0.82
                        0.80                         0.80                         0.80
                           100 41.0 16.8 6.9 2.8 1.2 0.5 0.2 0.1     100 41.0 16.8 6.9 2.8 1.2 0.5 0.2 0.1     100 41.0 16.8 6.9 2.8 1.2 0.5 0.2 0.1
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 40: Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned with
                      layer-wise pruning. This is the same as Figure 7, except with layer-wise pruning rather than global
                      pruning.


                                       rate 0.1     rate 0.01     rand reinit     rate 0.03, warmup 20K     rand reinit

                        0.85                         0.85                         0.90


                        Test Accuracy (10K)0.80


                                                   Test Accuracy (20K)0.80


                                                                              Test Accuracy (30K) 0.88

                                                                               0.86
                        0.75                         0.75
                                                                               0.84
                        0.70                         0.70
                                                                               0.82
                        0.65                         0.65
                           100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 41: Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned with
                      layer-wise pruning. This is the same as Figure 8 except with layer-wise pruning rather than global
                      pruning.


                                                              41                      Published as a conference paper at ICLR 2019


                                       rate 0.01     rate 0.02     rate 0.03     rate 0.05     rate 0.1     rand reinit
                                       rand reinit     rand reinit     rand reinit     rand reinit

                        0.90                         0.90
                                                                               0.92
                        0.85                         0.85                         0.90


                        Val. Accuracy (10K)                           Val. Accuracy (20K)                           Val. Accuracy (30K)0.80                         0.80                         0.88

                        0.75                         0.75                         0.86

                                                                               0.840.70                         0.70
                                                                               0.82
                        0.65                         0.65
                           100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 42: Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned and trained with various learning rates.


                                       rate 0.01     rate 0.02     rate 0.03     rate 0.05     rate 0.1     rand reinit
                                       rand reinit     rand reinit     rand reinit

                                                    0.94                         0.94
                        0.900
                                                    0.92                         0.92
                        0.875                         0.90


                                                                              Val. Accuracy (112K)0.90


                                                   Val. Accuracy (60K)
                        Val. Accuracy (30K)0.850                         0.88                         0.88

                        0.825                         0.86                         0.86
                                                    0.84                         0.84 0.800
                                                    0.82                         0.82 0.775
                                                    0.80                         0.80
                           100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 43: Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned and trained with various learning rates.


                          rate 0.03, warmup 0    rate 0.03, warmup 500    rate 0.03, warmup 1000    rate 0.03, warmup 5000    rate 0.03, warmup 10000    rate 0.03, warmup 20000

                        0.90                         0.90
                                                                               0.92
                        0.85                         0.85                         0.90


                        Val. Accuracy (10K)                           Val. Accuracy (20K)                           Val. Accuracy (30K)0.80                         0.80                         0.88

                        0.75                         0.75                         0.86

                                                                               0.840.70                         0.70
                                                                               0.82
                        0.65                         0.65
                           100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5       100 64.4 41.7 27.1 17.8 11.8 8.0 5.5
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 44: Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned and trained with varying amounts of warmup at learning rate 0.03.


                          rate 0.1, warmup 0    rate 0.1, warmup 1000    rate 0.1, warmup 5000    rate 0.1, warmup 10000    rate 0.1, warmup 20000    rate 0.1, warmup 50000

                        0.94                         0.94                         0.94
                        0.92                         0.92                         0.92
                        0.90                         0.90


                                                                              Val. Accuracy (112K) 0.90


                        Val. Accuracy (30K)                           Val. Accuracy (60K)0.88                         0.88                         0.88
                        0.86                         0.86                         0.86
                        0.84                         0.84                         0.84
                        0.82                         0.82                         0.82
                        0.80                         0.80                         0.80
                           100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1      100 41.016.8 6.9 2.8 1.2 0.5 0.20.1
                                Percent of Weights Remaining                Percent of Weights Remaining                Percent of Weights Remaining
                      Figure 45: Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned and trained with varying amounts of warmup at learning rate 0.1.


                                                              42