diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt index 851f739..70d7dd3 100644 --- a/Corpus/CORPUS.txt +++ b/Corpus/CORPUS.txt @@ -21521,4 +21521,7539 @@ In this section, we show the plots of feature importance for all the tasks. <> +<> <> <> + + +<> <> <> +Scalable Gradients for Stochastic Differential Equations + +Xuechen Li. Ting-Kam Leonard Wong + +Google Research University of Toronto + +Abstract + +The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic Differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset. + +1 Introduction + +Deterministic dynamical systems can often be modeled by ordinary Differential equations (ODEs). The adjoint sensitivity method can efficiently compute gradients of ODE solutions with constant memory cost. This method was well-known in the physics, numerical analysis, and control communities for decades [3, 4, 60, 65]. Recently, it was combined with modern reverse-mode automatic differentiation packages, enabling ODEs with millions of parameters to be fit to data [12] and allow. +ing more flexible density estimation and time-series models [23, 32, 72]. +Stochastic Differential equations (SDEs) generalize ODEs, adding instantaneous noise to their dynamics [55, 77, 78]. They are a natural model for phenomena governed by many small and unobserved interactions, such as motion of molecules in a liquid [8], + +allele frequencies in a gene pool [15], or prices in a market [79]. Previous attempts on fitting SDEs mostly relied on methods with poor scaling properties. The pathwise approach [22, 89], a form of forward-mode automatic differentiation, scales poorly in time with the number of parameters and states in the model. On the other hand, simply differentiating through the operations of an SDE solver [19] scales poorly in memory. +In this work, we generalize the adjoint method to stochastic dynamics defined by SDEs. We give a sim.ple and practical algorithm for fitting SDEs with tens of thousands of parameters, while allowing the use of high-order adaptive time-stepping SDE solvers. We call this approach the stochastic adjoint sensitivity method. + +<
> + +Table 1: Asymptotic complexity comparison. L is the number of steps used in a fixed-step solve, and D is the number of state and parameters. Both memory and time are expressed in units of the cost of evaluating the drift and diffusion functions once each. +There are two main difficulties in generalizing the ad.joint formulation for ODEs to SDEs. The first is mathematical: SDEs are defined using nonstandard integrals that usually rely on Ito calculus. The adjoint method requires solving the dynamics backwards in time from the end state. However, it is not clear exactly what running the SDE backwards means in the context of stochastic calculus, and when it correctly reconstructs the forward trajectory. We address this problem in Section 3, deriving a backward Stratonovich SDE whose dynamics compute the necessary gradient. +The second difficulty is computational: To retrace the steps, one needs to reconstruct the noise sampled on the forward pass, ideally without storing it. In Section 4, we give an algorithm that allows querying a Brownian motion sample at any time point arbitrarily-precisely, while only storing a single random seed. + +We combine our adjoint approach with a gradient-based stochastic variational inference scheme for efficiently marginalizing over latent SDE models with arbitrary differentiable likelihoods. This model fam.ily generalizes several existing families such as latent ODEs [12, 72], Gaussian state-space models [36, 81], and deep Kalman filters [40], and can naturally handle irregularly-sampled times series and missing observations. We train latent SDEs on toy and real datasets, demonstrating competitive performance compared to existing approaches for dynamics modeling. + +2 Background: Stochastic Flows + +2.1 Adjoint Sensitivity Method +The adjoint sensitivity method is an efficient approach to solve control problems relying on the adjoint (co-state) system [65]. Chen et al. [12] used this method to compute the gradient with respect to parameters of a neural ODE, which is a particular model among many others inspired by the theory of dynamical systems [10, 11, 26, 44, 46, 74, 86]. The method, shown in Algorithm 1, is scalable, since the most costly computation is a vector-Jacobian product defining its backwards dynamics. In addition, since the gradient is obtained by solving another ODE, no intermediate computation is stored as in the case of regular backpropagation [73]. + +2.2 Stochastic Differential Equations +We briefly define SDEs: Consider a filtered probability space <> on which an m-dimensional adapted Wiener process (aka Brownian motion) <> is defined. For a fixed terminal time t +<>, we denote by <> the time horizon. We denote the ith component of Wt by <>. A stochastic process <> can be defined by an Ito SDE + +<>, (1) + +where z0 . Rd is the starting state, and <> and <> are the drift and diffusion functions, respectively. For ease of presentation, we let m =1 in the following unless otherwise stated. Our contributions can be easily generalized to cases where +m> 1. Here, the second integral on the right hand side of (1) is the Ito stochastic integral [55]. When the coefficients are globally Lipschitz in both the state and time, there exists a unique strong solution to the SDE [55]. +2.3 Neural Stochastic Differential Equations +Similar to neural ODEs, one can consider drift and diffusion functions defined by neural networks, a model known as the neural SDE [32, 45, 82, 83]. +Amongst work on neural SDEs, none has enabled an efficient training framework. In particular, Tzen and Raginsky [82] and Liu et al. [45] considered computing the gradient by simulating the forward dynamics of an explicit Jacobian matrix. This Jacobian has size of either the square of the number of parameters, or the number of parameters times the number of states, building on the pathwise approach [22, 89]. In contrast, our approach only requires a small number of cheap vector-Jacobian products, independent of the dimension of the parameter and state vectors. These vector-Jacobian products have the same asymptotic time cost as evaluating the drift and diffusion functions, and can be easily computed by modern automatic differentiation libraries [1, 16, 49, 59]. +2.4 Backward Stratonovich Integral +Our stochastic adjoint sensitivity method involves stochastic processes running both forward and back.ward in time. The Stratonovich stochastic integral, due to its symmetry, gives nice expressions for the backward dynamics and is more convenient for our purpose. Our results can be straightforwardly applied to ItSDEs as well, using a simple conversion (see e.g. [64, Sec. 2]). +Following the treatment of Kunita [41], we introduce the forward and backward Stratonovich integrals. Let <> be a two-sided filtration, where <> is the \sigma-algebra generated by <> for <> such that <>. For a continuous semi-martingale <> adapted to the forward filtration <>, the Stratonovich stochastic integral is + +<> + +where <> is a partition of the interval <> denotes the size of largest segment of the partition, and the limit is to be interpreted in the L2 sense. The Ito integral uses instead the left endpoint <> rather than the average. In general, the Ito and Stratonovich integrals differ by a term of finite variation. + +To define the backward Stratonovich integral, we consider c the backward Wiener process <> defined as <> for all t that is adapted to the backward filtration <>. For a continuous semimartingale <> adapted to the backward filtration, + +Algorithm 1 ODE Adjoint Sensitivity + +<> + +Algorithm 2 SDE Adjoint Sensitivity (Ours) + +<> + +Figure 1: Pseudocode of the (ODE) adjoint sensitivity method (left), and our generalization to Stratonovich SDEs (right). differences are highlighted in blue. Square brackets denote vector concatenation. +the backward Stratonovich integral is Moreover, each .s,t is a smooth diffeomorphism N flow of diffeomorphisms generated by the SDE (2). + +<
> + +from Rd to itself. We thus call S the stochastic (b) The backward flow <> satisfies the backward SDE: + +<> + +where <> is the partition. + +2.5 Stochastic Flow of diffeomorphisms +<> + +It is well known that an ODE defines a flow of diffeomorphisms [6]. Here we consider the stochastic analog <>, (3) for the Stratonovich SDE s + +<> + +for all <> and <> such that <>. + +<> (2) + +The coefficients in (2) and (3) differ by only a negative sign. This symmetry is due to our use of the Stratonovich integral (see Figure 2). + +<> + +Throughout the paper, we assume that both b and <> have infinitely many bounded derivatives w.r.t. the state, and bounded first derivatives w.r.t. time, i.e. <>, so that the SDE has a unique strong solution. Let <> be the solution at time t +when the process is started at z at time s. Given a realization of the Wiener process, this defines a collection of continuous maps <> from Rd to itself. +The following theorem shows that these maps are diffeomorphisms (after choosing a suitable modification) and that they satisfy backward SDEs. +Theorem 2.1 ([41, Theorem 3.7.1]). (a) With probability 1, the collection <> satisfies the flow property + +<>. + +3 Sensitivity via Stochastic Adjoint + +We present our main contribution: a stochastic analog of the adjoint sensitivity method for SDEs. We use (3) to derive another backward Stratonovich SDE, which we call the stochastic adjoint process. The direct implication is a gradient computation algorithm that works by solving a set of dynamics in reverse time, and relies on cheap vector-Jacobian products without storing any intermediate quantities. +The proof included in Appendix 9.1 relies on Its lemma in the Stratonovich form [41, Theorem 2.4.1]. We stress that this lemma considers only the case where the endpoint z is fixed and deterministic. +Now, we extend to the case where the endpoint is not deterministic, but rather computed from the forward flow. To achieve this, we compose the state process and the loss function. Consider As, <>. The chain rule gives As, <>. Let + +<> + +3.1 Stochastic Adjoint Process +The goal is to derive a stochastic adjoint process <> that can be simulated by evaluating only vector-Jacobian products, where <> is a + +<> (6) + +Note that As, <>. + +Since <> is scalar loss of the terminal state from the forward flow a constant, <> satisfies the augmented <> backward SDE system + +backward SDE for the process + +<> + +We first derive <>, assuming that <> follows the inverse flow from a deterministic end state ZT + +<> + +that does not depend on the realized Wiener process (Lemma 3.1). We then extend to the case where <> is obtained by the forward flow starting from a deterministic initial state z0 (Theorem 3.2). This latter part is unconventional, and the resulting value cannot be interpreted as the solution to a backward SDE anymore due to loss of adaptedness. Instead, we will formulate the result with the Ito map [69]. Finally, it is straightforward to extend the state Zt to include parameters of the drift and diffusion functions such that the desired gradient can be obtained for stochastic optimization; we comment on this step in Section 3.3. + +<> + +Since the drift and diffusion functions of this augmented system are <>, the system has a unique strong solution. Let s=0 and t = T . Since (7) admits a strong solution, we may write + +<>, (8) + +We first present the SDE for the Jacobian matrix of where <> denotes the path of the Wiener the backward flow. +process and + +Lemma 3.1 (Dynamics of <>). Consider the stochastic flow generated by the backward SDE (3) as in <> + +Theorem 2.1(b). Letting Js,t(z) := r.s,t(z), we have +is a deterministic measurable function (the Ito map) [69, Chapter V, definition 10.9]. Intuitively, F can be thought as a black box that computes the solution + +<> + +to the backward SDE system (7) given the position at time T and the realized Wiener process samples. Similarly, we let G be the solution map for the forward flow (2). The next theorem follows immediately from (6) and the definition of <>, we have +for all <> and <>. Furthermore, letting + +<>, (4) + + we have + +Theorem 3.2. For <>-almost all <>, + +<> + +where <> + +<>, (5) + +for all <> and <> and (8). + +Proof. This is a consequence of composing <> + + +This shows that one can obtain the gradient by "composing" the backward SDE system (7) with the original forward SDE (2) and ends our continuous-time analysis. + +3.2 Numerical Approximation +In practice, we compute solutions to SDEs with numerical solvers Fh and Gh, where <> denotes the mesh size of a fixed grid. The approximate algorithm thus outputs <>. The following theorem provides sufficient conditions for convergence. +Theorem 3.3. Suppose the schemes Fh and Gh satisfy the following conditions: (i) <> in probability as <>, and +(ii) for any <>, we have <> in probability as <>. Then, for any starting point z of the forward flow, we have + +<> + +in probability as <>. + +See Appendix 9.2 for the proof. Usual schemes such as the Euler-Maruyama scheme (more generally ItTaylor schemes) converge pathwise (i.e. almost surely) from any fixed starting point [38] and satisfies (i). While (ii) is strong, we note that the SDEs considered here have smooth coefficients, and thus their solutions enjoy nice regularity properties in the starting position. There.fore, it is reasonable to expect that the corresponding numerical schemes to also behave nicely as a function of both the mesh size and the starting position. To the best of our knowledge, this property is not considered +at all in the literature on numerical methods for SDEs (where the initial position is fixed), but is crucial in the proof of Theorem 3.3. In Appendix 9.3, we prove that condition (ii) holds for the Euler-Maruyama scheme. Detailed analysis for other schemes is beyond the scope of this paper. + +3.3 The Algorithm +So far we have derived the gradient of the loss with respect to the initial state. We can extend these results to give gradients with respect to parameters of the drift and diffusion functions by treating them as an additional part of the state whose dynamics has zero drift and diffusion. We summarize this in Algorithm 2, assuming access only to a black-box solver sdeint. All terms in the augmented dynamics, such as <> can be cheaply evaluated by calling <> and <>, respectively. +difficulties with non-diagonal diffusion. In principle, we can simulate the forward and backward adjoint dynamics with any high-order solver of choice. However, +for general matrix-valued diffusion functions ., to ob.tain a numerical solution with strong order 1 +beyond 1/2, we need to simulate multiple integrals of the Wiener process such as <>. +These random variables are difficult to simulate and costly to approximate [87]. +Fortunately, if we restrict our SDE to have diagonal noise, then even though the backward SDE for the stochastic adjoint will not in general have diagonal noise, it will satisfy a commutativity property [70]. In that case, we can safely adopt certain numerical schemes of strong order 1.0 (e.g. Milstein [52] and stochastic Runge-Kutta [71]) without approximating multiple integrals or the Levy area during simulation. We formally show this in Appendix 9.4. +One may also consider numerical schemes with high weak order [39]. However, analysis of this scenario is beyond the current scope. + +3.4 Software and Implementation +We have implemented several common SDE solvers in PyTorch [59] with adaptive time-stepping using a PI controller [9, 30]. Following +torchdiffeq [12], we have created a user-friendly subclass of torchautograd. Function that facilitates gradient computation using our stochastic adjoint framework for SDEs that are subclasses of torch.nn.Module. We include a short code snippet covering the main idea of the stochastic adjoint in Appendix 9.12. The complete codebase can be found at https://github.com/google-research/torchsde. + +4 Virtual Brownian Tree + +Our formulation of the adjoint can be numerically integrated efficiently, since simulating its dynamics only requires evaluating cheap vector-Jacobian products, as opposed to whole Jacobians. However, the backward-in-time nature introduces a new difficulty: The same Wiener process sample path used in the for.ward pass must be queried again during the backward pass. Brownian storing Brownian motion increments implies a large memory consumption and complicates the usage of adaptive time-stepping integrators, where the evaluation times in the backward pass may be different from those in the forward pass. +To overcome this issue, we combine Brownian trees with splittable pseudorandom number generators (PRNGs) to give an algorithm that can query values of a Wiener +1A numerical scheme is of strong order p if <> for all <>, where Xt and XN. are respectively the coupled true solution and numerical solution, N and . are respectively the iteration index and step size such that N. = T , and C is independent of process sample path at arbitrary times. This algorithm, which we call the virtual Brownian tree, has O(1) memory cost, and time cost logarithmic with respect to the inverse error tolerance. + +<
> + +Figure 3: Evaluating a Brownian motion sample at time tq using a virtual Brownian tree. Our algorithm repeatedly bisects the interval, sampling from a Brownian bridge at each halving to determine intermediate values. Each call to the random number generator uses a unique key whose value depends on the path taken to reach it. + +4.1 Brownian Bridges and Brownian Trees +Levy's Brownian bridge [67] states that given a start time ts and end time te along with their respective Wiener process values ws and we, the marginal of the process at time <> is a normal distribution: +  +<>. (9) + +We can recursively apply this formula to evaluate the process at the midpoint of any two distinct timestamps where the values are already known. Constructing the whole sample path of a Wiener process in this manner results in what is known as the Brownian tree [17]. Storing this tree would be memory-intensive, but we show how to reconstruct any node in this tree as desired. + +4.2 Brownian Trees using Splittable Seeds +We assume access to a splittable PRNG [14], which has an operation split that deterministically generates two keys from an existing key. Given a key, the function BrownianBridge samples deterministically from (9). To obtain the Wiener process value at a specific time, we must first know or sample the values at the initial and terminal times. Then, the virtual Brownian tree recursively samples from the midpoint of Brownian bridges, each sample using a key split from that of its parent node. The algorithm terminates when the most recently sampled time is close enough to the desired time. We outline the full procedure in Algorithm 3. + +Algorithm 3 Virtual Brownian Tree + +<> + +This algorithm has constant memory cost. For a fixed-step-size solver taking L steps, the tolerance that the tree will need to be queried at scales as 1/L. Thus the per-step time complexity scales as log L. Our implementation uses an efficient count-based PRNG [76] which avoids passing large random states, and instead simply passes integers. Table 1 compares the asymptotic time complexity of this approach against existing alternatives. + +5 Latent Stochastic Differential Equations + +The algorithms presented in Sections 3 and 4 allow us to efficiently compute gradients of scalar objectives with respect to SDE parameters, letting us fit SDEs to data. This raises the question: Which loss to optimize? +Simply fitting SDE parameters to maximize likelihood will in general cause overfitting, and will result in the diffusion function going to zero. In this section, we show how to do efficient variational inference in SDE models, and optimize the marginal log-likelihood to fit both prior (hyper-)parameters and the parameters of a tractable approximate posterior over functions. +In particular, we can parameterize both a prior over functions and an approximate posterior using SDEs: + +<>, (prior) + +<>, (approx. post.) + +where <> and <> are Lipschitz in both arguments, and both processes have the same starting value: <>. +If both processes share the same diffusion function <>, then the KL divergence between them is finite (under additional mild regularity conditions; see Appendix 9.6), and can be estimated by sampling paths from the posterior process. Then, the evidence lower + +<
> + +Figure 4: Graphical models for the generative process (decoder) and recognition network (encoder) of the latent stochastic Differential equation model. This model can be viewed as a variational autoencoder with infinite-dimensional noise. Red circles represent entire function draws from Brownian motion. Given the initial state z0 and a Brownian motion sample path <>, the intermediate states <> are deterministically approximated by a numerical SDE solver. + +bound (ELBO) can be written as: + +<>, (10) + +where <> satisfies <>, and the expectation is taken over the approximate posterior process defined by (approx. post.). The likelihoods of the observations x1,...,xN at times t1,...,tN depend only on the latent states zt at corresponding times. +To compute the gradient with respect to prior parameters <> and variational parameters <>, we need only augment the forward SDE with an extra scalar variable whose drift function is <> and whose diffusion function is zero. The backward dynamics can be derived analogously using (7). We include a detailed derivation in Appendix 9.6. Thus, a stochastic estimate of the gradients of the loss w.r.t. all parameters can be computed in a single pair of forward and backward SDE solves. +The variational parameters . can either be optimized individually for each sequence, or if multiple time series are sharing parameters, then an encoder network can be trained to input the observations and output .. This architecture, shown in figure 4, can be viewed as an infinite-dimensional variational autoencoder [35, 68]. +6 Related Work +Sensitivity Analysis for SDEs. Gradient computation is closely related to sensitivity analysis. Computing gradients with respect to parameters of vector fields of an SDE has been extensively studied in the stochastic control literature [42]. In particular, for low dimensional problems, this is done effectively using dynamic programming [7] and finite differences [20, 43]. However, both approaches scale poorly with the dimensionality of the parameter vector. +Analogous to REINFORCE (or the score-function estimator) [21, 37, 88], Yang and Kushner [89] considered deriving the gradient as rE[L(ZT )] = E[L(ZT )H] for some random variable H. However, H usually depends on the density of ZT with respect to the Lebesgue measure which can be difficult to compute. Gobet and Munos [22] extended this approach by weakening a non-degeneracy condition using Mallianvin calculus [53]. +Closely related to the current approach is the pathwise method [89], which is also a continuous-time analog of the reparameterization trick [35, 68]. Existing meth.ods in this regime [22, 45, 82] all require simulating a (forward) SDE where each step requires computing entire Jacobian matrices. This computational cost is prohibitive for high-dimensional systems with a large number of parameters. +Based on the Euler discretization, Giles and Glasser.man [19] considered simply performing reverse-mode automatic differentiation through all intermediate steps. They named this method the adjoint approach, which, by modern standards, is a form of "backpropagation through the operations of a numerical solver". This approach, widely adopted in the field of finance for calibrating market models [19], has high memory cost, and relies on a fixed Euler-Maruyama discretization. Recently, this approach was also used by Hegde et al. +[27] to learn parameterized drift and diffusion functions Figure 5: (a) Same fixed step size used in both forward and reverse simulation. Boxplot generated by repeating the experiment with different Brownian motion sample paths 64 times. (b) Colors of dots represent tolerance levels and correspond to the colorbar on the right. Only atol was varied and rtol was set to 0. + + +of an SDE. In scientific computing, Innes et al. [31] considered backpropagating through high-order implicit SDE solvers. +In the machine learning literature, Ryder et al. [75] perform variational inference over the state and parameters for Euler-discretized latent SDEs and optimize the model with regular backpropagation. This approach should not be confused with the formulation of variational inference for non-discretized SDEs presented in previous works [25, 57, 82] and our work, as it is unclear whether the limit of their discretization corresponds to that obtained by operating with continuous-time SDEs using Girsanov's theorem. +Backward SDEs. Our stochastic adjoint process re.lies on the notion of backward SDEs devised by Kunita [41], which is based on two-sided filtrations. This is different from the more traditional notion of backward SDEs where only a single filtration is defined [58, 62]. +Based on the latter notion, forward-backward SDEs (FBSDEs) have been proposed to solve stochastic optimal control problems [63]. However, simulating FBS-DEs is costly due to the need to estimate conditional expectations in the backward pass [58]. +Bayesian Learning of SDEs. Recent works considered the problem of inferring an approximate posterior SDE given observed data under a prior SDE with the same diffusion coefficient [25, 57, 82]. The special case with constant diffusion coefficients was considered more than a decade ago [5]. Notably, computing the KL divergence between two SDEs over a finite time horizon was well-explored in the control literature [33, 80]. We include background on this topic in Appendix 9.5. +Bayesian learning and parameter estimation for SDEs have a long history [24]. Techniques which don't fit require positing a variational family such as then extended Kalman filter and Markov chain Monte Carlo have been considered in the literature [50]. +7 Experiments +The aim of this section is threefold. We first empirically verify our theory by comparing the gradients obtained by our stochastic adjoint framework against analytically derived gradients for problems having closed-form solutions. We then fit latent SDE models with our framework on two synthetic datasets, verifying that the variational inference framework allows learning a generative model of time series. Finally, we learn dynamics parameterized by neural networks with a latent SDE from a motion capture dataset, demonstrating competitive performance compared to existing approaches. +We report results based on an implementation of Brownian motion that stores all intermediate queries. The virtual Brownian tree allowed training with much larger batch sizes on GPUs, but was not necessary for our small-scale experiments. Notably, our adjoint approach, even when combined with the Brownian motion implementation that stores noise, was able to reduce the memory usage by 1/2-1/3 compared to directly back-propagating through solver operations on the tasks we considered. +7.1 Numerical Studies +We consider three test problems (examples 1-3 from [66]; details in Appendix 9.7), all of which have closed-form solutions. We compare the gradient computed from simulating our stochastic adjoint process using the Milstein scheme against the exact gradient. Figure 5(a) shows that for test example 2, the error between the adjoint gradient and analytical gradient decreases with step size. +For all three test problems, the mean squared error across dimensions tends to be smaller as the absolute tolerance of the adaptive solver is reduced (e.g. see Fig. 5 (b)). However, the Number of Function Evaluations (NFEs) tends to be much larger than that in the ODE case [12]. + +Additionally, for two out of three test problems, we found that our adjoint approach with the Milstein scheme and fixed step size can be much more time.efficient than regular backpropagation through operations of the Milstein and Euler schemes (see e.g. Fig. 5(c)). Backpropagating through the Euler scheme gives gradients of higher error compared to the Milstein method. On the other hand, directly backpropagating through the Milstein solve requires evaluating high-order derivatives and can be costly. +Results for examples 1 and 3 are in Appendix 9.8. + +Figure 6: Learned posterior and prior dynamics on data from a stochastic Lorenz attractor. All samples from our model are continuous-time paths, and form a multi-modal, non-Gaussian distribution. +7.2 Synthetic Datasets +We trained latent SDEs with our adjoint framework to recover (1) a 1D Geometric Brownian motion, and (2) a 3D stochastic Lorenz attractor process. The main objective is to verify that the learned posterior can reconstruct the training data, and that the learned priors are not deterministic. We jointly optimize the evidence lower bound (10) with respect to parameters of the prior and posterior distributions at the initial latent state z0, the prior and posterior drift, the diffusion function, the encoder, and the decoder. We include the details of datasets and architectures in Appendix 9.9. +For the stochastic Lorenz attractor, not only is the model able to reconstruct the data well, but also the learned prior process can produce bimodal samples in both data and latent space. This is showcased in the last row of Figure 6 where the latent and data space samples cluster around two modes. This is hard to achieve using a latent ODE with a unimodal Gaussian initial approximate posterior. We include additional visualizations in Appendix 9.10. +7.3 Motion Capture Dataset +To demonstrate that latent SDEs can learn complex dynamics from real-world datasets, we evaluated their predictive performance on a 50-dimensional motion capture dataset. The dataset, from Gan et al. [18], consists of 23 walking sequences of subject 35 partitioned into 16 training, 3 validation, and 4 test sequences. We follow the preprocessing of Wang et al. [85]. +In designing the recognition network, we follow Y�ld�z et al. [90] and use a fully connected network to encode the first three observations of each sequence and there.after predicted the remaining sequence. This encoder is chosen for fair comparison to existing models, and could be extended to a recurrent or attention model [84]. The overall architecture is described in Appendix 9.11 and is similar to that of ODE2VAE [90], with a similar number of parameters. We also use a fixed step size 1/5 of smallest interval between any two observations [90]. +We train latent ODE and latent SDE models with the Adam optimizer [34] and its default hyperparameter settings, with an initial learning rate of 0.01 that is exponentially decayed with rate 0.999 during each iteration. We perform validation over the number of training iterations, KL penalty [29], and KL annealing schedule. All models were trained for at most 400 iterations, where we start to observe severe overfitting for most model instances. We report the test MSE on future observations following Y�ld�z et al. [90]. We believe that the improved performance is due to the strong regularization in path space, as removing the KL penalty improve training error but caused validation error to deteriorate. + +Table 2: Test MSE on 297 future frames averaged over 50 samples. 95% confidence interval reported based on t-statistic results from [90]. + +<
> + +8 Discussion + +We presented a generalization of the adjoint sensitivity method to compute gradients through solutions of SDEs. In contrast to existing approaches, this method has nearly the same time and memory complexity as simply solving the SDE. We showed how our stochastic adjoint framework can be combined with a gradient-based stochastic variational inference scheme for train.ing latent SDEs. +It is worthwhile to mention that SDEs and the commonly used GP models define two distinct classes of stochastic processes, albeit having a nonempty inter.section (e.g. Ornstein-Uhlenbeck processes fall under both). Computationally, the cost of fitting GPs lies in the matrix inversion, whereas the computational bottle.neck of training SDEs is the sequential numerical solve. Empirically, another avenue of research is to reduce the variance of gradient estimates. In the future, we may adopt techniques such as control variates or antithetic paths. +On the application side, our method opens up a broad set of opportunities for fitting any differentiable SDE model, such as Wright-Fisher models with selection and mutation parameters [15], derivative pricing models in finance, or infinitely-deep Bayesian neural networks [61]. In addition, the latent SDE model enabled by our frame.work can be extended to include domain knowledge and structural or stationarity constraints [48] in the prior process for specific applications. +On the theory side, there remain fundamental questions to be answered. Convergence rates of numerical gradients estimated with general schemes are unknown. Additionally, since our analyses are based on strong orders of schemes, it is natural to question whether convergence results still hold when we consider weak errors, and moreover if the method could be reformulated more coherently with rough paths theory [47]. + +Acknowledgements +We thank Yulia Rubanova, Danijar Hafner, Mufan Li, Shengyang Sun, Kenneth R. Jackson, Simo S�rkk�, Daniel Lacker, and Philippe Casgrain for helpful discus.sions. We thank �a�atay Y�ld�z for helpful discussions regarding evaluation settings of the mocap task. We also thank Guodong Zhang, Kevin Swersky, Chris Rackauckas, and members of the Vector Institute for helpful comments on an early draft of this paper. + +References +[1] Mart�n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Je�rey Dean, Matthieu Devin, Sanjay Ghemawat, Geo�rey Irving, Michael Isard, et al. Tensorflow: A system for large-scale +machine learning. In 12th Symposium on Oper. +ating Systems Design and Implementation, pages +265�283, 2016. +[2] R Adams. Sobolev Spaces. Academic Press, 1975. +[3] Joel Andersson. A general-purpose software frame.work for dynamic optimization. PhD thesis, Aren.berg Doctoral School, KU Leuven, 2013. +[4] Joel Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1�36, 2019. +[5] C�dric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S Shawe-Taylor. variational inference for diffusion processes. In Advances in Neural Information Processing Systems, pages 17�24, 2008. +[6] VI Arnold. Ordinary Differential Equations. The MIT Press, 1978. +[7] Jonathan Baxter and Peter L Bartlett. Infinite-horizon gradient-based policy search. 2001. +[8] Robert Brown. ... microscopical observations ... on the particles contained in the pollen of plants. The Philosophical Magazine, 4(21):161�173, 1828. +[9] Pamela M Burrage, R Herdiana, and Kevin Bur-rage. Adaptive stepsize based on control theory for stochastic Differential equations. Journal of Computational and Applied Mathematics, 170(2): 317�336, 2004. +[10] Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017. +[11] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. +[12] Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bet.tencourt, and David K Duvenaud. Neural ordinary Differential equations. In Advances in neural in.formation processing systems, pages 6571�6583, 2018. +[13] Kyunghyun Cho, Bart Van Merri�nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. +[14] Koen Claessen and Micha. H Pa.ka. Splittable pseudorandom number generators using crypto.graphic hashing. In ACM SIGPLAN Notices, vol.ume 48, pages 47�58. ACM, 2013. +[15] Warren J Ewens. Mathematical population genetics 1: theoretical introduction, volume 27. Springer Science & Business Media, 2012. +[16] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing, 2018. +[17] Jessica G Gaines and Terry J Lyons. Variable step size control in the numerical solution of stochastic Differential equations. SIAM Journal on Applied Mathematics, 57(5):1455�1484, 1997. +[18] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sig.moid belief networks for sequence modeling. In Advances in Neural Information Processing systems, pages 2467�2475, 2015. +[19] Mike Giles and Paul Glasserman. Smoking ad-joints: Fast Monte Carlo greeks. Risk, 19(1):88�92, 2006. +[20] Paul Glasserman and David D Yao. Some guide.lines and guarantees for common random numbers. Management Science, 38(6):884�908, 1992. +[21] Peter W Glynn. Likelihood ratio gradient estima.tion for stochastic systems. Communications of the ACM, 33(10):75�84, 1990. +[22] Emmanuel Gobet and R�mi Munos. Sensitivity analysis using ItMalliavin calculus and martin.gales, and application to stochastic optimal control. SIAM Journal on control and optimization, 43(5): 1676�1713, 2005. +[23] Will Grathwohl, Ricky T. Q. Chen, Jesse Bet.tencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scal.able reversible generative models. International Conference on Learning Representations, 2019. +[24] Narendra Gupta and Raman Mehra. computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations. IEEE transactions on automatic control, 19(6): 774�783, 1974. +[25] Jung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, and Han-Lim Choi. Adaptive path-integral autoencoders: Representation learning and planning for dynamical systems. In Advances in Neural Information Processing Systems, pages 8927�8938, 2018. +[26] Eldad Haber and Lars Ruthotto. Stable architec.tures for deep neural networks. Inverse Problems, 34(1):014004, 2017. +[27] Pashupati Hegde, Markus Heinonen, Harri L�hdesm�ki, and Samuel Kaski. Deep learning with Differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1812�1821, 2019. +[28] Markus Heinonen, Cagatay Yildiz, Henrik Man.nerstr, Jukka Intosalmi, and Harri L�hdesm�ki. Learning unknown ode models with gaussian pro.cesses. arXiv preprint arXiv:1803.04303, 2018. +[29] Irina Higgins, Loic Matthey, Arka Pal, Christo.pher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta.vae: Learning basic visual concepts with a con.strained variational framework. ICLR, 2(5):6, 2017. +[30] Silvana Ilie, Kenneth R Jackson, and Wayne H Enright. Adaptive time-stepping for the strong numerical solution of stochastic Differential equations. Numerical Algorithms, 68(4):791�812, 2015. +[31] Mike Innes, Alan Edelman, Keno Fischer, Chris Rackauckus, Elliot Saba, Viral B Shah, and Will Tebbutt. Zygote: A differentiable programming system to bridge machine learning and scien.ti�c computing. arXiv preprint arXiv:1907.07587, 2019. +[32] Junteng Jia and Austin R. Benson. Neural Jump Stochastic Differential Equations. arXiv e-prints, art. arXiv:1905.10403, May 2019. +[33] Hilbert Johan Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and in.ference. Journal of Statistical Physics, 162(5): 1244�1266, 2016. +[34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. +[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. +[36] Genshiro Kitagawa and Will Gersch. Linear gaus.sian state space modeling. In Smoothness Priors Analysis of Time Series, pages 55�65. Springer, 1996. +[37] Jack PC Kleijnen and Reuven Y Rubinstein. Op.timization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3): 413�427, 1996. +[38] Peter E Kloeden and Andreas Neuenkirch. The pathwise convergence of approximation schemes for stochastic Differential equations. LMS jour.nal of Computation and Mathematics, 10:235�253, 2007. +[39] Peter E Kloeden and Eckhard Platen. Numer.ical solution of stochastic Differential equations, volume 23. Springer Science & Business Media, 2013. +[40] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. +[41] Hiroshi Kunita. Stochastic Flows and Jump.diffusions. Springer, 2019. +[42] Harold Kushner and Paul G Dupuis. Numerical methods for stochastic control problems in continu.ous time, volume 24. Springer Science & Business Media, 2013. +[43] Pierre L�Ecuyer and Gafitan Perron. On the con.vergence rates of ipa and fdc derivative estimators. Operations Research, 42(4):643�656, 1994. +[44] Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Re.search, 18(1):5998�6026, 2017. +[45] Xuanqing Liu, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural sde: Stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019. +[46] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridg.ing deep architectures and numerical Differential equations. arXiv preprint arXiv:1710.10121, 2017. +[47] Terry J Lyons. Differential equations driven by rough signals. Revista Matemfitica Iberoamericana, 14(2):215�310, 1998. +[48] Yi-An Ma, Tianqi Chen, and Emily Fox. A com.plete recipe for stochastic gradient mcmc. In Ad.vances in Neural Information Processing Systems, pages 2917�2925, 2015. +[49] Dougal Maclaurin, David Duvenaud, M Johnson, and RP Adams. Autograd: Reverse-mode differ.entiation of native python. In ICML workshop on Automatic Machine Learning, 2015. +[50] Isambi S Mbalawata, Simo S�rkk�, and Heikki Haario. Parameter estimation in stochastic differential equations with markov chain monte carlo and non-linear kalman filtering. Computational Statistics, 28(3):1195�1223, 2013. +[51] Grigori Noah Milstein and Michael V Tretyakov. Stochastic Numerics for Mathematical Physics. Springer Science & Business Media, 2013. +[52] Grigorii Noikhovich Milstein. Numerical integra.tion of stochastic Differential equations, volume 313. Springer Science & Business Media, 1994. +[53] Ivan Nourdin and Giovanni Peccati. Normal ap.proximations with Malliavin calculus: from Stein�s method to universality, volume 192. Cambridge University Press, 2012. +[54] Daniel Ocone and fitienne Pardoux. A general.ized itventzell formula. application to a class of anticipating stochastic Differential equations. 25 (1):39�71, 1989. +[55] Bernt �ksendal. Stochastic Differential Equations. Springer, 2003. +[56] Bernt Oksendal. Stochastic Differential equations: an introduction with applications. Springer Science & Business Media, 2013. +[57] Manfred Opper. Variational inference for stochas.tic Differential equations. Annalen der Physik, 531 (3):1800233, 2019. +[58] Etienne Pardoux and Shige Peng. Backward stochastic Differential equations and quasilinear parabolic partial Differential equations. In Stochas.tic Partial Differential Equations and Their Ap.plications, pages 200�217. Springer, 1992. +[59] Adam Paszke, Sam Gross, Soumith Chintala, Gre.gory Chanan, Edward Yang, Zachary DeVito, Zem.ing Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. +[60] Barak A Pearlmutter. Gradient calculations for dy.namic recurrent neural networks: A survey. IEEE Transactions on Neural networks, 6(5):1212�1228, 1995. +[61] Stefano Peluchetti and Stefano Favaro. Neural stochastic Differential equations. arXiv preprint arXiv:1904.01681, 2019. +[62] Shige Peng. A general stochastic maximum principle for optimal control problems. SIAM Journal on Control and Optimization, 28(4):966�979, 1990. +[63] Shige Peng and Zhen Wu. Fully coupled forward-backward stochastic Differential equations and ap.plications to optimal control. SIAM Journal on Control and Optimization, 37(3):825�843, 1999. +[64] Eckhard Platen. An introduction to numerical methods for stochastic Differential equations. Acta numerica, 8:197�246, 1999. +[65] Lev Semenovich Pontryagin. Mathematical Theory of Optimal Processes. Routledge, 2018. +[66] Christopher Rackauckas and Qing Nie. Adaptive methods for stochastic Differential equations via natural embeddings and rejection sampling with memory. Discrete and Continuous Dynamical systems. Series B, 22(7):2731, 2017. +[67] Daniel Revuz and Marc Yor. Continuous martin.gales and Brownian motion, volume 293. Springer Science & Business Media, 2013. +[68] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. +[69] L Chris G Rogers and David Williams. diffusions, Markov Processes and Martingales: Volume 2, ItCalculus, volume 2. Cambridge University Press, 2000. +[70] Andreas Rler. Runge�Kutta methods for stratonovich stochastic Differential equation systems with commutative noise. Journal of Com.putational and Applied mathematics, 164:613�627, 2004. +[71] Andreas Rler. Runge�Kutta methods for the strong approximation of solutions of stochastic Differential equations. SIAM Journal on Numerical Analysis, 48(3):922�952, 2010. +[72] Yulia Rubanova, Ricky TQ Chen, and David Du.venaud. Latent odes for irregularly-sampled time series. Neural Information Processing Systems, 2019. +[73] David E Rumelhart, Geo�rey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Cognitive Modeling, 5(3):1, 1988. +[74] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial Differential equations. arXiv preprint arXiv:1804.04272, 2018. +[75] Thomas Ryder, Andrew Golightly, A Stephen Mc-Gough, and Dennis Prangle. Black-box variational inference for stochastic Differential equa.tions. arXiv preprint arXiv:1802.03335, 2018. +[76] John K Salmon, Mark A Moraes, Ron O Dror, and David E Shaw. Parallel random numbers: as easyas1,2, 3. In Proceedings of 2011 Interna.tional Conference for High Performance Comput.ing, Networking, Storage and Analysis, page 16. ACM, 2011. +[77] Simo S�rkk�. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013. +[78] Simo S�rkk� and Arno Solin. Applied stochas.tic Differential equations, volume 10. Cambridge University Press, 2019. +[79] Steven E Shreve. Stochastic calculus for finance II: Continuous-time models, volume 11. Springer Science & Business Media, 2004. +[80] Evangelos Theodorou. Nonlinear stochastic con.trol and information theoretic dualities: Connec.tions, interdependencies and thermodynamic in.terpretations. Entropy, 17(5):3352�3375, 2015. +[81] Ryan Turner, Marc Deisenroth, and Carl Ras.mussen. State-space inference and learning with gaussian processes. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 868�875, 2010. +[82] Belinda Tzen and Maxim Raginsky. Neural stochastic Differential equations: Deep latent gaus.sian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019. +[83] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. Proceeings of the Conference on Learning Theory, 2019. +[84] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, .ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998�6008, 2017. +[85] Jack M Wang, David J Fleet, and Aaron Hertz.mann. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283�298, 2007. +[86] E Weinan. A proposal on machine learning via dy.namical systems. Communications in Mathematics and Statistics, 5(1):1�11, 2017. +[87] Magnus Wiktorsson et al. Joint characteristic function and simultaneous simulation of iterated itintegrals for multiple independent brownian motions. The Annals of Applied Probability, 11(2): 470�487, 2001. +[88] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce.ment learning. Machine Learning, 8(3-4):229�256, 1992. +[89] Jichuan Yang and Harold J Kushner. A monte carlo method for sensitivity analysis and paramet.ric optimization of nonlinear stochastic systems. SIAM Journal on Control and Optimization, 29 (5):1216�1249, 1991. +[90] �a�atay Y�ld�z, Markus Heinonen, and Harri L�hdesm�ki. Ode2vae: Deep generative second order odes with bayesian neural networks. arXiv preprint arXiv:1905.10994, 2019. + +9 Appendix + +Notation. For a fixed terminal time <>, we denote by <> the time horizon. Let <> be the class +of infinitely differentiable functions from Rd to itself. Let Cp,q be the class of functions from <> to <> that <> be +are p and q times continuously differentiable in the first and second component, respectively. Let <> the subclass with bounded derivatives of all possible orders. For a positive integer m, we adopt the short hand +[m]= {1, 2,...,m}. We denote the Euclidean norm of a vector v by |v|. For f . Cp,q, we denote its Jacobian with respect to the first component by rf. + +9.1 Proof of Theorem 3.1 +Proof of Theorem 3.1. We have <>, where <> is defined in (3). Now we take the gradient with respect to z on both sides. The solution is differentiable with respect to z and we may differentiate under the stochastic integral [41, Proposition 2.4.3]. Theorem 3.4.3 [41] is sufficient for the regularity conditions required. Since <>, applying the Stratonovich version of Its formula to (4), we have (5). + +9.2 Proof of Theorem 3.3 +Proof of Theorem 3.3. By the triangle inequality, + +<> + +We show that both I and I converge to 0 in probability as <>. For simplicity, we suppress z and W�. +Bounding I(1) . Let > 0 be given. Since Gh . G in probability, there exist M1 > 0 and h0 > 0 such that <>, <>, for all <>. +By Lemma 2.1 (iv) of Ocone and Pardoux [54], which can be easily adapted to our context, there exists a positive random variable C1, finite almost surely, such that <>, and there exists M2 > 0 such that <>. Given M2, there exists h1 > 0 such that + +<> + +Now, suppose <>. Then, by the union bound, with probability at least 1, we have  + +<> + +On this event, we have + +<> (1) + +Thus, we have shown that (1) converges to 0 in probability as <>. Bounding <>. The idea is similar. By condition (ii), we have + +<> + +in probability. Using this and condition (i), for given <>, there exist <> and <> such that for all <>, we have + +<> + +with probability at least 1. On this event, we have + +<> + +Thus <> also converges to 0 in probability as <>. + +9.3 Euler-Maruyama Scheme satisfies Local Uniform Convergence +Here we verify that the Euler-Maruyama scheme satisfies condition (ii) when d =1. Our proof can be extended to +the case where d> 1 assuming an Lp estimate of the error; see the discussion after the proof of Proposition 9.1. Proposition 9.1. Let Fh(z) be the Euler-Maruyama discretization of a 1-dimensional SDE with mesh size h of F(z). Then, for any compact <>, we have + +<> + +Usual convergence results in stochastic numerics only control the error for a single fixed starting point. Here, we strengthen the result to local uniform convergence. Our main idea is to apply a Sobolev inequality argument [54, Part II]. To do so, we need some preliminary results about the Euler-Maruyama discretization of the original SDE and its derivative. We first recall a theorem characterizing the expected squared error for general schemes. +Theorem 9.2 (Mean-square order of convergence [51, Theorem 1.1]). Let <> be the solution to an Ito SDE, and <> be a numerical discretization with fixed step size h, both of which are started at <> and defined on the same probability space. Let the coefficients of the SDE be <>. Furthermore, suppose that the numerical scheme has order of accuracy p1 for the expectation of deviation and order of accuracy p2 for the mean-square deviation. If <> and <>, then, for any <>, and <> +for a constant C that does not depend on h or z. + +We refer the reader to [51] for the precise definitions of orders of accuracy and the proof. Given this theorem, we establish an estimate regarding errors of the discretization and its derivative with respect to the initial position. + +Lemma 9.3. We have + + <>, + +where C1 is a constant independent of z and h. + +Proof of Lemma 9.3. Since the coefficients of the SDE are of class <>, we may differentiate the SDE in z to +b get the SDE for the derivative rzZz [41]. Specifically, letting <>, we have + +<> + +Note that the augmented process (F(z), rzF(z)) satisfies an SDE with <> coefficients. By the chain rule, +one can easily show that the derivative of the Euler-Maruyama discretization Fh(z) is the discretization of the derivative process Y z . Thus, (Fh(z), rzFh(z)) is simply the discretization of (F(z), rzF(z)). +Since the Euler-Maruyama scheme has orders of accuracy (p1,p2) = (1.5, 1.0) [51, Section 1.1.5], by Theorem 9.2, we have + +<> + +for some constant C1 that does not depend on z or h. + +We also recall a variant of the Sobolev inequality which we will apply for d =1. Theorem 9.4 (Sobolev inequality [2, Theorem 5.4.1.c]). For any p>d, there exists a universal constant cp such that + +<> + +where + +<> + +for all continuously differentiable <>. + +Proof of Proposition 9.1. define H. :� . R . R, regarded as a random function <>, by + +<> + +where <> is a fixed constant. Since H. is continuously differentiable a.s., by Theorem 9.4, + +<>, + +Without loss of generality, we may let the compact set be <> where <>. Then, + +<> (11) + +It remains to estimate <>. Starting from the definition of <>, a standard estimation yields + +<> + +where C2 is a deterministic constant depending only on . (but not z and h). +Now we take expectation on both sides. By Lemma 9.3, we have + +<> + +where the last integral is finite since <>. + +We have shown that <>. Thus kH.k. 0 in L2 , and hence also in probability, as <>. From equation 11, we have that <> converges to 0 in probability as <>. +It is clear from the above proof that we may generalize to the case where d> 1 and other numerical schemes if we can bound the expected <>, p-norm of <> in terms of z and h, for p>d, where W 1,p here denotes the Sobolev space consisting of all real-valued functions on Rd whose weak derivatives are functions in Lp. For the Euler scheme and <>, we need only bound the Lp norm of the discretization error in term of z and h for general p. To achieve this, we would need to make explicit the dependence on z for existing estimates (see e.g. [39, Chapter 10]). +Generically extending the argument to other numerical schemes, however, is technically non-trivial. We plan to address this question in future research. + +9.4 Stochastic Adjoint has Commutative Noise when Original SDE has Diagonal Noise +Recall the Stratonovich SDE (2) with drift and diffusion functions <> governed by a set of parameters <>. Consider the augmented state composed of the original state and parameters Yt =(Zt,.). The augmented state satisfies a Stratonovich SDE with the drift function <> and diffusion functions <> for <>. By (5) and (6), the dynamics for the adjoint process of the augmented state is characterized by the backward SDE: + +<> + +By definitions of f and gi, the Jacobian matrices rf(x, s) and rgi(x, s) can be written as: +  +<> + +Thus, we can write out the backward SDEs for the adjoint processes of the state and parameters separately: + +<> + +Now assume the original SDE has diagonal noise. Then, m = d and Jacobian matrix r.i(z) can be written as: + +<> + +Consider the adjoint process for the augmented state along with the backward flow of the backward SDE (3). We write the overall state as <>, where we abuse notation slightly to let <> denote the backward +flow process. Then, by (12) and (13), {Xt}t.T satisfies a backward SDE with a diffusion function that can be written as: + +<> + +Recall, for an SDE with diffusion function <>, it is said to satisfy the commutativity property [70] if + +<> + +for all j1,j2 . [m] and k . [d]. When an SDE has commutative noise, the computationally intensive double Itintegrals (and the Levy areas) need not be simulated by having the numerical scheme take advantage of the following property of iterated integrals [30]: + +<> + +where the Brownian motion increment <> for <> can be easily sampled. To see that the diffusion function (14) indeed satisfies the commutativity condition (15), we consider several cases: +<> Both LHS and RHS are zero unless j1 == k, since for .i,j2 (x) to be non-zero, <> Similar to the case above. Write <>, where <>. Both LHS and RHS are zero unless <>, since + +<> + +for <> to be non-zero <> or <> and <>. + +Since in all scenarios, LHS = RHS, we conclude that the commutativity condition holds. +Finally, we comment that the Milstein scheme for the stochastic adjoint of diagonal noise SDEs can be implemented such that during each iteration of the backward solve, vjp is only called a number of times independent respect to the dimensionality of the original SDE. + +9.5 Background on Latent SDE + +Consider a filtered probability space <>, where <> is a finite time horizon. +Recall the approximate posterior process that we intend to learn is governed by the SDE: + +<>, (16) + +Suppose there exists a measurable function u(z, t) such that <>, and <> satisfies Novikov's condition, i.e. <>. Novikov's condition ensures that the process + +<> + +is a P-martingale. By Girsanov Theorem II [56, Theorem 8.6.4], the process <> is a Wiener process under the probability measure Q defined by + +<>, + +Moreover, since a simple rewrite shows that + +<>, (17) + +we conclude that the Q-law of (17) (or equivalently (16)) is the same as the P -law of the prior process. + +9.5.1 Deriving the Variational Bound + +Let xt1,...,xtN be observation data at times t1,...,tN , whose conditionals only depend on the respective latent states zt1,...,ztN . Since the Q-law of the approximate posterior is the same as the P-law of the prior, + +<> + +where the second line follows from the definition of Q and third line follows from Jensen's inequality. In the last equality we used the fact that the Ito integral <> is a martingale. + +9.6 Stochastic Adjoint for Latent SDE + +Note that the variational free energy (10) can be derived from Girsanov's change of measure theorem [57]. To efficiently Monte Carlo estimate this quantity and its gradient, we simplify the equation by noting that for a one-dimensional process <> adapted to the filtration generated by a one-dimensional Wiener process <>, +if Novikov's condition [55] is satisfied, then the process defined by the Ito integral Vs dWs is a Martingale [55]. Hence, <>, and + +<> + +To Monte Carlo simulate the quantity in the forward pass along with the original dynamics, we need only extend the original augmented state with an extra variable Lt such that the new drift and diffusion functions for the new augmented state <> are + +<> + +By (7), the backward SDEs of the adjoint processes become + +<> (18) + +In this case, neither do we need to actually simulate the backward SDE of the extra variable nor do we need to simulate its adjoint. Moreover, when considered as a single system for the augmented adjoint state, the diffusion function of the backward SDE (18) satisfies the commutativity property (15). + +9.7 Test Problems + +In the following, <> and p are parameters of SDEs, and x0 is a fixed initial value. + +Example 1. + +<> + +Analytical solution: + +<> + +Example 2. + +<> + +Analytical solution: + +<> + +Example 3. + +<> + +Analytical solution: + +<> + +In each numerical experiment, we duplicate the equation 10 times to obtain a system of SDEs where each dimension had their own parameter values sampled from the standard Gaussian distribution and then passed through a sigmoid to ensure positivity. Moreover, we also sample the initial value for each dimension from a Gaussian distribution. + +<
> + +Figure 7: (a-c) Example 1. (d-f) Example 3. + +9.8 Results for Example 1 and 3 + +9.9 Toy Datasets Configuration + +9.9.1 Geometric Brownian Motion +Consider a geometric Brownian motion SDE: + +<>. + +We use <>, and <> as the ground-truth model, where <>. We sample 1024 time series, each of which is observed at intervals of 0.02 from time 0 to time 1. We corrupt this data using Gaussian noise with mean zero and standard deviation 0.01. +To recover the dynamics, we use a GRU-based [13] latent SDE model where the GRU has 1 layer and 100 hidden units, the prior and posterior drift functions are MLPs with 1 hidden layer of 100 units, and the diffusion function is an MLP with 1 hidden layer of 100 hidden units and the sigmoid activation applied at the end. The drift function in the posterior is time-inhomogenous in the sense that it takes in a context vector of size 1 at each observation that is output by the GRU from running backwards after processing all future observations. The decoder is a linear mapping from a 4 dimensional latent space to observation space. For all nonlinearities, we use the softplus function. We <> the observation model to be Gaussian with noise standard deviation 0.01. +We optimize the model jointly with respect to the parameters of a Gaussian distribution for initial latent state distribution, the prior and posterior drift functions, the diffusion function, the GRU encoder, and the decoder. We use a fixed discretization with step size of 0.01 in both the forward and backward pass. We use the Adam optimizer [34] with an initial learning rate of 0.01 that is decay by a factor of 0.999 after each iteration. We use a linear KL annealing schedule over the first 50 iterations. +9.9.2 Stochastic Lorenz Attractor + +Consider a stochastic Lorenz attractor SDE with diagonal noise: + +<>, + +<>, + +<>. + +We use <>, and (x0,y0,z0) sampled from the standard Gaussian distribution as the ground-truth model. We sample 1024 time series, each of which is observed at intervals of 0.025 from time 0 to time 1. We normalize these samples by their mean and standard deviation across each dimension and corrupt this data by Gaussian noise with mean zero and standard deviation 0.01. +We use the same architecture and training procedure for the latent SDE model as in the geometric Brownian motion section, except that the diffusion function consists of four small neural networks, each for a single dimension of the latent SDE. + +9.10 Additional Visualization + +<
> + +Figure 8: Additional visualizations of learned posterior and prior dynamics on the synthetic stochastic Lorenz attractor dataset. First row displays the true data and posterior reconstructions. Second row displays samples with initial latent state for each trajectory is sampled independently. Third row displays samples with initial latent state sampled and fixed to be the same for different trajectories. +See Figure 8 for additional visualization on the synthetic Lorenz attractor dataset. See Figure 9 for visualization on the synthetic geometric Brownian motion dataset. We comment that for the second example, the posterior reconstructs the data well, and the prior process exhibit behavior of the data. However, from the third row, we can observe that the prior process is learned such that most of the uncertainty is account for in the initial latent state. We leave the investigation of more interpretable prior process for future work. + +9.11 Model Architecture for Learning from Motion Capture Dataset +We use a latent SDE model with an MLP encoder which takes in the first three frames and outputs the mean and log-variance of the variational distribution of the initial latent state and a context vector. The decoder has a similar architecture as that for the ODE2VAE model [90] and projects the 6-dimensional latent state into the 50-dimensional observation space. The posterior drift function takes in a 3-dimensional context vector output by the encoder and the current state and time, whereas the prior drift only takes in the current state and time. The diffusion function is composed of multiple small neural nets, each producing a scalar for the corresponding + +<
> + +Figure 9: Visualizations of learned posterior and prior dynamics on the synthetic geometric Brownian motion dataset. First row displays the true data and posterior reconstructions. Orange contour covers 95% of 512 samples. Second row displays samples with initial latent state for each trajectory is sampled independently. Third row displays samples with initial latent state sampled and fixed to be the same for different trajectories. + +dimension such that the posterior SDE has diagonal noise. We use the same observation likelihood as that of the ODE2VAE model [90]. We comment that the overall parameter count of our model (11605) is smaller than that of ODE2VAE for the same task (12157). +The latent ODE baseline was implemented with a similar architecture, except is does not have the diffusion and prior drift components, and its vector field defining the ODE does not take in a context vector. Therefore, the model has slightly fewer parameters (10573) than the latent SDE model. See Figure 10 for overall details of the architecture. +The main hyperparameter we tuned was the coefficient for reweighting the KL. For both the latent ODE and SDE, we considered training the model with a reweighting coefficient in {1, 0.1, 0.01, 0.001}, either with or without a linear KL annealing schedule that increased from 0 to the prescribed value over the first 200 iterations of training. + +9.12 Stochastic Adjoint Implementation + +We include the core implementation of the stochastic adjoint, assuming access to a callable Brownian motion bm, an Euler-Maruyama integrator ito_int_diag for diagonal noise SDEs, and several helper functions whose purposes can be inferred from their names. +<> +<> <> <> + + +<> <> <> + Scaling Laws for Neural Language Models + + + Jared Kaplan Sam McCandlish + + Johns Hopkins University, OpenAI OpenAI + jaredk@jhu.edu sam@openai.com + + + + Tom Henighan Tom B. Brown Benjamin Chess Rewon Child + OpenAI OpenAI OpenAI OpenAI + henighan@openai.com tom@openai.com bchess@openai.com rewon@openai.com + + Scott Gray Alec Radford Jeffrey Wu Dario Amodei + OpenAI OpenAI OpenAI OpenAI + scott@openai.com alec@openai.com jeffwu@openai.com damodei@openai.com + + + + Abstract + + We study empirical scaling laws for language model performance on the cross-entropy loss. + The loss scales as a power-law with model size, dataset size, and the amount of compute + used for training, with some trends spanning more than seven orders of magnitude. Other + architectural details such as network width or depth have minimal effects within a wide + range. Simple equations govern the dependence of overfitting on model/dataset size and the + dependence of training speed on model size. These relationships allow us to determine the + optimal allocation of a fixed compute budget. Larger models are significantly more sample- + efficient, such that optimally compute-efficient training involves training very large models + on a relatively modest amount of data and stopping significantly before convergence. + + + Equal contribution. + + Contributions: Jared Kaplan and Sam McCandlish led the research. Tom Henighan contributed the LSTM ex- + periments. Tom Brown, Rewon Child, and Scott Gray, and Alec Radford developed the optimized Transformer + implementation. Jeff Wu, Benjamin Chess, and Alec Radford developed the text datasets. Dario Amodei provided + guidance throughout the project. Contents + + 1 Introduction 2 + + 2 Background and Methods 6 + + 3 Empirical Results and Basic Power Laws 7 + + 4 Charting the Infinite Data Limit and Overfitting 10 + + 5 Scaling Laws with Model Size and Training Time 12 + + 6 Optimal Allocation of the Compute Budget 14 + + 7 Related Work 18 + + 8 Discussion 18 + + Appendices 20 + + A Summary of Power Laws 20 + + B Empirical Model of Compute-Efficient Frontier 20 + + C Caveats 22 + + D Supplemental Figures 23 + + + 1 Introduction + + Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning + tasks can be efficiently expressed and evaluated in language, and the world’s text provides a wealth of + data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in + language modeling, with state of the art models [RNSS18, DCLT18, YDY + 19, LOG + 19, RSR + 19] approaching + human-level performance on many specific tasks [WPN + 19], including the composition of coherent + multiparagraph prompted text samples [RWC + 19]. + One might expect language modeling performance to depend on model architecture, the size of neural models, + the computing power used to train them, and the data available for this training process. In this work we will + empirically investigate the dependence of language modeling loss on all of these factors, focusing on the + Transformer architecture [VSP + 17, LSP + 18]. The high ceiling and low floor for performance on language + tasks allows us to study trends over more than seven orders of magnitude in scale. + Throughout we will observe precise power-law scaling for performance as a function of training time, + context length, dataset size, model size, and compute budget. + + 1.1 Summary + + Our key findings for Transformer language models are are as follows: + + 2 Here we display predicted compute when using a sufficiently small batch size. See Figure 13 for comparison to the + purely empirical data. + + <
> + + Figure 1 Language modeling performance improves smoothly as we increase the model size, dataset + size, and amount of compute 2 used for training. For optimal performance all three factors must be scaled + up in tandem. Empirical performance has a power-law relationship with each individual factor when not + bottlenecked by the other two. + + + Performance depends strongly on scale, weakly on model shape: Model performance depends most + strongly on scale, which consists of three factors: the number of model parameters N (excluding + embeddings), the size of the datasetD, and the amount of compute C used for training. Within reasonable limits, + performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section + 3) + + Smooth power laws: Performance has a power-law relationship with each of the three scale factors + N;D;C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude + (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance + must flatten out eventually before reaching zero loss. (Section 3) + + Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem, + but enters a regime of diminishing returns if eitherNorDis held fixed while the other increases. The + performance penalty depends predictably on the ratioN0:74 =D, meaning that every time we increase the + model size 8x, we only need to increase the data by roughly 5x to avoid a penalty. (Section 4) + + Universality of training: Training curves follow predictable power-laws whose parameters are roughly + independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the + loss that would be achieved if we trained for much longer. (Section 5) + + Transfer improves with test performance: When we evaluate models on text with a different distribution + than they were trained on, the results are strongly correlated to those on the training validation set with + a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant + penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2) + + Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of + performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4). + + Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions + on the model size N or available dataD, we attain optimal performance by training very large models + and stopping significantly short of convergence(see Figure 3). Maximally compute-efficient training would + therefore be far more sample efficient than one might expect based on training small models to convergence, + with data requirements growing very slowly as <> with training compute. (Section 6) + + Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, + and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million + tokens at convergence for the largest models we can train. (Section 5.1) + Taken together, these results show that language modeling performance improves smoothly and predictably + as we appropriately scale up model size, data, and compute. We expect that larger language models will + perform better and be more sample efficient than current models. + + <
> + + Figure 2 We show a series of language model training runs, with models ranging in size from10 3 to10 9 + parameters (excluding embeddings). + + <
> + + Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger + models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in + compute. For optimally compute-efficient training, most of the increase should go towards increased model + size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to + increase parallelism through larger batch sizes, with only a very small increase in serial training time required. + + + + 1.2 Summary of Scaling Laws + + The test loss of a Transformer trained to auto regressively model language can be predicted using a power-law + when performance is limited by only either the number of non-embedding parametersN, the dataset sizeD, + or the optimally allocated compute budget C_min (see Figure 1): + 1.For models with a limited number of parameters, trained to convergence on sufficiently large + datasets: + <> (non-embedding parameters) (1.1) + 2.For large models trained with a limited dataset with early stopping: + <> (tokens) (1.2) + 3.When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized + model, and a sufficiently small batch size (making optimal 3 use of compute): + <> + 3 We also observe an empirical power-law trend with the training computeC(Figure 1) while training at fixed batch + size, but it is the trend withCmin that should be used to make predictions. They are related by equation (5.5). + + <
> + + Figure 4 Left: The early-stopped test lossL(N;D)varies predictably with the dataset size D and model + size N according to Equation (1.5).Right: After an initial transient period, learning curves for all model + sizes N can be fit with Equation (1.6), which is parameterized in terms of S_min , the number of steps when + training at large batch size (details in Section 5.1). + + + These relations hold across eight orders of magnitude in C_min , six orders of magnitude inN, and over two + orders of magnitude inD. They depend very weakly on model shape and other Transformer hyperparameters + (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 + training set [RWC + 19]. The power lawsN ; D ; min specify the degree of performance improvement C expected as we scale upN,D, orCmin ; for example, doubling the number of parameters yields a loss that + is smaller by a factor <>. The precise numerical values ofNc ;C min ;andDc c depend on the + vocabulary size and tokenization and hence do not have a fundamental meaning. + The critical batch size, which determines the speed/efficiency tradeoff for data parallelism ([MKAT18]), also + roughly obeys a power law in L: + + <> + + Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset + size sublinearly according to <>. In fact, we find that there is a single equation combining + (1.1) and (1.2) that governs the simultaneous dependence on N and D and governs the degree of overfitting: + + <> (1.5) + + with fits pictured on the left in figure 4. We conjecture that this functional form may also parameterize the + trained log-likelihood for other generative modeling tasks. + When training a given model for a finite number of parameter update stepsSin the infinite data limit, after + an initial transient period, the learning curves can be accurately fit by (see the right of figure 4) + + <> (1.6) + + where <> and <>, and S_min(S) is the minimum possible number of optimization steps + (parameter updates) estimated using Equation (5.4). + When training within a fixed compute budgetC, but with no other constraints, Equation (1.6) leads to the + prediction that the optimal model sizeN, optimal batch sizeB, optimal number of stepsS, and dataset size + Dshould grow as + <> (1.7) + with + <> (1.8) + which closely matches the empirically optimal resultsN/C0:73 ,B/C0:24 , andS/C0:03 . As the + computational budget C increases, it should be spent primarily on larger models, without dramatic increases + in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become + increasingly sample efficient. In practice, researchers typically train smaller models for longer than would + be maximally compute-efficient because of hardware constraints. Optimal performance depends on total + compute as a power law (see Equation (1.3)). + We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their + implications for training time, and a breakdown of our results per token. We also make some brief comparisons + to LSTMs and recurrent Transformers [DGV + 18]. + + 1.3 Notation + + We use the following notation: + L– the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in + some cases we report the loss for specific tokens within the context. + N– the number of model parameters,excluding all vocabulary and positional embeddings + C6NBS– an estimate of the total non-embedding training compute, whereBis the batch size, + andSis the number of training steps (ie parameter updates). We quote numerical values in PF-days, + where one PF-day= 10 15 243600 = 8:6410 19 floating point operations. + D– the dataset size in tokens + B_crit – the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the + critical batch size provides a roughly optimal compromise between time and compute efficiency. + Cmin – an estimate of the minimum amount of non-embedding compute to reach a given value of + the loss. This is the training compute that would be used if the model were trained at a batch size + much less than the critical batch size. + S_min – an estimate of the minimal number of training steps needed to reach a given value of the loss. + This is also the number of training steps that would be used if the model were trained at a batch size + much greater than the critical batch size. + X – power-law exponents for the scaling of the loss as <> where X can be any of + <>. + + 2 Background and Methods + + We train language models on WebText2, an extended version of the WebText [RWC + 19] dataset, tokenized + using byte-pair encoding [SHB15] with a vocabulary size n vocab = 50257. We optimize the autoregressive + log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal + performance metric. We record the loss on the WebText2 test distribution and on a selection of other text + distributions. We primarily train decoder-only [LSP + 18, RNSS18] Transformer [VSP + 17] models, though + we also train LSTM models and Universal Transformers [DGV + 18] for comparison. + + 2.1 Parameter and Compute Scaling of Transformers + + We parameterize the Transformer architecture using hyperparameters n layer (number of layers),d model + (dimension of the residual stream), d (dimension of the intermediate feed-forward layer),dattn (dimension of + the attention output), and n heads (number of attention heads per layer). We include n ctx tokens in the input + context, with n ctx = 1024 except where otherwise noted. + We use N to denote the model size, which we define as the number of non-embedding parameters + + <> (2.1) + + where we have excluded biases and other sub-leading terms. Our models also have n vocab d model parameters + in an embedding matrix, and use n ctx d model parameters for positional embeddings, but we do not include + these when discussing the ‘model size’N; we will see that this produces significantly cleaner scaling laws. + Evaluating a forward pass of the Transformer involves roughly + + <> (2.2) + + add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in + matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1. + + <
> + + Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading + terms such as nonlinearities, biases, and layer normalization are omitted. + + + + For contexts and models with d model > n ctx =12, the context-dependent computational cost per token is a + relatively small fraction of the total compute. Since we primarily study models where d model n ctx=12, + we do not include context-dependent terms in our training compute estimate. Accounting for the backwards + pass (approximately twice the compute as the forwards pass), we then define the estimated non-embedding + compute as <> floating point operators per training token. + + 2.2 Training Procedures + + Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed <> steps with + a batch size of512sequences of1024tokens. Due to memory constraints, our largest models (more than + 1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and + schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of + learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate + schedule with a 3000 step linear warmup followed by a cosine decay to zero. + + 2.3 Datasets + + We train our models on an extended version of the WebText dataset described in [RWC + 19]. The original + WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at + least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January + to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether + people found the link interesting or useful. The text of the new links was extracted with the Newspaper3k + python library. In total, the dataset consists of 20.3M documents containing 96 GB of text and <> + words (as defined bywc). We then apply the reversible tokenizer described in [RWC + 19], which yields + <> tokens. We reserve <> of these tokens for use as a test set, and we also test on similarly- + prepared samples of Books Corpus [ZKZ + 15], Common Crawl [Fou], English Wikipedia, and a collection + of publicly-available Internet Books. + + 3 Empirical Results and Basic Power Laws + + To characterize language model scaling we train a wide variety of models, varying a number of factors + including: + + Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) + Dataset size (ranging from 22 million to 23 billion tokens) + Shape (including depth, width, attention heads, and feed-forward dimension) + Context length (1024 for most runs, though we also experiment with shorter contexts) + Batch size (219 for most runs, but we also vary it to measure the critical batch size) + + <
> + + Figure 5 Performance depends very mildly on model shape when the total number of non-embedding + parametersNis held fixed. The loss varies only a few percent over a wide range of shapes. Small differences + in parameter counts are compensated for by using the fit toL(N)as a baseline. Aspect ratio in particular can + vary by a factor of 40 while only slightly impacting performance; an(nlayer ;d model ) = (6;4288)reaches a + loss within 3% of the(48;1600)model used in [RWC + 19]. + + <
> + + Figure 6 Left:When we include embedding parameters, performance appears to depend strongly on the + number of layers in addition to the number of parameters.Right:When we exclude embedding parameters, + the performance of models with different depths converge to a single trend. Only models with fewer than 2 + layers or with extreme depth-to-width ratios deviate significantly from the trend. + + + In this section we will display data along with empirically-motivated fits, deferring theoretical analysis to + later sections. + + 3.1 Approximate Transformer Shape and Hyperparameter Independence + + Transformer performance depends very weakly on the shape parameters n layer; n heads , and d when we hold + the total non-embedding parameter count N fixed. To establish these results we trained models with fixed + size while varying a single hyperparameter. This was simplest for the case of n heads . When varying n layer, + we simultaneously varied d model while keeping <> layer d2 fixed. Similarly, to vary d model at fixed + model size we also simultaneously varied the d model parameter, as required by the parameter counts in Table + 1. Independence of n layers would follow if deeper Transformers effectively behave as ensembles of shallower + models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5. + + 3.2 Performance with Non-Embedding Parameter CountN + + In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape + (n layer, d model) = (2,128)through billion-parameter models, ranging in shape from(6;4288)through + (207;768). Here we have trained to near convergence on the full WebText2 dataset and observe no over- + fitting (except possibly for the very largest models). + As shown in Figure 1, we find a steady trend with non-embedding parameter countN, which can be fit to the + first term of Equation (1.5), so that + + <> (3.1) + + <
> + + Figure 7 + + + To observe these trends it is crucial to study performance as a function ofN; if we instead use the total + parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This + suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in + recent work [LCG + 19]. + Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets + is also a power-law in N with nearly identical power, as shown in Figure 8. + + 3.2.1 Comparing to LSTMs and Universal Transformers + In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter + countN. The LSTMs were trained with the same dataset and context length. We see from these figures + that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match + the Transformer performance for later tokens. We present power-law relationships between performance and + context position Appendix D.5, where increasingly large powers for larger models suggest improved ability + to quickly recognize patterns. + We also compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure + 17 in the appendix. These models re-use parameters, and so perform slightly better as a function ofN, at the + cost of additional compute per-parameter. + + 3.2.2 Generalization Among Data Distributions + We have also tested our models on a set of additional text data distributions. The test loss on these datasets + as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 + dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct + parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the + in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. + We also observe no dependence on model depth (see Appendix D.8). + + 3.3 Performance with Dataset Size and Compute + + We display empirical trends for the test loss as a function of dataset sizeD(in tokens) and training compute + Cin Figure 1. + For the trend withDwe trained a model with <> on fixed subsets of the WebText2 + dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be + fit with simple power-law + + <> (3.2) + + in the dataset size. The data and fit appear in Figure 1. + The total amount of non-embedding compute used during training can be estimated asC= 6NBS, where + Bis the batch size,Sis the number of parameter updates, and the factor of6accounts for the forward and + backward passes. Thus for a given value ofCwe can scan over all models with variousNto find the model + + <
> + + Figure 8 Left:Generalization performance to other data distributions improves smoothly with model size, + with only a small and very slowly growing offset from the WebText2 training distribution.Right: + Generalization performance depends only on training distribution performance, and not on the phase of training. + We compare generalization of converged models (points) to that of a single large model (dashed curves) as it + trains. + + + with the best performance on stepS= C . Note that in these results the batch size B remains fixed for + all models, which means that these empirical results are not truly optimal. We will account for this in later 6BS + sections using an adjusted C_min to produce cleaner trends. + The result appears as the heavy black line on the left-hand plot in Figure 1. It can be fit with + + <> (3.3) + + The figure also includes images of individual learning curves to clarify when individual models are optimal. + We will study the optimal allocation of compute more closely later on. The data strongly suggests that sample + efficiency improves with model size, and we also illustrate this directly in Figure 19 in the appendix. + + 4 Charting the Infinite Data Limit and Overfitting + + In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will + study the performance of a model of size N trained on a dataset with D tokens while varying N and D + simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling + law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing + size while keeping overfitting under control. + + 4.1 Proposed L(N;D) Equation + + We have chosen the parameterization (1.5) (repeated here for convenience): + + <> (4.1) + + using three principles: + + 1.Changes in vocabulary size or tokenization are expected to rescale the loss by an overall factor. The + parameterization of L(N;D) (and all models of the loss) must naturally allow for such a rescaling. + 2.Fixing D and sending N!1, the overall loss should approachL(D). Conversely, fixing N and + sending D!1 the loss must approach L(N). + 3.L(N;D) should be analytic atD=1, so that it has a series expansion in 1=D with integer powers. + Theoretical support for this principle is significantly weaker than for the first two. + + Our choice of L(N;D) satisfies the first requirement because we can rescaleNc ;D c with changes in the + vocabulary. This also implies that the values ofNc ;D c have no fundamental meaning. + + <
> + + Figure 9 The early-stopped test lossL(N;D)depends predictably on the dataset size D and model sizeN + according to Equation (1.5).Left: For largeD, performance is a straight power law inN. For a smaller fixed + D, performance stops improving as N increases and the model begins to overfit. (The reverse is also true, + see Figure 4.)Right: The extent of overfitting depends predominantly on the ratio <>, as predicted in + equation (4.3). The line is our fit to that equation. + + + Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we + expect that larger models should always perform better than smaller models. But with fixed finiteD, we also + do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly, + a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note + that knowledge ofL(N)at infinite D and L(D) at infinite N fully determines all the parameters inL(N;D). + The third principle is more speculative. There is a simple and general reason one might expect overfitting + to scale/1=Dat very largeD. Overfitting should be related to the variance or the signal-to-noise ratio + of the dataset [AS17], and this scales as1=D. This expectation should hold for any smooth loss function, + since we expect to be able to expand the loss about theD! 1limit. However, this argument assumes that + 1=D corrections dominate over other sources of variance, such as the finite batch size and other limits on the + efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability. + Our third principle explains the asymmetry between the roles of N and D in Equation (1.5). Very similar + symmetric expressions 4 are possible, but they would not have a 1=D expansion with integer powers, and + would require the introduction of an additional parameter. + In any case, we will see that our equation forL(N;D)fits the data well, which is the most important justification + for our L(N;D). + + 4.2 Results + + We regularize all our models with 10% dropout, and by tracking test loss and stopping once it is no longer + decreasing. The results are displayed in Figure 9, including a fit to the four parameters <> in + Equation (1.5): + + <
> + + Table 2 Fits to L(N;D) + + We obtain an excellent fit, with the exception of the runs where the dataset has been reduced by a factor of + 1024, to about <> tokens. With such a small dataset, an epoch consists of only 40 parameter updates. + Perhaps such a tiny dataset represents a different regime for language modeling, as overfitting happens very + early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in + Section 3, as here we are fitting the full L(N;D) rather than just L(N;1) or L(1;D). + To chart the borderlands of the infinite data limit, we can directly study the extent of overfitting. For all but + the largest models, we see no sign of overfitting when training with the full 22B token WebText2 dataset, + so we can take it as representative ofD=1. Thus we can compare finiteDto the infinite data limit by + <> For example, one might have used <>, but this does not have a 1=D expansion. + + <
> + + Figure 10 The critical batch size B crit follows a power law in the loss as performance increase, and does + not depend directly on the model size. We find that the critical batch size approximately doubles for every + 13%decrease in loss B crit is measured empirically from the data shown in Figure 18, but it is also roughly + predicted by the gradient noise scale, as in [MKAT18]. + + + defining + <> (4.2) + + and studying it as a function ofN;D. In fact, we see empirically that L depends only a specific combination + of N and D, as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies + + <> (4.3) + + Note that at large D this formula also has a series expansion in powers of 1=D. + We estimate that the variation in the loss with different random seeds is roughly <>, which means that to + avoid overfitting when training to within that threshold of convergence we require + + <> (4.4) + + With this relation, models smaller than10 9 parameters can be trained with minimal overfitting on the 22B + token WebText2 dataset, but our largest models will encounter some mild overfitting. More generally, this + relation shows that dataset size may grow sub-linearly in model size while avoiding overfitting. Note however + that this does not typically represent maximally compute-efficient training. We should also emphasize that + we have not optimized regularization (eg the dropout probability) while varying dataset and model size. + + 5 Scaling Laws with Model Size and Training Time + + In this section we will demonstrate that a simple scaling law provides a good description for the loss as a + function of model size N and training time. First we will explain how to use the results of [MKAT18] to + define a universal training step S_min , which accounts for the fact that most of our models have not been + trained at an optimal batch size. Then we will demonstrate that we can fit the model size and training time + dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation + of training compute between model size and training time, and then confirm that prediction. + + 5.1 Adjustment for Training at B_crit (L) + + A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also + [SLA + 18, ZLN + 19]). It was argued that there is a critical batch size B_crit for training; forBup to B_crit + the batch size can be increased with very minimal degradation in compute-efficiency, whereas for <> increases in + B result in diminishing returns. It was also argued that the gradient noise scale provides a simple + prediction for B_crit , and that neither depends directly on model size except through the value of the loss that + has been attained. These results can be used to predict how training time and compute will vary with the + batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch + size <>. Training at <> minimizes the number of training steps, while <> minimizes + the use of compute. + More specifically, it was demonstrated that for a wide variety of neural network tasks, the number of training + stepsSand the number of data examples processed E=BS satisfy the simple relation + + <> (5.1) + + when training to any fixed value of the lossL. Here S_min is the minimum number of steps necessary to reach + L, while E_min is the minimum number of data examples that must be processed. + We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation defines the + critical batch size + + <> (5.2) + + which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal + time/compute tradeoff, requiring 2S_min training steps and processing <> data examples. + In Figure 10 we have plotted the critical batch size and gradient noise scale 5 as a function of training loss for + two different models. We see that B_crit(L) is independent of model size, and only depends on the lossL. So + the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can + be fit with a power-law in the loss + + <> (5.3) + + where <> and <>. + + We have chosen this parameterization for B_crit(L) because as the loss approaches its minimum value L_min, + the gradient noise scale is expected to diverge, and we expect B_crit to track this noise scale. We do not + know L_min, as we see no sign that our models are approaching it, but L_min>0 since the entropy of natural + language is non-zero. Since apparently L_min is much smaller than the values ofLwe have achieved, we used + a parameterization where B_crit diverges asL!0. + We will use B_crit (L)to estimate the relation between the number of training steps S while training at batch + sizeB= 2 19 tokens and the number of training steps while training at <>. This is simply + + <> (5.4) + + for any given target value L for the loss. This also defines a critical value of the compute needed to train toL + with a model of sizeNif we were to train at <>. This is + + <> (5.5) + + where <> estimates the (non-embedding) compute used at batch size B. + + 5.2 Results for <> and Performance with Model Size and Compute + + Now we will use S_min defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the + loss on model size and training time in the infinite data limit. We will fit the stable, Adam-optimized training + runs using Equation (1.6), repeated here for convenience: + + <> (5.6) + + for the loss. We include all training steps after the warmup period of the learning rate schedule, and find a fit + to the data with the parameters: + 5 Although the critical batch size roughly matches the gradient noise scale, we are using a direct measurements of + B_crit from Figures 18 and 10 for all our later analyses. + + <
> + + Figure 11 When we hold either total compute or number of training steps fixed, performance follows + L(N;S)from Equation (5.6). Each value of compute budget has an associated optimal model size that + maximizes performance. Mediocre fits at small S are unsurprising, as the power-law equation for the learning + curves breaks down very early in training. + + <
> + + Table 3 Fits toL(N;S) + + + With these parameters, we obtain the learning curve fits in Figure 4. Though the fits are imperfect, we believe + they are quite compelling given the simplicity of Equation (5.6). + The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we + study the test loss as a function of model size while fixing either the total non-embedding compute C used + in training, or the number of stepsS. For the fits we use Equation (5.5) and (5.4) along with the parameters + above and Equation (5.6). + The power-law dependence of the loss on S_min reflects the interplay of optimizer dynamics and the loss + landscape. Since the fits are best late in training, when the loss may be approximately quadratic, the power- + law should provide information about the spectrum of the Hessian of the loss. Its universality suggests that + the Hessian eigenvalue density is roughly independent of model size. + + 5.3 Lower Bound on Early Stopping Step + + The results for<>can be used to derive a lower-bound (and rough estimate) of the step at which + early stopping should occur when training is data limited. It is motivated by the idea that finite and infiniteD + learning curves for a given model will be very similar until we reach <>. Thus overfitting should + be proportional to the correction from simply ending training at S stop . This will underestimate S_stop, because + in reality the test loss will decrease more slowly when we have a finiteD, and therefore we will require more + training steps to reach the optimal test loss at finiteD. This line of reasoning leads to the inequality + + <> (5.7) + + whereL(N;1)is the converged loss, evaluated with infinite available data. This inequality and its + comparison to the empirical data is displayed in Figure 16 in the appendix. In that figure, the values of S stop and L(N;D) are empirical (though S stop is adjusted to mimic training at <>), while L(N;1) is + computed from the fit to L(N;D) evaluated at D=1. + + + 6 Optimal Allocation of the Compute Budget + + We displayed the empirical trend of performance as a function of the computation used during training in + the top-right of Figure 1. However, this result involved training at a fixed batch sizeB, whereas we know + + <
> + + Figure 12 Left:Given a fixed compute budget, a particular model size is optimal, though somewhat larger + or smaller models can be trained with minimal additional compute.Right:Models larger than the compute- + efficient size require fewer steps to train, allowing for potentially faster training if sufficient additional + parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the + power-law region of the learning curve, after initial transient effects. + + + <
> + + Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a + somewhat altered power law for L(C_min) when compared with the fully empirical results. The conspicuous + lump at <> PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks + in the power-law fits. It is the L(C_min) trend that we expect to provide a reliable extrapolation for larger + compute. + + + that in fact we could train more efficiently 6 by training at the batch size B_crit discussed in Section 5.1. + Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, + and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more + predictable trends. + In this section we will adjust for this oversight. More importantly, we will use the results of Section 5 + to determine the optimal allocation of compute between model size N and the quantity of data processed + during training, namely <>. We will determine this allocation both empirically and theoretically, by + using the equation for <>, and we will demonstrate that these methods agree. + + 6.1 Optimal Performance and Allocations + + Let us first study the loss as a function of the optimally allocated compute from Equation (5.5). The result is + plotted in Figure 13, along with a power-law fit. We see that as compared to the compute plot of Figure 1, the + new fit with C_min is somewhat improved. + Given L(C_min), it is natural to ask for the optimal model size N(C_min) that provides the minimal loss with a + given quantity of training compute. The optimal model size is shown in Figure 14. We observe that N(C_min) + + 6 One might ask why we did not simply train at B_crit in the first place. The reason is that it depends not only on the + model but also on the target value of the loss we wish to achieve, and so is a moving target. + + <> + + Figure 14 Left:Each value of the compute budget C_min has an associated optimal model sizeN. Optimal + model size grows very rapidly with C_min, increasing by 5x for each 10x increase in compute. The number + of data examples processed makes up the remainder of the increase, growing relatively modestly by only 2x. + Right:The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most + of the growth in data examples processed can be used for increased batch sizes. + + + can be fit very well with a power-law + + <> (6.1) + + In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4). + By definition <>, and so we can use <> to extract further results. In particular, since + prior fits show <> and <>, we can conclude that <>. This leads us to conclude min that + the optimal number of steps will only grow very slowly with compute, as + + <>; (6.2) + + matching the empirical results in Figure 14. In fact the measured exponent is sufficiently small that our results + may even be consistent with an exponent of zero. + + Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we + should predominantly increase the model sizeN, while simultaneously scaling up the batch size via <> + with negligible increase in the number of serial steps. Since compute-efficient training uses relatively + few optimization steps, additional work on speeding up early training dynamics may be warranted. + + 6.2 Predictions from <> + + The results for <> and the allocations can be predicted from the <> equation obtained in + Section 5. Given our equation for <>, we can substitute <> and then find the minimum + of the loss as a function ofN, while fixing the training compute. We carry out this procedure in detail in 6NB + Appendix B, where we also provide some additional predictions. + For the loss as a function of training compute, we predict that + + <> (6.3) + + in excellent agreement with the exponent of Figure 13. We also predict that + + <> (6.5) + + which also matches the scaling of Figure 14 to within a few percent. Our scaling laws provide a predictive + framework for the performance of language modeling. + + <
> + + Figure 15 Far beyond the model sizes we study empirically, we find a contradiction between our equations + for<>andL(D)due to the slow growth of data needed for compute-efficient training. The intersection + marks the point before which we expect our predictions to break down. The location of this point is highly + sensitive to the precise exponents from our power-law fits. + + + 6.3 Contradictions and a Conjecture + + We observe no signs of deviation from straight power-law trends at large values of compute, data, or model + size. Our trends must eventually level off, though, since natural language has non-zero entropy. + Indeed, the trends for compute-efficient training described in this section already contain an apparent contra- + diction. At scales several orders of magnitude above those documented here, the performance predicted by + the<>scaling law decreases below what should be possible given the slow growth in training data with + compute. This implies that our scaling laws must break down before this point, but we conjecture that the + intersection point has a deeper meaning: it provides an estimate of the point at which Transformer language + models reach maximal performance. + Since the amount of data used by compute-efficient training grows slowly with the compute budget, the + performance predicted by<>eventually hits a lower bound set by theL(D)power law (see Figure 15). + Let us work this out in more detail. + To keep overfitting under control, the results of Section 4 imply that we should scale the dataset size as + + <> (6.6) + + where we have used the compute-efficient <> from Figure 14. + Let us compare this to the data requirements of compute-efficient training. If we train at the critical batch + size (i.e. <>) and never re-use data during training, we find that data usage grows with compute as + + <> (6.7) + + This is the maximum rate at which the dataset size can productively grow with compute, since it means that + we are only training for a single epoch. But it grows the dataset much more slowly than in Equation (6.6). + It appears to imply that compute-efficient training will eventually run into a problem with overfitting, even if + the training process never re-uses any data! + According to Figure 1, we expect that when we are bottlenecked by the dataset size (ie by overfitting), the + loss should scale as <>. This implies that the loss would scale with compute as <> + once we are data-limited. Once again, we have a contradiction, as this will eventually intersect with min + our prediction for <> from Figure 13, where we found a scaling <> + The intersection point of <> and <> occurs at + + <> (6.8) + + though the numerical values are highly uncertain, varying by an order or magnitude in either direction de- + pending on the precise values of the exponents from the power-law fits. The most obvious interpretation is + that our scaling laws break down at or before we reach this point, which is still many orders of magnitude + away in both compute and model size. + One might also conjecture that this intersection point has a deeper meaning. If we cannot increase the model + size beyond N without qualitatively different data requirements, perhaps this means that once we reach + C and N, we have extracted all of the reliable information available in natural language data. In this min + interpretation, L would provide a rough estimate for the entropy-per-token 7 of natural language. In this + scenario, we would expect the loss trend to level off at or before L. + We can guess at the functional form of<>as it levels off by considering a version of our training + dataset with added noise. For example, we could append a random string of tokens to each context shown + to the model to artificially boost the loss by a constant additive factor. Then, the distance from the noise + floor LxL noise would be a more meaningful performance metric, with even a small decrease in this distance + potentially representing a significant boost in qualitative performance. Since the artificial noise would affect + all of our trends equally, the critical point of 6.8 would not change (aside from the absolute value of L, and + may be meaningful even if it occurs after the leveling off. + + 7 Related Work + + Power laws can arise from a wide variety of sources [THK18]. Power-law scalings with model and dataset + size in density estimation [Was06] and in random forest models [Bia12] may be connected with our results. + These models suggest that power-law exponents may have a very rough interpretation as the inverse of the + number of relevant features in the data. + Some early [BB01, Goo01] work found power-law scalings between performance and dataset size. More + recent work [HNA + 17, HAD19] also investigated scaling between model size and data size; their work is + perhaps the closest to ours in the literature 8 . Note, however, that [HNA + 17] found super-linear scaling of + dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our + findings on optimal allocation of compute and [Kom19], including power-law learning curves. EfficientNets + [TL19] also appear to obey an approximate power-law relation between accuracy and model size. Very recent + work [RRBS19b] studies scaling with both dataset size and model size for a variety of datasets, and fits an + ansatz similar to ours. + EfficientNet [TL19] advocates scaling depth and width exponentially (with different coefficients) for optimal + performance of image models, resulting in a power-law scaling of width as a function of depth. We find that + for language models this power should be roughly one when scaling up (as width/depth should remain fixed). + But more importantly, we find that the precise architectural hyperparameters are unimportant compared to the + overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles + of shallower models, which could potentially explain this finding. Earlier work [ZK16] has compared width + and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies + fix computation per data example, which tends to scale in proportion to the number of model parameters, + whereas we investigate scaling with both model size and the quantity of training computation. + Various works [AS17, BHMM18] have investigated generalization in highly overparameterized models, find- + ing a “jamming transition” [GJS + 19] when the model size reaches the dataset size (this may require training + many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do + not observe such a transition, and find that the necessary training data scales sublinearly in the model size. + Expansions in the model size, particularly at large width [JGH18, LXS + 19], may provide a useful framework + for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning + curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions + [ZLN + 19] in realistic settings. Making this connection quantitative will require a characterization of the + Hessian spectrum [Pap18, GKX19, GARD18]. + + 8 Discussion + + We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter + countN, dataset sizeD, and optimized training computation C_min , as encapsulated in Equations (1.5) and + (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. + Since scalings with <> are power-laws, there are diminishing returns with increasing scale. + + 7 Defining words using the wc utility, the WebText2 dataset has1:4tokens per word and <> characters per token. + 8 After this work was completed, [RRBS19a] also appeared, which makes similar predictions for the dependence of + loss on both model and dataset size. + We were able to precisely model the dependence of the loss on N and D, and alternatively on N and S, when + these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude + of overfitting, early stopping step, and data requirements when training large language models. So our scaling + relations go beyond mere observation to provide a predictive framework. One might interpret these relations + as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, + independent of most of the details of its microscopic constituents. + It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a + maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to + test these relations on other domains, such as images, audio, and video models, and perhaps also for random + network distillation. At this point we do not know which of our results depend on the structure of natural + language data, and which are universal. It would also be exciting to find a theoretical framework from + which the scaling relations can be derived: a ‘statistical mechanics’ underlying the ‘thermodynamics’ we + have observed. Such a theory might make it possible to derive other more precise predictions, and provide a + systematic understanding of the limitations of the scaling laws. + In the domain of natural language, it will be important to investigate whether continued improvement on the + loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major + qualitative improvements: “more is different”. For example, the smooth aggregate growth of the economy + provides no indication of the specific technological developments that underwrite it. Similarly, the smooth + improvements in language model loss may hide seemingly qualitative changes in capability. + Our results strongly suggest that larger models will continue to perform better, and will also be much more + sample efficient than has been previously appreciated. Big models may be more important than big data. + In this context, further investigation into model parallelism is warranted. Deep models can be trained using + pipelining [HCC + 18], which splits parameters depth-wise between devices, but eventually requires increased + batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization + [SCP + 18], since large layers can be split between multiple workers with less serial dependency. Sparsity + [CGRS19, GRK17] or branching (e.g. [KSH12]) may allow for even faster training of large networks through + increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, + it might be possible to remain on the compute-efficient frontier for an entire training run. + + Acknowledgements + + We would like to thank Shan Carter, Paul Christiano, Jack Clark, Ajeya Cotra, Ethan Dyer, Jason Eisner, + Danny Hernandez, Jacob Hilton, Brice Menard, Chris Olah, and Ilya Sutskever for discussions and for feed- + back on drafts of this work. + + + Appendices + + + A Summary of Power Laws + + For easier reference, we provide a summary below of the key trends described throughout the paper. + + <
> + + Table 4 + + The empirical fitted values for these trends are: + + <
> + + Table 5 + + The optimal parameters for compute efficient training are given by: + + <
> + + Table 6 + + + B Empirical Model of Compute-Efficient Frontier + + Throughout this appendix all values of C,S and C are adjusted for training at the critical batch size B_crit . + We have left off the ‘adj’ label to avoid cluttering the notation. + + B.1 Defining Equations + + The power-law fit to the learning curves implies a simple prescription for compute-efficient training. In this + appendix, we will derive the optimal performance, model size, and number of training steps as a function of + the compute budget. We start with the Equation (1.6), repeated here for convenience: + + <> (B.1) + + Here,S represents the number of parameter updates when training at the critical batch size[MKAT18], + which was defined in Equation (5.2) 9 : + + <> (B.2) + + We would like to determine optimal training parameters for a fixed compute budget, so we replaceS= + <>, where C is the number of FLOPs used in the training run: + + <> (B.3) + + Now, we set@N L = 0to find the condition for optimality: C + + <> (B.4) + + Equation (B.3) and (B.4) together determine the compute-efficient frontier. + + B.2 Efficient Training + + Now we assemble the implications of (B.3) and (B.4). First, note that inserting (B.4) into (B.3) yields + + <> (B.5) + + which implies that for compute-efficient training, we should train to a fixed percentage N=10% above the converged loss. + Next, let’s determine how the optimal loss depends on the compute budget. Eliminating S + N yields a power-law dependence of performance on compute: + + <> (B.6) + where we defined + + <> (B.7) + + <> (B.8) + + Similarly, we can eliminateLto find N(C): + + <> (B.9) + + and + + <> (B.10) + + 9 There is a slight ambiguity here: we can imagine training either at a constant batch size <>, or we could + instead train at a variable batch sizeB~(L), whereB~is the instantaneous critical batch size (as opposed to B, which is + the averaged version). These two prescriptions result in the same number of steps, so we can ignore this subtlety (see + [MKAT18]). + + B.3 Comparison to Inefficient + + Typically, researchers train models until they appear to be close to convergence. In this section, we compare + the efficient training procedure described above to this more typical setup. We define a the convergence factor + fas the percent deviation from the converged loss: + + <> (B.11) + + For compute-efficient training we have <> from the previous section, but researchers + typically use a much smaller value. Here, we choose f0=2% as an estimate. For a fixed value of the loss, + we predict: + <> (B.12) + + <> (B.13) + + <> (B.14) + + So that compute-efficient training uses 7.7x fewer parameter updates, 2.7x more parameters, and 65% less + compute to reach the same loss. + + B.4 Suboptimal Model Sizes + + We can solve A.1 to find an expression for the amount of compute needed to reach a given value of the loss + L with a model of size N: + + <> (B.15) + + Using A.6 and A.9, we can eliminateLin favor ofNe (L), the model size which reaches L most efficiently. + From there, we find an expression for the excess compute needed as a consequence of using a suboptimal + model size: + + <> (B.16) + + The result is shown in Figure X. Models between 0.6x and 2.2x the optimal size can be used with only a + 20% increase in compute budget. Using a smaller model is useful when accounting for the cost inference. A + larger model can be trained the the same level of performance in fewer steps, allowing for more parallelism + and faster training if sufficient hardware is available (see Figure Y): + + <> (B.17) + + A 2.2x larger model requires 45% fewer steps at a cost of 20% more training compute. Note that this equation + should not be trusted for very large models, as it is only valid in the power-law region of the learning curve + after initial transient effects. + + C Caveats + + In this section we list some potential caveats to our analysis. + + At present we do not have a solid theoretical understanding for any of our proposed scaling laws. + The scaling relations with model size and compute are especially mysterious. It may be possible to + understand scaling at very large DS holding model size fixed [AS17], and also the shape of learning + curves late in training, by modeling the loss with a noisy quadratic. But the scaling withDat very + large model size still remains mysterious. Without a theory or a systematic understanding of the + corrections to our scaling laws, it’s difficult to determine in what circumstances they can be trusted. + + <
> + + Figure 16 Left:We characterize the step on which early stopping occurs, as a function of the extent of + overfitting. The red line indicates a lower bound for early stopping that is derived in Section 5.3.Right: + We display train and test loss for a series of 300M parameter models trained on different sized dataset sub- + samples. The test loss typically follows that of a run done with unrestricted data until diverging. Note that the + degree of overfitting (as compared to the infinite data limit) is significantly overestimated by L_test & L_train + (denoted by a black bar for each run). + + + We are not especially confident in the prediction of B_crit (L)for values of the loss far outside the + range we have explored. Changes in B_crit could have a significant impact on trade-offs between + data parallelism and the number of serial training steps required, which would have a major impact + on training time. + We did not thoroughly investigate the small data regime, and our fits forL(N;D)were poor for + the smallest values ofD(where an epoch corresponded to only40steps). Furthermore, we did + not experiment with regularization and data augmentation. Improvements in these could alter our + results, quantitatively or qualitatively. + We used the estimated training compute <>, which did not include contributions proporcional + to nctx (see Section 2.1). So our scalings with compute may be confounded in practice in the + regime of very large nctx , specifically where nctx & 12d model. + We tuned learning rates, and we experimented with learning rate schedules. But we may have + neglected to tune some hyperparameter (e.g. intialization scale or momentum) that have an important + effect on scaling. + The optimal choice of learning rate is sensitive to the target loss. When training close to convergence, + it may be necessary to use a smaller learning rate to avoid divergences. But when conducting a short + training run (eg due to compute limitations), it may be possible to use a larger learning rate. We did + not experiment with higher learning rates for training runs that did not proceed to convergence. + + D Supplemental Figures + + D.1 Early Stopping and Test vs Train + + In section 5.3 we described the result shown in Figure 16, which provides a prediction for a lower bound on + the early stopping step. We also show the train and test loss for a given model size when training on different + sized datasets. + + D.2 Universal Transformers + + We compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17. + These models re-use parameters, and so perform slightly better as a function ofN, but slightly worse as a + function of compute C. We include several different possibilities for parameter re-use. + + D.3 Batch Size + + We measure the critical batch size using the data displayed in figure 18. This made it possible to estimate + B_crit(L) in figure 10. + + <
> + + Figure 17 We compare recurrent Transformers [DGV + 18], which re-use parameters, to standard Trans- + formers. Recurrent Transformers perform slightly better when comparing models with equal parameter count, + but slightly worse when accounting for reuse and comparing per FLOP. + + <
> + + Figure 18 These figures demonstrate fits to Equation (5.1) for a large number of values of the lossL, and + for two different Transformer model sizes. These fits were used to measure B_crit (L)for Figure 10. + + + + D.4 Sample Efficiency vs Model Size + + It is easy to see from figure 2 that larger models train faster, and are therefore more sample efficient. We + provide another way of looking at this phenomenon in figure 19, which shows when different models reach + various fixed values of the loss. + + <
> + + Figure 19 The number of minimum serial steps needed to reach any fixed value of the test loss decreases + precipitously with model size. Sample efficiency (show here for training far below the critical batch size) + improves greatly as well, improving by a factor of almost 100 when comparing the smallest possible model + to a very large one. + + <
> + + Figure 20 This figure provides information about the performance per token as a function of model size + and training time.Left:Loss per token as a function of its positionTin the 1024-token context. Loss scales + predictably as a power-law inT.Right: Test loss per token as a function of training step. + + <
> + + Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve + smoothly as model size increases. Training runs with shorter context nctx = 8 (dashed lines) perform better + on early tokens, since they can allocate all of their capacity to them. + + + D.5 Context Dependence + + The trends for loss as a function of model size are displayed for different tokens in the context in Figure 21. + We see that models trained on nctx = 1024 show steady improvement with model size on all but the first + token. + Fixing model size, it appears that the loss scales as a power-law as a function of positionTin the context, see + Figure 20. This may be a consequence of underlying power-law correlations in language [EP94, ACDE12, + LT16], or a more general feature of the model architecture and optimization. It provides some suggestion for + the potential benefits (or lack thereof) from training on larger contexts. Not only do larger models converge + to better performance atT= 1024, but they also improve more quickly at early tokens, suggesting that larger + models are more efficient at detecting patterns with less contextual information. In the right-hand plot we + show how per-token performance varies for a fixed model as a function of the training step. The model begins + by learning short-range information, and only learns longer-range correlations later in training. + We have also included models trained with a tiny context nctx = 8 in order to compare with our longer + context models. Even modestly sized models trained on nctx = 8 can dominate our largest nctx = 1024 + models on very early tokens. This also suggests that further improvements should be possible with much + larger models trained on large contexts. + + D.6 Learning Rate Schedules and Error Analysis + + We experimented with a variety of learning rates and schedules. A host of schedules and resulting test + performances for a small language model are plotted in Figure 22. We conclude that the choice of learning + rate schedule is mostly irrelevant, as long as the total summed learning rate is sufficiently large, and the + schedule includes a warmup period and a final decay to near-vanishing learning rate. Variations among + + <
> + + Figure 22 We test a variety of learning rate schedules including cosine decay, linear decay, as well as other + faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experiments we + do not decay to zero, since we find that this tends to give a fixed improvement close to the end of training. + We find that, as long as the learning rate is not too small and does not decay too quickly, performance does + not depend strongly on learning rate. Run-to-run variation is at the level of 0.05 in the loss, so averaging + multiple runs is necessary to validate performance changes smaller than this level. + + <
> + + Figure 23 The trend for performance as a function of parameter count,L(N), is fit better by a power law + than by other functions such as a logarithm at a qualitative level. + + + schedules appear to be statistical noise, and provide a rough gauge for the scale of variation between different + training runs. Experiments on larger models suggest that the variation in the final test loss between different + random seeds is roughly constant in magnitude for different model sizes. + We found that larger models require a smaller learning rate to prevent divergence, while smaller models can + tolerate a larger learning rate. To implement this, the following rule of thumb was used for most runs: + + <
> (D.1) + + We expect that this formula could be improved. There may be a dependence on network width, likely set by + the initialization scale. The formula also breaks down forN >10 10 parameters. Nevertheless, we found that + it works sufficiently well for the models we considered. + + D.7 Fit Details and Power Law Quality + + We experimented with a number of functional forms for the fits to <>, and <> the power-law + fits were qualitatively much more accurate than other functions such as logarithms (see Figure 23). + ForL(C), we do not include small models with only 1 layer in the fit, as the transition from 1 to 2 layers + causes a noticeable lump in the data. For L(N) we also do not include very small models with only 1 layer in + the fit, and we exclude the largest models that have not trained fully to convergence. Fit parameters change + marginally if we do include them, and the trend extrapolates well in both directions regardless. + + D.8 Generalization and Architecture + + In figure 24 we show that generalization to other data distributions does not depend on network depth when we + hold the total parameter count fixed. It seems to depend only on the performance on the training distribution. + + <> + + Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. + We observe no effect of depth on generalization; generalization performance depends primarily on + training distribution performance. The 12-layer model overfit the Internet Books dataset and we show the + early-stopped performance; we have not seen this surprising result in other experiments. + + + List of Figures + + 1 Summary of simple power laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 + 2 Illustration of sample efficiency and compute efficiency. . . . . . . . . . . . . . . . . . . . .4 + 3 How to scale up model size, batch size, and serial steps . . . . . . . . . . . . . . . . . . . .4 + 4 Performance when varying model and data size, or model and training steps, simultaneously5 + 5 Weak dependence of performance on hyperparameter tuning . . . . . . . . . . . . . . . . .8 + 6 Comparison of performance trend when including or excluding embeddings . . . . . . . . .8 + 7 LSTM and Transformer performance comparison . . . . . . . . . . . . . . . . . . . . . . .9 + 8 Generalization to other test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 + 9 Universality of overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 + 10 Critical batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 + 11 Performance versus compute budget or number of parameter updates . . . . . . . . . . . . .14 + 12 Training on suboptimal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 + 13 Comparison between empirical and adjusted compute trends . . . . . . . . . . . . . . . . .15 + 14 Optimal model size and serial number of steps versus compute budget . . . . . . . . . . . .16 + 15 Contradiction between compute and data trends . . . . . . . . . . . . . . . . . . . . . . . .17 + 16 Early stopping lower bound and training curves for overfit models . . . . . . . . . . . . . .23 + 17 Universal transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 + 18 Batch size scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 + 19 Another look at sample efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 + 20 Power-law dependence of performance on position in context . . . . . . . . . . . . . . . . .25 + 21 Performance at different context positions versus model size . . . . . . . . . . . . . . . . .25 + 22 Learning rate schedule scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 + 23 Comparison of Power-Law and Logarithmic Fits . . . . . . . . . . . . . . . . . . . . . . .26 + 24 Generalization versus depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 + + List of Tables + + 1 Parameter and compute counts for Transformer . . . . . . . . . . . . . . . . . . . . . . . .7 + 2 Fits toL(N;D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 + 3 Fits toL(N;S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 + 4 Key trend equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 + 5 Key parameters to trend fits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 + 6 Trends for compute-efficient training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 + + References + + [ACDE12]Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long- + range correlations in texts.Proceedings of the National Academy of Sciences, 109(29):11582– + 11587, 2012. 25 + [AS17]Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in + neural networks.arXiv, 2017, 1710.03667. 11, 18, 22 + [BB01]Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- + biguation. InProceedings of the 39th annual meeting on association for computational linguis- + tics, pages 26–33. Association for Computational Linguistics, 2001. 18 + [BHMM18]Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine + learning and the bias-variance trade-off.arXiv, 2018, 1812.11118. 18 + [Bia12]GÊrard Biau. Analysis of a random forests model.Journal of Machine Learning Research, + 13(Apr):1063–1095, 2012. 18 + [CGRS19]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with + sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URLhttp://arxiv.org/ + abs/1904.10509. 19 + [DCLT18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep + bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2 + [DGV + 18]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- + versal transformers. CoRR, abs/1807.03819, 2018, 1807.03819. URLhttp://arxiv.org/ + abs/1807.03819. 6, 9, 23, 24 + [EP94]Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. + EPL (Europhysics Letters), 26(4):241, 1994. 25 + [Fou]The Common Crawl Foundation. Common crawl. URLhttp://commoncrawl.org. 7 + [GARD18]Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. + 2018, arXiv:1812.04754. 18 + [GJS + 19]Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, + Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with + number of parameters in deep learning.arXiv, 2019, 1901.01608. 18 + [GKX19]Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op- + timization via hessian eigenvalue density. CoRR, abs/1901.10159, 2019, 1901.10159. URL + http://arxiv.org/abs/1901.10159. 18 + [Goo01]Joshua Goodman. A bit of progress in language modeling.CoRR, cs.CL/0108005, 2001. URL + http://arxiv.org/abs/cs.CL/0108005. 18 + [GRK17]Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights.ope- + nai.com, 2017. 19 + [HAD19]Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu- + tational challenges in deep learning. InProceedings of the 24th Symposium on Principles and + Practice of Parallel Programming, PPoPP ’19, pages 1–14, New York, NY, USA, 2019. ACM. + doi:10.1145/3293883.3295710. 18 + [HCC + 18]Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, + and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. + CoRR, abs/1811.06965, 2018, 1811.06965. URLhttp://arxiv.org/abs/1811.06965. 19 + [HNA + 17]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia- + ninejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is pre- + dictable, empirically, 2017, 1712.00409. 18 + [JGH18]Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and + generalization in neural networks. InAdvances in neural information processing systems, pages + 8571–8580, 2018. 18 + [KB14]Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, + 1412.6980. 7 + [Kom19]Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 + [KSH12]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep + convolutional neural networks. InProceedings of the 25th International Conference on Neural + Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran + Associates Inc. URLhttp://dl.acm.org/citation.cfm?id=2999134.2999257. 19 + [LCG + 19]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu + Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, + 1909.11942. 9 + [LOG + 19]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike + Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretrain- + ing approach. CoRR, abs/1907.11692, 2019, 1907.11692. URLhttp://arxiv.org/abs/ + 1907.11692. 2 + [LSP + 18]Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and + Noam Shazeer. Generating wikipedia by summarizing long sequences.arXiv:1801.10198 [cs], + 2018, 1801.10198. URLhttp://arxiv.org/abs/1801.10198. 2, 6 + [LT16]Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics.arXiv + preprint arXiv:1606.06737, 2016. 25 + [LXS + 19]Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- + Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models + under gradient descent, 2019, arXiv:1902.06720. 18 + [MKAT18]Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model + of large-batch training, 2018, arXiv:1812.06162. 3, 5, 6, 12, 13, 21 + [Pap18]Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. + CoRR, abs/1811.07062, 2018, 1811.07062. URLhttp://arxiv.org/abs/1811.07062. 18 + [RNSS18]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language + understanding by generative pre-training.URL https://s3-us-west-2. amazonaws. com/openai- + assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018. 2, 6 + [RRBS19a]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive + prediction of the generalization error across scales, 2019, 1909.12673. 18 + [RRBS19b]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive + prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 + [RSR + 19]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, + Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified + text-to-text transformer, 2019, arXiv:1910.10683. 2 + [RWC + 19]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language + models are unsupervised multitask learners.openai.com, 2019. 2, 5, 6, 7, 8 + [SCP + 18]Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- + takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and + Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19 + [SHB15]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words + with subword units.CoRR, 2015, 1508.07909. 6 + [SLA + 18]Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and + George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018, + arXiv:1811.03600. 12 + [SS18]Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory + cost.CoRR, abs/1804.04235, 2018, 1804.04235. URLhttp://arxiv.org/abs/1804.04235. + 7 + [THK18]Stefan Thurner, Rudolf Hanel, and Peter Klimek.Introduction to the theory of complex systems. + Oxford University Press, 2018. 18 + [TL19]Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural + networks.CoRR, abs/1905.11946, 2019, 1905.11946. URLhttp://arxiv.org/abs/1905. + 11946. 18 + [VSP + 17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, + S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural + Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL + http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. 2, 6 + [VWB16]Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles + of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 + [Was06]Larry Wasserman.All of nonparametric statistics. Springer Science & Business Media, 2006. + [WPN + 19]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, + Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose + language understanding systems, 2019, 1905.00537. 2 + [WRH17]Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in- + creasing model capacity.2017 IEEE Conference on Computer Vision and Pattern Recognition + (CVPR), Jul 2017. doi:10.1109/cvpr.2017.323. 19 + [WYL19]Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional + networks, 2019, 1906.02909. 19 + [YDY + 19]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. + Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, + arXiv:1906.08237. 2 + [ZK16]Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.Procedings of the British + Machine Vision Conference 2016, 2016. doi:10.5244/c.30.87. 18 + [ZKZ + 15]Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor- + ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by + watching movies and reading books.2015 IEEE International Conference on Computer Vision + (ICCV), Dec 2015. doi:10.1109/iccv.2015.11. 7 + [ZLN + 19]Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, + Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch + sizes? insights from a noisy quadratic model.CoRR, abs/1907.04164, 2019, 1907.04164. URL + http://arxiv.org/abs/1907.04164. 12, 18 +<> <> <> + + +<> <> <> +Structured Pruning of Convolutional Neural Networks via L1 Regularization + +CHEN YANG1,2, ZHENGHONG YANG1,2, ABDUL MATEEN KHATTAK2,3 , LIU YANG1,2, WENXIN ZHANG1,2, WANLIN GAO1,2 , AND MINJUAN WANG1,2 +1Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing 100083, China 2College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China 3Department of Horticulture, The University of Agriculture, Peshawar 25120, Pakistan +Corresponding authors: Wanlin Gao (wanlin_cau@163.com) and Minjuan Wang (minjuan@cau.edu.cn) +This work was supported by the Project of Scientific Operating Expenses from Ministry of Education of China under Grant 2017PT19. + +ABSTRACT +Deep learning architecture has achieved amazing success in many areas with the recent advancements in convolutional neural networks (CNNs). However, real-time applications of CNNs are seriously hindered by the significant storage and computational costs. Structured pruning is a promising method to compress and accelerate CNNs and does not need special hardware or software for an auxiliary calculation. Here a simple strategy of structured pruning approach is proposed to crop unimportant filters or neurons automatically during the training stage. The proposed method introduces a mask for all filters or neurons to evaluate their importance. Thus the filters or neurons with zero mask are removed. To achieve this, the proposed method adopted L1 regularization to zero filters or neurons of CNNs. Experiments were conducted to assess the validity of this technique. The experiments showed that the proposed approach could crop 90.4%, 95.6% and 34.04% parameters on LeNet-5, VGG-16, and ResNet-32respectively, with a negligible loss of accuracy. + + +INDEX +TERMS Convolutional neural networks, regularization, structured pruning. + + +I. INTRODUCTION + +During the recent years, convolutional neural networks (CNNs) [1] have accomplished successful applications in many areas such as image classification [2], object detection [3], neural style transfer [4], identity authentication [5], information security [6], speech recognition and natural language processing. However, these achievements were made through leveraging large-scale networks, which possessed millions or even billions of parameters. Those large-scale networks heavily relied on GPUs to accelerate computation. Moreover, devices with limited resources, such as mobile, FPGA or embedded devices, etc. have difficulties to deploy CNNs in actual applications. Thus, it is critical to accelerate the inference of CNNs and reduce storage for a wide range of applications [7]. +According to the studies done so far, the major approaches for compressing deep neural networks can be categorized into four groups, i.e. low-rank decomposition [8], parameter quantization [9], knowledge distillation [10][13], and +network pruning [14]. For the deep neural networks (DNN) that have been trained, the low-rank decomposition technology decomposes and approximates a tensor to a smaller level to achieve compression. The low-rank decomposition achieves efficient speedup because it reduces the elements of the matrix. However, it can only decompose or approximate tensors one by one within every layer, and cannot discover the redundant parameters of DNN. Besides, more research has been focused on network module designs, which are smaller, more efficient and more sophisticated. These models, such as SqueezeNet [15], MobileNet [16] and Shufflenet [17], are basically made up of low resolutions convolution with lesser parameters and better performance. +At present, network pruning is a major focus of research, which not only accelerates DNN, but also reduces redundant parameters. Actually, using a large-scale network directly may provide state-of-the-art performance, so learning a large-scale network is needed. However, optimum network architecture may not be known. Thus, a massive redundancy exists in large neural networks. To combat this problem, network pruning is useful to remove redundant parameters, filters, channels or neurons, and address the over-fitting issue. +This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ + +<
> + +FIGURE 1. The architecture of the layer with the mask. (a) The architecture of a convolutional layer with the mask. (b) The architecture of a fully-connected layer with the mask. The proposed approach chooses the unimportant filters and neurons (highlighted in yellow) by the order of magnitude of mask value. +Network pruning techniques can also be broadly categorized as structured pruning and non-structured pruning. Non-structured pruning aims to remove single parameters that have little influence on the accuracy of networks and non-structured pruning is efficient and effective for compact.ing networks. Nonetheless, non-structured pruning is difficult to be widely used in practical applications. Actually, the operation of convolution is reformulated as a matrix-by-matrix multiplication in many prevalent deep learning frameworks. This requires additional information to rep.resent pruned locations in non-structured pruning method. Therefore, special hardware or software is needed to assist with the calculation, which may increase computation time. Instead, structured pruning directly removes the entire filters, channels or neurons. Thus, the remaining network architecture can be used directly by the existing hardware. For example, Anwar et al. [18] employed particle filtering to structured sparsity convolutional neural network at channel-wise, kernel-wise, and intra-kernel stride levels. At present, several structured pruning methods [24], [25], [27] are mainly based on the statistical information of parameters or +activation outputs. These methods do not consider the loss and are unable to remove parameters during training. In addition, some methods, such as those mentioned by [19], [20], require layer-by-layer iterative pruning and recovery accuracy, which involves enormous calculations. On the contrary, the proposed approach links pruning with minimization of loss and can be implemented during the training. +It is inspiring that the filters who's weights are all zero can be safely removed, because, whatever the input, they would not extract any features. This study presents a scheme to prune filters or neurons of fully-connected layers based on L1 regularization [21] to zero out the weights of some filters or neurons. Similar to this method, Wen et al. [31] adopted group LASSO regularization [40] to zero out filters. However, all the weights are required to compute an extra gradient, which is computationally expensive for a large-scale network. +Contrarily, in the proposed method, a mask is introduced to address this issue and the regularization term only is the l1-norm of the mask, which easily calculates the gradients of the mask. In this method, the parameters of filters or neurons are multiplied by a mask to pick unimportant filters or neurons, and once the mask is zero the corresponding filter or neuron will be removed. Here, though a mask is introduced for filters or neurons, the method does not change the architecture of the network. This allows for other compression methods to be used with the proposed technique. Similar to the proposed method, Lin et al. [32] also adopted a mask to identify unimportant filters or neurons, but the value of the mask could not be changed by training. In addition, removing unimportant filters or neurons may temporarily degrade accuracy, but the network can be retrained for recovery performance. FIGURE 1 shows the framework of the proposed method. +In this article, a structured pruning technology is presented, which allows for simultaneously learning and removing unimportant filters or neurons of CNNs. The main contributions are as follows: + +� A simple yet effective method based L1 regularization is presented to compress CNNs model during the training stage. + +� A threshold is adopted to solve the optimization problem of l1-norm. In this approach, only some mask values are required to be near zero, though not completely zero. The detail is provided in the following section. + + + +II. PREVIOUS WORK + +The importance of compressing deep learning models before the application is self-evident, especially for expanding the application scenarios of deep learning [11]. For example, a compressed deep learning model can be combined with edge computing [12] to enable Internet of things devices under.stand data. In this section, we will review the contributions of others. +Le Cun et al. [14] first proposed a saliency measurement method called Optimal Brain Damage (OBD) to selectively delete weights by second-derivative information of error function. Later, Hassibi and Strok [22] proposed the Optimal Brain Surgeon (OBS) algorithm based on OBD. The OBS not only removed unimportant weights but also automatically adjusted the remaining weights, which improved accuracy and generalization ability. All these methods are based on Taylor expansion (even OBD and OBS are required to compute Hessian matrix), which may be computationally intensive especially for large networks. In addition, they use a criterion of minimal increase in error on the training data. Guo et al. [23] introduced a binary matrix to dynamically choose important weights. Han et al. [24], [25] directly removed weights with values lower than a predefined threshold to compress networks, then followed by retraining to recover accuracy. Considering most filters in CNNs that tended to be smooth in the spatial domain, Liu et al. [26] extended Guo's work to the frequency domain by implementing Discrete Cosine Transform (DCT) to filters in the spatial domain. However, these non-structured pruning technologies were hard to use in real applications, because extra software or hardware was required for the calculation. +Directly cropping a trained model by the value of weight is a wide method. Normally it is used to find an effective evaluation to judge the importance of weights and to cut the unimportant connection or filter to reduce the redundancy of a model. Hu et al. [27] thought the activation outputs of a significant portion of neurons were zero in a large network, whatever inputs the network received. These zero activation neurons were unimportant, so they defined the Average Percentage of Zeros (ApoZ) to observe the percentage of activations of a neuron and cropped the neurons with fewer activations. Li et al. [28] introduced a structured pruning method by measuring the norm of filters to remove unimportant filters. Luo et al. [29] took advantage of a subset of input channels to approximate output for compressing convolutional layers. Changpinyo et al. [30] proposed a random method to compress CNNs. They randomly connected the output channel to a small subset of input channels to compress CNNs. Though successful to an extent, their method did not directly relate to the loss, hence it was necessary to retrain the network for the recovery of accuracy. On the other hand, such a scheme could only be used layer-by-layer. Thus, it was essential to iterate over and over to prune, which would result in massive computation costs. +Ding et al. [37] applied a customized L2 regularization to remove unimportant filters and simultaneously stimulate important filters to grow stronger. Lin et al. [32] proposed a Global & Dynamic Filter Pruning (GDP) method, which could dynamically recover the previously removed filters. Liu et al. [33] enforced channel-level sparsity in the net.work to compress DNNs in the training phase. In addition, Gordon et al. [39] iteratively shrank and expanded a network targeting reduction of particular resources (e.g. FLOPS, or the number of parameters). + +III. The APPROACH OF STRUCTURED PRUNING FOR CNNs + +A. NOTATIONS +First of all, notations are clarified in this section. CNN is a multi-layer deep feed-forward neural network, which is composed of a stack of convolutional layers, pooling layers, and full-connected layers. In an l-layer CNNs model, <>. +represents the k-th filter of l layer, <> denotes the number of feature maps in l-1 layer and d indicates the kernel size. Let us denote feature maps in the l layer by <>, where <> is the size, Cl is the number of channels, and Zl is the output of l-1 layer. In addition, Zk +represents the k-th feature map of l layer. The output feature map Zk can be computed as: + +<>, (1) + +where f is a non-linear activation function, is the convolutional operation and bk is the bias. <> represents the training set, where xi and yi represent the training sample and label respectively, and N indicates the number of samples. + +B. THE PROPOSED SCHEME + +structured +The goal of pruning is to remove those redundant filters or neurons, which are unimportant or useless for the performance of the networks. Essentially, the main role of the convolutional layer filters is to extract local features. However, once all the parameters of a filter are zeroed, the filter is confirmed unimportant. Whatever the inputs for the filter, the outputs are always zero. Under the circumstance, the filters are unable to extract any information. When the filters are multiplied by zero, all the parameters of the filters become zero. Based on this observation, a mask is introduced for every filter to estimate its importance. This can be formulated as: + +<>, (2) + +where mlk represents the k-th mask of l-layer. + +Therefore, the problem of zeroing out the values of some filters can be transformed to zero some mask. For this purpose, the following optimization solution is proposed: + +<>, (3) + +where <> is a loss function, such as cross-entropy loss, <> is the output of CNNs and C is a hyper-parameter that controls the +number of pruned filters. Equation (3) is the core of the proposed method. Once the optimal solution of the equation is obtained, the pruning is achieved. +In addition, this method can also remove redundant neurons in a fully-connected layer. The inference of fully-connected layer can be represented by: + +<>, (4) + +where <> is a weight matrix and Zl.1 . Rn.1 is the input of l-th layer. Here, when fully-connected layers introduce mask, the inference of these layers can be reformu.lated as: + +<>, (5) + +where <> is a mask vector and <> is Hadamard product operator. +Equation (3) can be transformed into the following form based Lagrange multiplier: + +<>, (6) + +where <> is a coefficient associated with C. +Equation (6) is an NP-hard problem because of the zero norm. Thus, it is quite difficult to obtain an optimal solution with equation (6). +Therefore, l1-norm is adopted to replace l0-norm, as: + +<>. (7) + +Equation (7) can be solved by SGD in practical application, so the proposed method is simple and easy to implement. We just need to introduce a mask for each layer and train the network. Though the proposed method introduces mask, the network topology will be preserved because the mask can be absorbed into weight. + +C. THRESHOLD + +L1 regularization is a widely used sparse technology, which pushes the coefficients of uninformative features to zero. So a sparse +network is achieved by solving equation (7). However, there is a problem in solving equation (7). +Here the mask value cannot be completely zeroed in practical application, because the objective function (7) +is non-convex and the global optimal solution may not be obtained. A strategy is adopted in the proposed method to solve this problem. If the order of magnitude of the mask value is small enough, it can be considered almost as zero. Thus, to decide whether the mask is zero, a threshold is introduced. However, considering only the value of the mask is meaningless if the mask is not completely zero. Because there is a linear transformation between mask and convolution. One can shrink the masks while expanding the weights to keep the product of them the same. Hence, considering the mask and weight simultaneously is necessary. The average value of the product of the mask and the weight is used to determine whether the mask is exactly zero or not? The specific definition can be presented as: + +<> (8) + +where <> is a pre-defined threshold and <> is the average operation. This strategy is efficient and reasonable, which can be proved by the results of the experiment. + +Algorithm 1 The Proposed Pruning Approach + +<> + +Merging weights and masks and then removing the mask layer. Return the pruned network architecture and preserved weights. + +D. FINE-TUNING AND OTHER REGULARIZATION STRATEGIES + +Pruning may temporarily lead to degradation in accuracy, so fine-tuning is necessary to improve accuracy. Furthermore, the proposed method can be employed iteratively to obtain a narrower architecture. Actually, a single iteration of proposed method is enough to yield noticeable compaction. The method is elaborated in Algorithm 1. +Essentially, the purpose of this approach is to adjust some masks to adequately small order of magnitude. Therefore, L2 regularization can also serve as a regularization strategy in this approach. + +IV. EXPERIMENTS + +The approach was primarily evaluated through three net.works: LeNet-5 on MNIST dataset, VGG-16 on CIFAR-10 dataset and ResNet-32 on CIFAR-10 dataset. The implementation of this approach was accomplished through the standard Keras library. All experiments were conducted through Intel E5-2630 V4 CPU and NVIDIA 1080Ti GPU. + +A. DATASETS + +1) MNIST +MNIST dataset of handwritten digits from 0 to 9 is widely applied to evaluate machine learning models. This dataset owns 60000 train samples and 10000 test samples. + +2) CIFAR-10 +The CIFAR-10 dataset [41] has a total of 60000 images consisting of 10 classes, each having 6000 images with 32x32 resolution. There are 50000 training images and 10000 test images. During training, a data augmentation scheme was adopted, which contained random horizontal flip, rotation, and translation. The input data was normalized using the means and standard deviations. + +B. NETWORK MODELS + +1) LENET-5 +LeNet-5 is a convolutional neural network designed by LeCun et al. [34]. It has two convolutional and two + +<
> + +TABLE 1. The result of lenet-5 on mnist. full-connected layers. This network has 44.2K learnable parameters. In this network, dropout is used in the full-connected layer. + +2) VGG-16 +The original VGG-16 [35] has thirteen convolutional and two fully-connected layers and has 130M learn-able parameters. However, VGG-16 is very complex for CIFAR-10 dataset. So the fully-connected layers were removed. Moreover, Batch Normalization was used after each convolution operation. The modified model has 14.7M learn-able parameters. + +3) RESNET-32 + +Deep residual network (ResNet) [42] is a state-of-the-art multiple CNNs architecture. In this paper, ResNet-32 was implemented to evaluate the proposed method. The used ResNet-32 had the same architecture as described in [42], which contained three stages of convolutional, one global average pooling after last convolutional layer and one fully-connected layer. In addition, when the dimensions increased, 1x1 convolution was adopted as identity mapping to match the dimensions. This network has 0.47M learnable parameters. + +C. THE DETAIL OF TRAINING, PRUNING, AND FINE-TUNING + +To obtain the baseline of accuracy in the experiments, we trained LeNet-5 on MNIST, VGG-16 on CIFAR-10, and ResNet-32 on CIFAR-10 from scratch. Then, the pruning was performed on the basis of the trained network and the strategy of regularization was chosen as L1 regularization, with the mask initialized to 1. Later, we would retrain the pruned network for the recovery of accuracy. + +1) LENET-5 ON MNIST +The original network was normally trained from scratch, for a total of 30 epochs, by Adam [43] with a batch sizes of 128. The learning rate was initialized to 0.001, the weight decay was set to 0.0005. The momentum was set to 0.9 and the dropout rate was set to 0.5 for the fully-connected layer. While implementing the pruning training, only the epochs was modified. The epochs was set at 10 and the threshold mentioned above to select pruned filters was set at 0.01. The pruned network was then retrained to compensate for the loss of accuracy. We adopted the same hyper-parameter setting as in normal training. + +2) VGG-16 ON CIFAR-10 +To get the baseline accuracy, the network was normally trained from scratch by SGD with a batch size of 128. The total epochs were set to 60. The initial learning rate was set to 0.01 and then scaled up by 0.1 every 20 epochs. The weight decay was set at 0.0005 and the momentum at 0.9. While implementing the pruning training, epochs was set to 30 , the learning rate was scaled by 0.1 every 10 epochs and other settings remained the same, while the threshold was set at 0.01. Finally, the pruned model was retrained following the same pre-processing and hyper-parameter settings as the normal training. + +3) RESNET-32 ON CIFAR-10 +Generally, the network was trained from scratch by SGD as the baseline with a batch size of 128. The weight decay was set at 0.0001, the epochs were set at 120, and the momentum was set at 0.9. The initial learning rate was set at 0.1 and then scaled by 0.1 at 60 and 100 epochs. Here, for pruning training, the epoch was set at 30, the learning rate was scaled by +0.1 every 10 epochs and the other settings remained the same. After pruning, the network was retrained from scratch. The epochs was modified to 60 and the learning rate was scaled by 0.1 every 20 epochs. + +D. RESULTS OF THE EXPERIMENTS + +1) LENET-5 ON MNIST +As per the results in TABLE 1, 88.84% of the parameters were removed without any impact on performance. Based on the proposed method, 95.46% of the parameters were discarded as well with an accuracy loss of 0.57%. + +<
> + +TABLE 2. Result of VGG-16 on CIFAR-10 datasets. + +<
> + +TABLE 1 also reveals that there was enormous redundancy in fully-connected layers because at least 90% parameters of fully-connected layers could easily be dropped. According to the form, the proposed method may indeed seek important connections. The reasons can be summarized in two points. First, when parameters of 83.83% are removed, the accuracy doesn't change. This indicates that the pruned parameters are unimportant for maintaining the accuracy of the network. Second, it is difficult to remove some filters or neurons, especially the neurons of fully-connected layers, when the pruning rate gradually increases. So the remaining connections are crucial. +In addition, the convolutional layer, especially the first one, is hard to prune in comparison with the next layer. The possible explanation could be that the proposed method automatically selects the unimportant filters through a backpropagation algorithm. However, the backpropagation algorithm will cause the previous layer to suffer gradient vanishing problem. That is why the former layers are hard to prune compared to the later ones. + +2) VGG-16 ON CIFAR-10 +As depicted in TABLE 2, over 94.4% of parameters could be removed with a negligible accuracy loss of 0.51%. It can also be observed that the loss of accuracy was only 2.04% when prune parameters of 97.76%. The proposed method proved to be effective again in reducing redundancy. +In fact, preserving the remaining architecture without retaining the parameters (training the pruned network from scratch) is also a strategy to fine-tune network. This strategy was adopted here to retrain the network and the results were promising, as shown in TABLE 2. The results reveal that a better effect can be achieved through directly retraining the pruned network from scratch. Perhaps the significance of the proposed method is that it furnishes the facility to discover excellent architectures, as mentioned by Liu et al. [36] as well. Nevertheless, training a pruned network from scratch + +FIGURE 2. Comparison of L1 regularization and L2 regularization. "accuracy loss" represents the difference of accuracy between pruned CNNs and original CNNs. A positive value indicates the improvement of network accuracy after pruning, while a negative value indicates the decrease of accuracy. +is expensive in terms of computation cost, especially in case of large-scale datasets and networks. + +3) RESNET-32 ON CIFAR-10 +Pruning ResNet-32 based on the order of magnitude of the mask may result in different output map dimensions in the residual module. So a 1x1 convolution is needed as identity mapping to match dimensions. However, this operation brings about extra parameter and computation. To avoid this problem, a percentile was defined to remove filters of the same proportion in every convolutional layer. TABLE 3 shows that the proposed method removed 34% parameters with accuracy loss of 0.65%. Moreover, over 62.3% of parameters could also be discarded with an accuracy loss of 1.76%. Thus, it was confirmed that the proposed method could reduce the redundancy of complex network, +i.e. ResNet. + +<
> + +FIGURE 3. The comparison of pruned and reserved filters. (a) The comparison of parameters order of magnitude between pruned and reserved filters. The x-axis represents the distribution interval and the y-axis represents the percentage of the parameter in the interval. (b) The comparison of non-zero activations. The left bar represents average non-zero activation percentage, and the right bar represents average non-zero activation value. + +<
> + +TABLE 3. Rest lt of RESNET-32 on CIFAR-10 datasets. + +V. ANALYSIS + +A. L2 REGULARIZATION + +L2 regularization was also explored as a regularization strategy in this study. As shown in FIGURE 2, the LeNet-5 can also be compressed without degrading accuracy based L2 regularization. Nevertheless, there is some difference between L1 regularization and L2 regularization. Both L1 and L2 regularizations can improve accuracy when pruning rate is less than 84%, but the effect of L2 regularization is better. The main reason is that regularization techniques can prevent overfitting and improve the generalization ability. Moreover, with the pruning rate increasing, L1 regularization can achieve a greater compression effect in the same accuracy. As per Han et al. [24], L1 regularization pushes more parameters closer to zero, so it can prune more parameters. Having studied the difference between L1 regularization and L2 regularization, the inclination is more towards the L1 regularization from the perspective of compression and accuracy trade-off. + +B. THE EFFECT OF PRUNING + +To better describe the effect of the proposed method, a comparison was made between the pruned filters and reserved filters. The CONV3-1 layer of VGG-16, which owned 256 filters, was chosen while the . set at 0.008. Based on the above setting, 125 filters of CONV3-1 layer could be removed. Empirically, a weak filter or neuron always has lower activation outputs, lower activation frequency, and lower weight value. Hence weight values and activation outputs were chosen here to evaluate the difference between pruned and preserved filters. +As shown in Figure 2, the bulk of values of pruned parameters, with a percentage of 96.9, are less than 10.6, in terms of the weight absolute values. However, most of the values of reserved parameters, with a percentage of 94.5, are greater than 0.001. The results indicate an enormous distribution difference between the values of the pruned and the reserved parameters. Therefore, the present approach can effectively reduce the order of magnitude of the pruned parameters. +In addition, the test set was chosen as a sample to calculate the average non-zero activation values and percentage of CONV3-1. As obvious from Figure 3, both the average percentage of non-zero activation and the average values of non-zero activation of the pruned filters was much lower than those of the reserved filters. From the activation perspective, the pruned filters were weak, because the output and weight values of pruned filters were negligible compared with the reserved filters and could be completely ignored. Thus, using the order of magnitude of the mask to determine pruned filters or neurons was reasonable. + +C. COMPARISON WITH OTHER METHODS +In this section, two classical structured prune methods were compared with the proposed method. First, in LeNet-5 on MNIST-10 dataset, the proposed method was compared with that of Wen et al. [31]. In this experiment, both the proposed and Wen et al. [31] methods adopted the same coefficient of sparsity regularization (. = 0.03). The results (TABLE 5) show that both the methods were analogous in terms of accuracy and compression effect. However, the proposed method is simpler and costs less computation in practice. Further, the proposed method was also compared with that + +<
> + +TABLE 4. Compare of VGG-16 on CIFAR-10. + +<
> + +TABLE 5. Compare of LENET-5 on MNIST. + +of Liu et al. [33] in VGG-16 on CIFAR-10. Again, the same sparsity regularization coefficient (. = 0.005) was adopted for both the methods. However, Liu et al. [33] adopted a fixed percentage threshold setting, whereas, the scheme of threshold setting of proposed method was different from Liu. The results (in TABLE 4) reveal that the proposed method was superior in terms of compression efficiency, although there was a slight loss of accuracy. In general, the proposed method can not only generate sparsity but also achieve better pruning effect with its improved threshold. +Nevertheless, some shortcomings were also observed with this approach. One is that though this approach doesn't change the existing CNNs architecture, the added mask layer essentially increases the number of layers in the network. This may increase optimization difficulty. However, this problem can be solved by Batch Normalization (BN [38]). The other is that, as this method introduces a threshold, the pruning effect may not be smooth. The pruning rate may change drastically with small changes in the <>, which is not conducive to finding the best <>. +VI. CONCLUSION In this article, a structured pruning technology is proposed to automatically tailor redundant filters or neurons based on regularization. A mask is introduced to remove unimportant filters or neurons by zeroing the values of some masks dur.ing training. In addition, to deal with the problem that the mask cannot be completely zeroed in practice, a threshold is designed to zero the mask. Experimentation with multiple datasets has proved that the proposed method can effectively remove parameters with a negligible loss of accuracy. In the future, establishing a relation between the hyper-parameter <> and the pruning rate will be considered to facilitate the adjustment of hyper-parameter .. + + +ACKNOWLEDGMENT +All the mentioned support is gratefully acknowledged. + + +REFERENCES +[1] Y. LeCun, Y. Bengio, and G. Hinton, ��Deep learning,�� Nature, vol. 521, pp. 436�444, May 2015. +[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ��ImageNet classification with deep convolutional neural networks,�� in Proc. Adv. Neural Inf. Pro.cess. Syst. (NIPS), 2012, pp. 1097�1105. +[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ��Rich feature hierarchies for accurate object detection and semantic segmentation,�� in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580�587. +[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ��Generative adversarial nets,�� in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672�2680. +[5] C. Shen, Y. Li, Y. Chen, X. Guan, and R. Maxion, ��Performance analysis of multi-motion sensor behavior for active smartphone authentication,�� IEEE Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 48�62, Jan. 2018. +[6] C. Shen, Y. Chen, X. Guan, and R. Maxion, ��Pattern-growth based mining mouse-interaction behavior for an active user authentication system,�� IEEE Trans. Dependable Secure Comput., to be published. +[7] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, ��A survey of model compres.sion and acceleration for deep neural networks,�� 2017, arXiv:1710.09282. [Online]. Available: https://arxiv.org/abs/1710.09282 +[8] C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, ��Convolutional neural networks with low-rank regularization,�� 2015, arXiv:1511.06067. [Online]. Available: https://arxiv.org/abs/1511.06067 +[9] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, ��Compressing neural networks with the hashing trick,�� in Proc. Int. Conf. Mach. Learn., 2015, pp. 2285�2294. +[10] Y. Gong, L. Liu, M. Yang, and L. Bourdev, ��Compressing deep convolutional networks using vector quantization,�� 2014, arXiv:1412.6115. [Online]. Available: https://arxiv.org/abs/1412.6115 +[11] Z. Tian, S. Su, W. Shi, X. Du, M. Guizani, and X. Yu, ��A data-driven method for future Internet route decision modeling,�� Future Gener. Com-put. Syst., vol. 95, pp. 212�220, Jun. 2018. +[12] Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, S. Su, Y. Sun, and N. Guizani, ��Real-time lateral movement detection based on evidence reasoning net.work for edge computing environment,�� IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 4285�4294, Jul. 2019. +[13] R. Liu, N. Fusi, and L. Mackey, ��Teacher-student compression with gener.ative adversarial networks,�� 2018, arXiv:1812.02271. [Online]. Available: https://arxiv.org/abs/1812.02271 +[14] Y. LeCun, J. S. Denker, and S. A. Solla, ��Optimal brain damage,�� in Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598�605. +[15] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer, ��SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,�� 2016, arXiv:1602.07360. [Online]. Avail.able: https://arxiv.org/abs/1602.07360 +[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, ��MobileNets: efficient convolutional neu.ral networks for mobile vision applications,�� 2017, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861 +[17] X. Zhang, X. Zhou, M. Lin, and J. Sun, ��Shufflenet: An extremely efficient convolutional neural network for mobile devices,�� in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6848�6856. +[18] S. Anwar, K. Hwang, and W. Sung, ��Structured pruning of deep convolu.tional neural networks,�� ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 3, p. 32, 2017. +[19] Y. He, X. Zhang, and J. Sun, ��Channel pruning for accelerating very deep neural networks,�� in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 1389�1397. +[20] J.-H. Luo and J. Wu, ��An entropy-based pruning method for CNN compression,�� arXiv:1706.05791, 2017. [Online]. Available: https://arxiv.org/abs/1706.05791 +[21] R. Tibshirani, ��Regression selection and shrinkage via the lasso,�� J. Roy. Stat. Soc. B, vol. 58, no. 1, pp. 267�288, 1996. +[22] B. Hassibi and D. G. Stork, ��Second order derivatives for network pruning: Optimal brain surgeon,�� in Proc. Adv. Neural Inf. Process. Syst., 1993, pp. 164�171. +[23] Y. Guo, A. Yao, and Y. Chen, ��Dynamic network surgery for efficient DNNs,�� in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1379�1387. +[24] S. Han, J. Pool, J. Tran, and W. Dally, ��Learning both weights and con.nections for efficient neural network,�� in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1135�1143. +[25] S. Han, H. Mao, and W. J. Dally, ��Deep compression: Com.pressing deep neural networks with pruning, trained quantization and Huffman coding,�� 2015, arXiv:1510.00149. [Online]. Available: https://arxiv.org/abs/1510.00149 +[26] Z. Liu, J. Xu, X. Peng, and R. Xiong, ��Frequency-domain dynamic pruning for convolutional neural networks,�� in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1043�1053. +[27] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, �finetwork trimming: A data-driven neuron pruning approach towards efficient deep architectures,�� 2016, arXiv:1607.03250. [Online]. Available: https://arxiv. org/abs/1607.03250 +[28] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ��Pruning filters for efficient convNets,�� 2016, arXiv:1608.08710. [Online]. Available: https://arxiv.org/abs/1608.08710 +[29] J.-H. Luo, J. Wu, and W. Lin, ��ThiNet: A filter level pruning method for deep neural network compression,�� in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 5058�5066. +[30] S. Changpinyo, M. Sandler, and A. Zhmoginov, ��The power of sparsity in convolutional neural networks,�� arXiv:1702.06257. [Online]. Available: https://arxiv.org/abs/1702.06257 +[31] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, ��Learning structured sparsity in deep neural networks,�� in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 2074�2082. +[32] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, ��Accelerating convolutional networks via global & dynamic filter pruning,�� in Proc. IJCAI, 2018, pp. 2425�2432. +[33] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ��Learning efficient convolutional networks through network slimming,�� in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 2736�2744. +[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ��Gradient-based learn.ing applied to document recognition,�� Proc. IEEE, vol. 86, no. 11, pp. 2278�2324, Nov. 1998. +[35] K. Simonyan and A. Zisserman, ��Very deep convolutional networks for large-scale image recognition,�� 2014, arXiv:1409.1556. [Online]. Avail.able: https://arxiv.org/abs/1409.1556 +[36] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, ��Rethinking the value of network pruning,�� 2018, arXiv:1810.05270. [Online]. Available: https://arxiv.org/abs/1810.05270 +[37] X. Ding, G. Ding, J. Han, and S. Tang, ��Auto-balanced filter pruning for efficient convolutional neural networks,�� in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 6797�6804. +[38] S. Ioffe and C. Szegedy, ��Batch normalization: Accelerating deep network training by reducing internal covariate shift,�� 2015, arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502. 03167 +[39] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, ��MorphNet: Fast & simple resource-constrained structure learning of deep networks,�� in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1586�1595. +[40] M. Yuan and Y. Lin, ��Model selection and estimation in regression with grouped variables,�� J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 68, no. 1, pp. 49�67, 2006. +[41] A. Krizhevsky and G. Hinton, ��Learning multiple layers of features from tiny images,�� Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009. +[42] K. He, X. Zhang, S. Ren, and J. Sun, ��Deep residual learning for image recognition,�� in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 770�778. +[43] D. P. Kingma and J. Ba, ��Adam: A method for stochastic opti.mization,�� 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/ abs/1412.6980 + +CHEN YANG is currently pursuing the master's degree with the Department of College of Information and Electrical Engineering, China Agricultural University, Beijing, China. His research is about general deep learning and machine learning but his main research interest includes deep models compression. + +ZHENGHONG YANG received the master's and Ph.D. degrees from Beijing Normal University, in 1990 and 2001, respectively. He is currently a Professor with the College of Science, China Agricultural University. He has presided two projects of National Natural Science Foundation. He has written two teaching and research books and has published more than 40 academic papers in domestic and foreign journals, among them, about 30 are cited by SCI/EI/ISTP. His major research +interests include the matrix theory, numerical algebra, image processing, and so on. He is a member of Beijing and Chinese Society of Computational Mathematics. + +ABDUL MATEEN KHATTAK received the Ph.D. degree in horticulture and landscape from the University of Reading, U.K., in 1999. He was a Research Scientist in different agriculture research organizations before joining the University of Agriculture, Peshawar, Pakistan, where he is currently a Professor with the Department of Horticulture. He has conducted academic and applied research on different aspects of tropical fruits, vegetables, and ornamental plants. He has also worked for Alberta Agriculture and Forestry, Canada, as a Research Associate, and Organic Agriculture Centre of Canada as a Research and Extension Coordinator, for Alberta province. There he helped in developing organic standards for greenhouse production and energy saving technologies for Alberta greenhouses. He is a Professor with considerable experience in teaching and research. He is currently a Visiting Professor with the College of Information and Electrical Engineering, China Agricultural University, Beijing. He has published 59 research articles in scientific journals of international repute. He has also attended and presented in several international scientific conferences. His research interests include greenhouse produc.tion, medicinal, aromatic and ornamental plants, light quality, supplemental lighting, temperature effects on greenhouse crops, aquaponics, and organic production. + +LIU YANG is currently pursuing the master's degree with the College of Information and Elec.trical Engineering, China Agricultural University, Beijing, China. Her research interests include the application of image recognition and intelligent robots in the field of agriculture. + +WENXIN ZHANG is currently pursuing the master's degree with the School of Information and Electrical Engineering, China agricultural univer.sity, Beijing, China. Her research interest includes pose estimation methods about pig based on deep learning for timely access to pig information. + +WANLIN GAO received the B.S., M.S., and Ph.D. degrees from China Agricultural University, in 1990, 2000, and 2010, respectively. He is the currently the Dean of the College of Information and Electrical Engineering, China Agricultural University. He has been the principal investiga.tor (PI) of over 20 national plans and projects. He has published 90 academic papers in domestic and foreign journals, among them, over 40 are cited by SCI/EI/ISTP. He has written two teaching +materials, which are supported by the National Key Technology Research and Development Program of China during the 11th Five-Year Plan Period, and �ve monographs. He holds 101 software copyrights, 11 patents for inventions, and eight patents for new practical inventions. His major research interests include the informationization of new rural areas, intelligence agriculture, and the service for rural comprehensive information. He is a member of Science and Technology Committee of the Ministry of Agricul.ture, a member of Agriculture and Forestry Committee of Computer Basic Education in colleges and universities, and a Senior Member of Society of Chinese Agricultural Engineering, etc. + +MINJUAN WANG received the Ph.D. degree from the School of Biological Science and Medical Engineering, Beihang University, under the super.vision of Prof. Hong Liu, in June 2017. She was a Visiting Scholar with the School of Environmen.tal Science, Ontario Agriculture College, Univer.sity of Guelph, from October 2015 to May 2017. She is currently a Postdoctoral Fellow with the College of Information and Electrical Engineer.ing, China Agricultural University. Her research +interests mainly include bioinformatics and the Internet of Things key technologies. +<> <> <> + + +<> <> <> + The 4 Research Techniques to Train Deep Neural Network Models More Efficiently + + + James Le Follow + + + Deep learning and unsupervised feature learning have shown + great promise in many practical applications. State-of-the-art + performance has been reported in several domains, ranging + from speech recognition and image recognition to text + processing and beyond. + + + It’s also been observed that increasing the scale of deep + learning—with respect to numbers of training examples, model + parameters, or both—can drastically improve accuracy. These + results have led to a surge of interest in scaling up the training + and inference algorithms used for these models and in + improving optimization techniques for both. + + + The use of GPUs is a significant advance in recent years that + makes the training of modestly-sized deep networks practical. + A known limitation of the GPU approach is that the training + speed-up is small when the model doesn’t Ft in a GPU’s + memory (typically less than 6 gigabytes). + + + To use a GPU eLectively, researchers often reduce the size of + the dataset or parameters so that CPU-to-GPU transfers are not + a significant bottleneck. While data and parameter reduction + work well for small problems (e.g. acoustic modeling for speech + recognition), they are less attractive for problems with a large + number of examples and dimensions (e.g., high-resolution + images). + + + In the previous post, we + talked about 5 different + algorithms for efficient deep + learning inference. In this + article, we’ll discuss the + upper right part of the + quadrant on the left. What + are the best research + techniques to train deep + neural networks more + efficiently? + + + + 1 — Parallelization Training + Let’s start with parallelization. As the Fgure below shows, the + number of transistors keeps increasing over the years. But + single-threaded performance and frequency are plateauing in + recent years. Interestingly, the number of cores is increasing. + + So what we really need to know is how to parallelize the + problem to take advantage of parallel processing. There are a + lot of opportunities to do that in deep neural networks. + + + For example, we can do data parallelism: feeding 2 images + into the same model and running them at the same time. This + does not aLect latency for any single input. It doesn’t make it + shorter, but it makes the batch size larger. It also requires + coordinated weight updates during training. + + + For example, in JeL Dean’s paper “Large Scale Distributed Deep + Networks,” there’s a parameter server (as a master) and a + couple of model workers (as slaves) running their own pieces of + training data and updating the gradient to the master. + + Another idea is model parallelism — splitting up the model + and distributing each part to different processors or different + threads. For example, imagine we want to run convolution in + the image below by doing a 6-dimension “for” loop. What we + can do is cut the input image by 2x2 blocks, so that each + thread/processor handles 1/4 of the image. Also, we can + parallelize the convolutional layers by the output or input + feature map regions, and the fully-connected layers by the + output activation. + + 2 — Mixed Precision Training + Larger models usually require more compute and memory + resources to train. These requirements can be lowered by using + reduced precision representation and arithmetic. + + Performance (speed) of any program, including neural network + training and inference, is limited by one of three factors: + arithmetic bandwidth, memory bandwidth, or latency. + Reduced precision addresses two of these limiters. Memory + bandwidth pressure is lowered by using fewer bits to store the + same number of values. Arithmetic time can also be lowered on + processors that oLer higher throughput for reduced precision + math. For example, half-precision math throughput in recent + GPUs is 2× to 8× higher than for single-precision. In addition + to speed improvements, reduced precision formats also reduce + the amount of memory required for training. + + Modern deep learning training systems use a single-precision + (FP32) format. In their paper “Mixed Precision Training,” + researchers from NVIDIA and Baidu addressed training with + reduced precision while maintaining model accuracy. + + Specifically, they trained various neural networks using the + IEEE half-precision format (FP16). Since FP16 format has a + narrower dynamic range than FP32, they introduced three + techniques to prevent model accuracy loss: maintaining a + master copy of weights in FP32, loss-scaling that minimizes + gradient values becoming zeros, and FP16 arithmetic with + accumulation in FP32. + + + Using these techniques, they + demonstrated that a wide + variety of network + architectures and + applications can be trained + to match the accuracy of + FP32 training. Experimental + results include convolutional + and recurrent network + architectures, trained for classification, regression, and + generative tasks. + + + Applications include image classification, image generation, + object detection, language modeling, machine translation, and + speech recognition. The proposed methodology requires no + changes to models or training hyperparameters. + + + + 3 — Model Distillation + Model distillation refers to the idea of model compression by + teaching a smaller network exactly what to do, step-by-step, + using a bigger, already-trained network. The ‘soft labels’ refer + to the output feature maps by the bigger network after every + convolution layer. The smaller network is then trained to learn + the exact behavior of the bigger network by trying to replicate + its outputs at every level (not just the Final loss). + + + The method was First proposed by Bucila et al., 2006 and + generalized by Hinton et al., 2015. In distillation, knowledge is + transferred from the teacher model to the student by + minimizing a loss function in which the target is the + distribution of class probabilities predicted by the teacher + model. That is — the output of a softmax function on the + teacher model’s logits. + + + So how do teacher-student + networks exactly work? + + + The highly-complex teacher + network is Frst trained + separately using the + complete dataset. This step + requires high computational + performance and thus can + only be done ohine (on + high-performing GPUs). + + While designing a student network, correspondence needs + to be established between intermediate outputs of the + student network and the teacher network. This + correspondence can involve directly passing the output of a + layer in the teacher network to the student network, or + performing some data augmentation before passing it to the + student network. + + Next, the data are forward-passed through the teacher + network to get all intermediate outputs, and then data + augmentation (if any) is applied to the same. + + Finally, the outputs from the teacher network are back- + propagated through the student network so that the student + network can learn to replicate the behavior of the teacher + network. + + 4 — Dense-Sparse-Dense Training + The research paper “Dense-Sparse-Dense Training for Deep + Neural Networks” was published back in 2017 by researchers + from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense- + Sparse-Dense (DSD) takes 3 sequential steps: + + Dense: Normal neural net training…business as usual. It’s + notable that even though DSD acts as a regularizer, the + usual regularization methods such as dropout and weight + regularization can be applied as well. The authors don’t + mention batch normalization, but it would work as well. + + Sparse: We regularize the + network by removing + connections with small + weights. From each layer in + the network, a percentage of + the layer’s weights that are + closest to 0 in absolute value is selected to be pruned. This + means that they are set to 0 at each training iteration. It’s + worth noting that the pruned weights are selected only + once, not at each SGD iteration. Eventually, the network + recovers the pruned weights’ knowledge and condenses it in + the remaining ones. We train this sparse net until + convergence. + + Dense: First, we re-enable the pruned weights from the + previous step. The net is again trained normally until + convergence. This step increases the capacity of the model. + It can use the recovered capacity to store new knowledge. + The authors note that the learning rate should be 1/10th of + the original. Since the model is already performing well, the + lower learning rate helps preserve the knowledge gained in + the previous step. +<> <> <> + + +<> <> <> + THE LOTTERY TICKET HYPOTHESIS. FINDING SPARSE , TRAINABLE NEURAL NETWORKS + + + Jonathan Frankle Michael Carbin + MIT CSAIL MIT CSAIL + jfrankle@csail.mit.edu mcarbin@csail.mit.edu + + + + ABSTRACT + + Neural network pruning techniques can reduce the parameter counts of trained net- + works by over 90%, decreasing storage requirements and improving computational + performance of inference without compromising accuracy. However, contemporary + experience is that the sparse architectures produced by pruning are difficult to train + from the start, which would similarly improve training performance. + We find that a standard pruning technique naturally uncovers subnetworks whose + initializations made them capable of training effectively. Based on these results, we + articulate the lottery ticket hypothesis. dense, randomly-initialized, feed-forward + networks contain subnetworks (winning tickets) that—when trained in isolation— + reach test accuracy comparable to the original network in a similar number of + iterations. The winning tickets we find have won the initialization lottery. their + connections have initial weights that make training particularly effective. + We present an algorithm to identify winning tickets and a series of experiments + that support the lottery ticket hypothesis and the importance of these fortuitous + initializations. We consistently find winning tickets that are less than 10-20% of + the size of several fully-connected and convolutional feed-forward architectures + for MNIST and CIFAR10. Above this size, the winning tickets that we find learn + faster than the original network and reach higher test accuracy. + + + 1 INTRODUCTION + + Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990; + Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than + 90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015) + or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained + networks, making inference more efficient. However, if a network can be reduced in size, why do we + not train this smaller architecture instead in the interest of making training more efficient as well? + Contemporary experience is that the architectures uncovered by pruning are harder to train from the + start, reaching lower accuracy than the original networks. 1 + + Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected + network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect + of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels + of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that + iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy. + + 1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate + the difficulty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain + the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the + pruned layers...gradient descent is able to find a good solution when the network is initially trained, but not after + re-initializing some layers and retraining them.” (Han et al., 2015) + 2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion + would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of + minimum validation loss during training. See Appendix C for more details on this choice. + + <
> + + Figure 1. The iteration at which early-stopping would occur (left) and the test accuracy at that iteration + (right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for + CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled + sparse networks (average of ten trials). Solid lines are winning tickets (average of five trials). + + + In this paper, we show that there consistently exist smaller subnetworks that train from the start and + learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in + Figure 1 show networks that we find. Based on these results, we state the lottery ticket hypothesis. + The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet- + work that is initialized such that—when trained in isolation—it can match the test accuracy of the + original network after training for at most the same number of iterations. + + More formally, consider a dense feed-forward neural network <> with initial parameters <> + When optimizing with stochastic gradient descent (SGD) on a training set, f reaches + minimum validation loss lat iteration j with test accuracy a. In addition, consider training <> + with a mask <> on its parameters such that its initialization is <> When optimizing + with SGD on the same training set (with m fixed), f reaches minimum validation loss l0 at iteration j0 + with test accuracy a0 . The lottery ticket hypothesis predicts that 9m for which <> (commensurate + training time), <> (commensurate accuracy), and <> (fewer parameters). + We find that a standard pruning technique automatically uncovers such trainable subnetworks from + fully-connected and convolutional feed-forward networks. We designate these trainable subnetworks, + <>,winning tickets, since those that we find have won the initialization lottery with a + combination of weights and connections capable of learning. When their parameters are randomly + reinitialized (<> where <>), our winning tickets no longer match the performance of + the original network, offering evidence that these smaller networks do not train effectively unless + they are appropriately initialized. + + Identifying winning tickets. We identify a winning ticket by training a network and pruning its + smallest-magnitude weights. The remaining, unpruned connections constitute the architecture of the + winning ticket. Unique to our work, each unpruned connection’s value is then reset to its initialization + from original network before it was trained. This forms our central experiment. + 1.Randomly initialize a neural network <> (where <>). + 2.Train the network for j iterations, arriving at parameters <>. + 3.Prune p% of the parameters in j, creating a mask m. + 4.Reset the remaining parameters to their values in <>, creating the winning ticket<>. + As described, this pruning approach is one-shot. the network is trained once,p%of weights are + pruned, and the surviving weights are reset. However, in this paper, we focus on iterative pruning, + which repeatedly trains, prunes, and resets the network over n rounds; each round prunes <> of the + weights that survive the previous round. Our results show that iterative pruning finds winning tickets + that match the accuracy of the original network at smaller sizes than does one-shot pruning. + Results.We identify winning tickets in a fully-connected architecture for MNIST and convolutional + architectures for CIFAR10 across several optimization strategies (SGD, momentum, and Adam) with + techniques like dropout, weight decay, batchnorm, and residual connections. We use an unstructured + pruning technique, so these winning tickets are sparse. In deeper networks, our pruning-based strategy + for finding winning tickets is sensitive to the learning rate. it requires warmup to find winning tickets + at higher learning rates. The winning tickets we find are 10-20% (or less) of the size of the original + + <
> + + Figure 2. Architectures tested in this paper. Convolutions are 3x3. Lenet is from LeCun et al. (1998). + Conv-2/4/6 are variants of VGG (Simonyan & Zisserman, 2014). Resnet-18 is from He et al. (2016). + VGG-19 for CIFAR10 is adapted from Liu et al. (2019). Initializations are Gaussian Glorot (Glorot + & Bengio, 2010). Brackets denote residual connections around layers. + + + network (smaller size). Down to that size, they meet or exceed the original network’s test accuracy + (commensurate accuracy) in at most the same number of iterations (commensurate training time). + When randomly reinitialized, winning tickets perform far worse, meaning structure alone cannot + explain a winning ticket’s success. + The Lottery Ticket Conjecture.Returning to our motivating question, we extend our hypothesis + into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, + randomly-initialized networks are easier to train than the sparse networks that result from pruning + because there are more possible subnetworks from which training might recover a winning ticket. + Contributions. + We demonstrate that pruning uncovers trainable subnetworks that reach test accuracy comparable + to the original networks from which they derived in a comparable number of iterations. + We show that pruning finds winning tickets that learn faster than the original network while + reaching higher test accuracy and generalizing better. + We propose the lottery ticket hypothesis as a new perspective on the composition of neural + networks to explain these findings. + Implications.In this paper, we empirically study the lottery ticket hypothesis. Now that we have + demonstrated the existence of winning tickets, we hope to exploit this knowledge to. + Improve training performance.Since winning tickets can be trained from the start in isolation, a hope + is that we can design training schemes that search for winning tickets and prune as early as possible. + Design better networks.Winning tickets reveal combinations of sparse architectures and initializations + that are particularly adept at learning. We can take inspiration from winning tickets to design new + architectures and initialization schemes with the same properties that are conducive to learning. We + may even be able to transfer winning tickets discovered for one task to many others. + Improve our theoretical understanding of neural networks.We can study why randomly-initialized + feed-forward networks seem to contain winning tickets and potential implications for theoretical + study of optimization (Du et al., 2019) and generalization (Zhou et al., 2018; Arora et al., 2018). + + 2 WINNING TICKETS IN FULLY-CONNECTED NETWORKS + + In this Section, we assess the lottery ticket hypothesis as applied to fully-connected networks trained + on MNIST. We use the Lenet-300-100 architecture (LeCun et al., 1998) as described in Figure 2. + We follow the outline from Section 1. after randomly initializing and training a network, we prune + the network and reset the remaining connections to their original initializations. We use a simple + layer-wise pruning heuristic. remove a percentage of the weights with the lowest magnitudes within + each layer (as in Han et al. (2015)). Connections to outputs are pruned at half of the rate of the rest of + the network. We explore other hyperparameters in Appendix G, including learning rates, optimization + strategies (SGD, momentum), initialization schemes, and network sizes. + + <
> + + Figure 3. Test accuracy on Lenet (iterative pruning) as training proceeds. Each curve is the average + of five trials. Labels arePm —the fraction of weights remaining in the network after pruning. Error + bars are the minimum and maximum of any trial. + + + Notation.Pm =kmk0 is the sparsity of mask m, e.g., <> m = 25% when 75% of weights are pruned. + Iterative pruning.The winning tickets we find learn faster than the original network. Figure 3 plots + the average test accuracy when training winning tickets iteratively pruned to various extents. Error + bars are the minimum and maximum of five runs. For the first pruning rounds, networks learn faster + and reach higher test accuracy the more they are pruned (left graph in Figure 3). A winning ticket + comprising 51.3% of the weights from the original network (i.e.,Pm = 51.3%) reaches higher test + accuracy faster than the original network but slower than whenPm = 21.1%. When Pm < 21.1%, + learning slows (middle graph). When Pm = 3.6%, a winning ticket regresses to the performance of + the original network. A similar pattern repeats throughout this paper. + Figure 4a summarizes this behavior for all pruning levels when iteratively pruning by 20% per + iteration (blue). On the left is the iteration at which each network reaches minimum validation loss + (i.e., when the early-stopping criterion would halt training) in relation to the percent of weights + remaining after pruning; in the middle is test accuracy at that iteration. We use the iteration at which + the early-stopping criterion is met as a proxy for how quickly the network learns. + The winning tickets learn faster asPm decreases from 100% to 21%, at which point early-stopping + occurs38%earlier than for the original network. Further pruning causes learning to slow, returning + to the early-stopping performance of the original network whenPm = 3.6%. Test accuracy increases + with pruning, improving by more than 0.3 percentage points whenPm = 13.5%; after this point, + accuracy decreases, returning to the level of the original network whenPm = 3.6%. + At early stopping, training accuracy (Figure 4a, right) increases with pruning in a similar pattern to + test accuracy, seemingly implying that winning tickets optimize more effectively but do not generalize + better. However, at iteration 50,000 (Figure 4b), iteratively-pruned winning tickets still see a test + accuracy improvement of up to 0.35 percentage points in spite of the fact that training accuracy + reaches 100% for nearly all networks (Appendix D, Figure 12). This means that the gap between + training accuracy and test accuracy is smaller for winning tickets, pointing to improved generalization. + Random reinitialization. To measure the importance of a winning ticket’s initialization, we retain + the structure of a winning ticket (i.e., the mask m) but randomly sample a new initialization <>. + We randomly reinitialize each winning ticket three times, making 15 total per point in Figure 4. We + find that initialization is crucial for the efficacy of a winning ticket. The right graph in Figure 3 + shows this experiment for iterative pruning. In addition to the original network and winning tickets at + Pm = 51% and 21% are the random reinitialization experiments. Where the winning tickets learn + faster as they are pruned, they learn progressively slower when randomly reinitialized. + The broader results of this experiment are orange line in Figure 4a. Unlike winning tickets, the + reinitialized networks learn increasingly slower than the original network and lose test accuracy after + little pruning. The average reinitialized iterative winning ticket’s test accuracy drops off from the + original accuracy when Pm = 21.1%, compared to 2.9% for the winning ticket. When Pm = 21%, + the winning ticket reaches minimum validation loss 2.51x faster than when reinitialized and is half a + percentage point more accurate. All networks reach 100% training accuracy for Pm = 5%; Figure + + <
> + + Figure 4. Early-stopping iteration and accuracy of Lenet under one-shot and iterative pruning. + Average of five trials; error bars for the minimum and maximum values. At iteration 50,000, training + accuracy 100% for Pm = 2% for iterative winning tickets (see Appendix D, Figure 12). + + + 4b therefore shows that the winning tickets generalize substantially better than when randomly + reinitialized. This experiment supports the lottery ticket hypothesis’ emphasis on initialization. + the original initialization withstands and benefits from pruning, while the random reinitialization’s + performance immediately suffers and diminishes steadily. + One-shot pruning.Although iterative pruning extracts smaller winning tickets, repeated training + means they are costly to find. One-shot pruning makes it possible to identify winning tickets + without this repeated training. Figure 4c shows the results of one-shot pruning (green) and randomly + reinitializing (red); one-shot pruning does indeed find winning tickets. When 67.5% Pm = 17.6%, + the average winning tickets reach minimum validation accuracy earlier than the original network. + When 95.0% Pm = 5.17%, test accuracy is higher than the original network. However, iteratively- + pruned winning tickets learn faster and reach higher test accuracy at smaller network sizes. The + green and red lines in Figure 4c are reproduced on the logarithmic axes of Figure 4a, making this + performance gap clear. Since our goal is to identify the smallest possible winning tickets, we focus + on iterative pruning throughout the rest of the paper. + + 3 WINNING TICKETS IN CONVOLUTIONAL NETWORKS + + Here, we apply the lottery ticket hypothesis to convolutional networks on CIFAR10, increasing + both the complexity of the learning problem and the size of the networks. We consider the Conv-2, + Conv-4, and Conv-6 architectures in Figure 2, which are scaled-down variants of the VGG (Simonyan + & Zisserman, 2014) family. The networks have two, four, or six convolutional layers followed by + two fully-connected layers; max-pooling occurs after every two convolutional layers. The networks + cover a range from near-fully-connected to traditional convolutional networks, with less than 1% of + parameters in convolutional layers in Conv-2 to nearly two thirds in Conv-6. 3 + + Finding winning tickets. The solid lines in Figure 5 (top) show the iterative lottery ticket experiment + on Conv-2 (blue), Conv-4 (orange), and Conv-6 (green) at the per-layer pruning rates from Figure 2. + The pattern from Lenet in Section 2 repeats. as the network is pruned, it learns faster and test accuracy + rises as compared to the original network. In this case, the results are more pronounced. Winning + + 3 Appendix H explores other hyperparameters, including learning rates, optimization strategies (SGD, + momentum), and the relative rates at which to prune convolutional and fully-connected layers. + + <
> + + Figure 5. Early-stopping iteration and test and training accuracy of the Conv-2/4/6 architectures when + iteratively pruned and when randomly reinitialized. Each solid line is the average of five trials; each + dashed line is the average of fifteen reinitializations (three per trial). The bottom right graph plots test + accuracy of winning tickets at iterations corresponding to the last iteration of training for the original + network (20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6); at this iteration, training + accuracy100%forPm 2%for winning tickets (see Appendix D). + + + + tickets reach minimum validation loss at best 3.5x faster for Conv-2 (Pm = 8.8%), 3.5x for Conv-4 + (Pm = 9.2%), and 2.5x for Conv-6 (Pm = 15.1%). Test accuracy improves at best 3.4 percentage + points for Conv-2 (Pm = 4.6%), 3.5 for Conv-4 (Pm = 11.1%), and 3.3 for Conv-6 (Pm = 26.4%). + All three networks remain above their original average test accuracy when Pm > 2%. + As in Section 2, training accuracy at the early-stopping iteration rises with test accuracy. However, at + iteration 20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6 (the iterations corresponding + to the final training iteration for the original network), training accuracy reaches 100% for all networks + when Pm = 2% (Appendix D, Figure 13) and winning tickets still maintain higher test accuracy + (Figure 5 bottom right). This means that the gap between test and training accuracy is smaller for + winning tickets, indicating they generalize better. + Random reinitialization.We repeat the random reinitialization experiment from Section 2, which + appears as the dashed lines in Figure 5. These networks again take increasingly longer to learn upon + continued pruning. Just as with Lenet on MNIST (Section 2), test accuracy drops off more quickly + for the random reinitialization experiments. However, unlike Lenet, test accuracy at early-stopping + time initially remains steady and even improves for Conv-2 and Conv-4, indicating that—at moderate + levels of pruning—the structure of the winning tickets alone may lead to better accuracy. + Dropout.Dropout (Srivastava et al., 2014; Hinton et al., 2012) improves accuracy by randomly dis- + abling a fraction of the units (i.e., randomly sampling a subnetwork) on each training iteration. Baldi + & Sadowski (2013) characterize dropout as simultaneously training the ensemble of all subnetworks. + Since the lottery ticket hypothesis suggests that one of these subnetworks comprises a winning ticket, + it is natural to ask whether dropout and our strategy for finding winning tickets interact. + Figure 6 shows the results of training Conv-2, Conv-4, and Conv-6 with a dropout rate of 0.5. Dashed + lines are the network performance without dropout (the solid lines in Figure 5). 4 We continue to find + winning tickets when training with dropout. Dropout increases initial test accuracy (2.1, 3.0, and 2.4 + percentage points on average for Conv-2, Conv-4, and Conv-6, respectively), and iterative pruning + increases it further (up to an additional 2.3, 4.6, and 4.7 percentage points, respectively, on average). + Learning becomes faster with iterative pruning as before, but less dramatically in the case of Conv-2. + + + 4 We choose new learning rates for the networks as trained with dropout—see Appendix H.5. + + <
> + + Figure 6. Early-stopping iteration and test accuracy at early-stopping of Conv-2/4/6 when iteratively + pruned and trained with dropout. The dashed lines are the same networks trained without dropout + (the solid lines in Figure 5). Learning rates are 0.0003 for Conv-2 and 0.0002 for Conv-4 and Conv-6. + + <
> + + Figure 7. Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned. + + + These improvements suggest that our iterative pruning strategy interacts with dropout in a complementary + way. Srivastava et al. (2014) observe that dropout induces sparse activations in the final + network; it is possible that dropout-induced sparsity primes a network to be pruned. If so, dropout + techniques that target weights (Wan et al., 2013) or learn per-weight dropout probabilities (Molchanov + et al., 2017; Louizos et al., 2018) could make winning tickets even easier to find. + + 4 VGG AND RESNET FOR CIFAR10 + + Here, we study the lottery ticket hypothesis on networks evocative of the architectures and techniques + used in practice. Specifically, we consider VGG-style deep convolutional networks (VGG-19 on + CIFAR10—Simonyan & Zisserman (2014)) and residual networks (Resnet-18 on CIFAR10—He + et al. (2016)). 5 These networks are trained with batchnorm, weight decay, decreasing learning + rate schedules, and augmented training data. We continue to find winning tickets for all of these + architectures; however, our method for finding them, iterative pruning, is sensitive to the particular + learning rate used. In these experiments, rather than measure early-stopping time (which, for these + larger networks, is entangled with learning rate schedules), we plot accuracy at several moments + during training to illustrate the relative rates at which accuracy improves. + Global pruning.On Lenet and Conv-2/4/6, we prune each layer separately at the same rate. For + Resnet-18 and VGG-19, we modify this strategy slightly. we prune these deeper networks globally, + removing the lowest-magnitude weights collectively across all convolutional layers. In Appendix + I.1, we find that global pruning identifies smaller winning tickets for Resnet-18 and VGG-19. Our + conjectured explanation for this behavior is as follows. For these deeper networks, some layers have + far more parameters than others. For example, the first two convolutional layers of VGG-19 have + 1728 and 36864 parameters, while the last has 2.35 million. When all layers are pruned at the same + rate, these smaller layers become bottlenecks, preventing us from identifying the smallest possible + winning tickets. Global pruning makes it possible to avoid this pitfall. + VGG-19.We study the variant VGG-19 adapted for CIFAR10 by Liu et al. (2019); we use the + the same training regime and hyperparameters. 160 epochs (112,480 iterations) with SGD with + 5 See Figure 2 and Appendices I for details on the networks, hyperparameters, and training regimes. + + <
> + + Figure 8. Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned. + + + momentum (0.9) and decreasing the learning rate by a factor of 10 at 80 and 120 epochs. This + network has 20 million parameters. Figure 7 shows the results of iterative pruning and random + reinitialization on VGG-19 at two initial learning rates. 0.1 (used in Liu et al. (2019)) and 0.01. At the + higher learning rate, iterative pruning does not find winning tickets, and performance is no better than + when the pruned networks are randomly reinitialized. However, at the lower learning rate, the usual + pattern reemerges, with subnetworks that remain within 1 percentage point of the original accuracy + whilePm 3.5%. (They are not winning tickets, since they do not match the original accuracy.) + When randomly reinitialized, the subnetworks lose accuracy as they are pruned in the same manner as + other experiments throughout this paper. Although these subnetworks learn faster than the unpruned + network early in training (Figure 7 left), this accuracy advantage erodes later in training due to the + lower initial learning rate. However, these subnetworks still learn faster than when reinitialized. + To bridge the gap between the lottery ticket behavior of the lower learning rate and the accuracy + advantage of the higher learning rate, we explore the effect of linear learning rate warmup from 0 to + the initial learning rate over k iterations. Training VGG-19 with warmup (k= 10000, green line) at + learning rate 0.1 improves the test accuracy of the unpruned network by about one percentage point. + Warmup makes it possible to find winning tickets, exceeding this initial accuracy whenPm 1.5%. + Resnet-18.Resnet-18 (He et al., 2016) is a 20 layer convolutional network with residual connections + designed for CIFAR10. It has 271,000 parameters. We train the network for 30,000 iterations with + SGD with momentum (0.9), decreasing the learning rate by a factor of 10 at 20,000 and 25,000 + iterations. Figure 8 shows the results of iterative pruning and random reinitialization at learning + rates 0.1 (used in He et al. (2016)) and 0.01. These results largely mirror those of VGG. iterative + pruning finds winning tickets at the lower learning rate but not the higher learning rate. The accuracy + of the best winning tickets at the lower learning rate (89.5% when 41.7%, Pm = 21.9%) falls + short of the original network’s accuracy at the higher learning rate (90.5%). At lower learning rate, + the winning ticket again initially learns faster (left plots of Figure 8), but falls behind the unpruned + network at the higher learning rate later in training (right plot). Winning tickets trained with warmup + close the accuracy gap with the unpruned network at the higher learning rate, reaching 90.5% test + accuracy with learning rate 0.03 (warmup,k= 20000) atPm = 27.1%. For these hyperparameters, + we still find winning tickets whenPm 11.8%. Even with warmup, however, we could not find + hyperparameters for which we could identify winning tickets at the original learning rate, 0.1. + + 5 DISCUSSION + + Existing work on neural network pruning (e.g., Han et al. (2015)) demonstrates that the function + learned by a neural network can often be represented with fewer parameters. Pruning typically + proceeds by training the original network, removing connections, and further fine-tuning. In effect, + the initial training initializes the weights of the pruned network so that it can learn in isolation during + fine-tuning. We seek to determine if similarly sparse networks can learn from the start. We find that + the architectures studied in this paper reliably contain such trainable subnetworks, and the lottery + ticket hypothesis proposes that this property applies in general. Our empirical study of the existence + and nature of winning tickets invites a number of follow-up questions. + The importance of winning ticket initialization.When randomly reinitialized, a winning ticket + learns more slowly and achieves lower test accuracy, suggesting that initialization is important to + its success. One possible explanation for this behavior is these initial weights are close to their final + values after training—that in the most extreme case, they are already trained. However, experiments + in Appendix F show the opposite—that the winning ticket weights move further than other weights. + This suggests that the benefit of the initialization is connected to the optimization algorithm, dataset, + and model. For example, the winning ticket initialization might land in a region of the loss landscape + that is particularly amenable to optimization by the chosen optimization algorithm. + Liu et al. (2019) find that pruned networks are indeed trainable when randomly reinitialized, seemingly + contradicting conventional wisdom and our random reinitialization experiments. For example, on + VGG-19 (for which we share the same setup), they find that networks pruned by up to 80% and + randomly reinitialized match the accuracy of the original network. Our experiments in Figure 7 + confirm these findings at this level of sparsity (below which Liu et al. do not present data). However, + after further pruning, initialization matters. we find winning tickets when VGG-19 is pruned by up + to 98.5%; when reinitialized, these tickets reach much lower accuracy. We hypothesize that—up + to a certain level of sparsity—highly overparameterized networks can be pruned, reinitialized, and + retrained successfully; however, beyond this point, extremely pruned, less severely overparamterized + networks only maintain accuracy with fortuitous initialization. + The importance of winning ticket structure.The initialization that gives rise to a winning ticket + is arranged in a particular sparse architecture. Since we uncover winning tickets through heavy + use of training data, we hypothesize that the structure of our winning tickets encodes an inductive + bias customized to the learning task at hand. Cohen & Shashua (2016) show that the inductive bias + embedded in the structure of a deep network determines the kinds of data that it can separate more + parameter-efficiently than can a shallow network; although Cohen & Shashua (2016) focus on the + pooling geometry of convolutional networks, a similar effect may be at play with the structure of + winning tickets, allowing them to learn even when heavily pruned. + The improved generalization of winning tickets.We reliably find winning tickets that generalize + better, exceeding the test accuracy of the original network while matching its training accuracy. + Test accuracy increases and then decreases as we prune, forming anOccam’s Hill(Rasmussen & + Ghahramani, 2001) where the original, overparameterized model has too much complexity (perhaps + overfitting) and the extremely pruned model has too little. The conventional view of the relationship + between compression and generalization is that compact hypotheses can better generalize (Rissanen, + 1986). Recent theoretical work shows a similar link for neural networks, proving tighter generalization + bounds for networks that can be compressed further (Zhou et al. (2018) for pruning/quantization + and Arora et al. (2018) for noise robustness). The lottery ticket hypothesis offers a complementary + perspective on this relationship—that larger networks might explicitly contain simpler representations. + Implications for neural network optimization.Winning tickets can reach accuracy equivalent to + that of the original, unpruned network, but with significantly fewer parameters. This observation + connects to recent work on the role of overparameterization in neural network training. For example, + Du et al. (2019) prove that sufficiently overparameterized two-layer relu networks (with fixed-size + second layers) trained with SGD converge to global optima. A key question, then, is whether the + presence of a winning ticket is necessary or sufficient for SGD to optimize a neural network to a + particular test accuracy. We conjecture (but do not empirically show) that SGD seeks out and trains a + well-initialized subnetwork. By this logic, overparameterized networks are easier to train because + they have more combinations of subnetworks that are potential winning tickets. + + + 6 LIMITATIONS AND FUTURE WORK + + We only consider vision-centric classification tasks on smaller datasets (MNIST, CIFAR10). We do + not investigate larger datasets (namely Imagenet (Russakovsky et al., 2015)). iterative pruning is + computationally intensive, requiring training a network 15 or more times consecutively for multiple + trials. In future work, we intend to explore more efficient methods for finding winning tickets that + will make it possible to study the lottery ticket hypothesis in more resource-intensive settings. + Sparse pruning is our only method for finding winning tickets. Although we reduce parameter-counts, + the resulting architectures are not optimized for modern libraries or hardware. In future work, we + intend to study other pruning methods from the extensive contemporary literature, such as structured + pruning (which would produce networks optimized for contemporary hardware) and non-magnitude + pruning methods (which could produce smaller winning tickets or find them earlier). + The winning tickets we find have initializations that allow them to match the performance of the + unpruned networks at sizes too small for randomly-initialized networks to do the same. In future + work, we intend to study the properties of these initializations that, in concert with the inductive + biases of the pruned network architectures, make these networks particularly adept at learning. + On deeper networks (Resnet-18 and VGG-19), iterative pruning is unable to find winning tickets + unless we train the networks with learning rate warmup. In future work, we plan to explore why + warmup is necessary and whether other improvements to our scheme for identifying winning tickets + could obviate the need for these hyperparameter modifications. + + 7 RELATED WORK + + In practice, neural networks tend to be dramatically overparameterized. Distillation (Ba & Caruana, + 2014; Hinton et al., 2015) and pruning (LeCun et al., 1990; Han et al., 2015) rely on the fact that + parameters can be reduced while preserving accuracy. Even with sufficient capacity to memorize + training data, networks naturally learn simpler functions (Zhang et al., 2016; Neyshabur et al., 2014; + Arpit et al., 2017). Contemporary experience (Bengio et al., 2006; Hinton et al., 2015; Zhang et al., + 2016) and Figure 1 suggest that overparameterized networks are easier to train. We show that dense + networks contain sparse subnetworks capable of learning on their own starting from their original + initializations. Several other research directions aim to train small or sparse networks. + Prior to training.Squeezenet (Iandola et al., 2016) and MobileNets (Howard et al., 2017) are + specifically engineered image-recognition networks that are an order of magnitude smaller than + standard architectures. Denil et al. (2013) represent weight matrices as products of lower-rank factors. + Li et al. (2018) restrict optimization to a small, randomly-sampled subspace of the parameter space + (meaning all parameters can still be updated); they successfully train networks under this restriction. + We show that one need not even update all parameters to optimize a network, and we find winning + tickets through a principled search process involving pruning. Our contribution to this class of + approaches is to demonstrate that sparse, trainable networks exist within larger networks. + After training.Distillation (Ba & Caruana, 2014; Hinton et al., 2015) trains small networks to mimic + the behavior of large networks; small networks are easier to train in this paradigm. Recent pruning + work compresses large models to run with limited resources (e.g., on mobile devices). Although + pruning is central to our experiments, we study why training needs the overparameterized networks + that make pruning possible. LeCun et al. (1990) and Hassibi & Stork (1993) first explored pruning + based on second derivatives. More recently, Han et al. (2015) showed per-weight magnitude-based + pruning substantially reduces the size of image-recognition networks. Guo et al. (2016) restore + pruned connections as they become relevant again. Han et al. (2017) and Jin et al. (2016) restore + pruned connections to increase network capacity after small weights have been pruned and surviving + weights fine-tuned. Other proposed pruning heuristics include pruning based on activations (Hu et al., + 2016), redundancy (Mariet & Sra, 2016; Srinivas & Babu, 2015a), per-layer second derivatives (Dong + et al., 2017), and energy/computation efficiency (Yang et al., 2017) (e.g., pruning convolutional + filters (Li et al., 2016; Molchanov et al., 2016; Luo et al., 2017) or channels (He et al., 2017)). Cohen + et al. (2016) observe that convolutional filters are sensitive to initialization (“The Filter Lottery”); + throughout training, they randomly reinitialize unimportant filters. + During training.Bellec et al. (2018) train with sparse networks and replace weights that reach + zero with new random connections. Srinivas et al. (2017) and Louizos et al. (2018) learn gating + variables that minimize the number of nonzero parameters. Narang et al. (2017) integrate magnitude- + based pruning into training. Gal & Ghahramani (2016) show that dropout approximates Bayesian + inference in Gaussian processes. Bayesian perspectives on dropout learn dropout probabilities during + training (Gal et al., 2017; Kingma et al., 2015; Srinivas & Babu, 2016). Techniques that learn per- + weight, per-unit (Srinivas & Babu, 2016), or structured dropout probabilities naturally (Molchanov + et al., 2017; Neklyudov et al., 2017) or explicitly (Louizos et al., 2017; Srinivas & Babu, 2015b) + prune and sparsify networks during training as dropout probabilities for some weights reach 1. In + contrast, we train networks at least once to find winning tickets. These techniques might also find + winning tickets, or, by inducing sparsity, might beneficially interact with our methods. + + REFERENCES + Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for + deep nets via a compression approach.ICML, 2018. + Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S + Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at + memorization in deep networks. InInternational Conference on Machine Learning, pp. 233–242, + 2017. + Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in neural information + processing systems, pp. 2654–2662, 2014. + Pierre Baldi and Peter J Sadowski. Understanding dropout. InAdvances in neural information + processing systems, pp. 2814–2822, 2013. + Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring. Training + very sparse deep networks.Proceedings of ICLR, 2018. + Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex + neural networks. InAdvances in neural information processing systems, pp. 123–130, 2006. + Joseph Paul Cohen, Henry Z Lo, and Wei Ding. Randomout. Using a convolutional gradient norm to + win the filter lottery.ICLR Workshop, 2016. + Nadav Cohen and Amnon Shashua. Inductive bias of deep convolutional networks through pooling + geometry.arXiv preprint arXiv.1605.06743, 2016. + Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. Predicting parameters in deep + learning. InAdvances in neural information processing systems, pp. 2148–2156, 2013. + Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise + optimal brain surgeon. InAdvances in Neural Information Processing Systems, pp. 4860–4874, + 2017. + Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes + over-parameterized neural networks. InInternational Conference on Learning Representations, + 2019. URLhttps.//openreview.net/forum?id=S1eK3i09YQ. + Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation. Representing model + uncertainty in deep learning. Ininternational conference on machine learning, pp. 1050–1059, + 2016. + Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. InAdvances in Neural Information + Processing Systems, pp. 3584–3593, 2017. + Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural + networks. InProceedings of the thirteenth international conference on artificial intelligence and + statistics, pp. 249–256, 2010. + Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. InAdvances + In Neural Information Processing Systems, pp. 1379–1387, 2016. + Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for + efficient neural network. InAdvances in neural information processing systems, pp. 1135–1143, + 2015. + Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John + Tran, and William J Dally. Dsd. Regularizing deep neural networks with dense-sparse-dense + training flow.Proceedings of ICLR, 2017. + Babak Hassibi and David G Stork. Second order derivatives for network pruning. Optimal brain + surgeon. InAdvances in neural information processing systems, pp. 164–171, 1993. + Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image + recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, + pp. 770–778, 2016. + Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. + InInternational Conference on Computer Vision (ICCV), volume 2, pp. 6, 2017. + Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv + preprint arXiv.1503.02531, 2015. + Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. + Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint + arXiv.1207.0580, 2012. + Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, + Marco Andreetto, and Hartwig Adam. Mobilenets. Efficient convolutional neural networks for + mobile vision applications.arXiv preprint arXiv.1704.04861, 2017. + Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming. A data-driven + neuron pruning approach towards efficient deep architectures.arXiv preprint arXiv.1607.03250, + 2016. + Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt + Keutzer. Squeezenet. Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. + arXiv preprint arXiv.1602.07360, 2016. + Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks + with iterative hard thresholding methods.arXiv preprint arXiv.1607.05423, 2016. + Diederik P Kingma and Jimmy Ba. Adam. A method for stochastic optimization.arXiv preprint + arXiv.1412.6980, 2014. + Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameteri- + zation trick. InAdvances in Neural Information Processing Systems, pp. 2575–2583, 2015. + Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. + Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. InAdvances in neural + information processing systems, pp. 598–605, 1990. + Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to + document recognition.Proceedings of the IEEE, 86(11).2278–2324, 1998. + Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension + of objective landscapes.Proceedings of ICLR, 2018. + Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for + efficient convnets.arXiv preprint arXiv.1608.08710, 2016. + Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value + of network pruning. InInternational Conference on Learning Representations, 2019. URL + https.//openreview.net/forum?id=rJlnB3C5Ym. + Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In + Advances in Neural Information Processing Systems, pp. 3290–3300, 2017. + Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through + l_0regularization.Proceedings of ICLR, 2018. + Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet. A filter level pruning method for deep neural + network compression.arXiv preprint arXiv.1707.06342, 2017. + Zelda Mariet and Suvrit Sra. Diversity networks.Proceedings of ICLR, 2016. + Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural + networks.arXiv preprint arXiv.1701.05369, 2017. + Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional + neural networks for resource efficient transfer learning.arXiv preprint arXiv.1611.06440, 2016. + Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent + neural networks.Proceedings of ICLR, 2017. + Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian + pruning via log-normal multiplicative noise. InAdvances in Neural Information Processing + Systems, pp. 6778–6787, 2017. + Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias. On the + role of implicit regularization in deep learning.arXiv preprint arXiv.1412.6614, 2014. + Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In T. K. Leen, T. G. Dietterich, + and V. Tresp (eds.),Advances in Neural Information Processing Systems 13, pp. 294–300. MIT + Press, 2001. URLhttp.//papers.nips.cc/paper/1925-occams-razor.pdf. + Jorma Rissanen. Stochastic complexity and modeling.The annals of statistics, pp. 1080–1100, 1986. + Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, + Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition + challenge.International Journal of Computer Vision, 115(3).211–252, 2015. + Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image + recognition.arXiv preprint arXiv.1409.1556, 2014. + Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv + preprint arXiv.1507.06149, 2015a. + Suraj Srinivas and R Venkatesh Babu. Learning neural network architectures using backpropagation. + arXiv preprint arXiv.1511.05497, 2015b. + Suraj Srinivas and R Venkatesh Babu. Generalized dropout.arXiv preprint arXiv.1611.06791, 2016. + Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. + InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, + pp. 138–145, 2017. + Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. + Dropout. A simple way to prevent neural networks from overfitting.The Journal of Machine + Learning Research, 15(1).1929–1958, 2014. + Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural + networks using dropconnect. InInternational Conference on Machine Learning, pp. 1058–1066, + 2013. + Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural + networks using energy-aware pruning.arXiv preprint, 2017. + Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding + deep learning requires rethinking generalization.arXiv preprint arXiv.1611.03530, 2016. + Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Compressibility + and generalization in large-scale deep learning.arXiv preprint arXiv.1804.05862, 2018. + + A ACKNOWLEDGMENTS + + We gratefully acknowledge IBM, which—through the MIT-IBM Watson AI Lab—contributed the + computational resources necessary to conduct the experiments in this paper. We particularly thank + IBM researchers German Goldszmidt, David Cox, Ian Molloy, and Benjamin Edwards for their + generous contributions of infrastructure, technical support, and feedback. We also wish to thank + Aleksander Madry, Shafi Goldwasser, Ed Felten, David Bieber, Karolina Dziugaite, Daniel Weitzner, + and R. David Edelman for support, feedback, and helpful discussions over the course of this project. + This work was supported in part by the Office of Naval Research (ONR N00014-17-1-2699). + + B ITERATIVE PRUNING STRATEGIES + + In this Appendix, we examine two different ways of structuring the iterative pruning strategy that we + use throughout the main body of the paper to find winning tickets. + + Strategy 1. Iterative pruning with resetting. + + 1.Randomly initialize a neural network <> where <> and <> is a mask. + 2.Train the network forjiterations, reaching parameters <>. + 3.Prune s% of the parameters, creating an updated mask m0 where <>. + 4.Reset the weights of the remaining portion of the network to their values in <>. That is, let + <>. + 5.Let <> and repeat steps 2 through 4 until a sufficiently pruned network has been + obtained. + + Strategy 2. Iterative pruning with continued training. + + 1.Randomly initialize a neural network <> where <> and <> is a mask. + 2.Train the network for j iterations. + 3.Prune s% of the parameters, creating an updated mask m0 where <>. + 4.Let <> and repeat steps 2 and 3 until a sufficiently pruned network has been obtained. + 5.Reset the weights of the remaining portion of the network to their values in <>. That is, let + <>. + + The difference between these two strategies is that, after each round of pruning, Strategy 2 retrains + using the already-trained weights, whereas Strategy 1 resets the network weights back to their initial + values before retraining. In both cases, after the network has been sufficiently pruned, its weights are + reset back to the original initializations. + Figures 9 and 10 compare the two strategies on the Lenet and Conv-2/4/6 architectures on the + hyperparameters we select in Appendices G and H. In all cases, the Strategy 1 maintains higher + validation accuracy and faster early-stopping times to smaller network sizes. + + C EARLY STOPPING CRITERION + + Throughout this paper, we are interested in measuring the speed at which networks learn. As a proxy + for this quantity, we measure the iteration at which an early-stopping criterion would end training. + The specific criterion we employ is the iteration of minimum validation loss. In this Subsection, we + further explain that criterion. + Validation and test loss follow a pattern where they decrease early in the training process, reach a + minimum, and then begin to increase as the model overfits to the training data. Figure 11 shows an + example of the validation loss as training progresses; these graphs use Lenet, iterative pruning, and + Adam with a learning rate of 0.0012 (the learning rate we will select in the following subsection). + This Figure shows the validation loss corresponding to the test accuracies in Figure 3. + + <
> + + Figure 9. The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket + experiment on the Lenet architecture when iteratively pruned using the resetting and continued + training strategies. + + <
> + + Figure 10. The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket + experiment on the Conv-2, Conv-4, and Conv-6 architectures when iteratively pruned using the + resetting and continued training strategies. + + <
> + + Figure 11. The validation loss data corresponding to Figure 3, i.e., the validation loss as training + progresses for several different levels of pruning in the iterative pruning experiment. Each line is + the average of five training runs at the same level of iterative pruning; the labels are the percentage + of weights from the original network that remain after pruning. Each network was trained with + Adam at a learning rate of 0.0012. The left graph shows winning tickets that learn increasingly faster + than the original network and reach lower loss. The middle graph shows winning tickets that learn + increasingly slower after the fastest early-stopping time has been reached. The right graph contrasts + the loss of winning tickets to the loss of randomly reinitialized networks. + + <
> + + Figure 12. Figure 4 augmented with a graph of the training accuracy at the end of 50,000 iterations. + + + In all cases, validation loss initially drops, after which it forms a clear bottom and then begins + increasing again. Our early-stopping criterion identifies this bottom. We consider networks that reach + this moment sooner to have learned “faster.” In support of this notion, the ordering in which each + experiment meets our early-stopping criterion in Figure 3 is the same order in which each experiment + reaches a particular test accuracy threshold in Figure 3. + Throughout this paper, in order to contextualize this learning speed, we also present the test accuracy + of the network at the iteration of minimum validation loss. In the main body of the paper, we find + that winning tickets both arrive at early-stopping sooner and reach higher test accuracy at this point. + + + D TRAINING ACCURACY FOR LOTTERY TICKET EXPERIMENTS + + This Appendix accompanies Figure 4 (the accuracy and early-stopping iterations of Lenet on MNIST + from Section 2) and Figure 5 (the accuracy and early-stopping iterations of Conv-2, Conv-4, and + Conv-6 in Section Section 3) in the main body of the paper. Those figures show the iteration of + early-stopping, the test accuracy at early-stopping, the training accuracy at early-stopping, and the + test accuracy at the end of the training process. However, we did not have space to include a graph + of the training accuracy at the end of the training process, which we assert in the main body of the + paper to be 100% for all but the most heavily pruned networks. In this Appendix, we include those + additional graphs in Figure 12 (corresponding to Figure 4) and Figure 13 (corresponding to Figure 5). + As we describe in the main body of the paper, training accuracy reaches 100% in all cases for all but + the most heavily pruned networks. However, training accuracy remains at 100% longer for winning + tickets than for randomly reinitialized networks. + + + E COMPARING RANDOM REINITIALIZATION AND RANDOM SPARSITY + + In this Appendix, we aim to understand the relative performance of randomly reinitialized winning + tickets and randomly sparse networks. + + 1.Networks found via iterative pruning with the original initializations (blue in Figure 14). + 2.Networks found via iterative pruning that are randomly reinitialized (orange in Figure 14). + 3.Random sparse subnetworks with the same number of parameters as those found via iterative + pruning (green in Figure 14). + + <
> + + Figure 13. Figure 5 augmented with a graph of the training accuracy at the end of the training process. + + + Figure 14 shows this comparison for all of the major experiments in this paper. For the fully-connected + Lenet architecture for MNIST, we find that the randomly reinitialized networks outperform random + sparsity. However, for all of the other, convolutional networks studied in this paper, there is no + significant difference in performance between the two. We hypothesize that the fully-connected + network for MNIST sees these benefits because only certain parts of the MNIST images contain + useful information for classification, meaning connections in some parts of the network will be more + valuable than others. This is less true with convolutions, which are not constrained to any one part of + the input image. + + + F EXAMINING WINNING TICKETS + + In this Appendix, we examine the structure of winning tickets to gain insight into why winning tickets + are able to learn effectively even when so heavily pruned. Throughout this Appendix, we study the + winning tickets from the Lenet architecture trained on MNIST. Unless otherwise stated, we use the + same hyperparameters as in Section 2. glorot initialization and adam optimization. + + F.1 WINNING TICKET INITIALIZATION (ADAM) + + Figure 15 shows the distributions of winning ticket initializations for four different levels ofPm . To + clarify, these are the distributions of the initial weights of the connections that have survived the + pruning process. The blue, orange, and green lines show the distribution of weights for the first + hidden layer, second hidden layer, and output layer, respectively. The weights are collected from five + different trials of the lottery ticket experiment, but the distributions for each individual trial closely + mirror those aggregated from across all of the trials. The histograms have been normalized so that the + area under each curve is 1. + The left-most graph in Figure 15 shows the initialization distributions for the unpruned networks. We + use glorot initialization, so each of the layers has a different standard deviation. As the network is + pruned, the first hidden layer maintains its distribution. However, the second hidden layer and the + output layer become increasingly bimodal, with peaks on either side of 0. Interestingly, the peaks + are asymmetric. the second hidden layer has more positive initializations remaining than negative + initializations, and the reverse is true for the output layer. + The connections in the second hidden layer and output layer that survive the pruning process tend + to have higher magnitude-initializations. Since we find winning tickets by pruning the connections + with the lowest magnitudes in each layer at theend, the connections with the lowest-magnitude + initializations must still have the lowest-magnitude weights at the end of training. A different trend + holds for the input layer. it maintains its distribution, meaning a connection’s initialization has less + relation to its final weight. + + F.2 WINNING TICKET INITIALIZATIONS (SGD) + + We also consider the winning tickets obtained when training the network with SGD learning rate 0.8 + (selected as described in Appendix G). The bimodal distributions from Figure 15 are present across + all layers (see Figure 16. The connections with the highest-magnitude initializations are more likely + to survive the pruning process, meaning winning ticket initializations have a bimodal distribution + with peaks on opposite sides of 0. Just as with the adam-optimized winning tickets, these peaks are + of different sizes, with the first hidden layer favoring negative initializations and the second hidden + layer and output layer favoring positive initializations. Just as with the adam results, we confirm that + each individual trial evidences the same asymmetry as the aggregate graphs in Figure 16. + + F.3 REINITIALIZING FROM WINNING TICKET INITIALIZATIONS + + Considering that the initialization distributions of winning ticketsDm are so different from the + Gaussian distributionDused to initialize the unpruned network, it is natural to ask whether randomly + reinitializing winning tickets fromDm rather thanDwill improve winning ticket performance. We do + not find this to be the case. Figure 17 shows the performance of winning tickets whose initializations + are randomly sampled from the distribution of initializations contained in the winning tickets for + + <
> + + Figure 14. The test accuracy at the final iteration for each of the networks studied in this paper. + + <
> + + Figure 15. The distribution of initializations in winning tickets pruned to the levels specified in the + titles of each plot. The blue, orange, and green lines show the distributions for the first hidden layer, + second hidden layer, and output layer of the Lenet architecture for MNIST when trained with the + adam optimizer and the hyperparameters used in 2. The distributions have been normalized so that + the area under each curve is 1. + + <
> + + Figure 16. Same as Figure 15 where the network is trained with SGD at rate 0.8. + + + adam. More concretely, let <> be the set of initializations found in the winning 0 ticket with maskm. We sample a new set of parameters + <> and train the network <> We perform this sampling on a per-layer basis. The results of this experiment are in Figure 17. + Winning tickets reinitialized fromDm perform little better than when randomly reinitialized from D. + We attempted the same experiment with the SGD-trained winning tickets and found similar results. + + F.4 PRUNING AT ITERATION 0 + + One other way of interpreting the graphs of winning ticket initialization distributions is as follows. + weights that begin small stay small, get pruned, and never become part of the winning ticket. (The + only exception to this characterization is the first hidden layer for the adam-trained winning tickets.) + If this is the case, then perhaps low-magnitude weights were never important to the network and can + be pruned from the very beginning. Figure 18 shows the result of attempting this pruning strategy. + Winning tickets selected in this fashion perform even worse than when they are found by iterative + + <
> + + Figure 17. The performance of the winning tickets of the Lenet architecture for MNIST when the + layers are randomly reinitialized from the distribution of initializations contained in the winning + ticket of the corresponding size. + + <
> + + Figure 18. The performance of the winning tickets of the Lenet architecture for MNIST when + magnitude pruning is performed before the network is ever trained. The network is subsequently + trained with adam. + + <
> + + Figure 19. Between the first and last training iteration of the unpruned network, the magnitude by + which weights in the network change. The blue line shows the distribution of magnitudes for weights + that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for + weights that are in the eventual winning ticket. + + + pruning and randomly reinitialized. We attempted the same experiment with the SGD-trained winning + tickets and found similar results. + + F.5 COMPARING INITIAL AND FINAL WEIGHTS IN WINNING TICKETS + + In this subsection, we consider winning tickets in the context of the larger optimization process. To + do so, we examine the initial and final weights of the unpruned network from which a winning ticket + derives to determine whether weights that will eventually comprise a winning ticket exhibit properties + that distinguish them from the rest of the network. + We consider the magnitude of the difference between initial and final weights. One possible rationale + for the success of winning tickets is that they already happen to be close to the optimum that gradient + descent eventually finds, meaning that winning ticket weights should change by a smaller amount + than the rest of the network. Another possible rationale is that winning tickets are well placed in the + optimization landscape for gradient descent to optimize productively, meaning that winning ticket + weights should change by a larger amount than the rest of the network. Figure 19 shows that winning + ticket weights tend to change by a larger amount then weights in the rest of the network, evidence + that does not support the rationale that winning tickets are already close to the optimum. + It is notable that such a distinction exists between the two distributions. One possible explanation for + this distinction is that the notion of a winning ticket may indeed be a natural part of neural network + optimization. Another is that magnitude-pruning biases the winning tickets we find toward those + containing weights that change in the direction of higher magnitude. Regardless, it offers hope that + winning tickets may be discernible earlier in the training process (or after a single training run), + meaning that there may be more efficient methods for finding winning tickets than iterative pruning. + Figure 20 shows the directions of these changes. It plots the difference between the magnitude of the + final weight and the magnitude of the initial weight, i.e., whether the weight moved toward or away + + <
> + + Figure 20. Between the first and last training iteration of the unpruned network, the magnitude by + which weights move away from 0. The blue line shows the distribution of magnitudes for weights + that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for + weights that are in the eventual winning ticket. + + + <
> + + Figure 21. The fraction of incoming connections that survive the pruning process for each node in + each layer of the Lenet architecture for MNIST as trained with adam. + + + from 0. In general, winning ticket weights are more likely to increase in magnitude (that is, move + away from 0) than are weights that do not participate in the eventual winning ticket. + + F.6 WINNING TICKET CONNECTIVITY + + In this Subsection, we study the connectivity of winning tickets. Do some hidden units retain a + large number of incoming connections while others fade away, or does the network retain relatively + even sparsity among all units as it is pruned? We find the latter to be the case when examining the + incoming connectivity of network units. for both adam and SGD, each unit retains a number of + incoming connections approximately in proportion to the amount by which the overall layer has + been pruned. Figures 21 and 22 show the fraction of incoming connections that survive the pruning + process for each node in each layer. Recall that we prune the output layer at half the rate as the rest of + the network, which explains why it has more connectivity than the other layers of the network. + + <
> + + Figure 22. Same as Figure 21 where the network is trained with SGD at rate 0.8. + + <
> + + Figure 23. The fraction of outgoing connections that survive the pruning process for each node in + each layer of the Lenet architecture for MNIST as trained with adam. The blue, orange, and green + lines are the outgoing connections from the input layer, first hidden layer, and second hidden layer, + respectively. + + <
> + + Figure 24. Same as Figure 23 where the network is trained with SGD at rate 0.8. + + + + However, this is not the case for the outgoing connections. To the contrary, for the adam-trained + networks, certain units retain far more outgoing connections than others (Figure 23). The distributions + are far less smooth than those for the incoming connections, suggesting that certain features are far + more useful to the network than others. This is not unexpected for a fully-connected network on a + task like MNIST, particularly for the input layer. MNIST images contain centered digits, so the pixels + around the edges are not likely to be informative for the network. Indeed, the input layer has two + peaks, one larger peak for input units with a high number of outgoing connections and one smaller + peak for input units with a low number of outgoing connections. Interestingly, the adam-trained + winning tickets develop a much more uneven distribution of outgoing connectivity for the input layer + than does the SGD-trained network (Figure 24). + + + + F.7 ADDING NOISE TO WINNING TICKETS + + In this Subsection, we explore the extent to which winning tickets are robust to Gaussian noise added + to their initializations. In the main body of the paper, we find that randomly reinitializing a winning + ticket substantially slows its learning and reduces its eventual test accuracy. In this Subsection, + we study a less extreme way of perturbing a winning ticket. Figure 25 shows the effect of adding + Gaussian noise to the winning ticket initializations. The standard deviation of the noise distribution + of each layer is a multiple of the standard deviation of the layer’s initialization Figure 25 shows noise + distributions with standard deviation 0.5%,1%,2%, and 3%. Adding Gaussian noise reduces the test + accuracy of a winning ticket and slows its ability to learn, again demonstrating the importance of + the original initialization. As more noise is added, accuracy decreases. However, winning tickets + are surprisingly robust to noise. Adding noise of 0.5% barely changes winning ticket accuracy. Even + after adding noise of 3%, the winning tickets continue to outperform the random reinitialization + experiment. + + <
> + + Figure 25. The performance of the winning tickets of the Lenet architecture for MNIST when + Gaussian noise is added to the initializations. The standard deviations of the noise distributions for + each layer are a multiple of the standard deviations of the initialization distributions; in this Figure, + we consider multiples 0.5, 1, 2, and 3. + + + G HYPERPARAMETER EXPLORATION FOR FULLY-CONNECTED NETWORKS + + This Appendix accompanies Section 2 of the main paper. It explores the space of hyperparameters + for the Lenet architecture evaluated in Section 2 with two purposes in mind. + + 1.To explain the hyperparameters selected in the main body of the paper. + 2.To evaluate the extent to which the lottery ticket experiment patterns extend to other choices + of hyperparameters. + + G.1 EXPERIMENTAL METHODOLOGY + + This Section considers the fully-connected Lenet architecture (LeCun et al., 1998), which comprises + two fully-connected hidden layers and a ten unit output layer, on the MNIST dataset. Unless otherwise + stated, the hidden layers have 300 and 100 units each. + The MNIST dataset consists of 60,000 training examples and 10,000 test examples. We randomly + sampled a 5,000-example validation set from the training set and used the remaining 55,000 training + examples as our training set for the rest of the paper (including Section 2). The hyperparameter + selection experiments throughout this Appendix are evaluated using the validation set for determining + both the iteration of early-stopping and the accuracy at early-stopping; the networks in the main body + of this paper (which make use of these hyperparameters) have their accuracy evaluated on the test set. + The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire + training set is shuffled. + Unless otherwise noted, each line in each graph comprises data from three separate experiments. The + line itself traces the average performance of the experiments and the error bars indicate the minimum + and maximum performance of any one experiment. + Throughout this Appendix, we perform the lottery ticket experiment iteratively with a pruning rate of + 20% per iteration (10% for the output layer); we justify the choice of this pruning rate later in this + Appendix. Each layer of the network is pruned independently. On each iteration of the lottery ticket + experiment, the network is trained for 50,000 training iterations regardless of when early-stopping + occurs; in other words, no validation or test data is taken into account during the training process, and + early-stopping times are determined retroactively by examining validation performance. We evaluate + validation and test performance every 100 iterations. + For the main body of the paper, we opt to use the Adam optimizer (Kingma & Ba, 2014) and Gaussian + Glorot initialization (Glorot & Bengio, 2010). Although we can achieve more impressive results on + the lottery ticket experiment with other hyperparameters, we intend these choices to be as generic + as possible in an effort to minimize the extent to which our main results depend on hand-chosen + hyperparameters. In this Appendix, we select the learning rate for Adam that we use in the main body + of the paper. + In addition, we consider a wide range of other hyperparameters, including other optimization + algorithms (SGD with and without momentum), initialization strategies (Gaussian distributions + + <
> + + Figure 26. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Lenet architecture trained with MNIST using the Adam optimizer at various + learning rates. Each line represents a different learning rate. + + <
> + + Figure 27. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent at + various learning rates. + + + with various standard deviations), network sizes (larger and smaller hidden layers), and pruning + strategies (faster and slower pruning rates). In each experiment, we vary the chosen hyperparameter + while keeping all others at their default values (Adam with the chosen learning rate, Gaussian Glorot + initialization, hidden layers with 300 and 100 units). The data presented in this appendix was collected + by training variations of the Lenet architecture more than 3,000 times. + + G.2 LEARNING RATE + + In this Subsection, we perform the lottery ticket experiment on the Lenet architecture as optimized + with Adam, SGD, and SGD with momentum at various learning rates. + Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria for + selecting the learning rate are as follows. + + 1.On the unpruned network, it should minimize training iterations necessary to reach early- + stopping and maximize validation accuracy at that iteration. That is, it should be a reasonable + hyperparameter for optimizing the unpruned network even if we are not running the lottery + ticket experiment. + 2. When running the iterative lottery ticket experiment, it should make it possible to match + the early-stopping iteration and accuracy of the original network with as few parameters as + possible. + 3.Of those options that meet (1) and (2), it should be on the conservative (slow) side so that it is + more likely to productively optimize heavily pruned networks under a variety of conditions + with a variety of hyperparameters. + + <
> + + Figure 28. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent + with momentum (0.9) at various learning rates. + + + Figure 26 shows the early-stopping iteration and validation accuracy at that iteration of performing + the iterative lottery ticket experiment with the Lenet architecture optimized with Adam at various + learning rates. According to the graph on the right of Figure 26, several learning rates between 0.0002 + and 0.002 achieve similar levels of validation accuracy on the original network and maintain that + performance to similar levels as the network is pruned. Of those learning rates, 0.0012 and 0.002 + produce the fastest early-stopping times and maintain them to the smallest network sizes. We choose + 0.0012 due to its higher validation accuracy on the unpruned network and in consideration of criterion + (3) above. + We note that, across all of these learning rates, the lottery ticket pattern (in which learning becomes + faster and validation accuracy increases with iterative pruning) remains present. Even for those + learning rates that did not satisfy the early-stopping criterion within 50,000 iterations (2.5e-05 and + 0.0064) still showed accuracy improvements with pruning. + + G.3 OTHER OPTIMIZATION ALGORITHMS + + G.3.1 SGD + Here, we explore the behavior of the lottery ticket experiment when the network is optimized with + stochastic gradient descent (SGD) at various learning rates. The results of doing so appear in Figure + 27. The lottery ticket pattern appears across all learning rates, including those that fail to satisfy the + early-stopping criterion within 50,000 iterations. SGD learning rates 0.4 and 0.8 reach early-stopping + in a similar number of iterations as the best Adam learning rates (0.0012 and 0.002) but maintain + this performance when the network has been pruned further (to less than 1% of its original size for + SGD vs. about 3.6% of the original size for Adam). Likewise, on pruned networks, these SGD + learning rates achieve equivalent accuracy to the best Adam learning rates, and they maintain that + high accuracy when the network is pruned as much as the Adam learning rates. + + G.3.2 MOMENTUM + Here, we explore the behavior of the lottery ticket experiment when the network is optimized with + SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 28. + Once again, the lottery ticket pattern appears across all learning rates, with learning rates between + 0.025 and 0.1 maintaining high validation accuracy and faster learning for the longest number of + pruning iterations. Learning rate 0.025 achieves the highest validation accuracy on the unpruned + network; however, its validation accuracy never increases as it is pruned, instead decreasing gradually, + and higher learning rates reach early-stopping faster. + + G.4 ITERATIVE PRUNING RATE + + When running the iterative lottery ticket experiment on Lenet, we prune each layer of the network + separately at a particular rate. That is, after training the network, we prunek%of the weights in + + <
> + + Figure 29. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment when pruned at different rates. Each line represents a differentpruning rate—the + percentage of lowest-magnitude weights that are pruned from each layer after each training iteration. + + <
> + + Figure 30. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment initialized with Gaussian distributions with various standard deviations. Each line + is a different standard deviation for a Gaussian distribution centered at 0. + + + each layer ( k %of the weights in the output layer) before resetting the weights to their original + initializations and training again. In the main body of the paper, we find that iterative pruning finds 2 + smaller winning tickets than one-shot pruning, indicating that pruning too much of the network at + once diminishes performance. Here, we explore different values ofk. + Figure 29 shows the effect of the amount of the network pruned on each pruning iteration on early- + stopping time and validation accuracy. There is a tangible difference in learning speed and validation + accuracy at early-stopping between the lowest pruning rates (0.1 and 0.2) and higher pruning rates (0.4 + and above). The lowest pruning rates reach higher validation accuracy and maintain that validation + accuracy to smaller network sizes; they also maintain fast early-stopping times to smaller network + sizes. For the experiments throughout the main body of the paper and this Appendix, we use a + pruning rate of 0.2, which maintains much of the accuracy and learning speed of 0.1 while reducing + the number of training iterations necessary to get to smaller network sizes. + In all of the Lenet experiments, we prune the output layer at half the rate of the rest of the network. + Since the output layer is so small (1,000 weights out of 266,000 for the overall Lenet architecture), + we found that pruning it reaches a point of diminishing returns much earlier the other layers. + + G.5 INITIALIZATION DISTRIBUTION + + To this point, we have considered only a Gaussian Glorot (Glorot & Bengio, 2010) initialization + scheme for the network. Figure 30 performs the lottery ticket experiment while initializing the Lenet + architecture from Gaussian distributions with a variety of standard deviations. The networks were + optimized with Adam at the learning rate chosen earlier. The lottery ticket pattern continues to appear + across all standard deviations. When initialized from a Gaussian distribution with standard deviation + 0.1, the Lenet architecture maintained high validation accuracy and low early-stopping times for the + longest, approximately matching the performance of the Glorot-initialized network. + + G.6 NETWORK SIZE + + <
> + + Figure 31. The early-stopping iteration and validation accuracy at at that iteration of the iterative + lottery ticket experiment on the Lenet architecture with various layer sizes. The label for each line + is the size of the first and second hidden layers of the network. All networks had Gaussian Glorot + initialization and were optimized with Adam (learning rate 0.0012). Note that the x-axis of this plot + charts the number ofweightsremaining, while all other graphs in this section have charted thepercent + of weights remaining. + + Throughout this section, we have considered the Lenet architecture with 300 units in the first hidden + layer and 100 units in the second hidden layer. Figure 31 shows the early-stopping iterations and + validation accuracy at that iteration of the Lenet architecture with several other layer sizes. All + networks we tested maintain the 3.1 ratio between units in the first hidden layer and units in the + second hidden layer. + The lottery ticket hypothesis naturally invites a collection of questions related to network size. Gener- + alizing, those questions tend to take the following form. according to the lottery ticket hypothesis, do + larger networks, which contain more subnetworks, find “better” winning tickets? In line with the + generality of this question, there are several different answers. + If we evaluate a winning ticket by the accuracy it achieves, then larger networks do find better + winning tickets. The right graph in Figure 31 shows that, for any particular number of weights (that + is, any particular point on the x-axis), winning tickets derived from initially larger networks reach + higher accuracy. Put another way, in terms of accuracy, the lines are approximately arranged from + bottom to top in increasing order of network size. It is possible that, since larger networks have + more subnetworks, gradient descent found a better winning ticket. Alternatively, the initially larger + networks have more units even when pruned to the same number of weights as smaller networks, + meaning they are able to contain sparse subnetwork configurations that cannot be expressed by + initially smaller networks. + If we evaluate a winning ticket by the time necessary for it to reach early-stopping, then larger + networks have less of an advantage. The left graph in Figure 31 shows that, in general, early-stopping + iterations do not vary greatly between networks of different initial sizes that have been pruned to the + same number of weights. Upon exceedingly close inspection, winning tickets derived from initially + larger networks tend to learn marginally faster than winning tickets derived from initially smaller + networks, but these differences are slight. + If we evaluate a winning ticket by the size at which it returns to the same accuracy as the original + network, the large networks do not have an advantage. Regardless of the initial network size, the + right graph in Figure 31 shows that winning tickets return to the accuracy of the original network + when they are pruned to between about 9,000 and 15,000 weights. + + H HYPERPARAMETER EXPLORATION FOR CONVOLUTIONAL NETWORKS + + This Appendix accompanies Sections 3 of the main paper. It explores the space of optimization + algorithms and hyperparameters for the Conv-2, Conv-4, and Conv-6 architectures evaluated in + Section 3 with the same two purposes as Appendix G. explaining the hyperparameters used in the main + body of the paper and evaluating the lottery ticket experiment on other choices of hyperparameters. + + H.1 EXPERIMENTAL METHODOLOGY + + The Conv-2, Conv-4, and Conv-6 architectures are variants of the VGG (Simonyan & Zisserman, + 2014) network architecture scaled down for the CIFAR10 (Krizhevsky & Hinton, 2009) dataset. Like + VGG, the networks consist of a series of modules. Each module has two layers of 3x3 convolutional + filters followed by a maxpool layer with stride 2. After all of the modules are two fully-connected + layers of size 256 followed by an output layer of size 10; in VGG, the fully-connected layers are of + size 4096 and the output layer is of size 1000. Like VGG, the first module has 64 convolutions in + each layer, the second has 128, the third has 256, etc. The Conv-2, Conv-4, and Conv-6 architectures + have 1, 2, and 3 modules, respectively. + The CIFAR10 dataset consists of 50,000 32x32 color (three-channel) training examples and 10,000 + test examples. We randomly sampled a 5,000-example validation set from the training set and used the + remaining 45,000 training examples as our training set for the rest of the paper. The hyperparameter + selection experiments throughout this Appendix are evaluated on the validation set, and the examples + in the main body of this paper (which make use of these hyperparameters) are evaluated on test set. + The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire + training set is shuffled. + The Conv-2, Conv-4, and Conv-6 networks are initialized with Gaussian Glorot initialization (Glorot + & Bengio, 2010) and are trained for the number of iterations specified in Figure 2. The number + of training iterations was selected such that heavily-pruned networks could still train in the time + provided. On dropout experiments, the number of training iterations is tripled to provide enough time + for the dropout-regularized networks to train. We optimize these networks with Adam, and select the + learning rate for each network in this Appendix. + As with the MNIST experiments, validation and test performance is only considered retroactively + and has no effect on the progression of the lottery ticket experiments. We measure validation and test + loss and accuracy every 100 training iterations. + Each line in each graph of this section represents the average of three separate experiments, with + error bars indicating the minimum and maximum value that any experiment took on at that point. + (Experiments in the main body of the paper are conducted five times.) + We allow convolutional layers and fully-connected layers to be pruned at different rates; we select + those rates for each network in this Appendix. The output layer is pruned at half of the rate of the + fully-connected layers for the reasons described in Appendix G. + + H.2 LEARNING RATE + + In this Subsection, we perform the lottery ticket experiment on the the Conv-2, Conv-4, and Conv-6 + architectures as optimized with Adam at various learning rates. + + <
> + + Figure 32. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained + using the Adam optimizer at various learning rates. Each line represents a different learning rate. + + + Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria + for selecting the learning rate are the same as in Appendix G. minimizing training iterations and + maximizing accuracy at early-stopping, finding winning tickets containing as few parameters as + possible, and remaining conservative enough to apply to a range of other experiments. + Figure 32 shows the results of performing the iterative lottery ticket experiment on the Conv-2 (top), + Conv-4 (middle), and Conv-6 (bottom) architectures. Since we have not yet selected the pruning rates + for each network, we temporarily pruned fully-connected layers at 20% per iteration, convolutional + layers at 10% per iteration, and the output layer at 10% per iteration; we explore this part of the + hyperparameter space in a later subsection. + For Conv-2, we select a learning rate of 0.0002, which has the highest initial validation accuracy, + maintains both high validation accuracy and low early-stopping times for the among the longest, + and reaches the fastest early-stopping times. This learning rate also leads to a 3.3 percentage point + improvement in validation accuracy when the network is pruned to 3% of its original size. Other + learning rates, such 0.0004, have lower initial validation accuracy (65.2% vs 67.6%) but eventually + reach higher absolute levels of validation accuracy (71.7%, a 6.5 percentage point increase, vs. 70.9%, + a 3.3 percentage point increase). However, learning rate 0.0002 shows the highest proportional + decrease in early-stopping times. 4.8x (when pruned to 8.8% of the original network size). + For Conv-4, we select learning rate 0.0003, which has among the highest initial validation accuracy, + maintains high validation accuracy and fast early-stopping times when pruned by among the most, + and balances improvements in validation accuracy (3.7 percentage point improvement to 78.6% + when 5.4% of weights remain) and improvements in early-stopping time (4.27x when 11.1% of + weights remain). Other learning rates reach higher validation accuracy (0.0004—3.6 percentage point + improvement to 79.1% accuracy when 5.4% of weights remain) or show better improvements in + early-stopping times (0.0002—5.1x faster when 9.2% of weights remain) but not both. + For Conv-6, we also select learning rate 0.0003 for similar reasons to those provided for Conv-4. + Validation accuracy improves by 2.4 percentage points to 81.5% when 9.31% of weights remain + and early-stopping times improve by 2.61x when pruned to 11.9%. Learning rate 0.0004 reaches + high final validation accuracy (81.9%, an increase of 2.7 percentage points, when 15.2% of weights + remain) but with smaller improvements in early-stopping times, and learning rate 0.0002 shows + greater improvements in early-stopping times (6.26x when 19.7% of weights remain) but reaches + lower overall validation accuracy. + We note that, across nearly all combinations of learning rates, the lottery ticket pattern—where + early-stopping times were maintain or decreased and validation accuracy was maintained or increased + during the course of the lottery ticket experiment—continues to hold. This pattern fails to hold at + the very highest learning rates. early-stopping times decreased only briefly (in the case of Conv-2 or + Conv-4) or not at all (in the case of Conv-6), and accuracy increased only briefly (in the case of all + three networks). This pattern is similar to that which we observe in Section 4. at the highest learning + rates, our iterative pruning algorithm fails to find winning tickets. + + H.3 OTHER OPTIMIZATION ALGORITHMS + + H.3.1 SGD + Here, we explore the behavior of the lottery ticket experiment when the Conv-2, Conv-4, and Conv-6 + networks are optimized with stochastic gradient descent (SGD) at various learning rates. The results + of doing so appear in Figure 33. In general, these networks—particularly Conv-2 and Conv-4— + proved challenging to train with SGD and Glorot initialization. As Figure 33 reflects, we could not + find SGD learning rates for which the unpruned networks matched the validation accuracy of the + same networks when trained with Adam; at best, the SGD-trained unpruned networks were typically + 2-3 percentage points less accurate. At higher learning rates than those in Figure 32, gradients tended + to explode when training the unpruned network; at lower learning rates, the networks often failed to + learn at all. + At all of the learning rates depicted, we found winning tickets. In all cases, early-stopping times + initially decreased with pruning before eventually increasing again, just as in other lottery ticket + experiments. The Conv-6 network also exhibited the same accuracy patterns as other experiments, + with validation accuracy initially increasing with pruning before eventually decreasing again. + + <
> + + Figure 33. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained + using SGD at various learning rates. Each line represents a different learning rate. The legend for + each pair of graphs is above the graphs. + + However, the Conv-2 and Conv-4 architectures exhibited a different validation accuracy pattern + from other experiments in this paper. Accuracy initially declined with pruning before rising as + the network was further pruned; it eventually matched or surpassed the accuracy of the unpruned + network. When they eventually did surpass the accuracy of the original network, the pruned networks + reached early-stopping in about the same or fewer iterations than the original network, constituting + a winning ticket by our definition. Interestingly, this pattern also appeared for Conv-6 networks at + slower SGD learning rates, suggesting that faster learning rates for Conv-2 and Conv-4 than those in + Figure 32 might cause the usual lottery ticket accuracy pattern to reemerge. Unfortunately, at these + higher learning rates, gradients exploded on the unpruned networks, preventing us from running these + experiments. + + H.3.2 MOMENTUM + + Here, we explore the behavior of the lottery ticket experiment when the network is optimized with + SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 34. + In general, the lottery ticket pattern continues to apply, with early-stopping times decreasing and + accuracy increasing as the networks are pruned. However, there were two exceptions to this pattern. + + 1.At the very lowest learning rates (e.g., learning rate 0.001 for Conv-4 and all but the highest + learning rate for Conv-2), accuracy initially decreased before increasing to higher levels + than reached by the unpruned network; this is the same pattern we observed when training + these networks with SGD. + 2.At the very highest learning rates (e.g., learning rates 0.005 and 0.008 for Conv-2 and Conv- + 4), early-stopping times never decreased and instead remained stable before increasing; this + is the same pattern we observed for the highest learning rates when training with Adam. + + + H.4 ITERATIVE PRUNING RATE + + For the convolutional network architectures, we select different pruning rates for convolutional and + fully-connected layers. In the Conv-2 and Conv-4 architectures, convolutional parameters make up a + relatively small portion of the overall number of parameters in the models. By pruning convolutions + more slowly, we are likely to be able to prune the model further while maintaining performance. + In other words, we hypothesize that, if all layers were pruned evenly, convolutional layers would + become a bottleneck that would make it more difficult to find lower parameter-count models that are + still able to learn. For Conv-6, the opposite may be true. since nearly two thirds of its parameters are + in convolutional layers, pruning fully-connected layers could become the bottleneck. + Our criterion for selecting hyperparameters in this section is to find a combination of pruning rates + that allows networks to reach the lowest possible parameter-counts while maintaining validation + accuracy at or above the original accuracy and early-stopping times at or below that for the original + network. + Figure 35 shows the results of performing the iterative lottery ticket experiment on Conv-2 (top), + Conv-4 (middle), and Conv-6 (bottom) with different combinations of pruning rates. + According to our criteria, we select an iterative convolutional pruning rate of 10% for Conv-2, 10% for + Conv-4, and 15% for Conv-6. For each network, any rate between 10% and 20% seemed reasonable. + Across all convolutional pruning rates, the lottery ticket pattern continued to appear. + + H.5 LEARNING RATES (DROPOUT ) + + In order to train the Conv-2, Conv-4, and Conv-6 architectures with dropout, we repeated the exercise + from Section H.2 to select appropriate learning rates. Figure 32 shows the results of performing + the iterative lottery ticket experiment on Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) with + dropout and Adam at various learning rates. A network trained with dropout takes longer to learn, so + we trained each architecture for three times as many iterations as in the experiments without dropout. + 60,000 iterations for Conv-2, 75,000 iterations for Conv-4, and 90,000 iterations for Conv-6. We + iteratively pruned these networks at the rates determined in Section H.4. + + <
> + + Figure 34. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained + using SGD with momentum (0.9) at various learning rates. Each line represents a different learning + rate. The legend for each pair of graphs is above the graphs. Lines that are unstable and contain large + error bars (large vertical lines) indicate that some experiments failed to learn effectively, leading to + very low accuracy and very high early-stopping times; these experiments reduce the averages that the + lines trace and lead to much wider error bars. + + <
> + + Figure 35. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures with an + iterative pruning rate of 20% for fully-connected layers. Each line represents a different iterative + pruning rate for convolutional layers. + + + + The Conv-2 network proved to be difficult to consistently train with dropout. The top right graph + in Figure 36 contains wide error bars and low average accuracy for many learning rates, especially + early in the lottery ticket experiments. This indicates that some or all of the training runs failed to + learn; when they were averaged into the other results, they produced the aforementioned pattern + in the graphs. At learning rate 0.0001, none of the three trials learned productively until pruned to + more than 26.5%, at which point all three trials started learning. At learning rate 0.0002, some of the + trials failed to learn productively until several rounds of iterative pruning had passed. At learning + rate 0.0003, all three networks learned productively at every pruning level. At learning rate 0.0004, + one network occasionally failed to learn. We selected learning rate 0.0003, which seemed to allow + networks to learn productively most often while achieving among the highest initial accuracy. + It is interesting to note that networks that were unable to learn at a particular learning rate (for + example, 0.0001) eventually began learning after several rounds of the lottery ticket experiment (that + is, training, pruning, and resetting repeatedly). It is worth investigating whether this phenomenon + was entirely due to pruning (that is, removing any random collection of weights would put the + network in a configuration more amenable to learning) or whether training the network provided + useful information for pruning, even if the network did not show improved accuracy. + For both the Conv-4 and Conv-6 architectures, a slightly slower learning rate (0.0002 as opposed to + 0.0003) leads to the highest accuracy on the unpruned networks in addition to the highest sustained + accuracy and fastest sustained learning as the networks are pruned during the lottery ticket experiment. + With dropout, the unpruned Conv-4 architecture reaches an average validation accuracy of 77.6%, a + 2.7 percentage point improvement over the unpruned Conv-4 network trained without dropout and + one percentage point lower than the highest average validation accuracy attained by a winning ticket. + The dropout-trained winning tickets reach 82.6% average validation accuracy when pruned to 7.6%. + Early-stopping times improve by up to 1.58x (when pruned to 7.6%), a smaller improvement than + then 4.27x achieved by a winning ticket obtained without dropout. + With dropout, the unpruned Conv-6 architecture reaches an average validation accuracy of 81.3%, + an improvement of 2.2 percentage points over the accuracy without dropout; this nearly matches + the 81.5% average accuracy obtained by Conv-6 trained without dropout and pruned to 9.31%. + The dropout-trained winning tickets further improve upon these numbers, reaching 84.8% average + validation accuracy when pruned to 10.5%. Improvements in early-stopping times are less dramatic + than without dropout. a 1.5x average improvement when the network is pruned to 15.1%. + At all learning rates we tested, the lottery ticket pattern generally holds for accuracy, with improve- + ments as the networks are pruned. However, not all learning rates show the decreases in early-stopping + times. To the contrary, none of the learning rates for Conv-2 show clear improvements in early- + stopping times as seen in the other lottery ticket experiments. Likewise, the faster learning rates for + Conv-4 and Conv-6 maintain the original early-stopping times until pruned to about 40%, at which + point early-stopping times steadily increase. + + H.6 PRUNING CONVOLUTIONS VS PRUNING FULLY-CONNECTED LAYERS + + Figure 37 shows the effect of pruning convolutions alone (green), fully-connected layers alone + (orange) and pruning both (blue). The x-axis measures the number of parameters remaining to + emphasize the relative contributions made by pruning convolutions and fully-connected layers to + the overall network. In all three cases, pruning convolutions alone leads to higher test accuracy + and faster learning; pruning fully-connected layers alone generally causes test accuracy to worsen + and learning to slow. However, pruning convolutions alone has limited ability to reduce the overall + parameter-count of the network, since fully-connected layers comprise 99%, 89%, and 35% of the + parameters in Conv-2, Conv-4, and Conv-6. + + <
> + + Figure 36. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery + ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained + using dropout and the Adam optimizer at various learning rates. Each line represents a different + learning rate. + + <
> + + Figure 37. Early-stopping iteration and accuracy of the Conv-2 (top), Conv-4 (middle), and Conv-6 + (bottom) networks when only convolutions are pruned, only fully-connected layers are pruned, and + both are pruned. The x-axis measures the number of parameters remaining, making it possible to + see the relative contributions to the overall network made by pruning FC layers and convolutions + individually. + + + + I HYPERPARAMETER EXPLORATION FOR VGG-19 AND RESNET-18 ON CIFAR10 + + This Appendix accompanies the VGG-19 and Resnet-18 experiments in Section 4. It details the + pruning scheme, training regimes, and hyperparameters that we use for these networks. + + I.1 GLOBAL PRUNING + + In our experiments with the Lenet and Conv-2/4/6 architectures, we separately prune a fraction of + the parameters in each layer (layer-wise pruning). In our experiments with VGG-19 and Resnet-18, + we instead pruneglobally; that is, we prune all of the weights in convolutional layers collectively + without regard for the specific layer from which any weight originated. + Figures 38 (VGG-19) and 39 (Resnet-18) compare the winning tickets found by global pruning + (solid lines) and layer-wise pruning (dashed lines) for the hyperparameters from Section 4. When + training VGG-19 with learning rate 0.1 and warmup to iteration 10,000, we find winning tickets when + Pm 6.9%for layer-wise pruning vs.Pm 1.5%for global pruning. For other hyperparameters, + accuracy similarly drops off when sooner for layer-wise pruning than for global pruning. Global + pruning also finds smaller winning tickets than layer-wise pruning for Resnet-18, but the difference is + less extreme than for VGG-19. + In Section 4, we discuss the rationale for the efficacy of global pruning on deeper networks. In + summary, the layers in these deep networks have vastly different numbers of parameters (particularly + severely so for VGG-19); if we prune layer-wise, we conjecture that layers with fewer parameters + become bottlenecks on our ability to find smaller winning tickets. + Regardless of whether we use layer-wise or global pruning, the patterns from Section 4 hold. at + learning rate 0.1, iterative pruning finds winning tickets for neither network; at learning rate 0.01, the + lottery ticket pattern reemerges; and when training with warmup to a higher learning rate, iterative + pruning finds winning tickets. Figures 40 (VGG-19) and 41 (Resnet-18) present the same data as + Figures 7 (VGG-19) and 8 (Resnet-18) from Section 4 with layer-wise pruning rather than global + pruning. The graphs follow the same trends as in Section 4, but the smallest winning tickets are larger + than those found by global pruning. + + I.2 VGG-19 DETAILS + + The VGG19 architecture was first designed by Simonyan & Zisserman (2014) for Imagenet. The + version that we use here was adapted by Liu et al. (2019) for CIFAR10. The network is structured + as described in Figure 2. it has five groups of 3x3 convolutional layers, the first four of which are + followed by max-pooling (stride 2) and the last of which is followed by average pooling. The network + has one final dense layer connecting the result of the average-pooling to the output. + We largely follow the training procedure for resnet18 described in Appendix I. + + We use the same train/test/validation split. + We use the same data augmentation procedure. + We use a batch size of 64. + We use batch normalization. + We use a weight decay of 0.0001. + We use three stages of training at decreasing learning rates. We train for 160 epochs (112,480 + iterations), decreasing the learning rate by a factor of ten after 80 and 120 epochs. + We use Gaussian Glorot initialization. + + We globally prune the convolutional layers of the network at a rate of 20% per iteration, and we do + not prune the 5120 parameters in the output layer. + Liu et al. (2019) uses an initial pruning rate of 0.1. We train VGG19 with both this learning rate and + a learning rate of 0.01. + + + I.3 RESNET-18 DETAILS + + The Resnet-18 architecture was first introduced by He et al. (2016). The architecture comprises 20 + total layers as described in Figure 2. a convolutional layer followed by nine pairs of convolutional + layers (with residual connections around the pairs), average pooling, and a fully-connected output + layer. + We follow the experimental design of He et al. (2016). + + We divide the training set into 45,000 training examples and 5,000 validation examples. We + use the validation set to select hyperparameters in this appendix and the test set to evaluate + in Section 4. + We augment training data using random flips and random four pixel pads and crops. + We use a batch size of 128. + We use batch normalization. + We use weight decay of 0.0001. + We train using SGD with momentum (0.9). + We use three stages of training at decreasing learning rates. Our stages last for 20,000, + 5,000, and 5,000 iterations each, shorter than the 32,000, 16,000, and 16,000 used in He + et al. (2016). Since each of our iterative pruning experiments requires training the network + 15-30 times consecutively, we select this abbreviated training schedule to make it possible + to explore a wider range of hyperparameters. + We use Gaussian Glorot initialization. + + We globally prune convolutions at a rate of 20% per iteration. We do not prune the 2560 parameters + used to downsample residual connections or the 640 parameters in the fully-connected output layer, + as they comprise such a small portion of the overall network. + + I.4 LEARNING RATE + + In Section 4, we observe that iterative pruning is unable to find winning tickets for VGG-19 and + Resnet-18 at the typical, high learning rate used to train the network (0.1) but it is able to do so at a + lower learning rate (0.01). Figures 42 and 43 explore several other learning rates. In general, iterative + pruning cannot find winning tickets at any rate above 0.01 for either network; for higher learning + rates, the pruned networks with the original initialization perform no better than when randomly + reinitialized. + + I.5 WARMUP ITERATION + + In Section 4, we describe how adding linear warmup to the initial learning rate makes it possible to + find winning tickets for VGG-19 and Resnet-18 at higher learning rates (and, thereby, winning tickets + that reach higher accuracy). In Figures 44 and 45, we explore the number of iterationskover which + warmup should occur. + For VGG-19, we were able to find values ofkfor which iterative pruning could identify winning + tickets when the network was trained at the original learning rate (0.1). For Resnet-18, warmup made + it possible to increase the learning rate from 0.01 to 0.03, but no further. When exploring values ofk, + we therefore us learning rate 0.1 for VGG-19 and 0.03 for Resnet-18. + In general, the greater the value ofk, the higher the accuracy of the eventual winning tickets. + + Resnet-18. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship + reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select + k= 20000, which achieves the highest validation accuracy. + + VGG-19. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship + reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select + k= 10000, as there is little benefit to larger values ofk. + + <
> + + Figure 38. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively + pruned with global (solid) and layer-wise (dashed) pruning. + + <
> + + Figure 39. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively + pruned with global (solid) and layer-wise (dashed) pruning. + + <
> + + Figure 40. Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned with + layer-wise pruning. This is the same as Figure 7, except with layer-wise pruning rather than global + pruning. + + <
> + + Figure 41. Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned with + layer-wise pruning. This is the same as Figure 8 except with layer-wise pruning rather than global + pruning. + + <
> + + Figure 42. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively + pruned and trained with various learning rates. + + <
> + + Figure 43. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively + pruned and trained with various learning rates. + + <
> + + Figure 44. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively + pruned and trained with varying amounts of warmup at learning rate 0.03. + + <
> + + Figure 45. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively + pruned and trained with varying amounts of warmup at learning rate 0.1. +<> <> <> + + +<> <> <> +The State of Sparsity in Deep Neural Networks + +Trevor Gale *1 � Erich Elsen *2 Sara Hooker 1 � + +Abstract + +like image classification and machine translation commonly + +We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Based on insights from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 using only magnitude pruning. Additionally, we repeat the experiments performed by Frankle & Carbin (2018) and Liu et al. (2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification. + +1. Introduction +Deep neural networks achieve state-of-the-art performance +in a variety of domains including image classification (He +et al., 2016), machine translation (Vaswani et al., 2017), +and text-to-speech (van den Oord et al., 2016; Kalchbrenner et al., 2018). +While model quality has been shown to +scale with model and dataset size (Hestness et al., 2017), +the resources required to train and deploy large neural net. +works can be prohibitive. State-of-the-art models +have tens of millions of parameters, and require billions of floating-point operations to make a prediction for a single input sample. +Sparsity has emerged as a leading approach to address these challenges. By sparsity, we refer to the property that a subset of the model parameters have a value of exactly zero2. With zero valued weights, any multiplications (which dominate neural network computation) can be skipped, and models can be stored and transmitted compactly using sparse matrix formats. It has been shown empirically that deep neural networks can tolerate high levels of sparsity (Han et al., 2015; Narang et al., 2017; Ullrich et al., 2017), and this property has been leveraged to significantly reduce the cost associated with the deployment of deep neural networks, and to enable the deployment of state-of-the-art models in severely resource constrained environments (Theis et al., 2018; Kalchbrenner et al., 2018; Valin & Skoglund, 2018). +Over the past few years, numerous techniques for induc.ing sparsity have been proposed and the set of models and datasets used as benchmarks has grown too large to reasonably expect new approaches to explore them all. In addition to the lack of standardization in modeling tasks, the distribution of benchmarks tends to slant heavily towards convolutional architectures and computer vision tasks, and the tasks used to evaluate new techniques are frequently not representative of the scale and complexity of real-world tasks where model compression is most useful. These char.acteristics make it difficult to come away from the sparsity literature with a clear understanding of the relative merits of different approaches. +In addition to practical concerns around comparing techniques, multiple independent studies have recently proposed that the value of sparsification in neural networks has been misunderstood (Frankle & Carbin, 2018; Liu et al., 2018). While both papers suggest that sparsification can be viewed as a form of neural architecture search, they disagree on what is necessary to achieve this. Speci�cally, Liu et al. +2 The term sparsity is also commonly used to refer to the pro.portion of a neural networks weights that are zero valued. Higher sparsity corresponds to fewer weights, and smaller computational and storage requirements. We use the term in this way throughout this paper. + +(2018) re-train learned sparse topologies with a random weight initialization, whereas Frankle & Carbin (2018) posit that the exact random weight initialization used when the sparse architecture was learned is needed to match the test set performance of the model sparsified during optimization. +In this paper, we address these ambiguities to provide a strong foundation for future work on sparsity in neural networks. Our main contributions: (1) We perform a comprehensive evaluation of variational dropout (Molchanov et al., 2017), l0 regularization (Louizos et al., 2017b), and magnitude pruning (Zhu & Gupta, 2017) on Transformer trained on WMT 2014 English-to-German and ResNet-50 trained on ImageNet. To the best of our knowledge, we are the first to apply variational dropout and l0 regularization to models of this scale. While variational dropout and l0 regularization achieve state-of-the-art results on small datasets, we show that they perform inconsistently for large-scale tasks and that simple magnitude pruning can achieve comparable or better results for a reduced computational budget. (2) Through insights gained from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 using only magnitude pruning. (3) We repeat the lottery ticket (Frankle & Carbin, 2018) and scratch (Liu et al., 2018) experiments on Transformer and ResNet-50 across a full range of sparsity levels. We show that unstruc.tured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with pruning as part of the optimization process. (4) We open-source our code, model checkpoints, and results of all hyperparameter settings to establish rigorous baselines for future work on model compression and sparsification 3. + +2. Sparsity in Neural Networks + +We briefly provide a non-exhaustive review of proposed approaches for inducing sparsity in deep neural networks. + +Simple heuristics based on removing small magnitude weights have demonstrated high compression rates with minimal accuracy loss (Str�om, 1997; Collins & Kohli, 2014; Han et al., 2015), and further refinement of the sparsification process for magnitude pruning techniques has increased achievable compression rates and greatly reduced computational complexity (Guo et al., 2016; Zhu & Gupta, 2017). Many techniques grounded in Bayesian statistics and in.formation theory have been proposed (Dai et al., 2018; Molchanov et al., 2017; Louizos et al., 2017b;a; Ullrich et al., 2017). These methods have achieved high compres.sion rates while providing deep theoretical motivation and connections to classical sparsification and regularization techniques. +3https://bit.ly/2ExE8Yj + +Some of the earliest techniques for sparsifying neural networks make use of second-order approximation of the loss surface to avoid damaging model quality (LeCun et al., 1989; Hassibi & Stork, 1992). More recent work has achieved comparable compression levels with more computationally efficient first-order loss approximations, and further refinements have related this work to efficient empirical estimates of the Fisher information of the model parameters (Molchanov et al., 2016; Theis et al., 2018). +Reinforcement learning has also been applied to automat.ically prune weights and convolutional filters (Lin et al., 2017; He et al., 2018), and a number of techniques have been proposed that draw inspiration from biological phenomena, and derive from evolutionary algorithms and neuromorphic computing (Guo et al., 2016; Bellec et al., 2017; Mocanu et al., 2018). +A key feature of a sparsity inducing technique is if and how it imposes structure on the topology of sparse weights. While unstructured weight sparsity provides the most flexibility for the model, it is more difficult to map efficiently to parallel processors and has limited support in deep learn.ing software packages. For these reasons, many techniques focus on removing whole neurons and convolutional filters, or impose block structure on the sparse weights (Liu et al., 2017; Luo et al., 2017; Gray et al., 2017). While this is practical, +there is a trade-off between achievable compression levels for a given model quality and the level of structure imposed on the model weights. In this work, we focus on unstructured sparsity with the expectation that it upper bounds the compression-accuracy trade-off achievable with structured sparsity techniques. + +3. Evaluating sparsification Techniques at Scale + +As a first step towards addressing the ambiguity in the sparsity literature, we rigorously evaluate magnitude-based pruning (Zhu & Gupta, 2017), sparse variational dropout (Molchanov et al., 2017), and l0 regularization (Louizos et al., 2017b) on two large-scale deep learning applications: ImageNet classification with ResNet-50 (He et al., 2016), and neural machine translation (NMT) with the Transformer on the WMT 2014 English-to-German dataset (Vaswani et al., 2017). For each model, we also benchmark a random weight pruning technique, representing the lower bound of compression-accuracy trade-off any method should be expected to achieve. +Here we briefly review the four techniques and introduce our experimental framework. We provide a more detailed overview of each technique in Appendix A. + +3.1. Magnitude Pruning +Magnitude-based weight pruning schemes use the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training. For our experiments, we use the approach introduced in Zhu & Gupta (2017), which is conveniently available in the TensorFlow model pruning library 4. This technique allows for masked weights to reactivate during training based on gradient updates, and makes use of a gradual sparsification schedule with sorting-based weight thresholding to achieve a user specified level of sparsification. These features enable high compression ratios at a reduced computational cost relative to the iterative pruning and re-training approach used by Han et al. (2015), while requiring less hyperparameter tuning relative to the technique proposed by Guo et al. (2016). + +3.2. Variational Dropout +Variational dropout was originally proposed as a re.interpretation of dropout training as variational inference, providing a Bayesian justification for the use of dropout in neural networks and enabling useful extensions to the standard dropout algorithms like learnable dropout rates (Kingma et al., 2015). It was later demonstrated that by learning a model with variational dropout and per-parameter dropout rates, weights with high dropout rates can be re.moved post-training to produce highly sparse solutions (Molchanov et al., 2017). +Variational dropout performs variational inference to learn the parameters of a fully-factorized Gaussian posterior over the weights under a log-uniform prior. In the standard formulation, we apply a local reparameterization to move the sampled noise from the weights to the activations, and then apply the additive noise reparameterization to further reduce the variance of the gradient estimator. Under this parameterization, we directly optimize the mean and variance of the neural network parameters. After training a model with variational dropout, the weights with the highest learned dropout rates can be removed to produce a sparse model. + +3.3. l0 Regularization +l0 regularization explicitly penalizes the number of non.zero weights in the model to induce sparsity. However, the l0-norm is both non-convex and non-differentiable. To address the non-differentiability of the l0-norm, Louizos et al. (2017b) propose a reparameterization of the neural network weights as the product of a weight and a stochastic gate variable sampled from a hard-concrete distribution. The parameters of the hard-concrete distribution can be + +4 https://bit.ly/2T8hBGn + +Table 1. Constant hyperparameters for all Transformer experiments. More details on the standard configuration for training the Transformer can be found in Vaswani et al. (2017). + +<
> + +optimized directly using the reparameterization trick, and the expected l0-norm can be computed using the value of the cumulative distribution function of the random gate variable evaluated at zero. + +3.4. Random Pruning Baseline +For our experiments, we also include a random sparsification procedure adapted from the magnitude pruning technique of Zhu & Gupta (2017). Our random pruning technique uses the same sparsity schedule, but differs by selecting the weights to be pruned each step at random rather based on magnitude and does not allow pruned weights to reactivate. This technique is intended to represent a lower-bound of the accuracy-sparsity trade-off curve. + +3.5. Experimental Framework +For magnitude pruning, we used the TensorFlow model pruning library. We implemented variational dropout and l0 regularization from scratch. For variational dropout, we verified our implementation by reproducing the results from the original paper. To verify our l0 regularization implementation, we applied our weight-level code to Wide ResNet (Zagoruyko & Komodakis, 2016) trained on CIFAR-10 and replicated the training FLOPs reduction and accuracy results from the original publication. Verification results for variational dropout and l0 regularization are included in Appendices B and C. For random pruning, we modified the TensorFlow model pruning library to randomly select weights as opposed to sorting them based on magnitude. +For each model, we kept the number of training steps constant across all techniques and performed extensive hyper-parameter tuning. While magnitude pruning is relatively simple to apply to large models and achieves reasonably consistent performance across a wide range of hyperparameters, variational dropout and l0-regularization are much less well understood. To our knowledge, we are the first to apply these techniques to models of this scale. To produce a fair comparison, we did not limit the amount of hyperparameter tuning we performed for each technique. In total, our results encompass over 4000 experiments. + +<
> + +Figure 1. Sparsity-BLEU trade-off curves for the Transformer. +Top: Pareto frontiers for each of the four sparsification techniques applied to the Transformer. Bottom: All experimental results with each technique. Despite the diversity of approaches, the relative performance of all three techniques is remarkably consistent. Magnitude pruning notably outperforms more complex techniques for high levels of sparsity. + +4. Sparse Neural Machine Translation + +We adapted the Transformer (Vaswani et al., 2017) model for neural machine translation to use these four sparsification techniques, and trained the model on the WMT 2014 English-German dataset. We sparsified all fully-connected layers and embeddings, which make up 99.87% of all of the parameters in the model (the other parameters coming from biases and layer normalization). The constant hyper-parameters used for all experiments are listed in table 1. We followed the standard training procedure used by Vaswani et al. (2017), but did not perform checkpoint averaging. This setup yielded a baseline BLEU score of 27.29 averaged across five runs. +We extensively tuned the remaining hyperparameters for each technique. Details on what hyperparameters we explored, and the results of what settings produced the best models can be found in Appendix D. + +4.1. Sparse Transformer Results & Analysis +All results for the Transformer are plotted in Figure 1. De.spite the vast differences in these approaches, the relative performance of all three techniques is remarkably consistent. While l0 regularization and variational dropout pro.duce the top performing models in the low-to-mid sparsity range, magnitude pruning achieves the best results for highly sparse models. While all techniques were able to outperform the random pruning technique, randomly removing weights produces surprisingly reasonable results, which is perhaps indicative of the models ability to recover from damage during optimization. +What is particularly notable about the performance of Magnitude pruning is that our experiments uniformly remove the same fraction of weights for each layer. This is in stark contrast to variational dropout and l0 regularization, where the distribution of sparsity across the layers is learned through the training process. Previous work has shown that a non-uniform sparsity among different layers is key to achieving high compression rates (He et al., 2018), and variational dropout and l0 regularization should theoretically be able to leverage this feature to learn better distributions of weights for a given global sparsity. +Figure 2 shows the distribution of sparsity across the differ.ent layer types in the Transformer for the top performing model at 90% global sparsity for each technique. Both l0 regularization and variational dropout learn to keep more parameters in the embedding, FFN layers, and the output transforms for the multi-head attention modules and induce more sparsity in the transforms for the query and value in.puts to the attention modules. Despite this advantage, l0 regularization and variational dropout did not significantly outperform magnitude pruning, even yielding inferior results at high sparsity levels. +It is also important to note that these results maintain a constant number of training steps across all techniques and that the Transformer variant with magnitude pruning trains 1.24x and 1.65x faster than l0 regularization and variational dropout respectively. While the standard Transformer train.ing scheme produces excellent results for machine translation, it has been shown that training the model for longer can improve its performance by as much as 2 BLEU (Ott et al., 2018). Thus, when compared for a fixed training cost magnitude pruning has a distinct advantage over these more complicated techniques. + +<
> + +Figure 2. Average sparsity in Transformer layers. Distributions calculated on the top performing model at 90% sparsity for each technique. l0 regularization and variational dropout are able to learn non-uniform distributions of sparsity, while magnitude pruning induces user-specified sparsity distributions (in this case, uniform). + +Table 2. Constant hyperparameters for all RN50 experiments. + +<
> + +5. Sparse Image classification + +To benchmark these four sparsity techniques on a large-scale computer vision task, we integrated each method into ResNet-50 and trained the model on the ImageNet large-scale image classification dataset. We sparsified all convolutional and fully-connected layers, which make up 99.79% of all of the parameters in the model (the other parameters coming from biases and batch normalization). +The hyperparameters we used for all experiments are listed in Table 2. Each model was trained for 128000 iterations with a batch size of 1024 images, stochastic gradient descent with momentum, and the standard learning rate schedule (see Appendix E.1). This setup yielded a baseline top-1 accuracy of 76.69% averaged across three runs. We trained each model with 8-way data parallelism across 8 accelerators. Due to the extra parameters and operations required for variational dropout, the model was unable to fit into device memory in this configuration. For all variational dropout experiments, we used a per-device batch size of 32 images and scaled the model over 32 accelerators. + +5.1. ResNet-50 Results & Analysis +Figure 3 shows results for magnitude pruning, variational dropout, and random pruning applied to ResNet-50. Surprisingly, we were unable to produce sparse ResNet-50 models with l0 regularization that did not significantly damage model quality. Across hundreds of experiments, our models were either able to achieve full test set performance with no sparsification, or sparsification with test set performance akin to random guessing. Details on all hyperparameter settings explored are included in Appendix E. +This result is particularly surprising given the success of l0 regularization on Transformer. One nuance of the l0 regularization technique of Louizos et al. (2017b) is that the model can have varying sparsity levels between the training and test-time versions of the model. At training time, a parameter with a dropout rate of 10% will be zero 10% of the time when sampled from the hard-concrete distribution. How.ever, under the test-time parameter estimator, this weight + + +Figure 3. Sparsity-accuracy trade-off curves for ResNet-50. +Top: Pareto frontiers for variational dropout, magnitude pruning, and random pruning applied to ResNet-50. Bottom: All experimental results with each technique. We observe large variation in performance for variational dropout and l0 regularization between Transformer and ResNet-50. Magnitude pruning and variational dropout achieve comparable performance for most sparsity levels, with variational dropout achieving the best results for high sparsity levels. +will be non-zero.5. Louizos et al. (2017b) reported results applying l0 regularization to a wide residual network (WRN) (Zagoruyko & Komodakis, 2016) on the CIFAR-10 dataset, and noted that they observed small accuracy loss at as low as 8% reduction in the number of parameters during training. Applying our weight-level l0 regularization implementation to WRN produces a model with comparable training time sparsity, but with no sparsity in the test-time parameters. For models that achieve test-time sparsity, we observe significant accuracy degradation on CIFAR-10. This result is consistent with our observation for l0 regularization applied to ResNet-50 on ImageNet. +The variation in performance for variational dropout and l0 regularization between Transformer and ResNet-50 is striking. While achieving a good accuracy-sparsity trade-off, variational dropout consistently ranked behind l0 regularization on Transformer, and was bested by magnitude pruning for sparsity levels of 80% and up. However, on ResNet-50 we observe that variational dropout consistently produces +5The fraction of time a parameter is set to zero during training depends on other factors, e.g. the . parameter of the hard-concrete distribution. However, this point is generally true that the training and test-time sparsities are not necessarily equivalent, and that there exists some dropout rate threshold below which a weight that is sometimes zero during training will be non-zero at test-time. + + +Figure 4. Average sparsity in ResNet-50 layers. Distributions calculated on the top performing model at 95% sparsity for each technique. Variational dropout is able to learn non-uniform distributions of sparsity, decreasing sparsity in the input and output layers that are known to be disproportionately important to model quality. +models on-par or better than magnitude pruning, and that l0 regularization is not able to produce sparse models at all. Variational dropout achieved particularly notable results in the high sparsity range, maintaining a top-1 accuracy over 70% with less than 4% of the parameters of a standard ResNet-50. +The distribution of sparsity across different layer types in the best variational dropout and magnitude pruning models at 95% sparsity are plotted in Figure 4. While we kept sparsity constant across all layers for magnitude and random pruning, variational dropout significantly reduces the amount of sparsity induced in the first and last layers of the model. +It has been observed that the first and last layers are often disproportionately important to model quality (Han et al., 2015; Bellec et al., 2017). In the case of ResNet-50, the first convolution comprises only .037% of all the parameters in the model. At 98% sparsity the first layer has only 188 non-zero parameters, for an average of less than 3 parameters per output feature map. With magnitude pruning uniformly sparsifying each layer, it is surprising that it is able to achieve any test set performance at all with so few parameters in the input convolution. +While variational dropout is able to learn to distribute sparsity non-uniformly across the layers, it comes at a significant increase in resource requirements. For ResNet-50 trained with variational dropout we observed a greater than 2x in.crease in memory consumption. When scaled across 32 accelerators, ResNet-50 trained with variational dropout completed training in 9.75 hours, compared to ResNet-50 with magnitude pruning finishing in 12.50 hours on only 8 accelerators. Scaled to a 4096 batch size and 32 accelerators, ResNet-50 with magnitude pruning can complete the same number of epochs in just 3.15 hours. +Figure 5. Sparsity-accuracy trade-off curves for ResNet-50 with modified sparsification scheme. Altering the distribution of sparsity across the layers and increasing training time yield significant improvement for magnitude pruning. + +5.2. Pushing the Limits of Magnitude Pruning +Given that a uniform distribution of sparsity is suboptimal, and the significantly smaller resource requirements for ap.plying magnitude pruning to ResNet-50 it is natural to won.der how well magnitude pruning could perform if we were to distribute the non-zero weights more carefully and increase training time. +To understand the limits of the magnitude pruning heuristic, we modify our ResNet-50 training setup to leave the first convolutional layer fully dense, and only prune the final fully-connected layer to 80% sparsity. This heuristic is reasonable for ResNet-50, as the first layer makes up a small fraction of the total parameters in the model and the final layer makes up only .03% of the total FLOPs. While tuning the magnitude pruning ResNet-50 models, we observed that the best models always started and ended pruning during the third learning rate phase, before the second learning rate drop. To take advantage of this, we increase the number of training steps by 1.5x by extending this learning rate region. Results for ResNet-50 trained with this scheme are plotted in Figure 5. +With these modifications, magnitude pruning outperforms variational dropout at all but the highest sparsity levels while still using less resources. However, variational dropout's performance in the high sparsity range is particularly notable. With very low amounts of non-zero weights, we find it likely that the models performance on the test set is closely tied to precise allocation of weights across the different layers, and that variational dropout's ability to learn this distribution enables it to better maintain accuracy at high sparsity levels. This result indicates that efficient sparsification techniques that are able to learn the distribution of sparsity across layers are a promising direction for future work. +Its also worth noting that these changes produced models at 80% sparsity with top-1 accuracy of 76.52%, only .17% off our baseline ResNet-50 accuracy and .41% better than the results reported by He et al. (2018), without the extra complexity and computational requirements of their reinforcement learning approach. This represents a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 trained on ImageNet. + +6. sparsification as Architecture Search +While sparsity is traditionally thought of as a model com.pression technique, two independent studies have recently suggested that the value of sparsification in neural networks is misunderstood, and that once a sparse topology is learned it can be trained from scratch to the full performance achieved when sparsification was performed jointly with optimization. +Frankle & Carbin (2018) posited that over-parameterized neural networks contain small, trainable subsets of weights, deemed "winning lottery tickets". They suggest that sparsity inducing techniques are methods for finding these sparse topologies, and that once found the sparse architectures can be trained from scratch with the same weight initialization that was used when the sparse architecture was learned. They demonstrated that this property holds across different convolutional neural networks and multi-layer perceptrons trained on the MNIST and CIFAR-10 datasets. +Liu et al. (2018) similarly demonstrated this phenomenon for a number of activation sparsity techniques on convolutional neural networks, as well as for weight level sparsity learned with magnitude pruning. However, they demonstrate this result using a random initialization during re.training. +The implications of being able to train sparse architectures from scratch once they are learned are large: once a sparse topology is learned, it can be saved and shared as with any other neural network architecture. Re-training then can be done fully sparse, taking advantage of sparse linear algebra to greatly accelerate time-to-solution. However, the combination of these two studies does not clearly establish how this potential is to be realized. +Beyond the question of whether or not the original random weight initialization is needed, both studies only explore convolutional neural networks (and small multi-layer perceptrons in the case of Frankle & Carbin (2018)). The majority of experiments in both studies also limited their analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets. While these are standard benchmarks for deep learning models, they are not indicative of the complexity of real-world tasks where model compression is most useful. Liu et al. (2018) do explore convolutional architectures on the Ima.geNet datasets, but only at two relatively low sparsity levels (30% and 60%). They also note that weight level sparsity on ImageNet is the only case where they are unable to re.produce the full accuracy of the pruned model. + +<
> + +Figure 6. Scratch and lottery ticket experiments with magnitude pruning. Top: results with Transformer. Bottom: Results with ResNet-50. Across all experiments, training from scratch using a learned sparse architecture is unable to re-produce the performance of models trained with sparsification as part of the optimization process. +To clarify the questions surrounding the idea of sparsification as a form of neural architecture search, we repeat the experiments of Frankle & Carbin (2018) and Liu et al. (2018) on ResNet-50 and Transformer. For each model, we explore the full range of sparsity levels (50% -98%) and compare to our well-tuned models from the previous sections. + +6.1. Experimental Framework +The experiments of Liu et al. (2018) encompass taking the final learned weight mask from a magnitude pruning model, randomly re-initializing the weights, and training the model with the normal training procedure (i.e., learning rate, num.ber of iterations, etc.). To account for the presence of sparsity at the start of training, they scale the variance of the initial weight distribution by the number of non-zeros in the matrix. They additionally train a variant where they increase the number of training steps (up to a factor of 2x) such that the re-trained model uses approximately the same number of FLOPs during training as model trained with sparsification as part of the optimization process. They refer to these two experiments as "scratch-e" and "scratch-b" respectively. +Frankle & Carbin (2018) follow a similar procedure, but use the same weight initialization that was used when the sparse weight mask was learned and do not perform the longer training time variant. +For our experiments, we repeat the scratch-e, scratch-b and lottery ticket experiments with magnitude pruning on Transformer and ResNet-50. For scratch-e and scratch-b, we also train variants that do not alter the initial weight distribution. For the Transformer, we re-trained five replicas of the best magnitude pruning hyperparameter settings at each sparsity level and save the weight initialization and final sparse weight mask. For each of the five learned weight masks, we train five identical replicas for the scratch-e, scratch-b, scratch-e with augmented initialization, scratch-b with augmented initialization, and the lottery ticket experiments. For ResNet-50, we followed the same procedure with three re-trained models and three replicas at each sparsity level for each of the five experiments. Figure 6 plots the averages and min/max of all experiments at each sparsity level 6. + +6.2. Scratch and Lottery Ticket Results & Analysis +Across all of our experiments, we observed that training from scratch using a learned sparse architecture is not able to match the performance of the same model trained with sparsification as part of the optimization process. +Across both models, we observed that doubling the number of training steps did improve the quality of the results for the scratch experiments, but was not sufficient to match the test set performance of the magnitude pruning baseline. As sparsity increased, we observed that the deviation between the models trained with magnitude pruning and those trained from scratch increased. For both models, we did not observe a benefit from using the augmented weight initialization for the scratch experiments. +For ResNet-50, we experimented with four different learn.ing rates schemes for the scratch-b experiments. We found that scaling each learning rate region to double the number of epochs produced the best results by a wide margin. These results are plotted in Figure 6. Results for the ResNet-50 scratch-b experiments with the other learning rate variants are included with our release of hyperparameter tuning results. +For the lottery ticket experiments, we were not able to replicate the phenomenon observed by Frankle & Carbin (2018). The key difference between our experiments is the complex.ity of the tasks and scale of the models, and it seems likely that this is the main factor contributing to our inability to train these architecture from scratch. +For the scratch experiments, our results are consistent with the negative result observed by (Liu et al., 2018) for Im. +ageNet and ResNet-50 with unstructured weight pruning. By replicating the scratch experiments at the full range of +6Two of the 175 Transformer experiments failed to train from scratch at all and produced BLEU scores less than 1.0. We omit these outliers in Figure 6 +sparsity levels, we observe that the quality of the models degrades relative to the magnitude pruning baseline as sparsity increases. For unstructured weight sparsity, it seems likely that the phenomenon observed by Liu et al. (2018) was produced by a combination of low sparsity levels and small-to-medium sized tasks. We'd like to emphasize that this result is only for unstructured weight sparsity, and that prior work Liu et al. (2018) provides strong evidence that activation pruning behaves differently. + +7. Limitations of This Study +Hyperparameter exploration. For all techniques and models, we carefully hand-tuned hyperparameters and per.formed extensive sweeps encompassing thousands of experiments over manually identified ranges of values. However, the number of possible settings vastly outnumbers the set of values that can be practically explored, and we cannot eliminate the possibility that some techniques significantly outperform others under settings we did not try. +Neural architectures and datasets. Transformer and ResNet-50 were chosen as benchmark tasks to represent a cross section of large-scale deep learning tasks with diverse architectures. We can fit exclude the possibility that some techniques achieve consistently high performance across other architectures. More models and tasks should be thoroughly explored in future work. +8. Conclusion +In this work, we performed an extensive evaluation of three state-of-the-art sparsification techniques on two large-scale learning tasks. Notwithstanding the limitations discussed in section 7, we demonstrated that complex techniques shown to yield state-of-the-art compression on small datasets per.form inconsistently, and that simple heuristics can achieve comparable or better results on a reduced computational bud.get. Based on insights from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet.50 with only magnitude pruning and highlight promising directions for research in sparsity inducing techniques. +Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression. As such, we open-source our code, check.points, and results of all hyperparameter configurations to establish rigorous baselines for future work. + +Acknowledgements +We would like to thank Benjamin Caine, Jonathan Frankle, +Raphael Gontijo Lopes, Sam Greydanus, and Keren Gu for +helpful discussions and feedback on drafts of this paper. + +References +Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. Deep Rewiring: Training Very Sparse Deep Networks. CoRR, abs/1711.05136, 2017. +Collins, M. D. and Kohli, P. Memory Bounded Deep convolutional Networks. CoRR, abs/1412.1442, 2014. URL http://arxiv.org/abs/1412.1442. +Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural Networks using the Variational Information Bottleneck. CoRR, abs/1802.10399, 2018. +Frankle, J. and Carbin, M. The Lottery Ticket Hy.pothesis: Training Pruned Neural Networks. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/ +abs/1803.03635. +Gray, S., Radford, A., and Kingma, D. P. Block-sparse gpu kernels. https://blog.openai.com/ +block-sparse-gpu-kernels/, 2017. +Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery for efficient DNNs. In NIPS, 2016. +Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both Weights and Connections for efficient Neural Network. In NIPS, pp. 1135�1143, 2015. +Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, pp. 164�171. Morgan Kaufmann, 1992. +He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn.ing for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770�778, 2016. +He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC: automl for model compression and acceleration on mo.bile devices. In Computer Vision -ECCV 2018 -15th European Conference, Munich, Germany, September 8.14, 2018, Proceedings, Part VII, pp. 815�832, 2018. +Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. +Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., and Kavukcuoglu, K. efficient Neural Audio Synthesis. In Proceedings of the 35th Interna.tional Conference on Machine Learning, ICML 2018, Stockholmsm� +assan, Stockholm, Sweden, July 10-15, 2018, pp. 2415�2424, 2018. +Kingma, D. P. and Welling, M. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013. +Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. CoRR, abs/1506.02557, 2015. +LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain Damage. In NIPS, pp. 598�605. Morgan Kaufmann, 1989. +Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning. In NIPS, pp. 2178�2188, 2017. +Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, +C. Learning efficient Convolutional Networks through Network Slimming. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2755�2763, 2017. +Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the Value of Network Pruning. CoRR, abs/1810.05270, 2018. +Louizos, C., Ullrich, K., and Welling, M. Bayesian Com.pression for Deep Learning. In Advances in Neural In.formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 De.cember 2017, Long Beach, CA, USA, pp. 3290�3300, 2017a. +Louizos, C., Welling, M., and Kingma, D. P. Learn.ing Sparse Neural Networks through L0 Regularization. CoRR, abs/1712.01312, 2017b. +Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level Pruning Method for Deep Neural Network Compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5068�5076, 2017. +Mitchell, T. J. and Beauchamp, J. J. Bayesian Variable Selection in Linear Regression. Journal of the American Statistical Association, 83(404):1023�1032, 1988. +Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. Scalable Training of Arti�.cial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science. Nature Communications, 2018. +Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational Dropout Sparsifies Deep Neural Networks. In Proceed.ings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au.gust 2017, pp. 2498�2507, 2017. +Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning Convolutional Neural Networks for Resource Ef.�cient Transfer Learning. CoRR, abs/1611.06440, 2016. +Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex.ploring Sparsity in Recurrent Neural Networks. CoRR, abs/1704.05119, 2017. +Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 -November 1, 2018, pp. 1�9, 2018. +Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative models. In ICML, volume 32 of JMLR Workshop and Conference Proceedings, pp. 1278�1286. JMLR.org, 2014. +Str�om, N. Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks. In EUROSPEECH, 1997. +Theis, L., Korshunova, I., Tejani, A., and Husz�ar, F. Faster gaze prediction with dense networks and Fisher pruning. CoRR, abs/1801.05787, 2018. URL http://arxiv. +org/abs/1801.05787. +Ullrich, K., Meeds, E., and Welling, M. Soft Weight-Sharing for Neural Network Compression. CoRR, abs/1702.04008, 2017. +Valin, J. and Skoglund, J. Lpcnet: Improving Neural Speech Synthesis Through Linear Prediction. CoRR, abs/1810.11846, 2018. URL http://arxiv.org/ +abs/1810.11846. +van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A Generative Model for Raw Audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, pp. 125, 2016. +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten.tion is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural In.formation Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000�6010, 2017. +Zagoruyko, S. and Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016. +Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017. URL http://arxiv.org/ +abs/1710.01878. + +The State of Sparsity in Deep Neural Networks: Appendix + +A. Overview of Sparsity Inducing Techniques + +Here we provide a more detailed review of the three sparsity techniques we benchmarked. + +A.1. Magnitude Pruning +Magnitude-based weight pruning schemes use the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training. Many variants have been proposed (Collins & Kohli, 2014; Han et al., 2015; Guo et al., 2016; Zhu & Gupta, 2017), with the key differences lying in when weights are removed, whether weights should be sorted to remove a precise pro.portion or thresholded based on a fixed or decaying value, and whether or not weights that have been pruned still re.ceive gradient updates and have the potential to return after being pruned. +Han et al. (2015) use iterative magnitude pruning and re.training to progressively sparsify a model. The target model is first trained to convergence, after which a portion of weights are removed and the model is re-trained with these weights fixed to zero. This process is repeated until the target sparsity is achieved. Guo et al. (2016) improve on this approach by allowing masked weights to still receive gradient updates, enabling the network to recover from incorrect pruning decisions during optimization. They achieve higher compression rates and interleave pruning steps with gradient update steps to avoid expensive re-training. Zhu & Gupta (2017) similarly allow gradient updates to masked weights, and make use of a gradual sparsification schedule with sorting-based weight thresholding to maintain accuracy while achieving a user specified level of sparsification. +Its worth noting that magnitude pruning can easily be adapted to induce block or activation level sparsity by re.moving groups of weights based on their p-norm, average, max, or other statistics. Variants have also been proposed that maintain a constant level of sparsity during optimization to enable accelerated training (Mocanu et al., 2018). + +A.2. Variational Dropout +Consider the setting of a dataset D of N i.i.d. samples (x, y) and a standard classification problem where the goal is to learn the parameters w of the conditional probability p(y|x, w). Bayesian inference combines some initial belief over the parameters w in the form of a prior distribution p(w) with observed data D into an updated belief over the parameters in the form of the posterior distribution p(w|D). In practice, computing the true posterior using Bayes' rule is computationally intractable and good approximations are needed. In variational inference, we optimize the parameters <> of some parameterized model <> such that <> is a close approximation to the true posterior distribution p(w|D) as measured by the Kullback-Leibler divergence between the two distributions. The divergence of our ap.proximate posterior from the true posterior is minimized in practice by maximizing the variational lower-bound + +<> + +where <> + +Using the Stochastic Gradient Variational Bayes (SGVB) (Kingma et al., 2015) algorithm to optimize this bound, <> reduces to the standard cross-entropy loss, and the KL divergence between our approximate posterior and prior over the parameters serves as a regularizer that enforces our initial belief about the parameters w. +In the standard formulation of variational dropout, we as.sume the weights are drawn from a fully-factorized Gaussian approximate posterior. + +<> + +Where <> and <> are neural network parameters. For each training step, we sample weights from this distribution and use the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014) to differentiate the loss w.r.t. the parameters through the sampling operation. Given the weights are normally distributed, the distribution of the activations B after a linear operation like matrix multiplication or convolution is also Gaussian and can be calculated in closed form 7. + +<> + +with <> and <> where <> are the inputs to the layer. Thus, rather + +7 We ignore correlation in the activations, as is done by Molchanov et al. (2017) + +than sample weights, we can directly sample the activations at each layer. This step is known as the local reparameterization trick, and was shown by Kingma et al. (2015) to reduce the variance of the gradients relative to the standard formulation in which a single set of sampled weights must be shared for all samples in the input batch for efficiency. Molchanov et al. (2017) showed that the variance of the gradients could be further reduced by using an additive noise reparameterization, where we define a new parameter + +<> + +Under this parameterization, we directly optimize the mean and variance of the neural network parameters. +Under the assumption of a log-uniform prior on the weights w, the KL divergence component of our objective function <> can be accurately approximated (Molchanov et al., 2017): + +<> + +After training a model with variational dropout, the weights with the highest . values can be removed. For all their experiments, Molchanov et al. (2017) removed weights with log . larger than 3.0, which corresponds to a dropout rate greater than 95%. Although they demonstrated good results, it is likely that the optimal <> threshold varies across different models and even different hyperparameter settings of the same model. We address this question in our experiments. + +A.3. l0 Regularization +To optimize the l0-norm, we reparameterize the model weights . as the product of a weight and a random vari.able drawn from the hard-concrete distribution. + +<> where <> and <> + +In this formulation, the <> parameter that controls the posi.tion of the hard-concrete distribution (and thus the probability that zj is zero) is optimized with gradient descent. <> and <> are fixed parameters that control the shape of the hard-concrete distribution. <> controls the curvature or temperature of the hard-concrete probability density function, and <> and <> stretch the distribution s.t. zj takes value 0 or 1 with non-zero probability. + +On each training iteration, zj is sampled from this distri.bution and multiplied with the standard neural network weights. The expected l0-norm LC can then be calcu.lated using the cumulative distribution function of the hard-concrete distribution and optimized directly with stochastic gradient descent. + +<> + +At test-time, Louizos et al. (2017b) use the following estimate for the model parameters. + +<> + +Interestingly, Louizos et al. (2017b) showed that their objective function under the l0 penalty is a special case of a variational lower-bound over the parameters of the network under a spike and slab (Mitchell & Beauchamp, 1988) prior. + +B. Variational Dropout Implementation Verification + +To verify our implementation of variational dropout, we applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST and compared our results to the original paper (Molchanov et al., 2017). We matched our hyperparameters to those used in the code released with the paper8. All results are listed in table 3 + +Table 3. Variational Dropout MNIST Reproduction Results. + +<
> + +Our baseline LeNet-300-100 model achieved test set accuracy of 98.42%, slightly higher than the baseline of 98.36% reported in (Molchanov et al., 2017). Applying our varia.tional dropout implementation to LeNet-300-100 with these hyperparameters produced a model with 97.52% global sparsity and 98.42% test accuracy. The original paper produced + +8 https://github.com/ars-ashuha/variational-dropout-Sparsifies.dnn + +<
> + +Figure 7. Forward pass FLOPs for WRN-28-10 trained with l0 regularization. Our implementation achieves FLOPs reductions comparable to those reported in Louizos et al. (2017b). + +a model with 98.57% global sparsity, and 98.08% test accuracy. While our model achieves .34% higher tests accuracy with 1% lower sparsity, we believe the discrepancy is mainly due to difference in our software packages: the authors of (Molchanov et al., 2017) used Theano and Lasagne for their experiments, while we use TensorFlow. +Given our model achieves highest accuracy, we can decrease the log . threshold to trade accuracy for more sparsity. With a <> threshold of 2.0, our model achieves 98.5% global sparsity with a test set accuracy of 98.40%. With a log . threshold of 0.1, our model achieves 99.1% global sparsity with 98.13% test set accuracy, exceeding the sparsity and accuracy of the originally published results. +On LeNet-5-Caffe, our implementation achieved a global sparsity of 99.29% with a test set accuracy of 99.26%, ver.sus the originaly published results of 99.6% sparsity with 99.25% accuracy. Lowering the <> threshold to 2.0, our model achieves 99.5% sparsity with 99.25% test accuracy. + +C. l0 Regularization Implementation Verification + +The original l0 regularization paper uses a modified version of the proposed technique for inducing group sparsity in models, so our weight-level implementation is not directly comparable. However, to verify our implementation we trained a Wide ResNet (WRN) (Zagoruyko & Komodakis, 2016) on CIFAR-10 and compared results to those reported in the original publication for group sparsity. +As done by Louizos et al. (2017b), we apply l0 to the first convolutional layer in the residual blocks (i.e., where dropout would normally be used). We use the weight decay formulation for the re-parameterized weights, and scale the weight decay coefficient to maintain the same initial length scale of the parameters. We use the same batch size of 128 samples and the same initial <>, and train our model on a single GPU. +Our baseline WRN-28-10 implementation trained on CIFAR-10 achieved a test set accuracy of 95.45%. Using our l0 regularization implementation and a l0-norm weight of .0003, we trained a model that achieved 95.34% accuracy on the test set while achieving a consistent training-time FLOPs reduction comparable to that reported by Louizos et al. (2017b). Floating-point operations (FLOPs) required to compute the forward over the course of training WRN.28-10 with l0 are plotted in Figure 7. +During our re-implementation of the WRN experiments from Louizos et al. (2017b), we identified errors in the original publications FLOP calculations that caused the number of floating-point operations in WRN-28-10 to be miscalculated. Wefive contacted the authors, and hope to resolve this issue to clarify their performance results. + +D. Sparse Transformer Experiments + +D.1. Magnitude Pruning Details +For our magnitude pruning experiments, we tuned four key hyperparameters: the starting iteration of the sparsification process, the ending iteration of the sparsification process, the frequency of pruning steps, and the combination of other regularizers (dropout and label smoothing) used during train.ing. We trained models with 7 different target sparsities: 50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of these sparsity levels, we tried pruning frequencies of 1000 and 10000 steps. During preliminary experiments we identi.�ed that the best settings for the training step to stop pruning at were typically closer to the end of training. Based on this insight, we explored every possible combination of start and end points for the sparsity schedule in increments of 100000 steps with an ending step of 300000 or greater. + +By default, the Transformer uses dropout with a dropout rate of 10% on the input to the encoder, decoder, and before each layer and performs label smoothing with a smoothing parameter of <>. We found that decreasing these other regularizers produced higher quality models in the mid to high sparsity range. For each hyperparameter combination, we tried three different regularization settings: standard label smoothing and dropout, label smoothing only, and no regularization. + +D.2. Variational Dropout Details +For the Transformer trained with variational dropout, we extensively tuned the coefficient for the KL divergence component of the objective function to find models that achieved high accuracy with sparsity levels in the target range. We found that KL divergence weights in the range +<>, where N is the number of samples in the training set, produced models in our target sparsity range. (Molchanov et al., 2017) noted difficulty training some models from scratch with variational dropout, as large portions of the model adopt high dropout rates early in training before the model can learn a useful representation from the data. To address this issue, they use a gradual ramp-up of the KL divergence weight, linearly increasing the regularizer coefficient until it reaches the desired value. +For our experiments, we explored using a constant regu.larizer weight, linearly increasing the regularizer weight, and also increasing the regularizer weight following the cubic sparsity function used with magnitude pruning. For the linear and cubic weight schedules, we tried each combination of possible start and end points in increments of 100000 steps. For each hyperparameter combination, we also tried the three different combinations of dropout and label smoothing as with magnitude pruning. For each trained model, we evaluated the model with 11 <> thresholds in the range [0, 5]. For all experiments, we initialized all <> parameters to the constant value <>. + +D.3. l0 Regularization Details +For Transformers trained with l0 regularization, we simi.larly tuned the coefficient for the l0-norm in the objective function. We observed that much higher magnitude regu.larization coefficients were needed to produce models with the same sparsity levels relative to variational dropout. We +found that l0-norm weights in the range <> produced models in our target sparsity range. +For all experiments, we used the default settings for the paramters of the hard-concrete distribution: <>, and <>. We initialized the <> parameters to 2.197, corresponding to a 10% dropout rate. +For each hyperparameter setting, we explored the three reg.ularizer coefficient schedules used with variational dropout and each of the three combinations of dropout and label smoothing. + +D.4. Random Pruning Details +We identified in preliminary experiments that random pruning typically produces the best results by starting and ending pruning early and allowing the model to finish the rest of the training steps with the final sparse weight mask. For our experiments, we explored all hyperparameter combinations that we explored with magnitude pruning, and also included start/end pruning step combinations with an end step of less than 300000. + +E. Sparse ResNet-50 + +E.1. Learning Rate +For all experiments, the we used the learning rate scheme used by the official TensorFlow ResNet-50 implementation9. With our batch size of 1024, this includes a linear ramp-up for 5 epochs to a learning rate of .4 followed by learning rate drops by a factor of 0.1 at epochs 30, 60, and 80. + +E.2. Magnitude Pruning Details +For magnitude pruning on ResNet-50, we trained models with a target sparsity of 50%, 70%, 80%, 90%, 95%, and 98%. At each sparsity level, we tried starting pruning at steps 8k, 20k, and 40k. For each potential starting point, we tried ending pruning at steps 68k, 76k, and 100k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k steps and explored training with and without label smoothing. During preliminary experiments, we observed that removing weight decay from the model consistently caused significant decreases in test accuracy. Thus, for all hyperparameter combinations, we left weight decay on with the standard coefficient. +For a target sparsity of 98%, we observed that very few hy.perparameter combinations were able to complete training without failing due to numerical issues. Out of all the hyper-parameter configurations we tried, only a single model was able to complete training without erroring from the presence of NaNs. As explained in the main text, at high sparsity levels the first layer of the model has very few non-zero parameters, leading to instability during training and low test set performance. Pruned ResNet-50 models with the first layer left dense did not exhibit these issues. + +E.3. Variational Dropout Details +For variational dropout applied to ResNet-50, we explored the same combinations of start and end points for the kl-divergence weight ramp up as we did for the start and end points of magnitude pruning. For all transformer experi.ments, we did not observe a significant gain from using a cubic kl-divergence weight ramp-up schedule and thus only explored the linear ramp-up for ResNet-50. For each combi.nation of start and end points for the kl-divergence weight, we explored 9 different coefficients for the kl-divergence loss term: .01/N,.03/N,.05/N,.1/N,.3/N,.5/N,1/ N, 10 / N, and 100 / N. +Contrary to our experience with Transformer, we found ResNet-50 with variational dropout to be highly sensitive to the initialization for the <> parameters. With the standard setting of -10, we couldnfit match the baseline accuracy, and with an initialization of -20 our models achieved + +9 https://bit.ly/2Wd2Lk0 + +good test performance but no sparsity. After some exper.imentation, we were able to produce good results with an initialization of -15. +While with Transformer we saw a reasonable amount of variance in test set performance and sparsity with the same model evaluated at different log . thresholds, we did not observe the same phenomenon for ResNet-50. Across a range of log . values, we saw consistent accuracy and nearly identical sparsity levels. For all of the results reported in the main text, we used a <> threshold of 0.5, which we found to produce slightly better results than the standard threshold of 3.0. + +E.4. l0 Regularization Details +For l0 regularization, we explored four different initial <> values corresponding to dropout rates of 1%, 5%, 10%, and 30%. For each dropout rate, we extenively tuned the l0 .norm weight to produce models in the desired sparsity range. After identifying the proper range of l0-norm coefficients, we ran experiments with 20 different coefficients in that range. For each combination of these hyperparameters, we tried all four combinations of other regularizers: standard weight decay and label smoothing, only weight decay, only label smoothing, and no regularization. For weight decay, we used the formulation for the reparameterized weights provided in the original paper, and followed their approach of scaling the weight decay coefficient based on the initial dropout rate to maintain a constant length-scale between the l0 regularized model and the standard model. +Across all of these experiments, we were unable to produce ResNet models that achieved a test set performance better than random guessing. For all experiments, we observed that training proceeded reasonably normally until the l0-norm loss began to drop, at which point the model incurred severe accuracy loss. We include the results of all hyperparameter combinations in our data release. +Additionally, we tried a number of tweaks to the learning process to improve the results to no avail. We explored training the model for twice the number of epochs, training with much higher initial dropout rates, modifying the <> parameter for the hard-concrete distribution, and a modified test-time parameter estimator. + +E.5. Random Pruning Details +For random pruning on ResNet-50, we shifted the set of possible start and end points for pruning earlier in training relative to those we explored for magnitude pruning. At each of the sparsity levels tried with magnitude pruning, we tried starting pruning at step 0, 8k, and 20k. For each potential starting point, we tried ending pruning at steps 40k, 68k, and 76k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k and explored training with and without label smoothing. + +E.6. Scratch-B Learning Rate Variants +For the scratch-b (Liu et al., 2018) experiments with ResNet. +50, we explored four different learning rate schemes for the extended training time (2x the default number of epochs). + +The first learning rate scheme we explored was uniformly scaling each of the five learning rate regions to last for double the number of epochs. This setup produced the best results by a wide margin. We report these results in the main text. +The second learning rate scheme was to keep the standard learning rate, and maintain the final learning rate for the extra training steps as is common when fine-tuning deep neural networks. The third learning rate scheme was to maintain the standard learning rate, and continually drop the learning rate by a factor of 0.1 every 30 epochs. The last scheme we explored was to skip the learning rate warm-up, and drop the learning rate by 0.1 every 30 epochs. This learning rate scheme is closest to the one used by Liu et al. (2018). We found that this scheme underperformed relative to the scaled learning rate scheme with our training setup. +Results for all learning rate schemes are included with the released hyperparameter tuning data. +<> <> <> + + +<> <> <> + NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications + + Tien-Ju Yang 1⋆[0000−0003−4728−0321] , Andrew Howard 2 ,BoChen 2 , + Xiao Zhang 2 ,AlecGo 2 ,MarkSandler 2 , Vivienne Sze 1 , and Hartwig Adam 2 + + 1 Massachusetts Institute of Technology + 2 Google Inc. + {tjy,sze}@mit.edu,{howarda,bochen,andypassion,ago,sandler,hadam}@google.com + + + Abstract. + + This work proposes an algorithm, called NetAdapt, that + automatically adapts a pre-trained deep neural network to a mobile plat- + form given a resource budget. While many existing algorithms simplify + networks based on the number of MACs or weights, optimizing those + indirect metrics may not necessarily reduce the direct metrics, such as + latency and energy consumption. To solve this problem, NetAdapt + incorporates direct metrics into its adaptation algorithm. These direct metrics + are evaluated using empirical measurements, so that detailed knowledge + of the platform and tool chain is not required. NetAdapt automatically + and progressively simplifies a pre-trained network until the resource bud- + get is met while maximizing the accuracy. Experiment results show that + NetAdapt achieves better accuracy versus latency tradeoffs on both + mobile CPU and mobile GPU, compared with the state-of-the-art automated + network simplification algorithms. For image classification on the + ImageNet dataset, NetAdapt achieves up to a 1.7× speedup in-measured + inference latency with equal or higher accuracy on MobileNets (V1&V2). + + + 1 Introduction + + Deep neural networks (DNNs or networks) have become an indispensable component + of artificial intelligence, delivering near or super-human accuracy on com- + mon vision tasks such as image classification and object detection. However, + DNN-based AI applications are typically too computationally intensive to be + deployed on resource-constrained platforms, such as mobile phones. This hinders + the enrichment of a large set of user experiences. + A significant amount of recent work on DNN design has focused on improving + the efficiency of networks. However, the majority of works are based on optimizing + the “indirect metrics”, such as the number of multiply-accumulate operations + (MACs) or the number of weights, as proxies for the resource consumption of + a network. Although these indirect metrics are convenient to compute and + integrate into the optimization framework, they may not be good approximations + to the “direct metrics” that matter for the real applications such as latency + + <
> + + Fig. 1.NetAdapt automatically adapts a pretrained network to a mobile platform + given a resource budget. This algorithm is guided by the direct metrics for resource + consumption. NetAdapt eliminates the requirement of platform-specific knowledge by + using empirical measurements to evaluate the direct metrics. At each iteration, Ne- + tAdapt generates many network proposals and measures the proposals on the target + platform. The measurements are used to guide NetAdapt to generate the next set of + network proposals at the next iteration. + + + and energy consumption. The relationship between an indirect metric and the + corresponding direct metric can be highly non-linear and platform-dependent as + observed by [15, 25, 26]. In this work, we will also demonstrate empirically that + a network with a fewer number of MACs can be slower when actually running + on mobile devices; specifically, we will show that a network of 19% less MACs + incurs 29% longer latency in practice (see Table 1). + There are two common approaches to designing efficient network architectures. + The first is designing a single architecture with no regard to the underlying + platform. It is hard for a single architecture to run optimally on all the platforms + due to the different platform characteristics. For example, the fastest architecture + on a desktop GPU may not be the fastest one on a mobile CPU with the + same accuracy. Moreover, there is little guarantee that the architecture could + meet the resource budget (e.g., latency) on all platforms of interest. The second + approach is manually crafting architectures for a given target platform based + on the platform’s characteristics. However, this approach requires deep knowledge + about the implementation details of the platform, including the toolchains, + the configuration and the hardware architecture, which are generally unavailable + given the proprietary nature of hardware and the high complexity of modern sys- + tems. Furthermore, manually designing a different architecture for each platform + can be taxing for researchers and engineers. + In this work, we propose a platform-aware algorithm, calledNetAdapt,to + address the aforementioned issues and facilitate platform-specific DNN deployment. NetAdapt 3 + NetAdapt (Fig. 1) incorporates direct metrics in the optimization loop, so + it does not suffer from the discrepancy between the indirect and direct metrics. + The direct metrics are evaluated by the empirical measurements taken from the + target platform. This enables the algorithm to support any platform without + detailed knowledge of the platform itself, although such knowledge could still be + incorporated into the algorithm to further improve results. In this paper, we use + latency as the running example of a direct metric and resource to target even + though our algorithm is generalizable to other metrics or a combination of them + (Sec. 4.3). + The network optimization of NetAdapt is carried out in an automatic way to + gradually reduce the resource consumption of a pretrained network while + maximizing the accuracy. The optimization runs iteratively until the resource budget + is met. Through this design, NetAdapt can generate not only a network that + meets the budget, but also a family of simplified networks with different trade- + offs, which allows dynamic network selection and further study. Finally, instead + of being a black box, NetAdapt is designed to be easy to interpret. For exam- + ple, through studying the proposed network architectures and the corresponding + empirical measurements, we can understand why a proposal is chosen and this + sheds light on how to improve the platform and network design. + The main contributions of this paper are: + A framework that uses direct metrics when optimizing a pretrained network + to meet a given resource budget. Empirical measurements are used to evaluate + the direct metrics such that no platform-specific knowledge is required. + An automated constrained network optimization algorithm that maximizes + accuracy while satisfying the constraints (i.e., the predefined resource bud- + get). The algorithm outperforms the state-of-the-art automatic network + simplification algorithms by up to 1.7×in terms of reduction inmeasured inference + latency while delivering equal or higher accuracy. Moreover, a family + of simplified networks with different trade-offs will be generated to allow + dynamic network selection and further study. + Experiments that demonstrate the effectiveness of NetAdapt on different + platforms and on real-time-class networks, such as the small MobileNetV1, + which is more difficult to simplify than larger networks. + + + 2 Related Work + + There is a large body of work that aims to simplify DNNs.We refer the readers + to [21] for a comprehensive survey, and summarize the main approaches below. + The most related works are pruning-based methods. [6, 14, 16] aim to remove + individual redundant weights from DNNs. However, most platforms cannot fully + take advantage of unstructured sparse filters [26]. Hu et al. [10] and Srinivas et + al. [20] focus on removing entire filters instead of individual weights. The draw- + back of these methods is the requirement of manually choosing the compression + rate for each layer. MorphNet [5] leverages the sparsifying regularizers to + automatically determine the layerwise compression rate. ADC [8] uses reinforcement + learning to learn a policy for choosing the compression rates. The crucial + difference between all the aforementioned methods and ours is that they are not + guided by the direct metrics, and thus may lead to sub-optimal performance, as + we see in Sec. 4.3. + Energy-aware pruning [25] uses an energy model [24] and incorporates the + estimated energy numbers into the pruning algorithm. However, this requires de- + signing models to estimate the direct metrics of each target platform, which re- + quires detailed knowledge of the platform including its hardware architecture [3], + and the network-to-array mapping used in the toolchain [2]. NetAdapt does not + have this requirement since it can directly use empirical measurements. + DNNs can also be simplified by approaches that involve directly designing + efficient network architectures, decomposition or quantization. MobileNets [9, 18] + and ShueNets [27] provide efficient layer operations and reference architecture + design. Layer-decomposition-based algorithms [13, 23] exploit matrix + decomposition to reduce the number of operations. Quantization [11, 12, 17] reduces + the complexity by decreasing the computation accuracy. The proposed + algorithm, NetAdapt, is complementary to these methods. For example, NetAdapt + can adapt MobileNets to further push the frontier of efficient networks as shown + in Sec. 4 even though MobileNets are more compact and much harder to simplify + than the other larger networks, such as VGG [19]. + + 3 Methodology: NetAdapt + + We propose an algorithm, called NetAdapt, that will allow a user to automatically + simplify a pretrained network to meet the resource budget of a platform + while maximizing the accuracy. NetAdapt is guided by direct metrics for resource + consumption, and the direct metrics are evaluated by using empirical measurements, + thus eliminating the requirement of detailed platform-specific knowledge. + + 3.1 Problem Formulation + NetAdapt aims to solve the following non-convex constrained problem: + + <> (1) + + where Net is a simplified network from the initial pretrained network, <> + computes the accuracy, <> evaluates the direct metric for resource con- + sumption of the jth resource, and <> is the budget of the jth resource and + the constraint on the optimization. The resource can be latency, energy, memory + footprint, etc., or a combination of these metrics. + Based on an idea similar to progressive barrier methods [1], NetAdapt breaks + this problem into the following series of easier problems and solves it iteratively: + + <> (2) + + + Algorithm 1:NetAdapt + + <> + + where <> is the network generated by the ith iteration, and Net_0 is the initial + pretrained network. As the number of iterations increases, the constraints (i.e., + current resource budget <> gradually become tighter. <>, + which is larger than zero, indicates how much the constraint tightens for the jth + resource in the ith iteration and can vary from iteration to iteration. This is + referred to as “resource reduction schedule”, which is similar to the concept of + learning rate schedule. The algorithm terminates when Res <> + is equal to or smaller thanBud j for every resource type. It outputs the final + adapted network and can also generate a sequence of simplified networks (i.e., + the highest accuracy network from each iteration <>) to provide the + efficient frontier of accuracy and resource consumption trade-offs. + + 3.2 Algorithm Overview + + For simplicity, we assume that we only need to meet the budget of one resource, + specifically latency. One method to reduce the latency is to remove filters from + the convolutional (CONV) or fully-connected (FC) layers. While there are other + ways to reduce latency, we will use this approach to demonstrate NetAdapt. + The NetAdapt algorithm is detailed in pseudo code in Algorithm 1 and in + Fig. 2. Each iteration solves Eq. 2 by reducing the number of filters in a single + CONV or FC layer (theChoose # of Filters and Choose Which Filters + blocks in Fig. 2). The number of filters to remove from a layer is guided by + empirical measurements. NetAdapt removes entire filters instead of individual + weights because most platforms can take advantage of removing entire filters, + + <
> + + Fig. 2.This figure visualizes the algorithm flow of NetAdapt. At each iteration, Ne- + tAdapt decreases the resource consumption by simplifying (i.e., removing filters from) + one layer. In order to maximize accuracy, it tries to simplify each layer individually + and picks the simplified network that has the highest accuracy. Once the target budget + is met, the chosen network is then fine-tuned again until convergence. + + and this strategy allows reducing both filters and feature maps, which play an + important role in resource consumption [25]. The simplified network is then + fine-tuned for a short length of time in order to restore some accuracy (the + Short-Term Fine-Tuneblock). + In each iteration, the previous three steps (highlighted in bold) are applied on + each of the CONV or FC layers individually 3 . As a result, NetAdapt generates + K (i.e., the number of CONV and FC layers) network proposals in one iteration, + each of which has a single layer modified from the previous iteration. The network + proposal with the highest accuracy is carried over to the next iteration (the + Pick Highest Accuracy block). Finally, once the target budget is met, the + chosen network is fine-tuned again until convergence (theLong-Term Fine-Tuneblock). + + + 3.3 Algorithm Details + + This section describes the key blocks in the NetAdapt algorithm (Fig. 2). + Choose Number of FiltersThis step focuses on determining how many + filters to preserve in a specific layer based on empirical measurements. NetAdapt + gradually reduces the number of filters in the target layer and measures the + resource consumption of each of the simplified networks. The maximum number + 3 The algorithm can also be applied to a group of multiple layers as a single unit + (instead of a single layer). For example, in ResNet [7], we can treat a residual block + as a single unit to speed up the adaptation process. + + <
> + + Fig. 3.This figure illustrates how layer-wise look-up tables are used for fast resource + consumption estimation. + + + of filters that can satisfy the current resource constraint will be chosen. Note + that when some filters are removed from a layer, the associated channels in the + following layers should also be removed. Therefore, the change in the resource + consumption of other layers needs to be factored in. + Choose Which FiltersThis step chooses which filters to preserve based on + the architecture from the previous step. There are many methods proposed in + the literature, and we choose the magnitude-based method to keep the algorithm + simple. In this work, the N filters that have the largest ℓ2-norm magnitude will + be kept, whereNis the number of filters determined by the previous step. More + complex methods can be adopted to increase the accuracy, such as removing the + filters based on their joint influence on the feature maps [25]. + Short-/Long-Term Fine-TuneBoth the short-term fine-tune and long- + term fine-tune steps in NetAdapt involve network-wise end-to-end fine-tuning. + Short-term fine-tune has fewer iterations than long-term fine-tune. + At each iteration of the algorithm, we fine-tune the simplified networks with + a relatively smaller number of iterations (i.e., short-term) to regain accuracy, in + parallel or in sequence. This step is especially important while adapting small + networks with a large resource reduction because otherwise the accuracy will + drop to zero, which can cause the algorithm to choose the wrong network proposal. + As the algorithm proceeds, the network is continuously trained but does not + converge. Once the final adapted network is obtained, we fine-tune the network + with more iterations until convergence (i.e., long-term) as the final step. + + + 3.4 Fast Resource Consumption Estimation + + As mentioned in Sec. 3.3, NetAdapt uses empirical measurements to determine + the number of filters to keep in a layer given the resource constraint. In theory, + we can measure the resource consumption of each of the simplified networks + on the fly during adaptation. However, taking measurements can be slow and + difficult to parallelize due to the limited number of available devices. Therefore, + it may be prohibitively expensive and become the computation bottleneck. + + <
> + + Fig. 4.The comparison between the estimated latency (using layer-wise look-up tables) + and the real latency on a single large core of Google Pixel 1 CPU while adapting the + 100% MobileNetV1 with the input resolution of 224 [9]. + + + We solve this problem by building layer-wise look-up tables with pre-measured + resource consumption of each layer. When executing the algorithm, we look up + the table of each layer, and sum up the layer-wise measurements to estimate + the network-wise resource consumption, which is illustrated in Fig. 3. The rea- + son for not using a network-wise table is that the size of the table will grow + exponentially with the number of layers, which makes it intractable for deep + networks. Moreover, layers with the same shape and feature map size only need + to be measured once, which is common for modern deep networks. + Fig. 4 compares the estimated latency (the sum of layer-wise latency from the + layer-wise look-up tables) and the real latency on a single large core of Google + Pixel 1 CPU while adapting the 100% MobileNetV1 with the input resolution of + 224 [9]. The real and estimated latency numbers are highly correlated, and the + difference between them is sufficiently small to be used by NetAdapt. + + + 4 Experiment Results + + In this section, we apply the proposed NetAdapt algorithm to MobileNets [9, 18], + which are designed for mobile applications, and experiment on the ImageNet + dataset [4]. We did not apply NetAdapt on larger networks like ResNet [7] and + VGG [19] because networks become more difficult to simplify as they become + smaller; these networks are also seldom deployed on mobile platforms. We benchmark + NetAdapt against three state-of-the-art network simplification methods: + Multipliers[9] are simple but effective methods for simplifying networks. + Two commonly used multipliers are the width multiplier and the resolution + multiplier; they can also be used together. Width multiplier scales the + number of filters by a percentage across all convolutional (CONV) and fully- + connected (FC) layers, and resolution multiplier scales the resolution of the + input image. We use the notation “50% MobileNetV1 (128)” to denote ap- + plying a width multiplier of 50% on MobileNetV1 with the input image + resolution of 128. + MorphNet[5] is an automatic network simplification algorithm based on sparsifying regularization. + ADC[8] is an automatic network simplification algorithm based on reinforcement learning. + + We will show the performance of NetAdapt on the small MobileNetV1 (50% + MobileNetV1 (128)) to demonstrate the effectiveness of NetAdapt on real-time- + class networks, which are much more difficult to simplify than larger networks. + To show the generality of NetAdapt, we will also measure its performance on + the large MobileNetV1 (100% MobileNetV1 (224)) across different platforms. + Lastly, we adapt the large MobileNetV2 (100% MobileNetV2 (224)) to push the + frontier of efficient networks. + + + 4.1 Detailed Settings for MobileNetV1 Experiments + + We perform most of the experiments and study on MobileNetV1 and detail the + settings in this section. + NetAdapt ConfigurationMobileNetV1 [9] is based on depthwise separable + convolutions, which factorize am×m standard convolution layer into am×m + depthwise layer and a 1×1 standard convolution layer called a pointwise layer. In + the experiments, we adapt each depthwise layer with the corresponding pointwise + layer and choose the filters to keep based on the pointwise layer. When adapting + the small MobileNetV1 (50% MobileNetV1 (128)), the latency reduction (<> + in Eq. 2) starts at 0.5 and decays at the rate of 0.96 per iteration. When adapting + other networks, we use the same decay rate but scale the initial latency reduction + proportional to the latency of the initial pretrained network. + Network TrainingWe preserve ten thousand images from the training + set, ten images per class, as the holdout set. The new training set without the + holdout images is used to perform short-term fine-tuning, and the holdout set is + used to pick the highest accuracy network out of the simplified networks at each + iteration. The whole training set is used for the long-term fine-tuning, which is + performed once in the last step of NetAdapt. + Because the training configuration can have a large impact on the accuracy, + we apply the same training configuration to all the networks unless otherwise + stated to have a fairer comparison. We adopt the same training configuration as + MorphNet [5] (except that the batch size is 128 instead of 96). The learning rate + for the long-term fine-tuning is 0.045 and that for the short-term fine-tuning is + 0.0045. This configuration improves ADC network’s top-1 accuracy by 0.3% and + almost all multiplier networks’ top-1 accuracy by up to 3.8%, except for one data + point, whose accuracy is reduced by 0.2%. We use these numbers in the following + analysis. Moreover, all accuracy numbers are reported on the validation set to + show the true performance. + Mobile Inference and Latency MeasurementWe use Google’s Tensor- + Flow Lite engine [22] for inference on a mobile CPU and Qualcomm’s Snap- + dragon Neural Processing Engine (SNPE) for inference on a mobile GPU. For + experiments on mobile CPUs, the latency is measured on a single large core of + + <
> + + Fig. 5.The figure compares NetAdapt (adapting the small MobileNetV1) with the + multipliers [9] and MorphNet [5] on a mobile CPU of Google Pixel 1. + + + Google Pixel 1 phone. For experiments on mobile GPUs, the latency is measured + on the mobile GPU of Samsung Galaxy S8 with SNPE’s benchmarking tool. For + each latency number, we report the median of 11 latency measurements. + + 4.2 Comparison with Benchmark Algorithms + Adapting Small MobileNetV1 on a Mobile CPUIn this experiment, we + apply NetAdapt to adapt the small MobileNetV1 (50% MobileNetV1 (128)) to + a mobile CPU. It is one of the most compact networks and achieves real-time + performance. It is more challenging to simplify than other larger networks + (include the large MobileNet V1). The results are summarized and compared with + the multipliers [9] and MorphNet [5] in Fig. 5. We observe that NetAdapt + outperforms the multipliers by up to 1.7×faster with the same or higher accuracy. + For MorphNet, NetAdapt’s result is 1.6×faster with 0.3% higher accuracy. + + Adapting Large MobileNetV1 on a Mobile CPUIn this experiment, we + apply NetAdapt to adapt the large MobileNetV1 (100% MobileNetV1 (224)) + on a mobile CPU. It is the largest MobileNetV1 and achieves the highest ac- + curacy. Because its latency is approximately 8×higher than that of the small + MobileNetV1, we scale the initial latency reduction by 8×. The results are shown + and compared with the multipliers [9] and ADC [8] in Fig. 6. NetAdapt achieves + higher accuracy than the multipliers and ADC while increasing the speed by + 1.4× and 1.2×, respectively. + While the training configuration is kept the same when comparing to the + benchmark algorithms discussed above, we also show in Fig. 6 that the accuracy + of the networks adapted using NetAdapt can be further improved with a better + training configuration. After simply adding dropout and label smoothing, the + accuracy can be increased by 1.3%. Further tuning the training configuration + for each adapted network can give higher accuracy numbers, but it is not the + focus of this paper. + + <
> + + Fig. 6.The figure compares NetAdapt (adapting the large MobileNetV1) with the + multipliers [9] and ADC [8] on a mobile CPU of Google Pixel 1. Moreover, the accuracy + of the adapted networks can be further increased by up to 1.3% through using a better + training configuration (simply adding dropout and label smoothing). + + <
> + + Fig. 7.This figure compares NetAdapt (adapting the large MobileNetV1) with the + multipliers [9] and ADC [8] on a mobile GPU of Samsung Galaxy S8. Moreover, the + accuracy of the adapted networks can be further increased by up to 1.3% through using + a better training configuration (simply adding dropout and label smoothing). + + + Adapting Large MobileNetV1 on a Mobile GPUIn this experiment, we + apply NetAdapt to adapt the large MobileNetV1 on a mobile GPU to show the + generality of NetAdapt. Fig. 7 shows that NetAdapt outperforms other benchmark + algorithms by up to 1.2×speed-up with higher accuracy. Due to the + limitation of the SNPE tool, the layerwise latency breakdown only considers the + computation time and does not include the latency of other operations, such as + feature map movement, which can be expensive [25]. This affects the precision + of the look-up tables used for this experiment. Moreover, we observe that there + is an approximate 6.2ms (38% of the latency of the network before applying + NetAdapt) non-reducible latency. These factors cause a smaller improvement on + the mobile GPU compared with the experiments on the mobile CPU. Moreover, + when the better training configuration is applied as previously described, the + accuracy can be further increased by 1.3%. + + <
> <
> + + Fig. 8.The accuracy of different short- Fig. 9.The comparison between before + term fine-tuning iterations when adapt- and after long-term fine-tuning when + ing the small MobileNetV1 (without long- adapting the small MobileNetV1 on a mo- + term fine-tuning) on a mobile CPU of bile CPU of Google Pixel 1. Although the + Google Pixel 1. Zero iterations means no short-term fine-tuning preserves the accu- + short-term fine-tuning. racy well, the long-term fine-tuning gives + the extra 3.4% on average (from 1.8% to + 4.5%). + + + 4.3 Ablation Studies + Impact of Direct MetricsIn this experiment, we use the indirect metric (i.e., + the number of MACs) instead of the direct metric (i.e., the latency) to guide + NetAdapt to investigate the importance of using direct metrics. When computing + the number of MACs, we only consider the CONV and FC layers because batch + normalization layers can be folded into the corresponding CONV layers, and the + other layers are negligibly small. Table 1 shows that NetAdapt outperforms the + benchmark algorithms with lower numbers of MACs and higher accuracy. This + demonstrates the effectiveness of NetAdapt. However, we also observe that the + network with lower numbers of MACs may not necessarily be faster. This shows + the necessity of incorporating direct measurements into the optimization flow. + + Impact of Short-Term Fine-TuningFig. 8 shows the accuracy of adapting + the small MobileNetV1 with different short-term fine-tuning iterations (without + long-term fine-tuning). The accuracy rapidly drops to nearly zero if no short- + term fine-tuning is performed (i.e., zero iterations). In this low accuracy region, + the algorithm picks the best network proposal solely based on noise and hence NetAdapt 13 + + <
> + + Fig. 10.NetAdapt and the multipliers generate different simplified networks when + adapting the small MobileNetV1 to match the latency of 25% MobileNetV1 (128). + + + gives poor performance. After fine-tuning a network for a short amount of time + (ten thousand iterations), the accuracy is always kept above 20%, which allows + the algorithm to make a better decision. Although further increasing the number + of iterations improves the accuracy, we find that using forty thousand iterations + leads to a good accuracy versus speed trade-off for the small MobileNetV1. + + Impact of Long-Term Fine-TuningFig. 9 illustrates the importance of per- + forming the long-term fine-tuning. Although the short-term fine-tuning preserves + the accuracy well, the long-term fine-tuning can still increase the accuracy by + up to another 4.5% or 3.4% on average. Since the short-term fine-tuning has a + short training time, the training is terminated far before convergence. Therefore, + it is not surprising that the final long-term fine-tuning can further increase the + accuracy. + + Impact of Resource Reduction Schedules Table 2 shows the impact of + using three different resource reduction schedules, which are defined in Sec. 3.1. + Empirically, using a larger resource reduction at each iteration increases the + adaptation speed (i.e., reducing the total number of adaptation iterations) at the + cost of accuracy. With the same number of total iterations, the result suggests + that a smaller initial resource reduction with a slower decay is preferable. + + 4.4 Analysis of Adapted Network Architecture + The network architectures of the adapted small MobileNetV1 by using NetAdapt + and the multipliers are shown and compared in Fig. 10. Both of them have similar + latency as 25% MobileNetV1 (128). There are two interesting observations. + + <
> + + Table 3.The comparison between NetAdapt (adapting the large MobileNetV2 (100% + MobileNetV2 (224))) and the multipliers [18] on a mobile CPU of Google Pixel 1. We + compare the latency at similar accuracy and the accuracy at similar latency. + + + First, NetAdapt removes more filters in layers 7 to 10, but fewer in layer 6. + Since the feature map resolution is reduced in layer 6 but not in layers 7 to 10, + we hypothesize that when the feature map resolution is reduced, more filters are + needed to avoid creating an information bottleneck. + The second observation is that NetAdapt keeps more filters in layer 13 (i.e. + the last CONV layer). One possible explanation is that the ImageNet dataset + contains one thousand classes, so more feature maps are needed by the last FC + layer to do the correct classification. + + 4.5 Adapting Large MobileNetV2 on a Mobile CPU + In this section, we show encouraging early results of applying NetAdapt to + MobileNetV2 [18]. MobileNetV2 introduces the inverted residual with linear + bottleneck into MobileNetV1 and becomes more efficient. Because MobileNetV2 + utilizes residual connections, we only adapt individual inner (expansion) layers + or reduce all bottleneck layers of the same resolution in lockstep. The main + differences between the MobileNetV1 and MobileNetV2 experiment settings are that + each network proposal is short-term fine-tuned with ten thousand iterations, the + initial latency reduction is 1ms, the latency reduction decay is 0.995, the batch + size is 96, and dropout and label smoothing are used. NetAdapt achieves 1.1% + higher accuracy or 1.2×faster speed than the multipliers as shown in Table 3. + + 5 Conclusion + + In summary, we proposed an automated algorithm, called NetAdapt, to adapt a + pretrained network to a mobile platform given a real resource budget. NetAdapt + can incorporate direct metrics, such as latency and energy, into the optimization + to maximize the adaptation performance based on the characteristics of the + platform. By using empirical measurements, NetAdapt can be applied to any + platform as long as we can measure the desired metrics, without any knowledge + of the underlying implementation of the platform. We demonstrated empirically + that the proposed algorithm can achieve better accuracy versus latency trade-off + (by up to 1.7×faster with equal or higher accuracy) compared with other + state-of-the-art network simplification algorithms. In this work, we aimed to highlight + the importance of using direct metrics in the optimization of efficient networks; + we hope that future research efforts will take direct metrics into account in order + to further improve the performance of efficient networks. + + + Bibliography + + [1] Audet, C., J. E. Dennis, J.: A progressive barrier for derivative-free nonlin- + ear programming. SIAM Journal on Optimization20(1), 445–472 (2009) + [2] Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A Spatial Architecture for Energy- + Efficient Dataflow for Convolutional Neural Networks. In: Proceedings of the + 43rd Annual International Symposium on Computer Architecture (ISCA) + (2016) + [3] Chen, Y.H., Krishna, T., Emer, J., Sze, V.: Eyeriss: An Energy-Efficient + Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE + Journal of Solid-State Circuits52, 127–138 (2016) + [4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A + large-scale hierarchical image database. In: IEEE Conference on Computer + Vision and Pattern Recognition (CVPR). pp. 248–255. IEEE (2009) + [5] Gordon, A., Eban, E., Nachum, O., Chen, B., Yang, T.J., Choi, E.: Mor- + phnet: Fast & simple resource-constrained structure learning of deep net- + works. In: IEEE Conference on Computer Vision and Pattern Recognition + (CVPR) (2018) + [6] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections + for efficient neural network. In: Advances in Neural Information Processing + Systems. pp. 1135–1143 (2015) + [7] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image + Recognition. In: IEEE Conference on Computer Vision and Pattern Recog- + nition (CVPR) (2016) + [8] He, Y., Han, S.: Adc: Automated deep compression and acceleration with + reinforcement learning. arXiv preprint arXiv:1802.03494 (2018) + [9] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, + T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural + networks for mobile vision applications. arXiv preprint arXiv:1704.04861 + (2017) + [10] Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network Trimming: A Data- + Driven Neuron Pruning Approach towards Efficient Deep Architectures. + arXiv preprint arXiv:1607.03250 (2016) + [11] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized + neural networks. In: Advances in Neural Information Processing Systems. + pp. 4107–4115 (2016) + [12] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., + Kalenichenko, D.: Quantization and training of neural networks for efficient + integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877 (2017) + [13] Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of + deep convolutional neural networks for fast and low power mobile applica- + tions. arXiv preprint arXiv:1511.06530 (2015) + [14] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances + in Neural Information Processing Systems (1990) 16 T.-J. Yang et al. + [15] Liangzhen Lai, Naveen Suda, V.C.: Not all ops are created equal! In: SysML + (2018) + [16] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolu- + tional neural networks for resource efficient transfer learning. arXiv preprint + arXiv:1611.06440 (2016) + [17] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet + classification using binary convolutional neural networks. In: European Con- + ference on Computer Vision (ECCV) (2016) + [18] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted + residuals and linear bottlenecks: Mobile networks for classification, detection + and segmentation. In: IEEE Conference on Computer Vision and Pattern + Recognition (CVPR) (2018) + [19] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large- + Scale Image Recognition. In: International Conference on Learning Repre- + sentations (ICLR) (2014) + [20] Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural net- + works. arXiv preprint arXiv:1507.06149 (2015) + [21] Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep + neural networks: A tutorial and survey. Proceedings of the IEEE105(12), + 2295–2329 (Dec 2017). https://doi.org/10.1109/JPROC.2017.2761740 + [22] TensorFlow Lite: https://www.tensorflow.org/mobile/tflite/ + [23] Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., + Wang, Z.: Deep fried convnets. In: Proceedings of the IEEE International + Conference on Computer Vision. pp. 1476–1483 (2015) + [24] Yang, Tien-Ju and Chen, Yu-Hsin and Emer, Joel and Sze, Vivienne: A + Method to Estimate the Energy Consumption of Deep Neural Networks. + In: Asilomar Conference on Signals, Systems and Computers (2017) + [25] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne: Designing energy- + efficient convolutional neural networks using energy-aware pruning. In: + IEEE Conference on Computer Vision and Pattern Recognition (CVPR) + (2017) + [26] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: + Customizing dnn pruning to the underlying hardware parallelism. In: Pro- + ceedings of the 44th Annual International Symposium on Computer Archi- + tecture (2017) + [27] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shuenet: An extremely ef- + ficient convolutional neural network for mobile devices. arXiv preprint + arXiv:1707.01083 (2017) +<> <> <> + + +<> <> <> + TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING + + Peter Henderson y , Jieru Hu z , Joshua Romoff + Emma Brunskill y , Dan Jurafsky y , Joelle Pineau z; + y Stanford University, z Facebook, Mila, McGill University + + + February 14, 2020 + + ABSTRACT + + Accurate reporting of energy and carbon usage is essential for understanding the potential climate + impacts of machine learning research. We introduce a framework that makes this easier by providing a + simple interface for tracking realtime energy consumption and carbon emissions, as well as generating + standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient + reinforcement learning algorithms to incentivize responsible research in this area as an example for + other areas of machine learning. Finally, based on case studies using our framework, we propose + strategies for mitigation of carbon emissions and reduction of energy consumption. By making + accounting easier, we hope to further the sustainable development of machine learning experiments + and spur more research into energy efficient algorithms. + + 1 Introduction + + Global climate change is a scientifically well-recognized phenomenon and appears to be accelerated due to greenhouse + gas (GHG) emissions such as carbon dioxide or equivalents (CO 2eq ) (Crowley,2000;IPCC,2018). The harmful health + and safety impacts of global climate change are projected to “fall disproportionately on the poor and vulnerable” (IPCC, + 2018). Energy production remains a large factor in GHG emissions, contributing about 25% of GHG emissions in + 2010 (IPCC,2018). With the compute and energy demands of many modern machine learning (ML) methods growing + exponentially (Amodei and Hernandez,2018), ML systems have the potential to significantly contribute to carbon + emissions. Recent work has demonstrated these potential impacts through case studies and suggested various mitigating + strategies (Strubell et al.,2019;Schwartz et al.,2019). + Systematic and accurate measurements are needed to better estimate the broader energy and carbon footprints of ML – + in both research and production settings. Accurate accounting of carbon and energy impacts aligns incentives with + energy efficiency (Schwartz et al.,2019), raises awareness, and drives mitigation efforts (Sundar et al.,2018;LaRiviere + et al.,2016), among other benefits. 1 Yet, most ML research papers do not regularly report energy or carbon emissions + metrics. 2 + + We hypothesize that part of the reason that much research does not report energy and carbon metrics is due to the + complexities of collecting them. Collecting carbon emission metrics requires understanding emissions from energy + grids, recording power outputs from GPUs and CPUs, and navigating among different tools to accomplish these tasks. + To reduce this overhead, we present experiment-impact-tracker a lightweight framework for consistent, easy, and + more accurate reporting of energy, compute, and carbon impacts of ML systems. + In Section4, we introduce the design and capabilities of our framework and the issues with accounting we aim to solve + with this new framework. Section5expands on the challenges of using existing accounting methods and discusses our + + 1 See Section4.1for an extended discussion on the importance of accounting. + 2 See Section3and AppendixBfor more information. + 3 https.//github.com/Breakend/experiment-impact-tracker + + + learnings from analyzing experiments with experiment-impact-tracker. For example, in an empirical case study on + image classification algorithms, we demonstrate that floating point operations (FPOs), a common measure of efficiency, + are often uncorrelated with energy consumption with energy metrics gathered by experiment-impact-tracker. + In Section6, we focus on recommendations for promoting energy-efficient research and mitigation strategies for carbon + emissions. Using our framework, we present aReinforcement Learning Energy Leaderboard in Section6.1to encourage + development of energy efficient algorithms. We also present a case study in machine translation to show how regional + energy grid differences can result in large variations inCO 2eq emissions. Emissions can be reduced by up to 30x just + by running experiments in locations powered by more renewable energy sources (Section6.2). Finally, we suggest + systemic and immediate changes based on our findings. + + •incentivizing energy-efficient research through leaderboards (Section6.1) + •running experiments in carbon-friendly regions (Section6.2) + •reducing overheads for utilizing efficient algorithms and resources (Section7.1) + •considering energy-performance trade-offs before deploying energy hungry models (Section7.2) + •selecting efficient test environment especially in RL (Section7.3) + •ensuring reproducibility to reduce energy consumption from replication difficulties (Section7.4) + •consistently reporting energy and carbon metrics (Section7.5) + + 2 Related Work + + Estimating GHG emissions and their downstream consequences is important for setting regulatory standards (U.S. + Environment Protection Agency,2013) and encouraging self-regulation (Byerly et al.,2018). In particular, these + estimates are used to set carbon emissions reduction targets and in turn set carbon prices for taxes or emissions trading + systems. 4 A large body of work has examined modeling and accounting of carbon emissions 5 at different levels of + granularity. at the global scale (IPCC,2018); using country-specific estimates (Ricke et al.,2018); targeting a particular + industrial sector like Information and Communication Technologies, for example, modeled byMalmodin et al.(2013); + or even targeting a particular application like bitcoin mining, for example, modeled byMora et al.(2018). + At the application level, some work has already modeled carbon impacts specifically in computationally intensive + settings like bitcoin mining (Krause and Tolaymat,2018;Stoll et al.,2019;Zade et al.,2019;Mora et al.,2018). + Such application-specific efforts are important for prioritizing emissions mitigation strategies. without understanding + projected impacts, policy decisions could focus on ineffective regulation. However, with large amounts of heterogeneity + and endogeneity in the underlying data, it can be difficult to model all aspects of an application’s usage. For example, + one study suggested that “bitcoin emissions alone could push global warming above 2°C” (Mora et al.,2018). But + Masanet et al.(2019),Houy(2019), and others, criticized the underlying modeling assumptions which led to such large + estimates of carbon emissions. This shows that it is vital that these models provide accurate measurements if they are to + be used for informed decision making. + With ML models getting more computationally intensive (Amodei and Hernandez,2018), we want to better understand + how machine learning in research and industry impacts climate change. However, estimating aggregate climate change + impacts of ML research and applications would require many assumptions due to a current lack of reporting and + accounting. Instead, we aim to emphasize and aid systematic reporting strategies such that accurate field-wide estimates + can be conducted in the future. + Some recent work investigates climate impacts of machine learning research, specifically Strubell et al.(2019) + demonstrate the issue of carbon and energy impacts of large NLP models by evaluating estimated power usage and + carbon emissions for a set of case studies. The authors suggest that. “authors should report training time and sensitivity + to hyperparameters”, “academic researchers need equitable access to computation resources”, and “researchers should + prioritize computationally efficient hardware and algorithms”.Schwartz et al.(2019) provide similar proposals, + suggesting floating point operations (FPOs) as a guiding efficiency metric. Lacoste et al.(2019) recently provided a + website for estimating carbon emissions based on GPU type, experiment length, and cloud provider. In Section5, we + 4 An emissions trading system is a cap on total allowed carbon emissions for a company with permits issued. When a company + emits a certain amount of carbon, they trade in a permit, creating a market for emissions permits. This is a market-based approach to + incentivize emission reductions. See Ramstein et al.(2019) for a description of such carbon pricing efforts across different countries. + 5 See also assorted examinations on carbon accounting, standardized reporting, and policy recommendations (Stechemesser and + Guenther,2012; Dayarathna et al.,2015; IPCC,2018; Ajani et al.,2013; Bellassen and Stephan,2015;Andrew and Cortese,2011; + Tang and Demeritt, 2018;Cotter et al.,2011;Tol,2011;U.S. Environment Protection Agency,2013; Ricke et al.,2018). + discuss how while the estimation methods of these works provide some understanding of carbon and energy impacts, + nuances in the estimation methods may make them inaccurate – particularly in experiments which utilize combined CPU + and GPU workloads heavily. We build a framework aiming to provide more accurate and easier systematic reporting of + carbon and energy footprints. We also provide additional mitigation and reporting strategies – beyond those discussed + by these prior works – to emphasize how both companies and research labs can be more carbon and energy efficient. + It is worth noting that prior work has also examined the carbon impacts of research in other fields, focusing mostly on + emissions from conference travel (Spinellis and Louridas,2013;Astudillo and AzariJafari,2018;Hackel and Sparkman, + 2018). We provide a brief discussion on ML-related conference travel in AppendixA, but will focus mainly on accurate + accounting of energy and carbon footprints of ML compute. + + 3 Background + + We briefly provide a primer on energy and carbon accounting, which form the basis of our proposed framework for + measuring and reporting the ecological footprint of ML research. + + 3.1 Energy Accounting + + Energy accounting is fairly straightforward. The energy consumption of a system can be measured in Joules (J) or + Watt-hours (Wh), 6 representing the amount of energy needed to power the system. Life-cycle accounting might also + consider the energy required to manufacture components of the system – for example, the production of GPUs or + CPUs (Jones et al.,2013). However, we largely ignore life-cycle aspects of energy accounting due to the difficulties in + attributing manufacturing impacts on a per-experiment basis. Measuring data-center energy impacts also contain several + layers, focusing on hardware-centric and software-centric analyses. Many parts contribute to the power consumption + of any computational system. Dayarathna et al.(2015) survey energy consumption components of a data center and + their relative consumption. cooling (50%), lighting (3%), power conversion (11%), network hardware (10%), and + server/storage (26%). The server and storage component can further be broken down into contributions from DRAM, + CPUs, among other compute components. Accurate accounting for all of these components requires complex modeling + and varies depending on workload. Since we aim to provide a framework at the per-experiment software level, we only + account for aspects of energy consumption which expose interfaces for energy metrics. For the purpose of our work, this + is constrained to DRAM, CPUs, and GPUs. To account for all other components, we rely on a power usage effectiveness + (PUE) factor (Strubell et al.,2019). This factor rescales the available power metrics by an average projected overhead + of other components. With more available software interfaces, more robust modeling can be performed as reviewed by + Dayarathna et al.(2015). + + 3.2 Carbon Accounting + + Carbon accounting can be all-expansive, so we focus on a narrow definition provided by Stechemesser and Guenther + (2012). “carbon accounting at the project scale can be defined as the measuring and non-monetary valuation of carbon + and GHG emissions and offsetting from projects, and the monetary assessment of these emissions with offset credits to + inform project-owners and investors but also to establish standardized methodologies.” Carbon and GHG emissions are + typically measured in some form close to unitsCO 2eq . This is the amount of carbon – and other GHG converted to + carbon amounts – released into the atmosphere as a result of the project. Carbon offsetting is the amount of carbon + emissions saved as a result of the project. For example, a company may purchase renewable energy in excess of + the energy required for their project to offset for the carbon emissions they contributed. Since our goal is to inform + and assess carbon emissions of machine learning systems, we ignore carbon offsetting 7 . We also do not consider + carbon accounting in the financial sense, but do provide metrics on monetary impacts through the social cost of carbon + (SC-CO2). TheU.S. Environment Protection Agency(2013) uses this metric when developing administrative rules and + regulations. According to the EPA, “The SC-CO2 is a measure, in dollars, of the long-term damage done by a ton of + carbon dioxide (CO2) emissions in a given year. This dollar figure also represents the value of damages avoided for + a small emission reduction (i.e., the benefit of a CO2 reduction).” We rely on the per-country social cost of carbon + developed byRicke et al.(2018), which accounts for different risk profiles of country-level policies and GDP growth in + their estimates of SC-CO2. + Carbon emissions from a project can also consider life-cycle emissions (for example, manufacturing of CPUs may emit + carbon as part of the process). We do not consider these aspects of emissions. We instead, consider only carbon emissions + from energy consumption. A given energy grid powering an experiment will have a carbon intensity. the grams of + + 6 One Watt is a unit of power – equivalent to one Joule per second. + 7 See discussion in AppendixCfor more information on why. + + CO2 emitted per kWh of energy used. This carbon intensity is determined based on the energy sources supplying the + grid. Each energy source has its own carbon intensity accounted for through a full life-cycle analysis (IPCC,2015). For + example, coal power has a median carbon intensity of 820 gCO 2eq / kWh, while hydroelectricity has a mean carbon + intensity of 24 gCO 2eq / kWh. Carbon emissions for a compute system can be estimated by understanding the carbon + intensity of the local energy grid and the energy consumption of the system. Similar analyses have been done for + bitcoin (Krause and Tolaymat,2018). These analyses, however, attempt to extrapolate impacts of bitcoin mining in + general, while in this work we attempt to examine machine learning impacts on a per-experiment basis. + + 3.3 Current State of Reporting in Machine Learning Research + + We briefly examine the current state of accounting in the machine learning literature and review commonly reported + computational metrics. Here we look at a non-exhaustive list of reported metrics from papers we surveyed and group + them into different categories. + + •Energy + –Energy in Joules (Assran et al.,2019) + –Power consumption in Watts (Canziani et al.,2016) + •Compute + –PFLOPs-hr (Amodei and Hernandez,2018), the floating point operations per second needed to run the + experiment in one hour + –Floating Point Operations (FPOs) or Multiply-Additions (Madds), typically reported as the computations + required to perform one forward pass through a neural network (Howard et al.,2017;Sandler et al.,2018; + Schwartz et al.,2019) + –The number of parameters defined by a neural network (often reported together with FPOs) (Howard + et al.,2017;Sandler et al.,2018) + –GPU/CPU utilization as a percentage (Assran et al.,2019;Dalton et al.,2019) + –GPU-hours or CPU-hours, the processor cycles utilized (or in the case of the GPU percentage utilized), + times the runtime (Soboczenski et al.,2018) + •Runtime + –Inference time, the time it takes to run one forward pass through a neural network, (Jeon and Kim,2018; + Qin et al.,2018) + –Wall clock training time, the total time it takes to train a network (Assran et al.,2019;Dalton et al.,2019). + –Hardware and time together (e.g., 8 v100 GPUs for 5 days) (Krizhevsky et al.,2012;Ott et al.,2018; + Gehring et al.,2017) + •Carbon Emissions + –US-average carbon emissions (Strubell et al.,2019) + + Example 1 To get a rough estimate of the prevalence of these metrics, we randomly sampled 100 NeurIPS papers from + the 2019 proceedings. In addition to the metrics above, we also investigate whether hardware information was reported + (important for extrapolating energy and carbon information with partial information). Of these papers, we found 1 + measured energy in some way, 45 measured runtime in some way, 46 provided the hardware used, 17 provided some + measure of computational complexity (e.g., compute-time, FPOs, parameters), and 0 provided carbon metrics. See + Appendix B for more details on methodology. + + Some of these metrics, when combined, can also be used to roughly estimate energy or carbon metrics. For example, + the experiment time (h) can be multiplied by the thermal design power (TDP) of the GPUs used (W) 8 . This results + in a Watt-hour energy metric. This can then be multiplied by the carbon intensity of the local energy grid to assess + the amount ofCO 2eq emitted. This method of estimation omits CPU usage and assumes a 100% GPU utilization. + Alternatively, Amodei and Hernandez(2018) use a utilization factor of 33% for GPUs. Similarly, the PFLOPs-hr metric + can by multiplied by TDP (Watts) and divided by the maximum computational throughput of the GPU (in PFLOPs). + This once again provides a Watt-hour energy metric. This, however, makes assumptions based on maximum efficiency + of a GPU and disregards variations in optimizations made by underlying frameworks (e.g., Tensorflow versus Pytorch; + AMD versus NVIDIA drivers). + + 8 This is a rough estimate of the maximum operating capacity of a GPU. + + As we will demonstrate using our framework (see Section5.2), the assumptions of these estimation methods lead to + significant inaccuracies. However, aggregating all necessary accounting information is not straightforward or easy; it + requires finding compatible tools, handling nuances on shared machines, among other challenges. + It is worth noting that some metrics focus on the computational requirements of training (which require additional + resources to compute gradients and backpropagate, in the case of neural networks) versus the computational requirements + of inference. The former is often more energy and carbon intensive in machine learning research, while the later is more + intensive in production systems (the cost of training is insignificant when compared to the lifetime costs of running + inference millions of times per day, every day). We will remain largely agnostic to this differentiation until some + discussions in Sections6.2and7.2. + + 4 A New Framework for Tracking Machine Learning Impacts + + 4.1 Motivation + + The goal of our experiment-impact-tracker framework is to provide an easy to deploy, reproducible, and quickly + understood mechanism for all machine learning papers to report carbon impact summaries, along with additional + appendices showing detailed energy, carbon, and compute metrics. + + Example 2A carbon impact summary generated by our framework can be found at the end of this paper in the Carbon + Impact Statement section. In brief, the experiments in our paper contributed 8.021 kg ofCO 2eq to the atmosphere and + used 24.344 kWh of electricity, having a USA-specific social cost of carbon of $0.38 ($0.00, $0.95) (Ricke et al.,2018). + + Such statements and informational reporting are important for, among other reasons, awareness, aligning incentives, + and enabling accurate cost-benefit analyses. + Awareness. Informational labels and awareness campaigns have been shown to be effective drivers of eco-friendly + behaviors (depending on the context) (Banerjee and Solomon,2003;Sundar et al.,2018;Newell and Siikamäki,2014; + Byerly et al.,2018). Without consistent and accurate accounting, many researchers will simply be unaware of the + impacts their models might have and will not pursue mitigating strategies. Consistent reporting also may provide social + incentives to reduce carbon impacts in research communities. + Aligning Incentives. While current reporting often focuses solely on performance metrics (accuracy in classification, + perplexity in language modeling, average return in reinforcement learning, etc), standardized reporting of energy in + addition to these metrics aligns incentives towards energy efficient models in research output (Schwartz et al.,2019). + Those who accurately report carbon emissions may have more incentive to reduce their carbon footprint. This may also + drive traffic to low-emission regions, spurring construction of more carbon-friendly data centers. 9 + + Cost-Benefit Analysis and Meta-Analysis. Cost-benefit analyses can be conducted with accurate energy metrics + reporting, but are impossible without it. For example, the estimated generated revenue of a model can be weighed + against the cost of electricity. In the case of models suggested by Rolnick et al.(2019), the carbon emissions saved by a + model can be weighed against the emissions generated by the model. Consistent reporting also opens the possibility for + performing meta-analyses on energy and carbon impacts (Henderson and Brunskill,2018). Larger extrapolations to + field-wide impacts of research conferences can also be assessed with more frequent reporting. + + 4.2 Design Considerations + + We consider five main principles when designing the framework for systematic reporting. usability, interpretability, + extensibility, reproducibility, and fault tolerance. + Usability. Perceived ease-of-use can be an important factor in adoption of new technologies and methods (Gefen and + Straub,2000). Since gathering key energy (kWh) and carbon (CO 2eq ) metrics requires specific knowledge about – and + aggregation of – different sources of information, there may be a barrier to the ease-of-use in the current status quo. As + a result, a core design consideration in developing tools for these metrics is usability, or ease-of-use. We accomplish + this by abstracting away and distilling required knowledge of information sources, keeping amount of required action + from the user to a minimum. + Interpretability. Along with ease-of-use, a key factor in adoption is perceived usefulness (Gefen and Straub,2000). + Since we wish for the reporting of carbon and energy metrics to become widespread, we consider perceived usefulness + + 9 See discussion in Section6.2on regional carbon emission differences. See discussion by LaRiviere et al.(2016) on how more accurate carbon accounting can result in reduced carbon emissions. + + through interpretability. We aim to make reporting tools within the framework useful through simple generation of + graphs and web pages from metrics for easy interpretation. We also provide a mechanism to generate a carbon impact + statement with the social cost of carbon. This dollar amount represents the projected damage from the experiment’s + carbon emissions and helps ground results in values that may be more interpretable. + Extensibility.We design the framework in a modular fashion to handle evolving driver support (see Section5) and + new metrics. The ML community can add new metrics, carbon intensity information, and other capabilities easily. For + each metric, a central data router stores a description, the function which gathers metric data, and a list of compatibility + checks (e.g., the metric can only be gathered on a Linux system). New metrics can be added to this router. 10 Similarly, + new carbon region and electricity grid information can be added as needed to similar centralized locations. 11 + + Reproducibility. Running an algorithm on different sets of hardware has been shown to affect the reproducibility of + algorithmic results (Gundersen and Kjensmo,2018;Sukhoy and Stoytchev,2019). Our framework aides in automating + reproducibility by logging additional metrics like hardware information, Python package versions, etc. These metrics can + help future work assess statistically significant differences in model energy requirements by accounting for controlled + and random variates (Boquet et al.,2019). + Fault tolerance.Mistakes in software are inevitable – as is discussed inSidor and Schulman(2017). We try to log all + rawinformation so that accounting can be recreated and updated based on new information. We also log the version + number of the tool itself, to ensure future comparisons do not mismatch information between versions that may have + changed. + + 4.3 Proposed Framework + + Theexperiment-impact-trackerrequires a simple code change to automatically gather available metrics and a script to + generate online appendices for reporting the data. Currently, on compatible Linux systems, we gather. + + •all python packages and version numbers + •CPU and GPU hardware information + •experiment start and end-times + •the version of theexperiment-impact-trackerframework used + •the energy grid region the experiment is being run in (based on IP address) + •the average carbon intensity in the energy grid region + •CPU- and GPU-package power draw + •per-process utilization of CPUs and GPUs + •GPU performance states + •memory usage + •the realtime CPU frequency (in Hz) + •realtime carbon intensity (only supported in CA right now) + •disk write speed + + The code change required for immediate logging of metrics can be seen in Listing 1. In the background, the framework + launches a thread which polls system supported tools. For example, the thread pollspsutil(Rodola,2016) for measuring + CPU utilization. All of these metrics are logged in parallel with the main machine learning process as described in + Figure1. A script 12 is provided to generate an HTML web page showing graphs and tables for all these metrics, meant + to serve as an online appendix for research papers. 13 Results in the generated appendix can be aggregated across + multiple experiments to show averages along with standard error as recommended in prior work (Henderson et al., + 2018;Colas et al.,2018;Reimers and Gurevych,2017). + + 10 Seehttps.//breakend.github.io/experiment-impact-tracker/contributing_new_metric.html + 11 Seehttps.//breakend.github.io/experiment-impact-tracker/contributing_carbon_region.html. + 12 https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/create-compute-appendix + 13 Appendices generated by our framework for Figure7and Figure3are available at.https.//breakend.github.io/ClimateChangeFromMachineLearningResearch/measuring_and_mitigating_energy_and_carbon_footprints_in_machine_learning/. Experiments in Figure5are available athttps.//breakend.github.io/RL-Energy-Leaderboard/ + reinforcement_learning_energy_leaderboard/index.html. + + <> + + Listing 1. Simple code addition required to log experiment details via our framework. + + + + <> + + Figure 1. A diagram demonstrating how the released version of the tool works. The main process launches a monitoring + thread which iterates over a list of metrics associated with function calls to other tools. For example, if available, we + call Intel RAPL to collect CPU power draw or querycaiso.orgto get realtime carbon intensity data for California. + Once all the data that is compatible with the current system is gathered, it is logged to a standardized log file and the + process repeats. The main thread may check in on this thread for exceptions, but the thread will not interrupt the main + process. Once the main thread exits, anatexithook (which is called whenever the main process exits, either successfully + or through an exception) gathers the final information (such as the time the experiment ended), logs it, and then ends + both the monitor and main process. + + + 4.3.1 Tracking Energy Consumption + Different hardware vendors provide different tooling for tracking energy consumption. Our framework hides these + complications from users. We currently use Intel’s RAPL tool with the powercap interface (David et al.,2010) to gather + CPU/DRAM power draw and Nvidia’snvidia-smi 14 for GPU power draw. We usepsutilfor gathering per-process CPU + utilization andnvidia-smifor per-process GPU utilization. We found that on a shared machine – as when running a + job on Slurm – using Intel’s RAPL would provide energy metrics for the entire machine (including other jobs running + on the worker). If two experiments were launched with Slurm to the same worker, using measurements from RAPL + without corrections would double count energy usage from the CPU. + As a result, we assign energy credits on a per-process basis (though we log system-wide information as well). We + track the parent process, and any children spawned. Power credits are provided based on relative usage of system + resources. If a process uses 25% of the CPU (relative to the entire system’s usage), we will credit the process with 25% + of the CPU-based power draw. This ensures that any non-experiment-related background processes – software updates, + weekly jobs, or multiple experiments on the same machine – will not be taken into account during training. + + 14 https.//developer.nvidia.com/nvidia-system-management-interface + + We calculate total energy as. + <> (1) + + where presource are the percentages of each system resource used by the attributable processes relative to the total in-use + resources anderesource is the energy usage of that resource. This is the per-process equivalent of the method which + Strubell et al.(2019) use. We assume the same constant power usage effectiveness (PUE) asStrubell et al.(2019). This + value compensates for excess energy from cooling or heating the data-center. + + 4.3.2 Carbon Accounting + + <
> + + Figure 2. Realtime carbon intensity (CO2 / kWh) collected during one experiment using our framework. As the + experiment continued, the sun rose in California, and with it the carbon intensity decreased. + + For calculating carbon emissions, we use the power estimate from the previous section in kilowatt-hours (kWh) and + multiply it by the carbon intensity of the local energy grid (CO2 / kWh). To gather carbon intensity metrics + for energy grids, we build on the open-source portions ofhttps.//www.electricitymap.organd define regions + based on map-based geometries, using the smallest bounding region for a given location as the carbon intensity + estimate of choice. For example, for an experiment run in San Francisco, if the average carbon intensity is available + for both the USA and California, the latter will be used. We estimate the region the experiment is conducted in + based on the machine’s IP address. Carbon intensities are gathered from the average fallback values provided in the + https.//www.electricitymap.orgcode where available and supplemented with additional metrics from various + governmental or corporate reports. We note thatelectricitymap.orgestimates are based on a closed-source system + and uses the methodology described byTranberg et al.(2019). All estimates fromelectricitymap.orgare of + the regional supply, rather than production (accounting for imports from other regions). Since https.//caiso.com + provides realtime intensities including imports for free, for experiments run in California, we also provide realtime + carbon intensity information. We do this by polling https.//caiso.com for the current intensity of the California + energy grid every five minutes. This helps gather even more accurate estimates of carbon emissions to account for daily + shifts in supply. For example, experiments run in California during the day time use roughly 2 of night-time experiments. + This is because much of California’s renewable energy comes from solar plants. Figure2is an automatically generated 3 + graph showing this phenomenon from an experiment using our framework. We hope that as users find more accurate + realtime or average measurements of regional supply-based carbon intensities, they will add them to the tool for even + more accurate measurements in the future. + + 5 The Importance and Challenges of Accounting. Why a New Framework? + + 5.1 FPOs Can Be Misleading + + Floating Point Operations (FPOs) are the de facto standard for reporting “efficiency” of a deep learning model (Schwartz + et al.,2019), and intuitively they should be correlated with energy efficiency – after all, fewer operations should result + in faster and more energy efficient processing. However, this is not always the case. + Previously,Jeon and Kim(2018) demonstrated mechanisms for constructing networks with larger FPOs, but lower + inference time – discussing the “Trap of FLOPs”. Similarly,Qin et al.(2018) show how Depthwise 3x3 Convolutions + comprised just 3.06% of an example network’s Multiply-Add operations, while utilizing 82.86% of the total training + time in the FPO-efficient MobileNet architectureHoward et al.(2017). Underlying optimizations at the firmware, deep + learning framework, memory, or even hardware level can change energy efficiency and run-time. This discrepancy has + led to Github Issues where users expect efficiency gains from FPO-efficient operations, but do not observe them. 15 + + Example 3 To investigate this empirically, we repeatedly run inference through pre-trained image classification models + and measure FPOs, parameters, energy usage, and experiment length using theexperiment-impact-trackerframework. + As described in Figure3, we find little correlation between FPOs and energy usage or experiment runtime when + comparing across different neural network architectures. However, within an architecture – relying on the same + operation types, but with different numbers of operations – FPOs are almost perfectly correlated with energy and + runtime efficiency. Thus, while FPOs are useful for measuring relative ordering within architecture classes, they are not + adequate on their own to measure energy or even runtime efficiency. + + <
> + + Figure 3. We run 50,000 rounds of inference on a single sampled image through pre-trained image classification models + and record kWh, experiment time, FPOs, and number of parameters (repeating 4 times on different random seeds). + References for models, code, and expanded experiment details can be found in AppendixD. We run a similar analysis + toCanziani et al.(2016) and find (left) that FPOs are not strongly correlated with energy consumption (R2 = 0.083, + Pearson 0.289) nor with time (R2 = 0.005, Pearson 0.074) when measured across different architectures. However, + within an architecture (right) correlations are much stronger. Only considering different versions of VGG, FPOs are + strongly correlated with energy (R2 =.999, Pearson 1.0) and time (R2 =.998, Pearson .999). Comparing parameters + against energy yields similar results (see AppendixDfor these results and plots against experiment runtime). + + + 5.2 Estimates with Partial Information Can Be Inaccurate + + The current state of accounting for energy and carbon varies across fields and papers (see Section 3). Few works, if any, + report all of the metrics that our framework collects. However, it is possible to extrapolate energy and carbon impacts + from some subsets of these metrics. This can give a very rough approximation of the energy used by an experiment in + kWh (see Section 3 for background). + + Example 4 We demonstrate how several such estimation methods compare against the more fine-grained accounting + methods we describe in Section4.16 As seen in Figure4, we find significant differences from when we track all data + (as through theexperiment-impact-trackerframework) to when we use partial data to extrapolate energy and carbon + emissions. Only using GPUs and the experiment time ignores memory or CPU effects; only using the average case US + region ignores regional differences. More details for this experiment can be found in AppendixE. + + We also note that the possible estimation differences in Figure4do not include possible errors from counting multiple + processes at once, as described in Section4.3.1. Clearly, without detailed accounting, it is easy to severely over- or + underestimate carbon or energy emissions by extrapolating from partial information. + 15 See for example.https.//github.com/tensorflow/tensorflow/issues/12132andhttps.//github.com/tensorflow/tensorflow/issues/12940 + 16 We also provide a script to do the rough calculation of energy and carbon footprints based on GPU type, IP address (which + is used to retrieve the location of the machine and that region’s carbon intensity), experiment length, and utilization factor. + https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/get-rough-emissions-estimate + + <
> + + Figure 4. We compare carbon emissions (left) and kWh (right) of our Pong PPO experiment (see AppendixEfor more + details) by using different estimation methods. By only using country wide or even regional average estimates, carbon + emissions may be over or under-estimated (respectively). Similarly, by using partial information to estimate energy + usage (right, for more information about the estimation methods see AppendixE), estimates significantly differ from + when collecting all data in real time (as in our method). Clearly, without detailed accounting, it is easy to over- or + under-estimate carbon or energy emissions in a number of situations. Stars indicate level of significance. * p < .05, ** p + < .01, *** p < .001, **** p < .0001. Annotation provided via.https.//github.com/webermarcolivier/statannot. + + + 6 Encouraging Efficiency and Mitigating Carbon Impacts. Immediate Mitigation Strategies + + With experiment-impact-tracker, we hope to ease the burden of standardized reporting. We have demonstrated + differences in more detailed estimation strategies from the current status quo. In this Section, we examine how accurate + reporting can be used to drive immediate mitigating strategies for energy consumption and carbon emissions. + + 6.1 Energy Efficiency Leaderboards + + A body of recent work has emphasized making more computationally efficient models (Wu et al.,2019;Coleman + et al.,2019;Jiang et al.,2019), yet another line of work has focused on the opposite. building larger models with + more parameters to tackle more complex tasks (Amodei and Hernandez,2018;Sutton,2019). We suggest leaderboards + which utilize carbon emissions and energy metrics to promote an informed balance of performance and efficiency. + DawnBench (Wu et al.,2019) has done this in terms of runtime and cost, 17 but by doing the same for energy and carbon + emissions, baseline implementations can converge to more efficient climate-friendly settings. This can also help spread + information about the most energy and climate-friendly combinations of hardware, software, and algorithms such that + new work can be built on top of these systems instead of more energy-hungry configurations. + A Deep RL Energy Leaderboard. + To demonstrate how energy leaderboards can be used to disseminate information on energy efficiency, we create a Deep + RL Energy Leaderboard. 18 The website is generated using the same tool for creating HTML appendices described in + Section4. All information (except for algorithm performance on tasks) comes from theexperiment-impact-tracker + framework. We populate the leaderboard for two common RL benchmarking environments, PongNoFrameskip-v4 and + BreakNoFrameskip-v4 (Bellemare et al.,2013;Brockman et al.,2016;Mnih et al.,2013), and four baseline algorithms, + PPO (Schulman et al.,2017), A2C (Mnih et al.,2016), A2C with V-Traces (Espeholt et al.,2018;Dalton et al.,2019), + and DQN (Mnih et al.,2013). The experimental details and results can also be found in Figure5. We find that no + algorithm is the energy efficiency winner across both environments, though the PPO implementation provided byHill + et al.(2018) attains balance between efficiency and performance when using default settings across algorithms. + + Example 5To see how such a leaderboard might help save energy, consider a Deep RL class of 235 students. 19 For a + homework assignment, each student must run an algorithm 5 times on Pong. The class would save 888 kWh of energy + + 17 For image classification and question answering tasks. + 18 https.//breakend.github.io/RL-Energy-Leaderboard/reinforcement_learning_energy_leaderboard/index.html + 19 See for example,Stanford’s CS 234. + + <
> + + Figure 5. We evaluate A2C, PPO, DQN, and A2C+VTraces on PongNoFrameskip-v4 (left) and BreakoutNoFrameskip- + v4 (right), two common evaluation environments included in OpenAI Gym. We train for only 5M timesteps, less than + prior work, to encourage energy efficiency and evaluate for 25 episodes every 250k timesteps. We show the Average + Return across all evaluations throughout training (giving some measure of both ability and speed of convergence of an + algorithm) as compared to the total energy in kWh. Weighted rankings of Average Return per kWh place A2C+Vtrace + first on Pong and PPO first on Breakout. Using PPO versus DQN can yield significant energy savings, while retaining + performance on both environments (in the 5M samples regime). See AppendixFfor more details and results in terms of + asymptotic performance. + + + by using PPO versus DQN, while achieving similar performance. 20 This is roughly the same amount needed to power a + US home for one month. 21 + + We, thus, encourage the community to submit more data to the leaderboard to find even more energy efficient algorithms + and configurations. + + 6.2 Running In Carbon-Friendly Regions + + We noted in Section4that it is important to assess which energy grid experiments are run on due to the large differences + in carbon emissions between energy grids. Figure6showsCO 2eq intensities for an assortment of locations, cloud- + provider regions, and energy production methods. We note that an immediate drop in carbon emission can be made by + moving all training jobs to carbon-efficient energy grids. In particular, Quebec is the cleanest available cloud region + to our knowledge. Running a job in Quebec would result in carbon emission 30x lower than running a job in Estonia + (based on 2017 averages). + + Example 6To demonstrate this in practice, we run inference on two translation models 1000 times and measure energy + usage. We extrapolate the amount of emissions and the difference between the two algorithms if run in different energy + grids, seen in Figure7. The absolute difference in emissions between the two models is fairly small (though significant) + if run in Quebec (.09 gCO 2eq ), yet the gap increases as one runs the jobs in less carbon-friendly regions (at 3.04 g + CO 2eq in Estonia). + + We provide a script with our framework to show all cloud provider region with emission statistics to make this decision- + making process easier. 22 We note thatLacoste et al.(2019) provide a website using partial information estimation to + extrapolate carbon emissions based on cloud provider region, GPU type, and experiment length in hours. Their tool + may also be used for estimating carbon emissions in cloud-based experiments ahead of time. + For companies that train and deploy large models often, shifting these resources is especially important. ML training + is not usually latency bound. companies can run training in cloud regions geographically far away since training + models usually does not require round trip communication requirements. Contrary to some opinions, 23 there is not a + necessary need to eliminate computation-heavy models entirely, as shifting training resources to low carbon regions will + immediately reduce carbon emissions with little impact to production systems. For companies seeking to hit climate + + 20 These rankings may change with different code-bases and hyperparameters. + 21 https.//www.eia.gov/tools/faqs/faq.php?id=97&t=3 + 22 See.get-region-emissions-info scriptandlookup-cloud-region-info script. + 23 https.//www.theguardian.com/technology/2019/sep/17/tech-climate-change-luddites-data + + <
> + + Figure 6. Carbon Intensity (gCO 2eq /kWh) of selected energy grid regions is shown from least carbon emissions (left) to + most carbon emissions (right). Red/unshaded boxes indicate carbon intensities of cloud provider regions. Blue/shaded + boxes indicate carbon intensities of various generation methods. Oil shale is the most carbon emitting method of energy + production in the Figure. Estonia is powered mainly by oil shale and thus is close to it in carbon intensity. Similarly, + Québec is mostly powered by hydroelectric methods and is close to it in carbon intensity. Cloud provider carbon + intensities are based on the regional energy grid in which they are located. Thus, us-west-1, located in California, has + the same carbon intensity as the state. Seehttps.//github.com/Breakend/experiment-impact-tracker/for + data sources of regional information. Energy source information fromKrey et al.(2014);International Energy Agency + (2015). + + + + change policy targets, promotion of carbon neutral regions and shifting of all machine learning systems to those regions + would accelerate reaching targets significantly and reduce the amount of offset purchasing required to meet goals (thus + saving resources). 24 It is worth noting that some companies like Google already purchase offsets (Google,2016), so it + may be unclear why shifting resources is necessary. We provide an extended discussion on this in AppendixC. As a + matter of total emissions reductions, running compute in carbon-friendly regions prevents emissions now, while offsets + may not come into effect for several years. Moreover, continuing offset purchasing at current levels, while shifting + resources to green regions would result in a net-negative carbon footprint. + + + + 7 Discussion. Systemic Changes + + + We demonstrated several use cases for accounting which can drive immediate mitigation strategies. However, the + question remains. how can we encourage systemic changes which lead to energy and carbon efficiency in ML systems? + + + 7.1 Green Defaults for Common Platforms and Tools + + Energy leaderboards help provide information on energy efficient configurations for the whole stack. However, to truly + spread energy efficient configurations, underlying frameworks should by default use the most energy-efficient settings + possible. This has been shown to be an effective way to drive pro-environmental behavior (Pichert and Katsikopoulos, + 2008). For example, Nvidia apex provides easy mixed-precision computing as an add-on which yields efficiency + gains. 25 However, it requires knowing this and using it.Merity(2019) also discusses the current difficulties in using + highly efficient components. Making such resources supported as defaults in frequently used frameworks, like PyTorch, + would immediately improve the efficiency of all downstream projects. We encourage maintainers of large projects to + prioritize and support such changes. + + + + 24 See, for example, Amazon’s goal.https.//press.aboutamazon.com/news-releases/news-release-details/amazon-co-founds-climate- + pledge-setting-goal-meet-paris + 25 https.//github.com/NVIDIA/apex + + <
> + + Figure 7. We use pre-trained En-Fr translation models downloaded from PyTorch Hub. a convolutional network (Gehring + et al.,2017) and transformer (Ott et al.,2018). We generate 1000 random sequences either between 3-50 words in + length using the essential_generators Python package.https.//pypi.org/project/essential-generators/. + We repeat with 20 random seeds. [Left] We show the true difference in energy consumption. [Right] We show estimated + kgCO 2eq released if the experiment had been conducted in a number of increasingly carbon-intensive energy grids. + Differences remain significant throughout, but the absolute difference increases as more carbon-intensive regions are + assumed. + + 7.2 How much is your performance gain worth? Balancing gains with cost + + While training jobs can easily be shifted to run in clean regions, there are often restrictions for inference-time use of + machine learning models which prevent such a move. Many companies are deploying large machine learning models + powered by GPUs for everyday services. + + Example 7 Production translation services, can process 100B words per day (Turovsky,2016). roughly 4.2 million + times our experiment in Figure 7. If all translation traffic were in Estonia, 12,768 kgCO 2eq (the carbon sequestered by + 16.7 acres of forest in one year (Agency,2008)) would be saved per day by using the more efficient model, yet if all + traffic were in Québec, 378 kgCO 2eq would be saved (the carbon sequestered by .5 acres of forest in one year (Agency, + 2008)). Considering the amounts of required compute, small differences in efficiency can scale to large emissions and + energy impacts. + + These services are latency-bound at inference time and thus cannot mitigate carbon emissions by shifting to different + regions. Instead, energy-efficiency is key. We encourage companies to consider weighing energy costs (both social and + monetary) with the performance gains of a new model before deploying it. In the case of our translation experiment in + Figure7, the pre-trained convolutional model we use is significantly more energy hungry across than the transformer + model we use. When deploying a new energy-hungry translation model, we ask companies to consider is the BLEU + score improvement really worth the energy cost of deploying it? Are there ways to route to different models to balance + this trade-off? For example, suppose an energy-hungry model only improves performance in some subset of the data. + Routing to this model only in that subset would maximize performance while minimizing energy footprint. We note + that considering such trade-offs is of increased importance for models aiming to reduce carbon emissions as described + by Rolnick et al.(2019). Deploying a large deep learning model for, say, improving the energy efficiency of a building, + is not worth it if the energy costs of the model outweigh the gains. We also leave an open question to economists to + help assess the welfare benefits of gains on a particular machine learning metric (e.g., how much is BLEU score worth + in a translation service). This would allow the social welfare of the metric to be balanced against the social cost of + carbon (Ricke et al.,2018) for deployment decisions. + Central to all of these cost-benefit analyses are accurate accounting. Our tool provides one step in consistent and + accurate accounting for such purposes. + + 7.3 Efficient testing environments + + In Section7.1we discuss the adoption of green default configurations and Section7.2discusses cost-benefit analyses for + deployments. Another consideration particular to research – especially RL – is the selection of the most efficient testing + environments which assess the mechanism under test. For example, if an RL algorithm solves a particularly complex task + in an interesting way, like solving a maze environment, is there a way to demonstrate the same phenomenon in a more + efficient environment. Several works have developed efficient versions of RL environments which reduce run-times + significantly. In particular,Dalton et al.(2019) improve the efficiency of Atari experiments by keeping resources on + the GPU (and thus avoiding energy and time overheads from moving memory back and forth).Chevalier-Boisvert + et al.(2018) develop a lightweight Grid World environment with efficient runtimes for low-overhead experiments. An + important cost-benefit question for researchers is whether the same point can be proven in a more efficient setting. + + 7.4 Reproducibility + + A key aspect to our work is helping to promote reproducibility by aiding in consistent reporting of experimental details. + We encourage all researchers to release code and models (when it is socially and ethically responsible to do so), to + prevent further carbon emissions. Replicating results is an important, if not required, part of research. If replication + resources are not available, then more energy and emissions must be spent to replicate results – in the case of extremely + large models, the social cost of carbon may be equivalently large. Thus, we ask researchers to also consider energy and + environmental impacts from replication efforts, when weighing model and code release. We note that there may very + well be cases where safety makes this trade-off lean in the direction of withholding resources, but this is likely rare + in most current research. For production machine learning systems, we encourage developers to release models and + codebases internally within a company. This may encourage re-use rather than spending energy resources developing + similar products. + + 26 See for example, search which now uses transformer networks at both Microsoft and Google. + https.//www.blog.google/products/search/search-language-understanding-bert/andhttps.//azure.microsoft.com/en-us/blog/microsoft- + makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/ + 27 Efficient routing of traffic to regions has been considered before byNguyen et al.(2012) andBerral et al.(2010). It may be + worth considering efficient routing of traffic to particular models as well. + + 7.5 Standardized reporting + + We suggest that all papers include standardized reporting of energy and carbon emissions. We also suggest adding a + Carbon Impact Statement at the end of papers (just like ours below) which estimates the carbon emissions of the paper. + This can be reported in a dollar amount via the country-specific social cost of carbonRicke et al.(2018). We provide a + script 28 to parse logs from theexperiment-impact-trackerframework and generate such a statement automatically. We + suggest this to spread awareness and bring such considerations to the forefront. We also emphasize that research, even + when compute intensive, is immensely important for progress. It is unknown what sequence of papers may inspire a + breakthrough (Stanley and Lehman,2015) which would reduce emissions by more than any suggestion here. While + emissions should be minimized when possible, we suggest that impact statements be only used for awareness. + We also suggest that, when developing features which visualize compute intensity for cloud or internal workloads, + developers consider providing built-in tools to visualize energy usage and carbon emissions. For example, the Colab + Research Environment shows RAM and Disk capacity, 29 but could also show and provide access to these other metrics + more easily. Providing similar informational labels (Byerly et al.,2018) within internal tooling could mitigate some + energy and carbon impacts within companies. + + 7.6 Badging + + Informational labeling has had a long history of being used in public policy (Banerjee and Solomon,2003). In the + USA, the “Energy Star” label has been used to guide customers to eco-friendly products. More recently, “badges” + rewarded by thePsychological Sciencejournal were shown to be effective, with a jump from 3% of articles reporting + open data to 39% one year later. ACM has introduced similar reproducibility badges. 30 With consistent reporting of + carbon and energy metrics, climate friendly research badges can be introduced by conferences to recognize any paper + that demonstrates a significant effort to mitigate its impacts. For example, a compute intensive paper, when showing + evidence of explicitly running resources in a clean region can be rewarded with such a badge. Another example badge + can be awarded to papers that create energy-friendly algorithms with similar performance as the state-of-the-art 31 . + The goal of these badges is to draw further attention to efficient versions of state-of-the-art systems and to encourage + mitigation efforts while, again, not punishing compute-intensive experiments. + + 7.7 Driver and Implementation Difficulties + + The experiment-impact-tracker framework abstracts away many of the previously mentioned difficulties in estimating + carbon and energy impacts. it handles routing to appropriate tools for collecting information, aggregates information + across tools to handle carbon calculations, finds carbon intensity information automatically, and corrects for multiple + processes on one machine. Yet, a few other challenges may be hidden by using the framework which remain difficult to + circumvent. + AsKhan et al.(2018) discuss, and we encounter ourselves, poor driver support makes tracking energy difficult. Not + every chipset supports RAPL, nor does every Linux kernel. Neither NVIDIA or Intel provide first party supported python + libraries for access to measurements.nvidia-smiper-process measurements in docker containers are not supported. 32 + A body of work has also looked at improving estimates of energy usage from RAPL by fitting a regression model to + real energy usage patterns (Povoa et al.,2019;Kavanagh and Djemame,2019;Ghosh et al.,2013;Song et al.,2013). + The Slurm workload manager provides an energy accounting plugin, 33 but requires administrator access to add. For + those without access to Slurm, Intel’s RAPL supports access to measurements through three mechanisms, but only one + of these (the powercap interface only available on Linux systems) does not require root access (see more discussion + byKhan et al.(2018)). To promote widespread reporting, we avoid any tool which requires administrative access or + would not be accessible on most Linux systems. Providing better supported tools for user-level access to power metrics + would make it possible to more robustly measure energy usage. Aggregating metrics and handling the intricacies of + these downstream tools requires time and knowledge. We try to abstract as much of these challenges away in the + experiment-impact-tracker, though some driver-related issues require driver developer support. + + 28 https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/ + generate-carbon-impact-statement + 29 https.//colab.research.google.com/ + 30 https.//www.acm.org/publications/policies/artifact-review-badging + 31 See, for example,Clark et al.(2020) which creates a more efficient version of text encoder pre-training. + 32 https.//github.com/NVIDIA/nvidia-docker/issues/179#issuecomment-242150861 + 33 https.//slurm.schedmd.com/acct_gather_energy_plugins.html + + We also note that carbon intensities for machines in cloud data centers may not reflect the regional carbon intensities. + Some providers buy clean energy directly for some data centers, changing the realtime energy mix for that particular + data center. We were unable to find any information regarding realtime energy mixes in such cases and thus could not + account for these scenarios. If providers exposed realtime APIs for such information this would help in generating + more accurate estimates. Moreover, customized hardware in cloud provider regions does not always provide energy + accounting mechanisms or interfaces. If cloud providers supported libraries for custom hardware, this could be used for + more detailed accounting in a wider range of cloud-based compute scenarios + + 8 Concluding Remarks and Recommendations + + We have shown how theexperiment-impact-trackerand associated tools can help ease the burden of consistent + accounting and reporting of energy, compute, and carbon metrics; we encourage contribution to help expand the + framework. We hope the Deep RL Energy Leaderboard helps spread information on energy efficient algorithms and + encourages research in efficiency. While we focus on compute impacts of machine learning production and research, a + plethora of other work considers costs of transportation for conferences (Holden et al.,2017;Spinellis and Louridas, + 2013;Bossdorf et al.,2010) and compute hardware manufacturing (Venkatesan,2015). We encourage researchers and + companies to consider these other sources of carbon impacts as well. Finally, we recap several points that we have + highlighted in mitigating emissions and supporting consistent accountability. + What can machine learning researchers do? + + •Run cloud jobs in low carbon regions only (see Section6.2). + •Report metrics as we do here, make energy-efficient configurations more accessible by reporting these results + (see Section7.5). + •Work on energy-efficient systems, create energy leaderboards (see Section6). + •Release code and models whenever safe to do so (see Section7.4). + •Integrate energy efficient configurations as defaults in baseline implementations (see Section7.1). + •Encourage climate-friendly initiatives at conferences (see Sections7.6and7.5). + + What can industry machine learning developers and framework maintainers do? + + •Move training jobs to low carbon regions immediately. Make default launch configurations and documentation + point to low carbon regions (see Section6.2). + •Provide more robust tooling for energy tracking and carbon intensities (see Section7.7). + •Integrate energy efficient operations as default in frameworks (see Section7.1). + •Release code and models (even just internally in the case of production systems) whenever safe to do so (see + Section7.4). + •Consider energy-based costs versus benefits of deploying new models (see Section7.2). + •Report model-related energy metrics (see Section7.5). + + We hope that regardless of which tool is used to account for carbon and energy emissions, the insights we provide here + will help promote responsible machine learning research and practices. + + Carbon Impact Statement + + This work contributed 8.021 kg ofCO 2eq to the atmosphere and used 24.344 kWh of electricity, having a + USA-specific social cost of carbon of $0.38 ($0.00, $0.95). Carbon accounting information can be found + here. https.//breakend.github.io/ClimateChangeFromMachineLearningResearch/measuring_and_ + mitigating_energy_and_carbon_footprints_in_machine_learning/ and https.//breakend.github. + io/RL-Energy-Leaderboard/reinforcement_learning_energy_leaderboard/index.html. The social cost + of carbon uses models from (Ricke et al.,2018). This statement and carbon emissions information was generated using + experiment-impact-trackerdescribed in this paper. + + References + US Environmental Protection Agency. Greenhouse gas equivalencies calculator, 2008. URLhttps.//www.epa.gov/ + energy/greenhouse-gas-equivalencies-calculator. + Judith I Ajani, Heather Keith, Margaret Blakers, Brendan G Mackey, and Helen P King. Comprehensive carbon stock + and flow accounting. a national framework to support climate change mitigation policy.Ecological Economics, 89. + 61–72, 2013. + Dario Amodei and Danny Hernandez. AI and Compute.https.//blog.openai.com/openai-five/, 2018. + Jane Andrew and Corinne Cortese. Accounting for climate change and the self-regulation of carbon disclosures. In + Accounting Forum, volume 35, pages 130–138. Taylor & Francis, 2011. + Mahmoud ("Mido") Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, and Mike Rabbat. Gossip-based actor- + learner architectures for deep reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, + E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32, pages 13299–13309. Curran + Associates, Inc., 2019. + Miguel F. Astudillo and Hessam AzariJafari. Estimating the global warming emissions of the LCAXVII conference. + connecting flights matter.The International Journal of Life Cycle Assessment, 23(7).1512–1516, Jul 2018. ISSN + 1614-7502. + Abhijit Banerjee and Barry D Solomon. Eco-labeling for energy efficiency and sustainability. a meta-evaluation of us + programs.Energy policy, 31(2).109–123, 2003. + Valentin Bellassen and Nicolas Stephan.Accounting for Carbon. Cambridge University Press, 2015. + Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment. An + Evaluation Platform for General Agents.Journal of Artificial Intelligence Research, 47.253–279, 2013. + Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi Guitart, Ricard Gavaldà, and Jordi Torres. Towards energy- + aware scheduling in data centers using machine learning. InProceedings of the 1st International Conference on + Energy-Efficient Computing and Networking, e-Energy ’10, page 215–224, New York, NY, USA, 2010. Association + for Computing Machinery. ISBN 9781450300421. + Thomas Boquet, Laure Delisle, Denis Kochetkov, Nathan Schucher, Parmida Atighehchian, Boris Oreshkin, and + Julien Cornebise. DECoVaC. Design of Experiments with Controlled Variability Components. arXiv preprint + arXiv.1909.09859, 2019. + Oliver Bossdorf, Madalin Parepa, and Markus Fischer. Climate-neutral ecology conferences. just do it!Trends in + Ecology & Evolution, 25(2).61, 2010. + Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. + OpenAI Gym, 2016. + Hilary Byerly, Andrew Balmford, Paul J Ferraro, Courtney Hammond Wagner, Elizabeth Palchak, Stephen Polasky, + Taylor H Ricketts, Aaron J Schwartz, and Brendan Fisher. Nudging pro-environmental behavior. evidence and + opportunities.Frontiers in Ecology and the Environment, 16(3).159–168, 2018. + Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical + applications.arXiv preprint arXiv.1605.07678, 2016. + Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. Hardnet. A low memory traffic + network. InProceedings of the IEEE International Conference on Computer Vision, pages 3552–3561, 2019. + Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI Gym. + https.//github.com/maximecb/gym-minigrid, 2018. + Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. {ELECTRA}. Pre-training text encoders + as discriminators rather than generators. InInternational Conference on Learning Representations, 2020. + Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep + reinforcement learning experiments.arXiv preprint arXiv.1806.08295, 2018. + Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun, + Chris Ré, and Matei Zaharia. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance + Benchmark.SIGOPS Oper. Syst. Rev., 53(1).14–25, July 2019. ISSN 0163-5980. + Julie Cotter, Muftah Najah, and Shihui Sophie Wang. Standardized reporting of climate change information in australia. + Sustainability accounting, management and policy journal, 2(2).294–321, 2011. + Thomas J Crowley. Causes of climate change over the past 1000 years.Science, 289(5477).270–277, 2000. + Steven Dalton, Iuri Frosio, and Michael Garland. GPU-Accelerated Atari Emulation for Reinforcement Learning, 2019. + Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. RAPL. memory power estimation + and capping. In2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pages + 189–194. IEEE, 2010. + Miyuru Dayarathna, Yonggang Wen, and Rui Fan. Data center energy consumption modeling. A survey. IEEE + Communications Surveys & Tutorials, 18(1).732–794, 2015. + Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, + Tim Harley, Iain Dunning, et al. IMPALA. Scalable Distributed Deep-RL with Importance Weighted Actor-Learner + Architectures. InInternational Conference on Machine Learning, pages 1406–1415, 2018. + David Gefen and Detmar W Straub. The relative importance of perceived ease of use in is adoption. A study of + e-commerce adoption.Journal of the association for Information Systems, 1(1).8, 2000. + Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence + learning. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. + JMLR. org, 2017. + Sayan Ghosh, Sunita Chandrasekaran, and Barbara Chapman. Statistical modeling of power/energy of scientific kernels + on a multi-gpu system. In2013 International Green Computing Conference Proceedings, pages 1–6. IEEE, 2013. + Google. Google’s Green PPAs. What, How, and Why.https.//static.googleusercontent.com/media/www. + google.com/en//green/pdfs/renewable-energy.pdf, 2013. + Google. Achieving Our 100% Renewable Energy Purchasing Goal and Going Be- + yond. https.//static.googleusercontent.com/media/www.google.com/en//green/pdf/ + achieving-100-renewable-energy-purchasing-goal.pdf, 2016. + Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art. Reproducibility in artificial intelligence. InThirty-Second + AAAI Conference on Artificial Intelligence, 2018. + Leor Hackel and Gregg Sparkman. Evaluating the climate impact of psychological science. Costs and opportunities. + Affective Seminar, 2018. URLhttps.//osf.io/dg5ap/?show=view. + Peter Henderson and Emma Brunskill. Distilling information from a flood. A possibility for the use of meta-analysis + and systematic review in machine learning research. InCritiquing and Correcting Trends in Machine Learning + Workshop (CRACT) at NeurIPS, 2018. + Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement + learning that matters. InThirty-Second AAAI Conference on Artificial Intelligence, 2018. + Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, + Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and + Yuhuai Wu. Stable baselines.https.//github.com/hill-a/stable-baselines, 2018. + Matthew H Holden, Nathalie Butt, Alienor Chauvenet, Michaela Plein, Martin Stringer, and Iadine Chadès. Academic + conferences urgently need environmental policies.Nature ecology & evolution, 2017. + Nicolas Houy. Rational mining limits bitcoin emissions.Nature Climate Change, 9(9).655–655, 2019. + Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, + and Hartwig Adam. Mobilenets. Efficient convolutional neural networks for mobile vision applications.arXiv + preprint arXiv.1704.04861, 2017. + Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional + networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, + 2017. + Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet. + AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size.arXiv preprint arXiv.1602.07360, 2016. + International Energy Agency.CO2 Emissions from Fuel Combustion. 2015. + IPCC.Climate Change 2014. Mitigation of Climate Change. Working Group III Contribution to the IPCC Fifth + Assessment Report. Cambridge University Press, 2015. + IPCC.Global Warming of 1.5 °C. 2018. + Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. InAdvances in Neural + Information Processing Systems, pages 5951–5961, 2018. + Angela H. Jiang, Daniel L. K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, + Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, and Padmanabhan Pillai. Accelerating Deep Learning by + Focusing on the Biggest Losers.arXiv e-prints, art. arXiv.1910.00762, Oct 2019. + Alex K Jones, Liang Liao, William O Collinge, Haifeng Xu, Laura A Schaefer, Amy E Landis, and Melissa M Bilec. + Green computing. A life cycle perspective. In2013 International Green Computing Conference Proceedings, pages + 1–6. IEEE, 2013. + Richard Kavanagh and Karim Djemame. Rapid and accurate energy models through calibration with ipmi and rapl. + Concurrency and Computation. Practice and Experience, 31(13).e5124, 2019. + Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen, and Zhonghong Ou. RAPL in Action. Experiences + in Using RAPL for Power Measurements.ACM Trans. Model. Perform. Eval. Comput. Syst., 3(2).9.1–9.26, March + 2018. ISSN 2376-3639. + Max J Krause and Thabet Tolaymat. Quantification of energy and carbon costs for mining cryptocurrencies.Nature + Sustainability, 1(11).711, 2018. + V. Krey, O. Masera, G. Blanford, T. Bruckner, R. Cooke, K. Fisher-Vanden, H. Haberl, E. Hertwich, E. Kriegler, + D. Mueller, S. Paltsev, L. Price, S. Schlömer, D. Ürge-Vorsatz, D. van Vuuren, and T. Zwickel. Annex 2 - metrics and + methodology. InClimate Change 2014. Mitigation of Climate Change. IPCC Working Group III Contribution to + AR5. Cambridge University Press, November 2014. URLhttp.//pure.iiasa.ac.at/id/eprint/11109/. + Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural + Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances in Neural Information + Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. + Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of + machine learning.arXiv preprint arXiv.1910.09700, 2019. + Jacob LaRiviere, Gavin Mccormick, and Sho Kawano. How better accounting can more cheaply reduce carbon + emissions.Policy Brief, 4, 2016. + Jens Malmodin, Pernilla Bergmark, and Dag Lundén. The future carbon footprint of the ict and e&m sectors.on + Information and Communication Technologies, page 12, 2013. + Eric Masanet, Arman Shehabi, Nuoa Lei, Harald Vranken, Jonathan Koomey, and Jens Malmodin. Implausible + projections overestimate near-term bitcoin co2 emissions.Nature Climate Change, 9(9).653–654, 2019. + Stephen Merity. Single Headed Attention RNN. Stop Thinking With Your Head.arXiv preprint arXiv.1911.11423, + 2019. + Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin + Riedmiller. Playing Atari With Deep Reinforcement Learning. InNIPS Deep Learning Workshop. 2013. + Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, + and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on + machine learning, pages 1928–1937, 2016. + Camilo Mora, Randi L Rollins, Katie Taladay, Michael B Kantar, Mason K Chock, Mio Shimada, and Erik C Franklin. + Bitcoin emissions alone could push global warming above 2 °C.Nature Climate Change, 8(11).931, 2018. + Richard G Newell and Juha Siikamäki. Nudging energy efficiency behavior. The role of information labels.Journal of + the Association of Environmental and Resource Economists, 1(4).555–598, 2014. + Kim Khoa Nguyen, Mohamed Cheriet, Mathieu Lemay, Victor Reijs, Andrew Mackarel, and Alin Pastrama. + Environmental-aware virtual data center network.Computer Networks, 2012. + Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. InProceedings of the + Third Conference on Machine Translation. Research Papers, Brussels, Belgium, 2018. Association for Computational + Linguistics. + Daniel Pichert and Konstantinos V. Katsikopoulos. Green defaults. Information presentation and pro-environmental + behaviour.Journal of Environmental Psychology, 28(1).63 – 73, 2008. ISSN 0272-4944. doi. https.//doi.org/10.1016/ + j.jenvp.2007.09.004. URLhttp.//www.sciencedirect.com/science/article/pii/S0272494407000758. + Lucas Venezian Povoa, Cesar Marcondes, and Hermes Senger. Modeling energy consumption based on resource + utilization. InInternational Conference on Computational Science and Its Applications, pages 225–240. Springer, + 2019. + Zheng Qin, Zhaoning Zhang, Dongsheng Li, Yiming Zhang, and Yuxing Peng. Diagonalwise Refactorization. An + Efficient Training Method for Depthwise Convolutions. In2018 International Joint Conference on Neural Networks + (IJCNN), pages 1–8. IEEE, 2018. + Celine Ramstein, Goran Dominioni, Sanaz Ettehad, Long Lam, Maurice Quant, Jialiang Zhang, Louis Mark, Sam + Nierop, Tom Berg, Paige Leuschner, et al. State and trends of carbon pricing 2019, 2019. + Nils Reimers and Iryna Gurevych. Reporting Score Distributions Makes a Difference. Performance Study of LSTM- + networks for Sequence Tagging. InEMNLP, 2017. + Katharine Ricke, Laurent Drouet, Ken Caldeira, and Massimo Tavoni. Country-level social cost of carbon.Nature + Climate Change, 2018. + Giampaolo Rodola. Psutil package. a cross-platform library for retrieving information on running processes and system + utilization, 2016. + David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, Andrew Slavin + Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan Maharaj, + Evan D. Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Hassabis, John C. + Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio. Tackling Climate Change with Machine Learning.arXiv + e-prints, art. arXiv.1906.05433, Jun 2019. + Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2. Inverted + residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, + pages 4510–4520, 2018. + John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization + algorithms.arXiv preprint arXiv.1707.06347, 2017. + Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.arXiv e-prints, art. arXiv.1907.10597, Jul + 2019. + Sam Shead. AI Researchers Left Disappointed As NIPS Sells Out In Under 12 Min- + utes. Forbes, Sep 2018. URL https.//www.forbes.com/sites/samshead/2018/09/05/ + ai-researchers-left-disappointed-as-nips-sells-out-in-under-12-minutes/#7dda67fc20e9. + Yoav Shoham, Erik Brynjolfsson, Jack Clark, John Etchemendy, Barbara Grosz, Terah Lyons, James Manyika, Saurabh + Mishra, and Juan Carlos Niebles. The ai index 2019 annual report.AI Index Steering Committee, Human-Centered + AI Initiative, Stanford University., 2019. + Szymon Sidor and John Schulman. Openai baselines. Dqn (blogpost). 2017. + Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv + preprint arXiv.1409.1556, 2014. + Frank Soboczenski, Michael D Himes, Molly D O’Beirne, Simone Zorzan, Atilim Gunes Baydin, Adam D Cobb, + Yarin Gal, Daniel Angerhausen, Massimo Mascaro, Giada N Arney, et al. Bayesian deep learning for exoplanet + atmospheric retrieval.arXiv preprint arXiv.1811.03390, 2018. + Shuaiwen Leon Song, Kevin Barker, and Darren Kerbyson. Unified performance and power modeling of scientific + workloads. InProceedings of the 1st International Workshop on Energy Efficient Supercomputing, page 4. ACM, + 2013. + Diomidis Spinellis and Panos Louridas. The carbon footprint of conference papers.PloS one, 8(6).e66508, 2013. + Kenneth O Stanley and Joel Lehman.Why greatness cannot be planned. The myth of the objective. Springer, 2015. + Kristin Stechemesser and Edeltraud Guenther. Carbon accounting. a systematic literature review.Journal of Cleaner + Production, 36.17–38, 2012. + Christian Stoll, Lena Klaaßen, and Ulrich Gallersdörfer. The carbon footprint of bitcoin.Joule, 2019. + Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP. + arXiv preprint arXiv.1906.02243, 2019. + Vladimir Sukhoy and Alexander Stoytchev. Eliminating the Variability of Cross-Validation Results with LIBLINEAR + due to Randomization and Parallelization. 2019. + Shyam Sundar, Ashish Kumar Mishra, and Ram Naresh. Modeling the impact of media awareness programs on + mitigation of carbon dioxide emitted from automobiles.Modeling Earth Systems and Environment, 4(1).349–357, + 2018. + Richard Sutton. The bitter lesson.Incomplete Ideas (blog), March, 13, 2019. + Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent + Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition + (CVPR), 2015. + Samuel Tang and David Demeritt. Climate change and mandatory carbon reporting. Impacts on business process and + performance.Business Strategy and the Environment, 27(4).437–455, 2018. + Richard SJ Tol. The social cost of carbon.Annu. Rev. Resour. Econ., 3(1).419–443, 2011. + Bo Tranberg, Olivier Corradi, Bruno Lajoie, Thomas Gibon, Iain Staffell, and Gorm Bruun Andresen. Real-time carbon + accounting method for the european electricity markets.Energy Strategy Reviews, 26.100367, 2019. + Barak Turovsky. Ten years of Google Translate.Google Official Blog, 2016. + U.S. Environment Protection Agency. Social Cost of Carbon.https.//www.epa.gov/sites/production/files/2016- + 12/documents/social_cost_of_carbon_fact_sheet.pdf, 2013. + Chandramouli Venkatesan. Comparative Carbon Footprint Assessment of the Manufacturing and Use Phases of Two + Generations of AMD Accelerated Processing Units, 2015. + Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay Less Attention with Lightweight and + Dynamic Convolutions. InInternational Conference on Learning Representations, 2019. + Michel Zade, Jonas Myklebost, Peter Tzscheutschler, and Ulrich Wagner. Is bitcoin the only problem? a scenario model + for the power demand of blockchains.Frontiers in Energy Research, 7, 2019. + Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv.1605.07146, 2016. + Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet. An extremely efficient convolutional neural + network for mobile devices. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, + pages 6848–6856, 2018. + + + A Conference Travel + + Prior work has also examined conference travel for various fields as a major source of impact Spinellis and Louridas + (2013); Astudillo and AzariJafari(2018);Hackel and Sparkman(2018). For example,Spinellis and Louridas(2013) + found that theCO 2eq emissions from travel per conference participant was about 801 kgCO 2eq ,Astudillo and AzariJafari + (2018) estimated around 883 kgCO 2eq emissions per participant, andHackel and Sparkman(2018) estimate around 910 + kg ofCO 2eq emissions per participant. Interestingly, these separate papers all align around the same carbon emissions + numbers per conference participant. Using this and ML conference participant statistics we can gain some (very) rough + insight into the carbon emissions caused by conference travel (not including food purchases, accommodations, and + travel within the conference city). + Conference participation has grown particularly popular in ML research, attracting participants from industry and + academia. In 2018 the Neural Information Processing Systems (NeurIPS) conference sold out registrations in 12 + minutes (Shead,2018). In 2019, according to the AI Index Report 2019 (Shoham et al.,2019), conferences had the + following attendance. CVPR (9,227); IJCAI (3,015); AAAI (3,227); NeurIPS (13,500); IROS (3,509); ICML (6,481); + ICLR (2,720); AAMAS (701); ICAPS (283); UAI (334). The larger conferences also showed continued growth. + NeurIPS showed a year-over-year growth 41% from 2018 to 2019. Given only these conferences and their attendances + in 2019, the lower 801kgCO 2eq average emissions estimate per participant (Spinellis and Louridas,2013), this adds up + to roughly 34,440,597 kgCO 2eq emitted in 2019 from ML-related conferences (not considering co-location and many + other factors). + + B NeurIPS Sampling on Metric Reporting + + We randomly sampled 100 NeurIPS papers from the 2019 proceedings, of these papers we found 1 mea- + sured energy in some way, 45 measured runtime in some way, 46 provided the hardware used, 17 pro- + vided some measure of computational complexity (e.g., compute-time, FPOs, parameters), and 0 pro- + vided carbon metrics. We sampled from the NeurIPS proceedings page. https.//papers.nips.cc/book/ + advances-in-neural-information-processing-systems-32-2019. We first automatically check for key + words (below) related to energy, compute, and carbon. We then examined the context of the word to classify it + as relating to hardware details (e.g., Nvidia Titan X GPU), computational efficiency (e.g., FPOs, MAdds, GPU-hours), + runtime (e.g., the experiment ran for 8 hours), energy (e.g., a plot of performance over Joules or Watts), or carbon (e.g., + we estimate 10 kg CO 2eq were emitted). We also manually validate papers for similar metrics that didn’t appear in the + keyword search. If a paper did not contain experiments we removed it and randomly redrew a new paper. In many cases, + metrics are only provided for some subset of experiments (or for particular ablation experiments). We nonetheless count + these as reporting the metric. Where a neural network diagram or architecture description was provided, we did not + consider this to be reporting a compute metric. + + compute_terms = ["flop", "fpo", "pflop", "tflops", "tflop", "parameters", "params", "pflops", "flops", "fpos", "gpu-hours", + "cpu-hours", "cpu-time", "gpu-time", "multiply-add", "madd"] + hardware_terms = ["nvidia", "intel", "amd", "radeon", "gtx", "titan", "v100", "tpu", "ryzen", "cpu", "gpu"] + time_terms = ["seconds", "second", "hour", "hours", "day", "days", "time", "experiment length", "run-time", "runtime"] + energy_terms = ["watt", "kWh", "joule", "joules", "wh", "kwhs", "watts", "rapl", "energy", "power"] + carbon_terms = ["co2", "carbon", "emissions"] + + C Carbon Discussion + + But cloud providers claim 100% carbon neutrality in my region, why do I need to shift my resources? + While we estimate energy mixes based on regional grids, cloud providers sometimes aim for carbonneutralitythrough + a mixture of mechanisms which may change the energy mix being provided to a data center in an otherwise carbon + intensive energy grid or otherwise offset unclean energy usage. Data centers draw energy from the local energy grids + and as a result the mix of energy they consume largely depends on the composition of the power running in the grids. If + the local energy grids are powered by a mix of fuel and renewable energy, a data center will inevitably consume fuel + energy as well. + Due to the fact that the consumers do not know the origin of the physical electricity from the utility grid, it is difficult to + assign ownership of the renewable energy consumption. The Environmental Protection Agency (EPA) uses renewable + energy certificates (RECs) to track the generation and consumption of renewable energy. one REC is issued when + one megawatt-hour (MWh) of electricity is generated from a renewable source and delivered to the energy grid. 34 + Consumers can then purchase RECs from a renewable energy provider and apply them to their electricity usage. This + means consumers can claim they run on renewable energy by purchasing RECs from providers that doesn’t actually + power the energy grids that they draw electricity from. Although this means that the consumers’ realtime carbon + footprints will still be decided by the composition of renewable and fuel energy in their local energy grids, more + renewable energy can flow onto the grid by purchasing the RECs and future development of renewable sources is + supported. Google, to offset its carbon emissions, uses RECs and power purchase agreements (PPAs) with renewable + energy providers to ensure that more renewable energy powers the same electricity grids that its data centers are in. 35 + Google then sells the renewable energy as it becomes available back to the electricity grids and strips away the RECs. + Over one year, Google applies equal amounts of RECs to its data centers’ total energy consumption. This method + helps green energy provider development by creating a long term demand. However, PPAs provide RECs forfuture + renewables, not only current energy on the grid which may remain unchanged. As it states. “While the renewable + facility output is not being used directly to power a Google data center, the PPA arrangement assures that additional + renewable generation sufficient to power the data center came on line in the area.” + We can see that even if a cloud provider’s data centers are carbon neutral, the actual CO2 eq emissions can vary largely + and depends on the region and even time of the day (solar energy cannot be generated at night). We suggest that cloud + providers release tools for understanding the carbon intensity for each data center region regardless of offset purchasing. + While the purchases of PPAs and RECs are valuable for driving towards renewable energy in otherwise dirty regions, + for machine learning model training, where the resources can be moved, we believe shifting resources to low intensity + regions is more beneficial to long term carbon impacts. Other cloud-based jobs where latency requirements prevent + shifting resources will remain to drive PPA/REC purchasing, and consequently renewable energy demand. + + D ImageNet Experiments + + We load pre-trained models available through PyTorch Hub (see https.//pytorch.org/hub) – namely + AlexNet (Krizhevsky et al.,2012), DenseNet (Huang et al.,2017), GoogLeNet (Szegedy et al.,2015), HardNet (Chao + et al.,2019), MobileNetv2 (Sandler et al.,2018), ShuffleNet (Zhang et al.,2018), SqueezeNet (Iandola et al.,2016), + VGG (Simonyan and Zisserman,2014), and Wide ResNets (Zagoruyko and Komodakis,2016). We run 50,000 rounds + of inference on a single image through pre-trained image classification models and run similar analysis toCanziani et al. + (2016). We repeat experiments on 4 random seeds. + + 34 https.//www.epa.gov/greenpower/renewable-energy-certificates-recs + 35 We note that this process is likely similar for most cloud providers, but Google is the most open with their methodology, so we + are able to gain more insight from the materials they publish. Information described here is mainly put together fromGoogle(2016) + andGoogle(2013). + 36 https.//static.googleusercontent.com/media/www.google.com/en/us/green/pdfs/renewable-energy.pdf + + We count flops and parameters using the thop package (for package version numbers see automated logs in the online + appendix linked above).https.//github.com/Lyken17/pytorch-OpCounter + Code for running the experiment is available at. https.//github.com/Breakend/ + ClimateChangeFromMachineLearningResearch/blob/master/paper_specific/run_inference.py + An online appendix showing all per-experiment details can be seen here. https.//breakend.github.io/ + ClimateChangeFromMachineLearningResearch/measuring_and_mitigating_energy_and_carbon_ + footprints_in_machine_learning/ + + The plot of FPOs versus runtime can be seen in Figure8and plots against number of parameters can be seen in Figure9. + Number of parameters similarly have no strong correlation with energy consumption (R2 = 0.002, Pearson 0.048), + nor time (R2 = 0.14, Pearson 0.373). We note that our runtime results likely differ fromCanziani et al.(2016) due to + the architectural differences in the model sets we use. + For parameter plots, see Figure9, for extended time and energy Figures, see Figure8. + + <
> + + Figure 8. We seek to investigate the connection between FPOs, energy usage, and experiment time, similarly toCanziani + et al.(2016). We run 50,000 rounds of inference on a single image through pre-trained image classification models + available through PyTorch Hub (seehttps.//pytorch.org/hub) – namely (Krizhevsky et al.,2012;Huang et al., + 2017;Szegedy et al.,2015;Chao et al.,2019;Sandler et al.,2018;Zhang et al.,2018;Iandola et al.,2016;Simonyan + and Zisserman,2014;Zagoruyko and Komodakis,2016). We record experiment time and the kWh of energy used to run + the experiments and repeat experiments 4 times, averaging results. We find that FPOs are not strongly correlated with + energy consumption (R2 = 0.083, Pearson0.289) nor with time (R2 = 0.005, Pearson 0.074). Number of parameters + (plotted in Appendix) similarly have no strong correlation with energy consumption (R2 = 0.002, Pearson 0.048), nor + time (R2 = 0.14, Pearson 0.373). We note, however, thatwithin an architecturecorrelations are much stronger. For + example, only considering different versions of VGG, FPOs are strongly correlated with energy (R2 =.999, Pearson + 1.0) and time (R2 =.998, Pearson .999). See Appendix for experiment details, code, and data links. Our runtime + results likely differ fromCanziani et al.(2016) due to the architectural differences in the model sets we use. + + E Estimation Methods + + We use our PPO Pong experiment (see AppendixFfor more details) as the experiment under comparison. For carbon + emission estimates, we use three estimation methods. realtime emissions data for California (collected by our framework + fromcaiso.org) times the power usage at that time integrated over the length of the experiment; multiplying total + energy usage recorded by our method by the California average carbon intensity; multiplying total energy usage + recorded by our method by the EPA US average carbon intensity (Strubell et al.,2019). For energy estimates, we use. + (1) the experiment time multiplied by the number of GPUs, a utilization factor of 1/3 or 1, and the Thermal Design + Power (TDP) – which can be thought of as the maximum Watt draw – of the GPU (Amodei and Hernandez,2018); (2) + the measured GPU-hrs of our tool multiplied by the TDP; a rough calculation of PFLOPs-hr (following the methodology + + <
> + + Figure 9. The same experiments as in Figure3, plotting parameters as the varying factor instead. See Figure3for + correlation values. + + + of (Amodei and Hernandez,2018) by the PFLOPs/TDP of the GPU; (3) our tool’s accounting method which tracks + energy from GPU readings, accounts for CPU time/energy, and measures utilization in realtime. + + F Reinforcement Learning + + We investigate the energy efficiency of four baseline RL algorithms. PPO (Hill et al.,2018;Schulman et al.,2017), + A2C (Hill et al.,2018;Mnih et al.,2016), A2C with VTraces (Espeholt et al.,2018;Dalton et al.,2019), and DQN (Hill + et al.,2018;Mnih et al.,2016). We evaluate on PongNoFrameskip-v4 (left) and BreakoutNoFrameskip-v4 (right), two + common evaluation environments included in OpenAI Gym (Bellemare et al.,2013;Brockman et al.,2016;Mnih et al., + 2013). + We train for only 5M timesteps, less than prior work, to encourage energy efficiency (Mnih et al.,2016,2013). We use + default settings from code provided in stable-baselines (Hill et al.,2018) and cule (Dalton et al.,2019), we only modify + evaluation code slightly. Modifications can be found here. + + •https.//github.com/Breakend/rl-baselines-zoo-1(for stable-baselines modifications) + •https.//github.com/Breakend/cule(for cule modifications) + + Since we compare both on-policy and off-policy methods, for fairness all evaluation is based on 25 separate rollouts + completed every 250k timesteps. This is to ensure parity across algorithms. We execute these in parallel together as + seen in the cule code.https.//github.com/Breakend/cule/blob/master/examples/a2c/test.py. + While average return across all evaluation episodes (e.g., averaging together the step at 250k timesteps and every + evaluation step until 5M timesteps) can be seen in the main text, the asymptotic return (for the final round of evaluation + episodes) can be seen Figure10. Plots comparing experiment runtime to asymptotic and average returns (respectively) + can be seen in Figure11and Figure12. + Our online leaderboard can be seen at. https.//breakend.github.io/RL-Energy-Leaderboard/ + reinforcement_learning_energy_leaderboard/index.html + We note that while DQN underperforms as compared to PPO here, better hyperparameters may be found such that DQN + is the more energy efficient algorithm. Moreover, we only use the 5M samples regime, whereas prior work has used + 10M or more samples for training, so DQN results seen here would correspond to earlier points in training in other + papers. + + <
> + + Figure 10. Pong (left) and Breakout (right) asymptotic return. + + <
> + + Figure 11. Pong (left) and Breakout (right) as a function of experiment length and asymptotic return. + + <
> + + Figure 12. Pong (left) and Breakout (right) as a function of experiment length and average return. +<> <> <> + + +<> <> <> +vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design + +Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zul�qar Stephen W. Keckler NVIDIA Santa Clara, CA 95050 +{mrhu, ngimelshein, jclemons, azulfiqar, skeckler}@nvidia.com + +Abstract + +The most widely used machine learning frame.works require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN. + +I. INTRODUCTION +Deep neural networks (DNNs) have recently been success.fully deployed in various application domains such as computer vision [1], speech recognition [2], and natural language processing [3] thanks to their superior performance compared to traditional state-of-the-art approaches. Such proliferation of deep learning techniques has led several software frameworks to be developed in recent years to analyze and facilitate the design of neural networks [4, 5, 6, 7]. The list of available frameworks continue to expand with developers constantly adding more features and improving computational efficiency to foster research in the area of deep learning. Due to the tremendous compute horsepower offered by graphics processing units (GPUs), these frameworks provide strong backend support for GPU software libraries such as cuDNN [8]. In fact, almost every group today involved in training neural networks is deploying GPUs for accelerated deep learning [9]. +While these popular machine learning (ML) frameworks facilitate the study of DNNs, a major limitation of the use of these frameworks is that the DRAM capacity limits of the GPU(s) in the system eventually limit the size the of the DNN that can be trained (Section II-C). To work around the memory capacity bottleneck [10, 11], ML practitioners must either use less desirable DNN architectures (e.g., smaller number of +Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 2016. + +<
> + +Fig. 1: GPU memory usage when using the baseline, network-wide allocation policy (left axis). The right axis shows the maximum fraction of this baseline allocation actually utilized when traversing through the network layer-wise. The numbers next to the names of each network refer to the batch size throughout this paper. Studied DNNs are detailed in Section IV-C. +layers, smaller batch sizes, less performing but more memory-Efficient convolutional algorithms) or parallelize the DNN across multiple GPUs [12]. Figure 1 highlights how the memory consumption trends of the ImageNet [13] winning DNNs have evolved over time. AlexNet [1], for instance, only contained 5 convolutional layers with 2 fully-connected layers and required a "mere" 1.1 GB of memory allocation for training, which is well below the 12 GB memory capacity of the state-of-the-art NVIDIA Titan X. The more recent VGG.16 [14], on the other hand, contains 16 convolutional layers and 3 fully-connected layers, incurring a total of 28 GB of memory usage for batch size 256. Because a single GPU can only accommodate a batch size of 64 for VGG-16, training with batch 256 requires parallelization across multiple GPUs or the network must be sequentially executed multiple times with smaller batches. With the most recent ImageNet winning network adopting more than a hundred convolutional layers [15], the trend in deep learning is to move towards larger and deeper network designs [14, 16, 17, 18]. As a result, alleviating the rigid physical memory limitations of GPUs is becoming increasingly important. +In this paper, we propose virtualized Deep Neural Network (vDNN), a runtime memory management solution that virtualizes the memory usage of deep neural networks across both GPU and CPU memories. Our vDNN allows ML practitioners to deploy larger and deeper networks beyond the physical +capacity of available GPUs, enabling them to focus more on their algorithms while the system architecture and run.time system transparently manage the allocation, placement, movement, and release of their data. The motivation behind vDNN is based on the following three key observations: + +1) DNNs trained via stochastic gradient-descent (SGD) are designed and structured with multiple layers [19]; 2) the training of these neural networks involves a series of layer-wise computations, the order of which is statically �xed and repeated for millions to billions of iterations throughout the entire training process; and 3) even though the GPU can, at any given time, only process a single layer's computation (due to the layer-wise computational characteristics of SGD-based DNN training), popular ML frameworks adopt a network-wide memory allocation policy because DNN training requires the intermediate feature maps of all the layers in the network to be backed up in GPU memory for gradient updates (Section II-C). In other words, existing memory management schemes overprovision the memory allocations to accommo.date the usage of the entire network layers, even though the GPU is only using a subset of this allocation for the layer-wise requirements. We observe that such memory underutilization issue becomes more severe for deeper networks, leading to 53% to 79% of allocated memory not being used at all at any given time (Figure 1). The goal of vDNN is to conservatively allocate GPU memory for the immediate usage of a given layer's computation so that the maximum and average memory usage is drastically reduced, allowing re.searchers to train larger networks. To achieve this goal, vDNN exploits the data dependencies of allocated data structures, particularly the intermediate feature maps that account for the majority of memory usage (Section II-C), and either releases or moves these intermediate data between GPU and CPU memory. Specifically, vDNN either 1) aggressively releases these feature maps from the GPU memory if no further reuse exists, or 2) offloads (and later prefetches) to (from) CPU memory if further reuse does exist but is not immediately required. By exploiting the inter-layer memory access and reuse patterns of DNNs, our vDNN memory manager intelligently overlaps the normal DNN computations with the offload/prefetch/release operations, effectively virtualizing the memory usage of DNNs with little to no performance loss. The operations of vDNN are completely transparent to programmers and enable them to train larger and deeper neural networks that consume memory well beyond the limits of physical memory of GPUs today. The key contributions of our work are: +� This work is the first to present a detailed, quantitative analysis on GPU-based DNN training, as opposed to re.cent literature targeting energy-Efficient accelerators for DNN inference [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. + +� To the best of our knowledge, our work is the first that provides an in-depth characterization study on the memory access characteristics of DNNs and their effect on the GPU memory system from an architectural perspective. + +� This work identifies the key limitations of current ML frameworks' memory management policies as they re.quire the network-wide memory usage of the target DNN to monolithically fit within the physical capacity of the GPU. We demonstrate this by showing that existing frameworks fail in training 6 out of the 10 studied DNNs when their memory allocation size (14 GB to 67 GB) exceeds the GPU memory budget (12 GB in NVIDIA's Titan X). + +� We propose, implement, and evaluate a runtime memory manager called vDNN that virtualizes the memory usage of neural networks across CPU and GPU memories. Our vDNN solution reduces the average GPU memory usage of these 6 memory hungry networks by 73% to 98%, allowing them to be trained on a single Titan X card. Compared to a hypothetical, oracular GPU containing enough memory to hold the entire DNN, vDNN incurs 1% to 18% performance overhead. + +II. BACKGROUND AND MOTIVATION + +This section provides an overview of modern DNNs, the memory management policies of current ML frameworks, and their key limitations that motivate this work. + +A. DNN Architecture +Convolutional neural networks are one of the most popular ML algorithms for high accuracy computer vision tasks. While other types of networks are also gaining tractions (e.g., recurrent neural networks for natural language pro.cessing), all of these DNNs are trained using a backward propagation algorithm [19] via stochastic gradient-descent (SGD). For clarity of exposition and owing to their state-of.the-art performance in the ImageNet competition, this paper mainly focuses on the feedforward style convolutional neural networks commonly seen in AlexNet [1], OverFeat [30], GoogLeNet [17], and VGG [14]. However, the key intuitions of our work are equally applicable to any neural network that exhibits layer-wise computational characteristics and is trained via SGD, detailed later in this section. +DNNs are designed using a combination of multiple types of layers, which are broadly categorized as convolutional layers (CONV), activation layers (ACTV), pooling layers (POOL), and fully-connected layers (FC). A neural network is structured as a sequence of multiple instances of these layers. DNNs for computer vision tasks in particular are broadly structured into the following two modules: 1) the feature extraction layers that detect distinguishable features across input images, and 2) the classification layers that analyze the extracted features and classify the image into a given image category. Feature extraction layers are generally designed using CONV/ACTV/POOL layers and are positioned as the initial part of the DNN. The classification layers are built up using the FC layers and are found at the end of the DNN computation sequence. The general trend in deep learning is to design the network with a large number of feature extraction layers so that a deep hierarchy of features are trained for robust image classification [14, 15, 17]. + +Fig. 2: Memory allocations required for linear networks using the baseline memory manager (bold arrows). For inference, the sum of all green (W) and red (X) arrows are allocated. For training, two additional data structures for dX and dY are required: both are sized to the maximum of all blue (dY) arrows and are reused while traversing back the layers during backward propagation. An optional temporary buffer, called workspace in cuDNN [8] (yellow arrow, WS), is needed in certain convolutional algorithms. The workspace buffer is sized with the maximum workspace requirement among all layers and is reused during backward propagation. +B. DNN Training vs. Inference +A neural network needs to be trained before it can be deployed for an inference or classification task. Training entails learning and updating the weights of the layers of a neural network by performing the operations of forward and backward propagation algorithms [19]. The direction of traversal, as well as the mathematical operations that must be performed, differ for forward and backward propagation. +Forward Propagation. Forward propagation is performed from the first (input) layer to the last (output) layer, whereas backward propagation is performed in the opposite direction (last to first layer), from right to left in Figure 2. Intuitively, forward propagation traverses the network layer-wise and per.forms the aforementioned feature extraction and classification tasks on a given input, leading to an image classification. Dur.ing forward propagation, each layer applies a mathematical operation to its input feature maps (X) and stores the results as output feature maps (Y). For linear feedforward DNNs, the resulting Y of layer(n.1) is directly used as the input X by layer(n) (Figure 2). The computation flow of forward propagation is therefore a serialized process, as layer(n) can initiate its layer's operation only when the preceding layer(n.1) is finished with its computation and forwarded its output Y to layer(n)'s input X. Non-linear network topologies can contain one-to-many (fork) and many-to-one (join) inter.layer dependencies, but forward propagation still involves a series of layer-wise computations as detailed in Figure 3. Note that the GPU can only process a single layer's computation at any given time due to such inter-layer data dependencies. As a result, the minimum, per layer memory allocations required are determined by the layer's input-output relationships and its mathematical function1. For instance, a CONV layer using the + +1 Popular activation functions (sigmoid/tanh/ReLU [1]) can be refactored into an in-place algorithm using element-wise computation. Both Caffe and Torch leverage this in-place memory optimization and only allocate memory space for Y and dY for forward (Y) and backward (both Y and dY) propagation [31]. This paper adopts this in-place optimization for both baseline and vDNN for a conservative evaluation. + +<
> + +Fig. 3: (a) The computation graph and its inter-layer dependencies of a GoogLeNet-style, non-linear feedforward network during forward propagation. Refcnt refers to the number of consumer layers that depends on the current, producer layer's Y. The order in which the GPU processes each layer's forward computation is shown in (b), from layer(1) to layer(5), highlighting the layer-wise computation of DNN training. The producer-consumer relationship is reversed during backward propagation. +most memory-Efficient convolutional algorithm (e.g., implicit GEMM in cuDNN [8]2) requires three data structures, the input/output feature maps (X and Y) and the weights of the layer (W) for forward propagation. Employing a fast.fourier-transform (FFT) based convolution algorithm however requires an additional, temporary workspace (WS) buffer to manage transformed maps. +Backward Propagation. For DNNs that are not fully trained, the inferred image category might be incorrect. As a result, a loss function is used to derive the magnitude of the inference error at the end of forward propagation. Specifically, the gradient of the loss function is derived with respect to the last layer(N)'s output: + +<> (1) + +The value in Equation 1 is forwarded to the last layer(N) as its input gradient maps (dY), and the output gradient maps (dX) are derived based on the chain rule [19]: + +<> (2) + +Because the output <> is the product of the input +<> with <>, deriving the value of dX for layer(N) +generally requires memory for both its input/output gradient maps (dY and dX) and also the input/output feature maps (X and Y) for this layer. For linear networks, the calculated dX of layer(N) is directly passed on to the preceding layer(N.1) to be used as dY for layer(N.1)'s dX derivation (Figure 2). + +2 cuDNN (version 4.0) provides six different convolutional algorithms. Implicit GEMM requires the least memory allocation as no additional workspace is needed. FFT-based convolutional algorithms on the other hand incur larger memory allocations because of the additional data structures required to store the feature maps transformed into frequency domain. More details are available in [8, 32]. + +<
> + +Fig. 4: Breakdown of GPU memory usage based on its functionality (left axis). The right axis shows the fraction of allocated memory consumed by feature maps. +This chain rule is similarly used to derive the gradients of the weights to update the network model. + +Similar to forward propagation, backward propagation is also performed layer-wise to the respective incoming gradient maps, dYs. Once backward propagation reaches the first layer, the weights are adjusted using the weight gradients so that the prediction error is reduced for the next classification task. Hence, training a network involves both forward and backward propagation, which are repeated for millions to billions of iterations. Because of the stochastic nature of SGD-based backward propagation, the network input is generally batched with hundreds of images (e.g., 128 and 256 images for best performing AlexNet and VGG-16), which increases memory allocation size but helps the network model better converge to an optimal solution. +C. Motivation: Scalable and Memory-Efficient DNN Design +To aid the design and deployment of neural networks, a va.riety of ML frameworks have been developed in recent years, including Caffe, Torch, Neon, TensorFlow, and Theano [9]. The rich set of features offered by these frameworks coupled with their ability to accelerate DNN training and inference using GPUs greatly simplifies the process of implementing neural networks. Despite their flexibility, popular ML frame.works suffer from severe limitations in the way they allocate and manage memory. +To illustrate the shortcomings of ML frameworks in man.aging memory, consider the example shown in Figure 2. When training a DNN using existing ML frameworks, the memory required across all of the layers of the network must fit within the physical GPU memory capacity. The key reason for this GPU-side, network-wide memory allocation strategy is to reap performance benefits. More Specifically, page-migration based virtualization solutions that expose both CPU and GPU memory for page allocations (regardless of whether the virtualization feature is provided by future CUDA runtime extensions or programming models such as OpenMP +(4.0) [33]) must transfer pages via PCIe, which involves several latency-intensive processes such as CPU interrupts for system calls, page-table updates, TLB updates/shootdowns, and the actual page transfer. Prior work [34] reported that + +Fig. 5: Per layer memory usage of VGG-16 (256). For brevity, we only show the memory usage during forward propagation and for layers that contain weights (CONV and FC). Left axis corresponds to the sum of workspace and per layer input/output feature maps. The right axis corresponds to the memory consumption for storing weights. The memory usage during backward propagation follows similar trends to this figure. +the latency to page-in a single 4 KB page to the GPU is 20 to 50 's, meaning the PCIe bandwidth utilization using page-migration is 80 to 200 MB/sec, as opposed to the DMA initiated cudaMemcpy that achieves an average 12.8 GB/sec out of the 16 GB/sec maximum PCIe bandwidth. As the amount of data to be paged in/out via PCIe can be 10s of GBs for very deep networks (Figure 15), ML frameworks will suffer from huge performance penalties when relying on page-migration for training DNNs. +Note that because of the layer-wise gradient update rule of the backward propagation algorithm (property of the chain rule, Section II-B), each layer's feature maps (X) are later reused during its own backward propagation pass. This means that all Xs must still be available in GPU memory until backward computation is completed. Figure 4 shows the amount of memory usage based on its functionality and the growing significance of feature maps as networks become deeper. Because deeper networks need to keep track of a larger number of Xs, the fraction of memory allocated for feature maps grows monotonically as the number of layers increases. Training the network itself is still done layer-wise, however, regardless of the depth of the neural network. The baseline network-wide memory allocation policy is therefore both extremely wasteful and not scalable because it does not take into account the layer-wise DNN training. Figure 5 shows the per layer memory usage of VGG-16 during forward propagation, which provides the following key observations. First, the intermediate feature maps and workspace (left axis) incur an order of magnitude higher memory usage compared to the weights (right axis) of each layer. Second, most of these intermediate data structures are concentrated on the feature extraction layers and are less significant in the later classifier layers. Third, the weights, while smaller in size compared to these intermediate data, are mostly concentrated on the classifier layers due to their full connectivity. Lastly, the per layer memory usage is much smaller than the 28 + +<
> + +Fig. 6: VGG-16's per layer computation latency for forward and backward propagation (left axis). Right axis shows the reuse distance of each layer's input feature maps, X. We define the reuse distance of a layer(n)'s X as the latency between the completion of layer(n)'s forward propagation and the start of the same layer(n)'s backward propagation. + +GB of memory required by the baseline policy (Figure 1), showing significant opportunities for memory savings with a fine-grained, layer-wise memory management policy. + +III. VIRTUALIZED DNN +The design objective of our virtualized DNN (vDNN) memory manager is to virtualize the memory usage of DNNs, using both GPU and CPU memory, while minimizing its impact on performance. vDNN is completely transparent to the programmer as the allocation, placement, movement, and release of data is seamlessly orchestrated by the system architecture and the runtime system. Such abstraction enables ML practitioners to focus more on their ML algorithm and not have to worry about the low level details of GPU memory management. vDNN primarily optimizes the memory usage of the feature extraction layers as the majority of memory usage is concentrated on these layers, accounting for 81% of memory usage on AlexNet and 96% on VGG-16 (256). More Specifically, we target the feature maps of these feature extraction layers as these intermediate data structures account for the majority of GPU memory usage (Figure 4 and Fig.ure 5). The intuitions of vDNN can also be applied to weights and to the classification layers, but with less of a memory saving benefit. + +A. Design Principle +Previous sections highlighted the fact that the memory requirement per individual layer is substantially smaller than what is actually provisioned with the baseline, network-wide memory allocation policy. vDNN adopts a sliding-window based, layer-wise memory management strategy in which the runtime memory manager conservatively allocates memory from its memory pool for the immediate usage of the layer that is currently being processed by the GPU. Intermediate data structures that are not needed by the current layer are targeted for memory release to reduce memory usage. +Forward Propagation. As discussed in Section II-C, deep networks have to keep track of a large number of the inter- + +<
> + +Fig. 7: Execution flow of a linear network during forward propagation. The figure assumes that layer(N) is currently being processed by the GPU. During this layer's forward computation, the data associated with the arrows marked with black Xs (all preceding layer's input feature maps) are not used and can safely be released from the memory pool. + +<
> + +Fig. 8: Execution flow of a linear network during backward propagation. The figure assumes that layer(2) is currently being processed by the GPU. Data associated with the arrows marked with black Xs can safely be released because they will not be reused during the training of this input image batch. + +mediate feature maps (Xs) that are extracted during forward propagation. Once a given layer(n)'s forward computation is complete, however, layer(n)'s X is not reused until the GPU comes back to the same layer(n)'s corresponding backward computation. Because the reuse distance of layer(n)'s X is on the order of milliseconds to seconds (e.g., more than 60 ms and 1200 ms for the first layer of AlexNet and VGG-16 (64), respectively), deep networks end up allocating a significant number of Xs that effectively camp inside the GPU memory without immediate usage (Figure 6). As a result, tackling these Xs for memory optimization is crucial for Efficient utilization of GPU memory as these intermediate data account for a significant fraction of memory allocations (Figure 4). vDNN therefore conditionally offloads these intermediate Xs to CPU memory via the system interconnect (e.g., PCIe, NVLINK [35]) if they are targeted for memory release. Section III-C details the vDNN memory transfer policy that decides which layers are chosen for offloading its X. Once the offload operation is complete, vDNN releases the offloaded X from the memory pool to reduce GPU memory usage. +Care must be taken however when evaluating the feasibility of offloading a layer's input X. This is because, for non-linear network topologies, multiple layers can be the consumers of a previously computed layer's output feature maps (Y). For instance, layer(2) and layer(3) in Figure 3 are both using the output Y of layer(1) as its input X. offloading and consequently releasing the input X of layer(2), before reaching + +<
> + +Fig. 9: Performance effect of offload and prefetch. FWD(n) and BWD(n) are the forward and backward computations for layer(n), respectively. OFF(n) is the offloading of layer(n)'s X and PRE(n) is the corresponding prefetch operation for layer(n). + +layer(3)'s forward computation, is problematic as these two layers share the same data structure for the input X. vDNN therefore keeps track of the inter-layer dependencies in the form of a dataflow graph (e.g., Refcnt in Figure 3) and allows the offload/release operation to be initiated only when the currently processing layer is the last consumer of its input feature maps. Figure 7 is an example execution flow of a linear DNN during forward propagation, highlighting when it becomes safe to release a layer's X. +Backward Propagation. Similar to forward propagation, vDNN aggressively releases data structures that are not needed for training the remaining layers� backward computation. During layer(n)'s backward propagation, layer(n+1)'s Y and dY are no longer required because the GPU has already completed the gradient updates for this layer (Figure 8). Again, by leveraging the layer-wise DNN backward propagation, vDNN immediately frees up a layer's Y and dY once this layer's backward computation is complete. X and dX are not released as the preceding layer's backward propagation will be needing these values for gradient derivation. Note that if a layer has offloaded its X to host memory, vDNN should guarantee that the offloaded data is copied back to GPU memory before the gradient update is initiated. Naively copying back the data on-demand will serialize the backward computation behind the memory copying operation of X. vDNN therefore launches a prefetch operation for layer(n)'s offloaded feature maps, which is overlapped with layer(m)'s backward computation, with n
> + + of current FPGA architectures causing it, and present suggested architectural solutions that can + reduce this gap. + + 3 COMPUTING ARCHITECTURES + + We implement three different highly optimized state-of-the-art CAs for accelerating CNN infer- + encetasksinRTLusingparameterizableSystemVerilogHDL.We refer to the three CAs as ASU-like + [26,27],Intel-DLA-like[2],andChain-NN-like[50].We implement all the hardware computational + blocks required to execute all the layers described in Section2.1for three different CNN models: + AlexNet, VGG-16, and ResNet-50. We also implement the control logic required to run the CAs + starting from reading the input features and weights from on-chip buffers, transferring them to + the computational blocks, and writing the fInal results in the output feature buffers. The on-chip + buffer sizes and the parallelization factors for each of the nested CONV loops are fixed in both + the FPGA and ASIC implementations for each of these CAs according to the optimal design point + originally reported in References [2,27,50]. For consistency and to enable fair comparisons, we + also use a fixed-point data representation for all three CAs with 16-bit features and 8-bit weights + as in Reference [27], which causes less than 2% accuracy degradation. We consider the external + memory interface and direct memory access engines to be out of the scope of this work, as they + do not affect the conclusions we seek to draw about the performance and area gaps or the bot- + tlenecks of current FPGA architectures in accelerating CNNs. However, our performance models + put off-chip data transfer into consideration according to any external memory interface that we + specify. In our experiments, we report two sets of results: one assuming infinite bandwidth and the + other assuming one bank of DDR4 memory at 1200MHz with a total bandwidth of 17GB/s similar + to that used in Reference [2]. + We carefully chose those three CAs out of numerous architectures proposed in the literature + to be diverse; the wide variations between them help ensure our analysis of FPGA vs. ASIC efff- + ciencyhasbroadapplicability.ThemaindifferencesbetweenthethreeCAs,summarizedinTable1, + are: + •All three CAs have different parallelization schemes. In other words, the array of MAC units + in each CA has a different number of dimensions leading to different execution orders, tiling + and unrolling factors for the CONV loops in Algorithm1. Output tiles of size(POM ×POX × + POY ),(POM ×POX ×1),and(POM ×1×1)are produced by the ASU-like, Intel-DLA-like, + and Chain-NN-like PE arrays, respectively. + •The Intel-DLA-like CA uses a mathematical optimization for CONV layers with kernels of + size 3×3 known as the Winograd Transform [22], which reduces the number of MAC op- + erations needed to compute convolutions. However, the ASU-like and Chain-NN-like CAs + + ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018. 20:8 A.Boutrosetal. + + + + + + + + + + + + + + + + + + Fig. 2. ASU-like CA tiling schemes and hardware architecture. + + perform conventional sliding-window convolution operations. This enables us to explore + different convolution schemes with different degrees of control logic complexity and ob- + serve their effect on the area and performance gaps. + • The three CAs implement their weight buffers differently. The Chain-NN-like CA stores the + kernel weights in small distributed buffers such that every MAC unit has its local scratch- + pad for weights implemented in the FPGA’s soft logic (MLABs). In contrast, both the ASU- + like and Intel-DLA-like CAs have larger weight buffers implemented using on-chip memory + blocks (BRAMs) to feed a group of MAC units. In FC layers, the Intel-DLA-like CA also + interchanges the roles of weight and feature buffers. + • The CAs differ in whether and how they use double-buffering to hide memory transfer + time. The ASU-like CA uses double-buffering for weights to hide the computation time of + FC layers by fflling one buffer from off-chip memory while using the weights in the other + buffer for computations. The Intel-DLA-like CA uses double-buffering by interchanging + input and output buffers after each layer to eliminate any external memory transfers if all + the output feature maps of a layer can fft in on-chip buffers. The Chain-NN-like CA does + not use any double-buffering techniques. + None of the three CAs is available as an open-source implementation, and hence we imple- + mented them from scratch to carry out the study presented in this article under controlled condi- + tions (e.g., RTL implementation, same FPGA platform, same weight and activation precisions, etc.) + to enable fair comparisons and focus only on the architectural aspects of these CAs. In Sections3.1, + 3.2,and3.3, we describe the details of the three CAs we implemented and any extensions added + to them for the sake of our study. + + 3.1 ASU-like CA + This CA was proposed in Reference [27] by Ma et al. from Arizona State University (ASU) and + then expanded in Reference [26] to support the ELTWISE and BNORM layers used in recent CNN + models. The core of this CA, shown in Figure2(c), is a three-dimensional MAC unit array of size + POM ×POX ×POY that can compute both CONV and FC layers. + Feature maps and weights are tiled to minimize external memory transfers by either buffering + all weights or all input feature maps in on-chip memory at any layer of the CNN model. In the + shallower layers of the network, all the weights but onlyN +K−1 rows of the input feature OY maps are buffered on-chip such that 0> + + Fig. 10. Area gap between FPGA and ASIC implementations for different blocks of: (a) BSC, (b) LRN, and + (c) ELT. The percentages represent the contribution of each component to the total area of the FPGA + implementation. + + Interestingly, the computational performance gap is not consistent among different CAs; how- + ever different variations of the same CA have similar gap results. The Intel-DLA-like CA has + the smallest ASIC-to-FPGA computational performance ratio (≈2.9) compared to the ASU-like + and Chain-NN-like CAs (≈4.6 and 6.2,respectively). We believe that the reason is that the Intel- + DLA-like CA has a modular daisy-chain architecture, which is more routing-friendly and bene- + ffts the FPGA implementation more than the ASIC one due to the relatively slow speed of FPGA + routing. + + 5.3 Area Gap + On average, the FPGA implementations have 8.7×larger area than their ASIC counterparts and + the gap is, in contrast to the performance gap, fairly similar across different variations of the three + CAs. To understand the reasons for this gap, Figures10(a),10(b), and10(c) illustrate the area ratio + of different components in the FPGA implementations to those in the ASIC implementations for + the BSC, LRN, and ELT variations, respectively. The percentages written above the bars represent + the area breakdown of each FPGA implementation into different components and hence indicate + the contribution of each component to the overall area gap. We notice that the convolution engine, + which has the largest contribution to total area (up to 60% in some cases) and thus the strongest + impactonthetotalareagap,hasanFPGA-to-ASICareaarearatiorangingfrom13to31fordifferent + variations of the three CAs. The Intel-DLA-like uses Winograd transform to significantly reduce + MAC operations in convolution, which costs almost the same area as the convolution engine in the + FPGA implementation. However, the Winograd transform and inverse transform blocks in this CA + have FPGA-to-ASIC area ratios of 28 and 26, respectively, which are almost twice the area gap for + the convolution engine, since they contain a large number of multi-input adders implemented in + the FPGA’s soft fabric compared to the convolution engine, which is mostly implemented in hard + DSP blocks. The smallest area gap is in the feature and weight buffers, since the RAMs in the FPGA + and the ASIC implementations are both custom SRAM blocks. However, the buffers area ratios are + still signiffcant (≈3–5)because of the area overhead of the programmable routing in BRAM tiles + as well as the underutilization of some of the M20K blocks on the FPGA, whereas in the ASIC + implementations, we use memories with the exact required sizes. The NORM block has an area + ratio of 32 and 28 and consumes 22% and 14% of the total area in ASU-like and Intel-DLA-like CAs, + respectively, since it is a heavily arithmetic block and is mostly implemented in the soft fabric. + However, it only consumes 3% of the total area in the Chain-NN-like CA, which produces outputs + in one dimension only and therefore does not normalize output features at different locations in + parallel. The POOL, ELTWISE and BNORM blocks have large area ratios, however they have small + overall areas and hence limited impact on the total gap. + An interesting observation is that the area gap in the convolution engine of the Intel-DLA-like + CA is significantly less than that of the other two CAs: an area ratio of 13 compared to 20 and + 29 in ASU-like and Chain-NN-like CAs, respectively. This is because the Intel-DLA-like CA uses + the hard adders in the DSP blocks to implement its dot-product unit, while the other two CAs + pay for the area of the complete DSP block on the FPGA but only make use of the multipliers + inside it and thus have a higher area gap compared to their ASIC counterparts. This observation + motivates the investigation of new DSP block designs that could bring more of the convolution + engine functionality inside the hard DSP block. For instance, the ASU-like CA needs two separate + accumulators for the two independent 18-bit multipliers, which is not supported in current DSP + blocks. Hence, the DSP block accumulators are wasted and soft logic is used to implement the + accumulators. The convolution engine of the Chain-NN-like CA has the highest area gap as it + implements input multiplexing, accumulation, and output de-multiplexing in the soft fabric. + + 5.4 Architectural Insights + Based on the results of Sections5.1and5.2, we can draw several architectural insights: + + • According to the resource utilization results in Figure8(b), the limiting factor is the DSP + block count available on-chip, with close to 100% resource utilization in most cases. One + direct approach to gain higher performance is adding more DSP blocks to current FPGAs, + especially given that a DSP-focused device spends only 5% of its core area on DSP blocks + [21]. This requires a careful architectural study to determine the optimal ratio and area + distribution between DSPs, BRAMs, and ALMs for DL-tuned FPGAs that are still ffexible + enough and suitable for other applications as well. These architectural explorations require + a suite of DL benchmark circuits such as the one we developed in this work, and which we + plan to expand and open-source in future work. + • AsshowninFigure10, the area gap of the convolution engine of the Intel-like-DLA CA is + significantly less than that of the other two CAs, since it makes better use of the DSP block + available functionalities such as the internal adders and hard cascade chains. By looking + at the ASIC area breakdown of the convolution engine, we can see that about 72% of the + logic in the convolution engine of the Intel-DLA-like CA was implemented inside hard DSP + blocks on the FPGA compared to only 32% and 35% in the ASU-like and Chain-NN-like CAs, + respectively, and the rest is implemented in the soft fabric. We believe that small changes to + the DSP block architecture could capture more of the convolution engine hardware inside + the hard circuitry of the DSP block. For example, adding an operation mode that conffgures + the two internal adders as independent accumulators for two independent 18-bit MACs + (such as in the ASU-like CA) or having a small circular shift register accumulator for inter- + leaving dot-product operations (as in the Intel-DLA-like CA) would save soft logic. Neither + of the DSP block enhancements would add much logic to the block, nor would they require + more block routing ports (inputs and outputs) and, therefore, the DSP block area increase + would be minimal. To increase the DSP block count on-chip, as mentioned in our first sug- + gestion, we not only wish to avoid signiffcant block area increase, but also remove DSP + block functionalities that are unnecessary for DL applications and would not cause severe + performance degradation when implemented in the soft fabric. For example, removing the + built-in constant coefficient banks in the Arria 10 DSP blocks should be evaluated as they + are not usable by any of our CAs. + •In this study, we used 16- and 8-bit fixed-point precision for features and weights, respec- + tively, in all CAs to ensure fair comparisons. However, the most suitable precision for CNN + inference is debatable and varies widely in the literature from single-precision floating- + point down to ternary and binary [28]. Currently, DSP blocks from Intel and Xilinx support + a limited number of precisions. For instance, a DSP block in Intel Arria 10, and similarly + Stratix 10, FPGAs supports two 18-bit, one 27-bit, or one single-precision floating-point + multiplication. However, a DSP slice in Xilinx Virtex Ultrascale FPGAs supports one 27×18 + multiplication. Designers can sometimes fft more low-precision multiplies that match cer- + tain patterns using clever tricks such as performing two 8-bit multiplies that share one + operand using a single Xilinx DSP slice [8]. Even with these operand packing tricks, using + lower precision leaves a portion of the DSP block logic idle. We can avoid this by designing + DSP blocks that natively support low-precision multiplications and reuse routing ports and + multiplier sub-arrays to keep the area overhead minimal. + •When implementing the three CAs, we noticed that the required on-chip buffers are either + deep central buffers for input and output features or smaller and more distributed buffers + for the weights. When we tried to extend the double-buffering technique used in the Intel- + DLA-like CA to more layers of models larger than AlexNet by implementing deeper stream + buffers, it resulted in a net performance degradation as the operating frequency dropped + significantly due to depth stitching of M20K BRAMs to implement those deep buffers. How- + ever, when implementing the small weight buffers of the Chain-NN-like CA in MLABs, the + high utilization of the soft fabric also resulted in lower operating frequency. This observa- + tion indicates that having only M20K BRAMs and MLABs to implement on-chip memories + might not be a good fft for DL acceleration on FPGAs. This also requires a more detailed ar- + chitectural study to determine the best size and ratio of on-chip BRAMs and their effect on + the overall performance using DL-representative benchmarks, and we believe our parame- + terized CAs can form the start of this benchmark set. In addition, the memory-richness of + FPGAs can be enhanced by employing emerging technologies such as Magnetic Tunneling + Junction memories, which can provide bigger yet more dense BRAMs for memory-intensive + applications as shown in Reference [54]. + + 6 CONCLUSION + + In this article, we implemented three highly optimized state-of-the-art CAs for accelerating CNN + inference, which are: ASU-like, Intel-DLA-like, and Chain-NN-like CAs. We implemented three + variations of each CA (BSC, LRN, and ELT) for three different CNN models (VGG-16, AlexNet, and + ResNet-50, respectively) on an Intel Arria 10 FPGA device and compared them to 28nm ASIC im- + plementations of the same CAs to quantify the programmability cost that comes with using FPGAs + on the performance and area of DL accelerators. Across different variations of the three CAs, we + observed a consistent area gap with an average FPGA-to-ASIC area ratio of 8.7×, to which the con- + volution engine contributes the most with area ratios ranging from 13 to 31 for different CAs. The + performance gap, unlike the area gap, varies significantly across different CAs. The computational + performance of the ASIC implementations is 2.8×to 6.3×faster than that of the FPGA imple- + mentations when assuming infinite external memory bandwidth. We find that the Intel-DLA-like + CA has the smallest performance gap compared to its ASIC counterpart indicating that focusing + on modular and routing-friendly designs is of great importance for building efficient FPGA-based + DL accelerators. Finally, we suggest several FPGA DSP and RAM architecture changes for future + work that could reduce the area and performance gaps and enable more efficient DL acceleration + on FPGAs. + + ACKNOWLEDGMENTS + + The authors thank Martin Langhammer, Debbie Marr,and Eriko Nurvitadhi for helpful discussions, + as well as Huawei, Intel, and NSERC for funding support. + + REFERENCES + [1] M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. InProceedings of the OSDI. 265–283. + [2] U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. InProceedings of the FPGA. 55–64. + [3] Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. InProceedings of the MICRO. 609–622. + [4] Y. Chen et al. 2017. Eyeriss: An energy-efficient reconffgurable accelerator for deep convolutional neural networks.In Proceedings of the JSSC, Vol. 52. 127–138. + [5] S. Chetlur et al. 2014. CuDNN: efficient primitives for deep learning.arXiv:1410.0759. + [6] E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at data center scale. InProceedings of the HOT CHIPS,Vol.29. + [7] F. Colombo et al. 2017. Deep artiffcial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81–96. + [8] Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. Inwhite paper of Xilinx. + [9] L. Gatys et al. 2015. A neural algorithm of artistic style.arXiv:1508.06576. + [10] A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. InProceedings of the ICASSP. 6645–6649. + [11] Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. InProceedings of the FCCM. 152–159. + [12] Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. InProceedings of the ICCAD. + [13] P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks.arXiv:1604.03168. + [14] K. He et al. 2015. Delving deep into rectiffers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026–1034. + [15] K. He et al. 2016. Deep residual learning for image recognition. InProceedings of the CVPR. 770–778. + [16] S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. InFrontiers in Human + Neuroscience,Vol.3. + [17] S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate + shift. InProceedings of the ICML. 448–456. + [18] Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding.arXiv:1408.5093. + [19] N. Jouppi et al. 2017. In-data center performance analysis of a tensor processing unit. InProceedings of the ISCA. 1–12. + [20] A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. InProceedings of the + NIPS. 1097–1105. + [21] M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. InProceedings of the FPGA. + 117–125. + [22] A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. InProceedings of the CVPR. 4013–4021. + [23] Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. InProceed- + ings of the FPT. 61–68. + [24] L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. InProceedings of the FCCM. + 101–108. + [25] Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. InPro- + ceedings of the FPL.1–8. + [26] Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolu- + tional neural networks. InProceedings of the FPL.1–8. + [27] Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural net- + works. InProceedings of the FPGA. 45–54. + [28] A. Mishra et al. 2017. WRPN: Wide reduced-precision networks.arXiv:1709.01134. + [29] E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In + Proceedings of the FPT. 77–84. + [30] K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. InMicrosoft + Research Whitepaper,Vol.2. + [31] A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on + FPGA. InProceedings of the FPL.1–7. + [32] A. Putnam et al. 2014. A reconffgurable fabric for accelerating large-scale data center services. InProceedings of the + ISCA. 13–24. + [33] J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. InProceedings of + the FPGA. 26–35. + [34] R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL + HLS. InProceedings of the FPT. 20–27. + [35] D. E. Rumelhart et al. 1985.Learning Internal Representations by Error Propagation. Technical Report. + [36] O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. InProceedings of the IJCV, Vol. 115. + 211–252. + [37] H. Sharma et al. 2016. From high-level deep neural models to FPGAs. InProceedings of the MICRO. 1–12. + [38] F. Shen et al. 2016. Weighted residuals for very deep networks. InProceedings of the ICSAI. 936–941. + [39] Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. InProceedings of the FPL.1–4. + [40] Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. InProceedings of the ISCA. + 535–547. + [41] D. Silver et al. 2017. Mastering the game of go without human knowledge. InNature, Vol. 550. 354–359. + [42] N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural + networks. InProceedings of the FPGA. 16–25. + [43] A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision. + arXiv:1703.05853. + [44] I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. InProceedings of the NIPS. 3104–3112. + [45] C. Szegedy et al. 2015. Going deeper with convolutions. InProceedings of the CVPR. + [46] Kosuke Tatsumura et al. 2016. High density, low energy, magnetictunnel junction based block RAMs for memory-rich + FPGAs. InProceedings of the FPT. 4–11. + [47] Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. InProceedings of + the FPGA. 65–74. + [48] S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. + InProceedings of the FCCM. 40–47. + [49] G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. InProceedings + of the ICASSP. 2861–2865. + [50] S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural + networks. InProceedings of the DATE. 1032–1037. + [51] Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network + family. InProceedings of the DAC.1–6. + [52] X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In + Proceedings of the DAC.1–6. + [53] H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. InProceed- + ings of the FPGA. 5–14. + [54] S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architec- + ture exploration. InProceedings of the FPGA. 115–124. + [55] C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. InProceed- + ings of the FPGA. 161–170. + [56] C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. InProceedings of the + ISLPED. 326–331. + [57] C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA + shared memory system. InProceedings of the FPGA. 35–44. <> <> <> \ No newline at end of file diff --git a/Corpus/Scalable Gradients for Stochastic Differential Equations.txt b/Corpus/Scalable Gradients for Stochastic Differential Equations.txt deleted file mode 100644 index da0f938..0000000 Binary files a/Corpus/Scalable Gradients for Stochastic Differential Equations.txt and /dev/null differ diff --git a/Corpus/Scaling Laws for Neural Language Models.txt b/Corpus/Scaling Laws for Neural Language Models.txt deleted file mode 100644 index d9ed768..0000000 Binary files a/Corpus/Scaling Laws for Neural Language Models.txt and /dev/null differ diff --git a/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt b/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt deleted file mode 100644 index f89abe5..0000000 Binary files a/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt and /dev/null differ diff --git a/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt b/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt deleted file mode 100644 index c15f6d0..0000000 Binary files a/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt and /dev/null differ diff --git a/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt b/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt deleted file mode 100644 index 4f12479..0000000 Binary files a/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt and /dev/null differ diff --git a/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt b/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt deleted file mode 100644 index 05d66fd..0000000 --- a/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt +++ /dev/null @@ -1,535 +0,0 @@ - The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - - To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, - including cookie policy. - - - - - - - - The 4 Research Techniques to - - Train Deep Neural Network - - Models More E:ciently - - - James Le Follow - Oct 29, 2019 · 9 min read - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Photo by Victor Freitas on Unsplash - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 1 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - Deep learning and unsupervised feature learning have shown - great promise in many practical applications. State-of-the-art - performance has been reported in several domains, ranging - from speech recognition and image recognition to text - processing and beyond. - - - It’s also been observed that increasing the scale of deep - learning—with respect to numbers of training examples, model - parameters, or both—can drastically improve accuracy. These - results have led to a surge of interest in scaling up the training - and inference algorithms used for these models and in - improving optimization techniques for both. - - - The use of GPUs is a signiFcant advance in recent years that - makes the training of modestly-sized deep networks practical. - A known limitation of the GPU approach is that the training - speed-up is small when the model doesn’t Ft in a GPU’s - memory (typically less than 6 gigabytes). - - - To use a GPU eLectively, researchers often reduce the size of - the dataset or parameters so that CPU-to-GPU transfers are not - a signiFcant bottleneck. While data and parameter reduction - work well for small problems (e.g. acoustic modeling for speech - recognition), they are less attractive for problems with a large - number of examples and dimensions (e.g., high-resolution - images). - - - In the previous post, we - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 2 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - talked about 5 diLerent - algorithms for ePcient deep - learning inference. In this - article, we’ll discuss the - upper right part of the - quadrant on the left. What - are the best research - techniques to train deep - neural networks more - ePciently? - - - - 1 — Parallelization Training - Let’s start with parallelization. As the Fgure below shows, the - number of transistors keeps increasing over the years. But - single-threaded performance and frequency are plateauing in - recent years. Interestingly, the number of cores is increasing. - - - - - - - - - - - - - - - - - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 3 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - - - So what we really need to know is how to parallelize the - problem to take advantage of parallel processing. There are a - lot of opportunities to do that in deep neural networks. - - - For example, we can do data parallelism: feeding 2 images - into the same model and running them at the same time. This - does not aLect latency for any single input. It doesn’t make it - shorter, but it makes the batch size larger. It also requires - coordinated weight updates during training. - - - For example, in JeL Dean’s paper “Large Scale Distributed Deep - Networks,” there’s a parameter server (as a master) and a - couple of model workers (as slaves) running their own pieces of - training data and updating the gradient to the master. - - - - - - - - - - - - - - - - - - - Another idea is model parallelism — splitting up the model - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 4 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - and distributing each part to diLerent processors or diLerent - threads. For example, imagine we want to run convolution in - the image below by doing a 6-dimension “for” loop. What we - can do is cut the input image by 2x2 blocks, so that each - thread/processor handles 1/4 of the image. Also, we can - parallelize the convolutional layers by the output or input - feature map regions, and the fully-connected layers by the - output activation. - - - - - - - - - - - - - - - - - ... - - - - - Machine learning models are moving closer - - and closer to edge devices. Fritz AI is here - - to help with this transition. Explore our - - suite of developer tools that makes it easy to - - teach devices to see, hear, sense, and think. - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 5 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - ... - - - 2 — Mixed Precision Training - Larger models usually require more compute and memory - resources to train. These requirements can be lowered by using - reduced precision representation and arithmetic. - - Performance (speed) of any program, including neural network - training and inference, is limited by one of three factors: - arithmetic bandwidth, memory bandwidth, or latency. - Reduced precision addresses two of these limiters. Memory - bandwidth pressure is lowered by using fewer bits to store the - same number of values. Arithmetic time can also be lowered on - processors that oLer higher throughput for reduced precision - math. For example, half-precision math throughput in recent - GPUs is 2× to 8× higher than for single-precision. In addition - to speed improvements, reduced precision formats also reduce - the amount of memory required for training. - - Modern deep learning training systems use a single-precision - (FP32) format. In their paper “Mixed Precision Training,” - researchers from NVIDIA and Baidu addressed training with - reduced precision while maintaining model accuracy. - - SpeciFcally, they trained various neural networks using the - IEEE half-precision format (FP16). Since FP16 format has a - narrower dynamic range than FP32, they introduced three - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 6 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - techniques to prevent model accuracy loss: maintaining a - master copy of weights in FP32, loss-scaling that minimizes - gradient values becoming zeros, and FP16 arithmetic with - accumulation in FP32. - - - Using these techniques, they - demonstrated that a wide - variety of network - architectures and - applications can be trained - to match the accuracy of - FP32 training. Experimental - results include convolutional - and recurrent network - architectures, trained for classiFcation, regression, and - generative tasks. - - - Applications include image classiFcation, image generation, - object detection, language modeling, machine translation, and - speech recognition. The proposed methodology requires no - changes to models or training hyperparameters. - - - - 3 — Model Distillation - Model distillation refers to the idea of model compression by - teaching a smaller network exactly what to do, step-by-step, - using a bigger, already-trained network. The ‘soft labels’ refer - to the output feature maps by the bigger network after every - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 7 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - convolution layer. The smaller network is then trained to learn - the exact behavior of the bigger network by trying to replicate - its outputs at every level (not just the Fnal loss). - - - The method was Frst proposed by Bucila et al., 2006 and - generalized by Hinton et al., 2015. In distillation, knowledge is - transferred from the teacher model to the student by - minimizing a loss function in which the target is the - distribution of class probabilities predicted by the teacher - model. That is — the output of a softmax function on the - teacher model’s logits. - - - So how do teacher-student - networks exactly work? - - - The highly-complex teacher - network is Frst trained - separately using the - complete dataset. This step - requires high computational - performance and thus can - only be done ohine (on - high-performing GPUs). - - While designing a student network, correspondence needs - to be established between intermediate outputs of the - student network and the teacher network. This - correspondence can involve directly passing the output of a - layer in the teacher network to the student network, or - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 8 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - performing some data augmentation before passing it to the - student network. - - Next, the data are forward-passed through the teacher - network to get all intermediate outputs, and then data - augmentation (if any) is applied to the same. - - Finally, the outputs from the teacher network are back- - propagated through the student network so that the student - network can learn to replicate the behavior of the teacher - network. - - ... - - - - - The future of machine learning is on the - - edge. Subscribe to the Fritz AI Newsletter - - to discover the possibilities and beneIts of - - embedding ML models inside mobile apps. - - ... - - - - 4 — Dense-Sparse-Dense Training - The research paper “Dense-Sparse-Dense Training for Deep - Neural Networks” was published back in 2017 by researchers - from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense- - Sparse-Dense (DSD) takes 3 sequential steps: - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 9 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - Dense: Normal neural net training…business as usual. It’s - notable that even though DSD acts as a regularizer, the - usual regularization methods such as dropout and weight - regularization can be applied as well. The authors don’t - mention batch normalization, but it would work as well. - - - Sparse: We regularize the - network by removing - connections with small - weights. From each layer in - the network, a percentage of - the layer’s weights that are - closest to 0 in absolute value is selected to be pruned. This - means that they are set to 0 at each training iteration. It’s - worth noting that the pruned weights are selected only - once, not at each SGD iteration. Eventually, the network - recovers the pruned weights’ knowledge and condenses it in - the remaining ones. We train this sparse net until - convergence. - - Dense: First, we re-enable the pruned weights from the - previous step. The net is again trained normally until - convergence. This step increases the capacity of the model. - It can use the recovered capacity to store new knowledge. - The authors note that the learning rate should be 1/10th of - the original. Since the model is already performing well, the - lower learning rate helps preserve the knowledge gained in - the previous step. - - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 10 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - Removing pruning in the dense step allows the training to - escape saddle points to eventually reach a better minimum. - This lower minimum corresponds to improved training and - validation metrics. - - - Saddle points are areas in the multidimensional space of the - model that might not be a good solution but are hard to escape - from. The authors hypothesize that the lower minimum is - achieved because the sparsity in the network moves the - optimization problem to a lower-dimensional space. This space - is more robust to noise in the training data. - - - The authors tested DSD on image classiFcation (CNN), caption - generation (RNN), and speech recognition (LSTM). The - proposed method improved accuracy across all three tasks. It’s - quite remarkable that DSD works across domains. - - - DSD improved all CNN models tested — ResNet50, VGG, - and GoogLeNet. The improvement in absolute top-1 - accuracy was respectively 1.12%, 4.31%, and 1.12%. This - corresponds to a relative improvement of 4.66%, 13.7%, - and 3.6%. These results are remarkable for such Fnely- - tuned models! - - - DSD was applied to - NeuralTalk, an amazing - model that generates a - description from an image. - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 11 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - To verify that the Dense- - Sparse-Dense method works - on an LSTM, the CNN part of - Neural Talk is frozen. Only - the LSTM layers are trained. Very high (80% deducted by - the validation set) pruning was applied at the Sparse step. - Still, this gives the Neural Talk BLEU score an average - relative improvement of 6.7%. It’s fascinating that such a - minor adjustment produces this much improvement. - - Applying DSD to speech recognition (Deep Speech 1) - achieves an average relative improvement of Word Error - Rate of 3.95%. On a similar but more advanced Deep - Speech 2 model Dense-Sparse-Dense is applied iteratively - two times. On the Frst iteration, pruning 50% of the - weights, then 25% of the weights are pruned. After these - two DSD iterations, the average relative improvement is - 6.5%. - - - - Conclusion - I hope that I’ve managed to explain these research techniques - for ePcient training of deep neural networks in a transparent - way. Work on this post allowed me to grasp how novel and - clever these techniques are. A solid understanding of these - approaches will allow you to incorporate them into your model - training procedure when needed. - - ... - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 12 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - - - Editor’s Note: Heartbeat is a contributor-driven online - publication and community dedicated to exploring the emerging - intersection of mobile app development and machine learning. - We’re committed to supporting and inspiring developers and - engineers from all walks of life. - - - Editorially independent, Heartbeat is sponsored and published by - Fritz AI, the machine learning platform that helps developers - teach devices to see, hear, sense, and think. We pay our - contributors, and we don’t sell ads. - - - If you’d like to contribute, head on over to our call for - contributors. You can also sign up to receive our weekly - newsletters (Deep Learning Weekly and the Fritz AI - Newsletter), join us on Slack, and follow Fritz AI on Twitter for - all the latest in mobile machine learning. - - - - Neural Networks Deep Learning Heartbeat Guides And Tutorials - - Machine Learning - - - - - - - - - - - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 13 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12 - - - - Discover Medium Make Medium Become a member - yours Welcome to a place where Get unlimited access to the - words matter. On Medium, Follow all the topics you best stories on Medium — - smart voices and original care about, and we’ll and support writers while - ideas take center stage - deliver the best stories for you’re at it. Just $5/month. - with no ads in sight. Watch you to your homepage and Upgrade - inbox. Explore - - - - - About Help Legal - - - - - - - - - - - - - - - - - - - - - - - - - - - - https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 14 of 14 \ No newline at end of file diff --git a/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt b/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt deleted file mode 100644 index ba90caa..0000000 --- a/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt +++ /dev/null @@ -1,678 +0,0 @@ - The State of Sparsity in Deep Neural Networks - - - - Trevor Gale * 1y Erich Elsen * 2 Sara Hooker 1y - - - Abstract like image classification and machine translation commonly - have tens of millions of parameters, and require billions ofWe rigorously evaluate three state-of-the-art tech- floating-point operations to make a prediction for a singleniques for inducing sparsity in deep neural net- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - arXiv:1902.09574v1 [cs.LG] 25 Feb 2019 input sample.works on two large-scale learning tasks: Trans- - former trained on WMT 2014 English-to-German, Sparsity has emerged as a leading approach to address these - and ResNet-50 trained on ImageNet. Across thou- challenges. By sparsity, we refer to the property that a subset - sands of experiments, we demonstrate that com- of the model parameters have a value of exactly zero 2 . With - plex techniques (Molchanov et al.,2017;Louizos zero valued weights, any multiplications (which dominate - et al.,2017b) shown to yield high compression neural network computation) can be skipped, and models - rates on smaller datasets perform inconsistently, can be stored and transmitted compactly using sparse matrix - and that simple magnitude pruning approaches formats. It has been shown empirically that deep neural - achieve comparable or better results. Based on networks can tolerate high levels of sparsity (Han et al., - insights from our experiments, we achieve a 2015;Narang et al.,2017;Ullrich et al.,2017), and this - new state-of-the-art sparsity-accuracy trade-off property has been leveraged to significantly reduce the cost - for ResNet-50 using only magnitude pruning. Ad- associated with the deployment of deep neural networks, - ditionally, we repeat the experiments performed and to enable the deployment of state-of-the-art models in - byFrankle & Carbin(2018) andLiu et al.(2018) severely resource constrained environments (Theis et al., - at scale and show that unstructured sparse archi- 2018;Kalchbrenner et al.,2018;Valin & Skoglund,2018). - tectures learned through pruning cannot be trained Over the past few years, numerous techniques for induc-from scratch to the same test set performance as ing sparsity have been proposed and the set of models anda model trained with joint sparsification and op- datasets used as benchmarks has grown too large to rea-timization. Together, these results highlight the sonably expect new approaches to explore them all. Inneed for large-scale benchmarks in the field of addition to the lack of standardization in modeling tasks, themodel compression. We open-source our code, distribution of benchmarks tends to slant heavily towardstop performing model checkpoints, and results of convolutional architectures and computer vision tasks, andall hyperparameter configurations to establish rig- the tasks used to evaluate new techniques are frequentlyorous baselines for future work on compression not representative of the scale and complexity of real-worldand sparsification. tasks where model compression is most useful. These char- - acteristics make it difficult to come away from the sparsity - literature with a clear understanding of the relative merits - 1. Introduction of different approaches. - Deep neural networks achieve state-of-the-art performance In addition to practical concerns around comparing tech- - in a variety of domains including image classification (He niques, multiple independent studies have recently proposed - et al.,2016), machine translation (Vaswani et al.,2017), that the value of sparsification in neural networks has been - and text-to-speech (van den Oord et al.,2016;Kalchbren- misunderstood (Frankle & Carbin,2018;Liu et al.,2018). - ner et al.,2018). While model quality has been shown to While both papers suggest that sparsification can be viewed - scale with model and dataset size (Hestness et al.,2017), as a form of neural architecture search, they disagree on - the resources required to train and deploy large neural net- what is necessary to achieve this. Specifically,Liu et al. - works can be prohibitive. State-of-the-art models for tasks 2 The term sparsity is also commonly used to refer to the pro- - * Equal contribution y This work was completed as part of the portion of a neural networks weights that are zero valued. Higher - Google AI Residency 1 Google Brain 2 DeepMind. Correspondence sparsity corresponds to fewer weights, and smaller computational - to: Trevor Gale. and storage requirements. We use the term in this way throughout - this paper. The State of Sparsity in Deep Neural Networks - - (2018) re-train learned sparse topologies with a random Some of the earliest techniques for sparsifying neural net- - weight initialization, whereasFrankle & Carbin(2018) posit works make use of second-order approximation of the loss - that the exact random weight initialization used when the surface to avoid damaging model quality (LeCun et al., - sparse architecture was learned is needed to match the test 1989;Hassibi & Stork,1992). More recent work has - set performance of the model sparsified during optimization. achieved comparable compression levels with more com- - putationally efficient first-order loss approximations, andIn this paper, we address these ambiguities to provide a further refinements have related this work to efficient em-strong foundation for future work on sparsity in neural net- pirical estimates of the Fisher information of the modelworks.Our main contributions:(1) We perform a com- parameters (Molchanov et al.,2016;Theis et al.,2018).prehensive evaluation of variational dropout (Molchanov - et al.,2017),l0 regularization (Louizos et al.,2017b), and Reinforcement learning has also been applied to automat- - magnitude pruning (Zhu & Gupta,2017) on Transformer ically prune weights and convolutional filters (Lin et al., - trained on WMT 2014 English-to-German and ResNet-50 2017;He et al.,2018), and a number of techniques have - trained on ImageNet. To the best of our knowledge, we been proposed that draw inspiration from biological phe- - are the first to apply variational dropout andl0 regulariza- nomena, and derive from evolutionary algorithms and neu- - tion to models of this scale. While variational dropout and romorphic computing (Guo et al.,2016;Bellec et al.,2017; - l0 regularization achieve state-of-the-art results on small Mocanu et al.,2018). - datasets, we show that they perform inconsistently for large- A key feature of a sparsity inducing technique is if andscale tasks and that simple magnitude pruning can achieve how it imposes structure on the topology of sparse weights.comparable or better results for a reduced computational While unstructured weight sparsity provides the most flex-budget. (2) Through insights gained from our experiments, ibility for the model, it is more difficult to map efficientlywe achieve a new state-of-the-art sparsity-accuracy trade-off to parallel processors and has limited support in deep learn-for ResNet-50 using only magnitude pruning. (3) We repeat ing software packages. For these reasons, many techniquesthe lottery ticket (Frankle & Carbin,2018) and scratch (Liu focus on removing whole neurons and convolutional filters,et al.,2018) experiments on Transformer and ResNet-50 or impose block structure on the sparse weights (Liu et al.,across a full range of sparsity levels. We show that unstruc- 2017;Luo et al.,2017;Gray et al.,2017). While this is prac-tured sparse architectures learned through pruning cannot tical, there is a trade-off between achievable compressionbe trained from scratch to the same test set performance as levels for a given model quality and the level of structurea model trained with pruning as part of the optimization imposed on the model weights. In this work, we focusprocess. (4) We open-source our code, model checkpoints, on unstructured sparsity with the expectation that it upperand results of all hyperparameter settings to establish rig- bounds the compression-accuracy trade-off achievable withorous baselines for future work on model compression and structured sparsity techniques.sparsification 3 . - - 3. Evaluating Sparsification Techniques at2. Sparsity in Neural Networks Scale - We briefly provide a non-exhaustive review of proposed - approaches for inducing sparsity in deep neural networks. As a first step towards addressing the ambiguity in the - sparsity literature, we rigorously evaluate magnitude-based - Simple heuristics based on removing small magnitude pruning (Zhu & Gupta,2017), sparse variational dropoutweights have demonstrated high compression rates with (Molchanov et al.,2017), andl0 regularization (Louizosminimal accuracy loss (Strom¨ ,1997;Collins & Kohli,2014; et al.,2017b) on two large-scale deep learning applications: - Han et al.,2015), and further refinement of the sparsifica- ImageNet classification with ResNet-50 (He et al.,2016), - tion process for magnitude pruning techniques has increased and neural machine translation (NMT) with the Transformer - achievable compression rates and greatly reduced computa- on the WMT 2014 English-to-German dataset (Vaswani - tional complexity (Guo et al.,2016;Zhu & Gupta,2017). et al.,2017). For each model, we also benchmark a random - Many techniques grounded in Bayesian statistics and in- weight pruning technique, representing the lower bound - formation theory have been proposed (Dai et al.,2018; of compression-accuracy trade-off any method should be - Molchanov et al.,2017;Louizos et al.,2017b;a;Ullrich expected to achieve. - et al.,2017). These methods have achieved high compres- Here we briefly review the four techniques and introduce sion rates while providing deep theoretical motivation and our experimental framework. We provide a more detailed - connections to classical sparsification and regularization overview of each technique in AppendixA. - techniques. - 3 https://bit.ly/2ExE8Yj The State of Sparsity in Deep Neural Networks - - 3.1. Magnitude Pruning Table 1.Constant hyperparameters for all Transformer exper- - Magnitude-based weight pruning schemes use the magni- iments.More details on the standard configuration for training the - tude of each weight as a proxy for its importance to model Transformer can be found inVaswani et al.(2017). - quality, and remove the least important weights according Hyperparameter Value - to some sparsification schedule over the course of training. dataset translatewmtendepacked - For our experiments, we use the approach introduced in training iterations 500000 - Zhu & Gupta(2017), which is conveniently available in the batch size 2048 tokens - TensorFlow modelpruning library 4 . This technique allows learning rate schedule standard transformerbase - for masked weights to reactivate during training based on optimizer Adam - gradient updates, and makes use of a gradual sparsification sparsity range 50% - 98% - schedule with sorting-based weight thresholding to achieve beam search beam size 4; length penalty 0.6 - a user specified level of sparsification. These features enable - high compression ratios at a reduced computational cost rel- optimized directly using the reparameterization trick, and - ative to the iterative pruning and re-training approach used the expectedl0 -norm can be computed using the value of the - byHan et al.(2015), while requiring less hyperparame- cumulative distribution function of the random gate variable - ter tuning relative to the technique proposed byGuo et al. evaluated at zero. - (2016). - 3.4. Random Pruning Baseline - 3.2. Variational Dropout For our experiments, we also include a random sparsification - Variational dropout was originally proposed as a re- procedure adapted from the magnitude pruning technique - interpretation of dropout training as variational inference, ofZhu & Gupta(2017). Our random pruning technique - providing a Bayesian justification for the use of dropout uses the same sparsity schedule, but differs by selecting the - in neural networks and enabling useful extensions to the weights to be pruned each step at random rather based on - standard dropout algorithms like learnable dropout rates magnitude and does not allow pruned weights to reactivate. - (Kingma et al.,2015). It was later demonstrated that by This technique is intended to represent a lower-bound of the - learning a model with variational dropout and per-parameter accuracy-sparsity trade-off curve. - dropout rates, weights with high dropout rates can be re- - moved post-training to produce highly sparse solutions 3.5. Experimental Framework - (Molchanov et al.,2017). For magnitude pruning, we used the TensorFlow model - Variational dropout performs variational inference to learn pruning library. We implemented variational dropout and - the parameters of a fully-factorized Gaussian posterior over l0 regularization from scratch. For variational dropout, we - the weights under a log-uniform prior. In the standard for- verified our implementation by reproducing the results from - mulation, we apply a local reparameterization to move the the original paper. To verify ourl0 regularization implemen- - sampled noise from the weights to the activations, and then tation, we applied our weight-level code to Wide ResNet - apply the additive noise reparameterization to further reduce (Zagoruyko & Komodakis,2016) trained on CIFAR-10 and - the variance of the gradient estimator. Under this parame- replicated the training FLOPs reduction and accuracy re- - terization, we directly optimize the mean and variance of sults from the original publication. Verification results for - the neural network parameters. After training a model with variational dropout andl0 regularization are included in - variational dropout, the weights with the highest learned AppendicesBandC. For random pruning, we modified - dropout rates can be removed to produce a sparse model. the TensorFlow model pruning library to randomly select - weights as opposed to sorting them based on magnitude. - 3.3.l0 Regularization For each model, we kept the number of training steps con- - l0 regularization explicitly penalizes the number of non- stant across all techniques and performed extensive hyper- - zero weights in the model to induce sparsity. However, parameter tuning. While magnitude pruning is relatively - thel0 -norm is both non-convex and non-differentiable. To simple to apply to large models and achieves reasonably - address the non-differentiability of thel0 -norm,Louizos consistent performance across a wide range of hyperparame- - et al.(2017b) propose a reparameterization of the neural ters, variational dropout andl0 -regularization are much less - network weights as the product of a weight and a stochastic well understood. To our knowledge, we are the first to apply - gate variable sampled from a hard-concrete distribution. these techniques to models of this scale. To produce a fair - The parameters of the hard-concrete distribution can be comparison, we did not limit the amount of hyperparameter - tuning we performed for each technique. In total, our results 4 https://bit.ly/2T8hBGn encompass over 4000 experiments. The State of Sparsity in Deep Neural Networks - - - - - - - - - - - - - - Figure 2.Average sparsity in Transformer layers.Distributions - calculated on the top performing model at 90% sparsity for each - technique.l0 regularization and variational dropout are able to - learn non-uniform distributions of sparsity, while magnitude prun- - ing induces user-specified sparsity distributions (in this case, uni- - form). - form the random pruning technique, randomly removing - weights produces surprisingly reasonable results, which is - perhaps indicative of the models ability to recover from - Figure 1.Sparsity-BLEU trade-off curves for the Transformer. damage during optimization. - Top: Pareto frontiers for each of the four sparsification techniques - applied to the Transformer. Bottom: All experimental results with What is particularly notable about the performance of mag- - each technique. Despite the diversity of approaches, the relative nitude pruning is that our experiments uniformly remove the - performance of all three techniques is remarkably consistent. Mag- same fraction of weights for each layer. This is in stark con- - nitude pruning notably outperforms more complex techniques for trast to variational dropout andl0 regularization, where the - high levels of sparsity. distribution of sparsity across the layers is learned through - the training process. Previous work has shown that a non- - 4. Sparse Neural Machine Translation uniform sparsity among different layers is key to achieving - high compression rates (He et al.,2018), and variational - We adapted the Transformer (Vaswani et al.,2017) model dropout andl0 regularization should theoretically be able to - for neural machine translation to use these four sparsifica- leverage this feature to learn better distributions of weights - tion techniques, and trained the model on the WMT 2014 for a given global sparsity. - English-German dataset. We sparsified all fully-connected - layers and embeddings, which make up 99.87% of all of Figure2shows the distribution of sparsity across the differ- - the parameters in the model (the other parameters coming ent layer types in the Transformer for the top performing - from biases and layer normalization). The constant hyper- model at 90% global sparsity for each technique. Bothl0 - parameters used for all experiments are listed in table1. We regularization and variational dropout learn to keep more - followed the standard training procedure used byVaswani parameters in the embedding, FFN layers, and the output - et al.(2017), but did not perform checkpoint averaging. transforms for the multi-head attention modules and induce - This setup yielded a baseline BLEU score of 27.29 averaged more sparsity in the transforms for the query and value in- - across five runs. puts to the attention modules. Despite this advantage,l0 - regularization and variational dropout did not significantly - We extensively tuned the remaining hyperparameters for outperform magnitude pruning, even yielding inferior re- - each technique. Details on what hyperparameters we ex- sults at high sparsity levels. - plored, and the results of what settings produced the best - models can be found in AppendixD. It is also important to note that these results maintain a - constant number of training steps across all techniques and - that the Transformer variant with magnitude pruning trains4.1. Sparse Transformer Results & Analysis 1.24x and 1.65x faster thanl0 regularization and variational - All results for the Transformer are plotted in figure1. De- dropout respectively. While the standard Transformer train- - spite the vast differences in these approaches, the relative ing scheme produces excellent results for machine transla- - performance of all three techniques is remarkably consis- tion, it has been shown that training the model for longer - tent. Whilel0 regularization and variational dropout pro- can improve its performance by as much as 2 BLEU (Ott - duce the top performing models in the low-to-mid sparsity et al.,2018). Thus, when compared for a fixed training cost - range, magnitude pruning achieves the best results for highly magnitude pruning has a distinct advantage over these more - sparse models. While all techniques were able to outper- complicated techniques. The State of Sparsity in Deep Neural Networks - - - Table 2.Constant hyperparameters for all RN50 experiments. - Hyperparameter Value - dataset ImageNet - training iterations 128000 - batch size 1024 images - learning rate schedule standard - optimizer SGD with Momentum - sparsity range 50% - 98% - - - - 5. Sparse Image Classification - To benchmark these four sparsity techniques on a large- - scale computer vision task, we integrated each method into - ResNet-50 and trained the model on the ImageNet large- - scale image classification dataset. We sparsified all convolu- - tional and fully-connected layers, which make up 99.79% - of all of the parameters in the model (the other parameters Figure 3.Sparsity-accuracy trade-off curves for ResNet-50. - coming from biases and batch normalization). Top: Pareto frontiers for variational dropout, magnitude pruning, - and random pruning applied to ResNet-50. Bottom: All experi- The hyperparameters we used for all experiments are listed mental results with each technique. We observe large variation in - in Table2. Each model was trained for 128000 iterations performance for variational dropout andl0 regularization between - with a batch size of 1024 images, stochastic gradient descent Transformer and ResNet-50. Magnitude pruning and variational - with momentum, and the standard learning rate schedule dropout achieve comparable performance for most sparsity levels, - (see AppendixE.1). This setup yielded a baseline top-1 with variational dropout achieving the best results for high sparsity - accuracy of 76.69% averaged across three runs. We trained levels. - each model with 8-way data parallelism across 8 accelera- - tors. Due to the extra parameters and operations required for will be non-zero. 5 .Louizos et al.(2017b) reported results - variational dropout, the model was unable to fit into device applyingl0 regularization to a wide residual network (WRN) - memory in this configuration. For all variational dropout (Zagoruyko & Komodakis,2016) on the CIFAR-10 dataset, - experiments, we used a per-device batch size of 32 images and noted that they observed small accuracy loss at as low - and scaled the model over 32 accelerators. as 8% reduction in the number of parameters during training. - Applying our weight-levell0 regularization implementation - 5.1. ResNet-50 Results & Analysis to WRN produces a model with comparable training time - sparsity, but with no sparsity in the test-time parameters.Figure3shows results for magnitude pruning, variational For models that achieve test-time sparsity, we observe sig-dropout, and random pruning applied to ResNet-50. Surpris- nificant accuracy degradation on CIFAR-10. This result isingly, we were unable to produce sparse ResNet-50 mod- consistent with our observation forl els withl 0 regularization applied - 0 regularization that did not significantly damage to ResNet-50 on ImageNet.model quality. Across hundreds of experiments, our models - were either able to achieve full test set performance with The variation in performance for variational dropout andl0 - no sparsification, or sparsification with test set performance regularization between Transformer and ResNet-50 is strik- - akin to random guessing. Details on all hyperparameter ing. While achieving a good accuracy-sparsity trade-off, - settings explored are included in AppendixE. variational dropout consistently ranked behindl0 regulariza- - tion on Transformer, and was bested by magnitude pruningThis result is particularly surprising given the success ofl0 for sparsity levels of 80% and up. However, on ResNet-50regularization on Transformer. One nuance of thel0 regular- we observe that variational dropout consistently producesization technique ofLouizos et al.(2017b) is that the model - can have varying sparsity levels between the training and 5 The fraction of time a parameter is set to zero during training - test-time versions of the model. At training time, a parame- depends on other factors, e.g. theparameter of the hard-concrete - ter with a dropout rate of 10% will be zero 10% of the time distribution. However, this point is generally true that the training - and test-time sparsities are not necessarily equivalent, and that when sampled from the hard-concrete distribution. How- there exists some dropout rate threshold below which a weight that - ever, under the test-time parameter estimator, this weight is sometimes zero during training will be non-zero at test-time. The State of Sparsity in Deep Neural Networks - - - - - - - - - - - - - - Figure 4.Average sparsity in ResNet-50 layers.Distributions Figure 5.Sparsity-accuracy trade-off curves for ResNet-50 - calculated on the top performing model at 95% sparsity for each with modified sparsification scheme. Altering the distribution - technique. Variational dropout is able to learn non-uniform dis- of sparsity across the layers and increasing training time yield - tributions of sparsity, decreasing sparsity in the input and output significant improvement for magnitude pruning. - layers that are known to be disproportionately important to model - quality. 5.2. Pushing the Limits of Magnitude Pruning - Given that a uniform distribution of sparsity is suboptimal, - and the significantly smaller resource requirements for ap- - plying magnitude pruning to ResNet-50 it is natural to won- - models on-par or better than magnitude pruning, and that der how well magnitude pruning could perform if we were to - l0 regularization is not able to produce sparse models at distribute the non-zero weights more carefully and increase - all. Variational dropout achieved particularly notable results training time. - in the high sparsity range, maintaining a top-1 accuracy To understand the limits of the magnitude pruning heuristic,over 70% with less than 4% of the parameters of a standard we modify our ResNet-50 training setup to leave the firstResNet-50. convolutional layer fully dense, and only prune the final - The distribution of sparsity across different layer types in the fully-connected layer to 80% sparsity. This heuristic is - best variational dropout and magnitude pruning models at reasonable for ResNet-50, as the first layer makes up a small - 95% sparsity are plotted in figure4. While we kept sparsity fraction of the total parameters in the model and the final - constant across all layers for magnitude and random prun- layer makes up only .03% of the total FLOPs. While tuning - ing, variational dropout significantly reduces the amount of the magnitude pruning ResNet-50 models, we observed that - sparsity induced in the first and last layers of the model. the best models always started and ended pruning during - the third learning rate phase, before the second learning rateIt has been observed that the first and last layers are often drop. To take advantage of this, we increase the number ofdisproportionately important to model quality (Han et al., training steps by 1.5x by extending this learning rate region.2015;Bellec et al.,2017). In the case of ResNet-50, the Results for ResNet-50 trained with this scheme are plottedfirst convolution comprises only .037% of all the parame- in figure5.ters in the model. At 98% sparsity the first layer has only - 188 non-zero parameters, for an average of less than 3 pa- With these modifications, magnitude pruning outperforms - rameters per output feature map. With magnitude pruning variational dropout at all but the highest sparsity levels while - uniformly sparsifying each layer, it is surprising that it is still using less resources. However, variational dropout’s per- - able to achieve any test set performance at all with so few formance in the high sparsity range is particularly notable. - parameters in the input convolution. With very low amounts of non-zero weights, we find it likely - that the models performance on the test set is closely tied toWhile variational dropout is able to learn to distribute spar- precise allocation of weights across the different layers, andsity non-uniformly across the layers, it comes at a significant that variational dropout’s ability to learn this distributionincrease in resource requirements. For ResNet-50 trained enables it to better maintain accuracy at high sparsity levels.with variational dropout we observed a greater than 2x in- This result indicates that efficient sparsification techniquescrease in memory consumption. When scaled across 32 that are able to learn the distribution of sparsity across layersaccelerators, ResNet-50 trained with variational dropout are a promising direction for future work. completed training in 9.75 hours, compared to ResNet-50 - with magnitude pruning finishing in 12.50 hours on only 8 Its also worth noting that these changes produced mod- - accelerators. Scaled to a 4096 batch size and 32 accelerators, els at 80% sparsity with top-1 accuracy of 76.52%, only - ResNet-50 with magnitude pruning can complete the same .17% off our baseline ResNet-50 accuracy and .41% better - number of epochs in just 3.15 hours. than the results reported byHe et al.(2018), without the The State of Sparsity in Deep Neural Networks - - extra complexity and computational requirements of their - reinforcement learning approach. This represents a new - state-of-the-art sparsity-accuracy trade-off for ResNet-50 - trained on ImageNet. - - 6. Sparsification as Architecture Search - While sparsity is traditionally thought of as a model com- - pression technique, two independent studies have recently - suggested that the value of sparsification in neural net- - works is misunderstood, and that once a sparse topology - is learned it can be trained from scratch to the full perfor- - mance achieved when sparsification was performed jointly - with optimization. - Frankle & Carbin(2018) posited that over-parameterized - neural networks contain small, trainable subsets of weights, - deemed ”winning lottery tickets”. They suggest that sparsity - inducing techniques are methods for finding these sparse - topologies, and that once found the sparse architectures can - be trained from scratch withthe same weight initialization Figure 6.Scratch and lottery ticket experiments with magni- that was used when the sparse architecture was learned. tude pruning.Top: results with Transformer. Bottom: Results They demonstrated that this property holds across different with ResNet-50. Across all experiments, training from scratch - convolutional neural networks and multi-layer perceptrons using a learned sparse architecture is unable to re-produce the - trained on the MNIST and CIFAR-10 datasets. performance of models trained with sparsification as part of the - optimization process. Liu et al.(2018) similarly demonstrated this phenomenon - for a number of activation sparsity techniques on convolu- - tional neural networks, as well as for weight level sparsity To clarify the questions surrounding the idea of sparsifi-learned with magnitude pruning. However, they demon- cation as a form of neural architecture search, we repeatstrate this result using a random initialization during re- the experiments ofFrankle & Carbin(2018) andLiu et al.training. (2018) on ResNet-50 and Transformer. For each model, - The implications of being able to train sparse architectures we explore the full range of sparsity levels (50% - 98%) - from scratch once they are learned are large: once a sparse and compare to our well-tuned models from the previous - topology is learned, it can be saved and shared as with sections. - any other neural network architecture. Re-training then - can be done fully sparse, taking advantage of sparse linear 6.1. Experimental Framework - algebra to greatly accelerate time-to-solution. However, the The experiments ofLiu et al.(2018) encompass taking thecombination of these two studies does not clearly establish final learned weight mask from a magnitude pruning model,how this potential is to be realized. randomly re-initializing the weights, and training the model - Beyond the question of whether or not the original random with the normal training procedure (i.e., learning rate, num- - weight initialization is needed, both studies only explore ber of iterations, etc.). To account for the presence of spar- - convolutional neural networks (and small multi-layer per- sity at the start of training, they scale the variance of the - ceptrons in the case ofFrankle & Carbin(2018)). The initial weight distribution by the number of non-zeros in the - majority of experiments in both studies also limited their matrix. They additionally train a variant where they increase - analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets. the number of training steps (up to a factor of 2x) such that - While these are standard benchmarks for deep learning mod- the re-trained model uses approximately the same number of - els, they are not indicative of the complexity of real-world FLOPs during training as model trained with sparsification - tasks where model compression is most useful.Liu et al. as part of the optimization process. They refer to these two - (2018) do explore convolutional architectures on the Ima- experiments as ”scratch-e” and ”scratch-b” respectively. - geNet datasets, but only at two relatively low sparsity levels Frankle & Carbin(2018) follow a similar procedure, but use(30% and 60%). They also note that weight level sparsity the same weight initialization that was used when the sparseon ImageNet is the only case where they are unable to re- weight mask was learned and do not perform the longerproduce the full accuracy of the pruned model. training time variant. The State of Sparsity in Deep Neural Networks - - For our experiments, we repeat the scratch-e, scratch-b and sparsity levels, we observe that the quality of the models - lottery ticket experiments with magnitude pruning on Trans- degrades relative to the magnitude pruning baseline as spar- - former and ResNet-50. For scratch-e and scratch-b, we also sity increases. For unstructured weight sparsity, it seems - train variants that do not alter the initial weight distribution. likely that the phenomenon observed byLiu et al.(2018) - For the Transformer, we re-trained five replicas of the best was produced by a combination of low sparsity levels and - magnitude pruning hyperparameter settings at each spar- small-to-medium sized tasks. We’d like to emphasize that - sity level and save the weight initialization and final sparse this result is only for unstructured weight sparsity, and that - weight mask. For each of the five learned weight masks, prior workLiu et al.(2018) provides strong evidence that - we train five identical replicas for the scratch-e, scratch- activation pruning behaves differently. - b, scratch-e with augmented initialization, scratch-b with - augmented initialization, and the lottery ticket experiments. 7. Limitations of This Study For ResNet-50, we followed the same procedure with three - re-trained models and three replicas at each sparsity level Hyperparameter exploration. For all techniques and - for each of the five experiments. Figure6plots the averages models, we carefully hand-tuned hyperparameters and per- - and min/max of all experiments at each sparsity level 6 . formed extensive sweeps encompassing thousands of exper- - iments over manually identified ranges of values. However, - 6.2. Scratch and Lottery Ticket Results & Analysis the number of possible settings vastly outnumbers the set - of values that can be practically explored, and we cannotAcross all of our experiments, we observed that training eliminate the possibility that some techniques significantlyfrom scratch using a learned sparse architecture is not able outperform others under settings we did not try.to match the performance of the same model trained with - sparsification as part of the optimization process. Neural architectures and datasets. Transformer and - ResNet-50 were chosen as benchmark tasks to represent aAcross both models, we observed that doubling the number cross section of large-scale deep learning tasks with diverseof training steps did improve the quality of the results for architectures. We can’t exclude the possibility that somethe scratch experiments, but was not sufficient to match the techniques achieve consistently high performance acrosstest set performance of the magnitude pruning baseline. As other architectures. More models and tasks should be thor-sparsity increased, we observed that the deviation between oughly explored in future work.the models trained with magnitude pruning and those trained - from scratch increased. For both models, we did not observe - a benefit from using the augmented weight initialization for 8. Conclusion - the scratch experiments. In this work, we performed an extensive evaluation of three - For ResNet-50, we experimented with four different learn- state-of-the-art sparsification techniques on two large-scale - ing rates schemes for the scratch-b experiments. We found learning tasks. Notwithstanding the limitations discussed in - that scaling each learning rate region to double the number section7, we demonstrated that complex techniques shown - of epochs produced the best results by a wide margin. These to yield state-of-the-art compression on small datasets per- - results are plotted in figure6. Results for the ResNet-50 form inconsistently, and that simple heuristics can achieve - scratch-b experiments with the other learning rate variants comparable or better results on a reduced computational bud- - are included with our release of hyperparameter tuning re- get. Based on insights from our experiments, we achieve a - sults. new state-of-the-art sparsity-accuracy trade-off for ResNet- - 50 with only magnitude pruning and highlight promisingFor the lottery ticket experiments, we were not able to repli- directions for research in sparsity inducing techniques.cate the phenomenon observed byFrankle & Carbin(2018). - The key difference between our experiments is the complex- Additionally, we provide strong counterexamples to two re- - ity of the tasks and scale of the models, and it seems likely cently proposed theories that models learned through prun- - that this is the main factor contributing to our inability to ing techniques can be trained from scratch to the same test - train these architecture from scratch. set performance of a model learned with sparsification as - part of the optimization process. Our results highlight theFor the scratch experiments, our results are consistent with need for large-scale benchmarks in sparsification and modelthe negative result observed by (Liu et al.,2018) for Im- compression. As such, we open-source our code, check-ageNet and ResNet-50 with unstructured weight pruning. points, and results of all hyperparameter configurations to By replicating the scratch experiments at the full range of establish rigorous baselines for future work. - 6 Two of the 175 Transformer experiments failed to train from - scratch at all and produced BLEU scores less than 1.0. We omit - these outliers in figure6 The State of Sparsity in Deep Neural Networks - - Acknowledgements Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., - Casagrande, N., Lockhart, E., Stimberg, F., van den Oord,We would like to thank Benjamin Caine, Jonathan Frankle, A., Dieleman, S., and Kavukcuoglu, K. Efficient NeuralRaphael Gontijo Lopes, Sam Greydanus, and Keren Gu for Audio Synthesis. InProceedings of the 35th Interna-helpful discussions and feedback on drafts of this paper. tional Conference on Machine Learning, ICML 2018, - Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018¨ , - References pp. 2415–2424, 2018. - Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. Kingma, D. P. and Welling, M. Auto-encoding variational - Deep Rewiring: Training Very Sparse Deep Networks. bayes.CoRR, abs/1312.6114, 2013. - CoRR, abs/1711.05136, 2017. Kingma, D. P., Salimans, T., and Welling, M. Variational - Collins, M. D. and Kohli, P. Memory Bounded Deep Con- dropout and the local reparameterization trick. CoRR, - volutional Networks.CoRR, abs/1412.1442, 2014. URL abs/1506.02557, 2015. - http://arxiv.org/abs/1412.1442. LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain - Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural Damage. InNIPS, pp. 598–605. Morgan Kaufmann, - Networks using the Variational Information Bottleneck. 1989. - CoRR, abs/1802.10399, 2018. Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning. - InNIPS, pp. 2178–2188, 2017.Frankle, J. and Carbin, M. The Lottery Ticket Hy- - pothesis: Training Pruned Neural Networks. CoRR, Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang,abs/1803.03635, 2018. URLhttp://arxiv.org/ C. Learning Efficient Convolutional Networks throughabs/1803.03635. Network Slimming. InIEEE International Conference - on Computer Vision, ICCV 2017, Venice, Italy, OctoberGray, S., Radford, A., and Kingma, D. P. Block- 22-29, 2017, pp. 2755–2763, 2017.sparse gpu kernels.https://blog.openai.com/ - block-sparse-gpu-kernels/, 2017. Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. - Rethinking the Value of Network Pruning. CoRR, - Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery abs/1810.05270, 2018. - for Efficient DNNs. InNIPS, 2016. Louizos, C., Ullrich, K., and Welling, M. Bayesian Com- - Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both pression for Deep Learning. InAdvances in Neural In- - Weights and Connections for Efficient Neural Network. formation Processing Systems 30: Annual Conference - InNIPS, pp. 1135–1143, 2015. on Neural Information Processing Systems 2017, 4-9 De- - cember 2017, Long Beach, CA, USA, pp. 3290–3300, - Hassibi, B. and Stork, D. G. Second order derivatives for 2017a. - network pruning: Optimal brain surgeon. InNIPS, pp. - 164–171. Morgan Kaufmann, 1992. Louizos, C., Welling, M., and Kingma, D. P. Learn- - ing Sparse Neural Networks through L0Regularization. - He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn- CoRR, abs/1712.01312, 2017b. - ing for Image Recognition. In2016 IEEE Conference on Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level PruningComputer Vision and Pattern Recognition, CVPR 2016, Method for Deep Neural Network Compression. InIEEELas Vegas, NV, USA, June 27-30, 2016, pp. 770–778, International Conference on Computer Vision, ICCV2016. 2017, Venice, Italy, October 22-29, 2017, pp. 5068–5076, - 2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC: - automl for model compression and acceleration on mo- Mitchell, T. J. and Beauchamp, J. J. Bayesian Variablebile devices. InComputer Vision - ECCV 2018 - 15th Selection in Linear Regression.Journal of the AmericanEuropean Conference, Munich, Germany, September 8- Statistical Association, 83(404):1023–1032, 1988.14, 2018, Proceedings, Part VII, pp. 815–832, 2018. - Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., - Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, Gibescu, M., and Liotta, A. Scalable Training of Artifi- - H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and cial Neural Networks with Adaptive Sparse Connectivity - Zhou, Y. Deep learning scaling is predictable, empirically. Inspired by Network Science.Nature Communications, - CoRR, abs/1712.00409, 2017. 2018. The State of Sparsity in Deep Neural Networks - - Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational Zagoruyko, S. and Komodakis, N. Wide Residual Networks. - Dropout Sparsifies Deep Neural Networks. InProceed- InProceedings of the British Machine Vision Conference - ings of the 34th International Conference on Machine 2016, BMVC 2016, York, UK, September 19-22, 2016, - Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au- 2016. - gust 2017, pp. 2498–2507, 2017. Zhu, M. and Gupta, S. To prune, or not to prune: exploring - Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. the efficacy of pruning for model compression.CoRR, - Pruning Convolutional Neural Networks for Resource Ef- abs/1710.01878, 2017. URLhttp://arxiv.org/ - ficient Transfer Learning.CoRR, abs/1611.06440, 2016. abs/1710.01878. - - Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex- - ploring Sparsity in Recurrent Neural Networks.CoRR, - abs/1704.05119, 2017. - - Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling - Neural Machine Translation. InProceedings of the Third - Conference on Machine Translation: Research Papers, - WMT 2018, Belgium, Brussels, October 31 - November 1, - 2018, pp. 1–9, 2018. - - Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic - Backpropagation and Approximate Inference in Deep - Generative models. InICML, volume 32 ofJMLR - Workshop and Conference Proceedings, pp. 1278–1286. - JMLR.org, 2014. - - Strom, N. Sparse Connection and Pruning in Large Dynamic¨ - Artificial Neural Networks. InEUROSPEECH, 1997. - - Theis, L., Korshunova, I., Tejani, A., and Huszar, F. Faster´ - gaze prediction with dense networks and Fisher pruning. - CoRR, abs/1801.05787, 2018. URLhttp://arxiv. - org/abs/1801.05787. - - Ullrich, K., Meeds, E., and Welling, M. Soft Weight- - Sharing for Neural Network Compression. CoRR, - abs/1702.04008, 2017. - - Valin, J. and Skoglund, J. Lpcnet: Improving Neural - Speech Synthesis Through Linear Prediction. CoRR, - abs/1810.11846, 2018. URLhttp://arxiv.org/ - abs/1810.11846. - - van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., - Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., - and Kavukcuoglu, K. Wavenet: A Generative Model for - Raw Audio. InThe 9th ISCA Speech Synthesis Workshop, - Sunnyvale, CA, USA, 13-15 September 2016, pp. 125, - 2016. - - Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, - L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten- - tion is All you Need. InAdvances in Neural Information - Processing Systems 30: Annual Conference on Neural In- - formation Processing Systems 2017, 4-9 December 2017, - Long Beach, CA, USA, pp. 6000–6010, 2017. The State of Sparsity in Deep Neural Networks: Appendix - - - - - A. Overview of Sparsity Inducing Techniques p(w)with observed dataDinto an updated belief over the - parameters in the form of the posterior distributionp(wjD).Here we provide a more detailed review of the three sparsity In practice, computing the true posterior using Bayes’ ruletechniques we benchmarked. is computationally intractable and good approximations are - needed. In variational inference, we optimize the parame-A.1. Magnitude Pruning tersof some parameterized modelq (w)such thatq (w) - Magnitude-based weight pruning schemes use the magni- is a close approximation to the true posterior distribution - tude of each weight as a proxy for its importance to model p(wjD)as measured by the Kullback-Leibler divergence - quality, and remove the least important weights according between the two distributions. The divergence of our ap- - to some sparsification schedule over the course of training. proximate posterior from the true posterior is minimized in - Many variants have been proposed (Collins & Kohli,2014; practice by maximizing the variational lower-bound - Han et al.,2015;Guo et al.,2016;Zhu & Gupta,2017), - with the key differences lying in when weights are removed, L() =D Lwhether weights should be sorted to remove a precise pro- KL (q (w)jjp(w)) + D () - - portion or thresholded based on a fixed or decaying value, PwhereLand whether or not weights that have been pruned still re- D () = Eq (w) [logp(yjx;w)] - (x;y)2D - ceive gradient updates and have the potential to return after Using the Stochastic Gradient Variational Bayes (SGVB)being pruned. (Kingma et al.,2015) algorithm to optimize this bound, - Han et al.(2015) use iterative magnitude pruning and re- LD ()reduces to the standard cross-entropy loss, and the - training to progressively sparsify a model. The target model KL divergence between our approximate posterior and prior - is first trained to convergence, after which a portion of over the parameters serves as a regularizer that enforces our - weights are removed and the model is re-trained with these initial belief about the parametersw. - weights fixed to zero. This process is repeated until the In the standard formulation of variational dropout, we as-target sparsity is achieved.Guo et al.(2016) improve on sume the weights are drawn from a fully-factorized Gaussianthis approach by allowing masked weights to still receive approximate posterior.gradient updates, enabling the network to recover from in- - correct pruning decisions during optimization. They achieve - higher compression rates and interleave pruning steps with wij q (wij ) =N(ij ; ij 2 )ij gradient update steps to avoid expensive re-training.Zhu - & Gupta(2017) similarly allow gradient updates to masked Whereandare neural network parameters. For eachweights, and make use of a gradual sparsification schedule training step, we sample weights from this distribution andwith sorting-based weight thresholding to maintain accuracy use thereparameterization trick(Kingma & Welling,2013; while achieving a user specified level of sparsification. Rezende et al.,2014) to differentiate the loss w.r.t. the pa- - Its worth noting that magnitude pruning can easily be rameters through the sampling operation. Given the weights - adapted to induce block or activation level sparsity by re- are normally distributed, the distribution of the activations - moving groups of weights based on their p-norm, average, Bafter a linear operation like matrix multiplication or con- - max, or other statistics. Variants have also been proposed volution is also Gaussian and can be calculated in closed - that maintain a constant level of sparsity during optimization form 7 . - to enable accelerated training (Mocanu et al.,2018). - q (bmj jA) N(mj ; mj ) - A.2. Variational Dropout - Consider the setting of a datasetDofNi.i.d. samples PK PK with (x;y)and a standard classification problem where the goal mj = ami ij andmj = a2 mi ij 2 and iji=1 i=1 - is to learn the parameterswof the conditional probability whereami 2Aare the inputs to the layer. Thus, rather - p(yjx;w). Bayesian inference combines some initial belief 7 We ignore correlation in the activations, as is done by over the parameterswin the form of a prior distribution Molchanov et al.(2017) The State of Sparsity in Deep Neural Networks: Appendix - - than sample weights, we can directly sample the activations andandstretch the distribution s.t.zj takes value 0 or 1 - at each layer. This step is known as thelocal reparame- with non-zero probability. - terization trick, and was shown byKingma et al.(2015) to On each training iteration,zreduce the variance of the gradients relative to the standard j is sampled from this distri- - bution and multiplied with the standard neural networkformulation in which a single set of sampled weights must weights. The expectedlbe shared for all samples in the input batch for efficiency. 0 -normLC can then be calcu- - lated using the cumulative distribution function of the hard-Molchanov et al.(2017) showed that the variance of the gra- concrete distribution and optimized directly with stochasticdients could be further reduced by using anadditive noise gradient descent.reparameterization, where we define a new parameter - - 2 =ij ij 2ij Xjj Xjj LC = (1Qs (0j)) = sigmoid(log - Under this parameterization, we directly optimize the mean j j log )j=1 j=1 - and variance of the neural network parameters. - Under the assumption of a log-uniform prior on the weights At test-time,Louizos et al.(2017b) use the following esti- - w, the KL divergence component of our objective function mate for the model parameters. - DKL (q (wij )jjp(wij ))can be accurately approximated - (Molchanov et al.,2017): - =~ z^ - z^=min(1;max(0;sigmoid(log)() +)) - DKL (q (wij )jjp(wij )) - k1 (k2 +k3 logij )0:5log(1 +1 +kij 1 ) Interestingly,Louizos et al.(2017b) showed that their ob- - k jective function under thel1 = 0:63576 k2 = 1:87320 k3 = 1:48695 0 penalty is a special case of a - variational lower-bound over the parameters of the network - under a spike and slab (Mitchell & Beauchamp,1988) prior.After training a model with variational dropout, the weights - with the highestvalues can be removed. For all their - experiments,Molchanov et al.(2017) removed weights with B. Variational Dropout Implementation - loglarger than 3.0, which corresponds to a dropout rate Verification - greater than 95%. Although they demonstrated good results, - it is likely that the optimalthreshold varies across different To verify our implementation of variational dropout, we - models and even different hyperparameter settings of the applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST - same model. We address this question in our experiments. and compared our results to the original paper (Molchanov - et al.,2017). We matched our hyperparameters to those - used in the code released with the paper 8 . All results areA.3.l0 Regularization listed in table3 - To optimize thel0 -norm, we reparameterize the model - weightsas the product of a weight and a random vari- Table 3.Variational Dropout MNIST Reproduction Results. able drawn from the hard-concrete distribution. Network Experiment Sparsity (%) Accuracy (%) - original (Molchanov et al.,2017) 98.57 98.08 - ours (log= 3.0) 97.52 98.42LeNet-300-100 ours (log= 2.0) 98.50 98.40 - ours (log= 0.1) 99.10 98.13 - j =~j zj original (Molchanov et al.,2017) 99.60 99.25 - wherez LeNet-5-Caffe ours (log= 3.0) 99.29 99.26 - j min(1;max(0;s));s=s() + ours (log= 2.0) 99.50 99.25 - s=sigmoid((logulog(1u) +log)=) - andu U(0;1) Our baseline LeNet-300-100 model achieved test set accu- - racy of 98.42%, slightly higher than the baseline of 98.36% - reported in (Molchanov et al.,2017). Applying our varia-In this formulation, theparameter that controls the posi- tional dropout implementation to LeNet-300-100 with these tion of the hard-concrete distribution (and thus the proba- hyperparameters produced a model with 97.52% global spar-bility thatzj is zero) is optimized with gradient descent., sity and 98.42% test accuracy. The original paper produced, andare fixed parameters that control the shape of the - hard-concrete distribution.controls the curvature ortem- 8 https://github.com/ars-ashuha/variational-dropout-sparsifies- - peratureof the hard-concrete probability density function, dnn The State of Sparsity in Deep Neural Networks: Appendix - - Our baseline WRN-28-10 implementation trained on - CIFAR-10 achieved a test set accuracy of 95.45%. Using - ourl0 regularization implementation and al0 -norm weight - of .0003, we trained a model that achieved 95.34% accuracy - on the test set while achieving a consistent training-time - FLOPs reduction comparable to that reported byLouizos - et al.(2017b). Floating-point operations (FLOPs) required - to compute the forward over the course of training WRN- - 28-10 withl0 are plotted in figure7. - During our re-implementation of the WRN experiments - Figure 7.Forward pass FLOPs for WRN-28-10 trained withl0 fromLouizos et al.(2017b), we identified errors in the orig- regularization.Our implementation achieves FLOPs reductions inal publications FLOP calculations that caused the number comparable to those reported inLouizos et al.(2017b). of floating-point operations in WRN-28-10 to be miscalcu- - lated. We’ve contacted the authors, and hope to resolve this - issue to clarify their performance results. - a model with 98.57% global sparsity, and 98.08% test accu- - racy. While our model achieves .34% higher tests accuracy D. Sparse Transformer Experiments with 1% lower sparsity, we believe the discrepancy is mainly - due to difference in our software packages: the authors of D.1. Magnitude Pruning Details - (Molchanov et al.,2017) used Theano and Lasagne for their For our magnitude pruning experiments, we tuned four keyexperiments, while we use TensorFlow. hyperparameters: the starting iteration of the sparsification - Given our model achieves highest accuracy, we can decrease process, the ending iteration of the sparsification process, - thelogthreshold to trade accuracy for more sparsity. With the frequency of pruning steps, and the combination of other - alogthreshold of 2.0, our model achieves 98.5% global regularizers (dropout and label smoothing) used during train- - sparsity with a test set accuracy of 98.40%. With alog ing. We trained models with 7 different target sparsities: - threshold of 0.1, our model achieves 99.1% global sparsity 50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of - with 98.13% test set accuracy, exceeding the sparsity and these sparsity levels, we tried pruning frequencies of 1000 - accuracy of the originally published results. and 10000 steps. During preliminary experiments we identi- - fied that the best settings for the training step to stop pruningOn LeNet-5-Caffe, our implementation achieved a global at were typically closer to the end of training. Based on thissparsity of 99.29% with a test set accuracy of 99.26%, ver- insight, we explored every possible combination of start andsus the originaly published results of 99.6% sparsity with end points for the sparsity schedule in increments of 10000099.25% accuracy. Lowering thelogthreshold to 2.0, our steps with an ending step of 300000 or greater.model achieves 99.5% sparsity with 99.25% test accuracy. - By default, the Transformer uses dropout with a dropout - C.l rate of 10% on the input to the encoder, decoder, and before 0 Regularization Implementation each layer and performs label smoothing with a smooth- Verification ing parameter of .1. We found that decreasing these other - The originall regularizers produced higher quality models in the mid to 0 regularization paper uses a modified version - of the proposed technique for inducing group sparsity in high sparsity range. For each hyperparameter combination, - models, so our weight-level implementation is not directly we tried three different regularization settings: standard la- - comparable. However, to verify our implementation we bel smoothing and dropout, label smoothing only, and no - trained a Wide ResNet (WRN) (Zagoruyko & Komodakis, regularization. - 2016) on CIFAR-10 and compared results to those reported - in the original publication for group sparsity. D.2. Variational Dropout Details - - As done byLouizos et al.(2017b), we applyl For the Transformer trained with variational dropout, we 0 to the - first convolutional layer in the residual blocks (i.e., where extensively tuned the coefficient for the KL divergence - dropout would normally be used). We use the weight decay component of the objective function to find models that - formulation for the re-parameterized weights, and scale the achieved high accuracy with sparsity levels in the target - weight decay coefficient to maintain the same initial length range. We found that KL divergence weights in the range - scale of the parameters. We use the same batch size of 128 [:1 ;1 ], whereNis the number of samples in the training N N - samples and the same initial log, and train our model on a set, produced models in our target sparsity range. - single GPU. The State of Sparsity in Deep Neural Networks: Appendix - - (Molchanov et al.,2017) noted difficulty training some mod- E. Sparse ResNet-50 - els from scratch with variational dropout, as large portions - of the model adopt high dropout rates early in training be- E.1. Learning Rate - fore the model can learn a useful representation from the For all experiments, the we used the learning rate schemedata. To address this issue, they use a gradual ramp-up of the used by the official TensorFlow ResNet-50 implementation 9 .KL divergence weight, linearly increasing the regularizer With our batch size of 1024, this includes a linear ramp-upcoefficient until it reaches the desired value. for 5 epochs to a learning rate of .4 followed by learning - For our experiments, we explored using a constant regu- rate drops by a factor of 0.1 at epochs 30, 60, and 80. - larizer weight, linearly increasing the regularizer weight, - and also increasing the regularizer weight following the E.2. Magnitude Pruning Details - cubic sparsity function used with magnitude pruning. For For magnitude pruning on ResNet-50, we trained modelsthe linear and cubic weight schedules, we tried each com- with a target sparsity of 50%, 70%, 80%, 90%, 95%, andbination of possible start and end points in increments of 98%. At each sparsity level, we tried starting pruning at100000 steps. For each hyperparameter combination, we steps 8k, 20k, and 40k. For each potential starting point, wealso tried the three different combinations of dropout and la- tried ending pruning at steps 68k, 76k, and 100k. For everybel smoothing as with magnitude pruning. For each trained hyperparameter setting, we tried pruning frequencies of 2k,model, we evaluated the model with 11logthresholds 4k, and 8k steps and explored training with and without labelin the range[0;5]. For all experiments, we initialized all smoothing. During preliminary experiments, we observedlog2 parameters to the constant value10. that removing weight decay from the model consistently - caused significant decreases in test accuracy. Thus, for allD.3.l0 Regularization Details hyperparameter combinations, we left weight decay on with - For Transformers trained withl the standard coefficient. 0 regularization, we simi- - larly tuned the coefficient for thel0 -norm in the objective For a target sparsity of 98%, we observed that very few hy-function. We observed that much higher magnitude regu- perparameter combinations were able to complete traininglarization coefficients were needed to produce models with without failing due to numerical issues. Out of all the hyper-the same sparsity levels relative to variational dropout. We parameter configurations we tried, only a single model wasfound thatl 100 -norm weights in the range[1 ; ]produced N N able to complete training without erroring from the presencemodels in our target sparsity range. of NaNs. As explained in the main text, at high sparsity - For all experiments, we used the default settings for the levels the first layer of the model has very few non-zero - paramters of the hard-concrete distribution:= 2=3,= parameters, leading to instability during training and low - 0:1, and= 1:1. We initialized thelogparameters to test set performance. Pruned ResNet-50 models with the - 2:197, corresponding to a 10% dropout rate. first layer left dense did not exhibit these issues. - - For each hyperparameter setting, we explored the three reg- E.3. Variational Dropout Detailsularizer coefficient schedules used with variational dropout - and each of the three combinations of dropout and label For variational dropout applied to ResNet-50, we explored - smoothing. the same combinations of start and end points for the kl- - divergence weight ramp up as we did for the start and end - D.4. Random Pruning Details points of magnitude pruning. For all transformer experi- - ments, we did not observe a significant gain from using aWe identified in preliminary experiments that random prun- cubic kl-divergence weight ramp-up schedule and thus onlying typically produces the best results by starting and ending explored the linear ramp-up for ResNet-50. For each combi-pruning early and allowing the model to finish the rest of nation of start and end points for the kl-divergence weight,the training steps with the final sparse weight mask. For our we explored 9 different coefficients for the kl-divergenceexperiments, we explored all hyperparameter combinations loss term: .01 / N, .03 / N, .05 / N, .1 / N, .3 / N, .5 / N, 1 /that we explored with magnitude pruning, and also included N, 10 / N, and 100 / N.start/end pruning step combinations with an end step of less - than 300000. Contrary to our experience with Transformer, we found - ResNet-50 with variational dropout to be highly sensitive - to the initialization for the log2 parameters. With the - standard setting of -10, we couldn’t match the baseline accu- - racy, and with an initialization of -20 our models achieved - 9 https://bit.ly/2Wd2Lk0 The State of Sparsity in Deep Neural Networks: Appendix - - good test performance but no sparsity. After some exper- pruning frequencies of 2k, 4k, and 8k and explored training - imentation, we were able to produce good results with an with and without label smoothing. - initialization of -15. - While with Transformer we saw a reasonable amount of E.6. Scratch-B Learning Rate Variants - variance in test set performance and sparsity with the same For the scratch-b (Liu et al.,2018) experiments with ResNet- - model evaluated at different logthresholds, we did not 50, we explored four different learning rate schemes for the - observe the same phenomenon for ResNet-50. Across a extended training time (2x the default number of epochs). - range of logvalues, we saw consistent accuracy and nearly - identical sparsity levels. For all of the results reported in the The first learning rate scheme we explored was uniformly - main text, we used a logthreshold of 0.5, which we found scaling each of the five learning rate regions to last for - to produce slightly better results than the standard threshold double the number of epochs. This setup produced the best - of 3.0. results by a wide margin. We report these results in the main - text. - E.4.l0 Regularization Details The second learning rate scheme was to keep the standard - learning rate, and maintain the final learning rate for theForl0 regularization, we explored four different initial log extra training steps as is common when fine-tuning deepvalues corresponding to dropout rates of 1%, 5%, 10%, neural networks. The third learning rate scheme was to and 30%. For each dropout rate, we extenively tuned thel0 - maintain the standard learning rate, and continually drop norm weight to produce models in the desired sparsity range. the learning rate by a factor of 0.1 every 30 epochs. The lastAfter identifying the proper range ofl0 -norm coefficients, scheme we explored was to skip the learning rate warm-up,we ran experiments with 20 different coefficients in that and drop the learning rate by 0.1 every 30 epochs. Thisrange. For each combination of these hyperparameters, we learning rate scheme is closest to the one used byLiu et al.tried all four combinations of other regularizers: standard (2018). We found that this scheme underperformed relativeweight decay and label smoothing, only weight decay, only to the scaled learning rate scheme with our training setup.label smoothing, and no regularization. For weight decay, - we used the formulation for the reparameterized weights Results for all learning rate schemes are included with the - provided in the original paper, and followed their approach released hyperparameter tuning data. - of scaling the weight decay coefficient based on the initial - dropout rate to maintain a constant length-scale between the - l0 regularized model and the standard model. - Across all of these experiments, we were unable to produce - ResNet models that achieved a test set performance better - than random guessing. For all experiments, we observed that - training proceeded reasonably normally until thel0 -norm - loss began to drop, at which point the model incurred severe - accuracy loss. We include the results of all hyperparameter - combinations in our data release. - Additionally, we tried a number of tweaks to the learning - process to improve the results to no avail. We explored - training the model for twice the number of epochs, training - with much higher initial dropout rates, modifying the - parameter for the hard-concrete distribution, and a modified - test-time parameter estimator. - - E.5. Random Pruning Details - For random pruning on ResNet-50, we shifted the set of - possible start and end points for pruning earlier in training - relative to those we explored for magnitude pruning. At - each of the sparsity levels tried with magnitude pruning, - we tried starting pruning at step 0, 8k, and 20k. For each - potential starting point, we tried ending pruning at steps 40k, - 68k, and 76k. For every hyperparameter setting, we tried \ No newline at end of file diff --git a/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt b/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt deleted file mode 100644 index 610ac21..0000000 Binary files a/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt and /dev/null differ diff --git a/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt b/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt deleted file mode 100644 index 61dd649..0000000 --- a/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt +++ /dev/null @@ -1,1187 +0,0 @@ - You Cannot Improve What You Do not Measure: - FPGA vs. ASIC Efficiency Gaps for Convolutional - Neural Network Inference - - ANDREW BOUTROS, SADEGH YAZDANSHENAS, and VAUGHN BETZ, - Department of Electrical and Computer Engineering, University of Toronto - - Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational - cost that necessitates high-performance energy-efffcient acceleration. The reconffgurability of FPGAs is ap- - pealingduetotherapidchangeinDLmodelsbutalsocauseslowerperformanceandarea-efffciencycompared - to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional - neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, - we highlight the area and performance costs of programmability to pinpoint the inefffciencies in current - FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 - and ResNet-50 to allow extensive comparisons. We ffnd that the performance gap varies signiffcantly from - 2.8×to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among - different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area - ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural - changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking - the on-chip memories to reduce the programmability gap for DL applications. - CCSConcepts:•Hardware→ReconffgurablelogicandFPGAs;Hardwareaccelerators;Reconffgurable - logic applications; - Additional Key Words and Phrases: Deep learning, convolutional neural networks, FPGA, ASIC - ACM Reference format: - Andrew Boutros, Sadegh Yazdanshenas, and Vaughn Betz. 2018. You Cannot Improve What You Do not Mea- - sure: FPGA vs. ASIC Efffciency Gaps for Convolutional Neural Network Inference.ACM Trans. Reconffgurable - Technol. Syst.11, 3, Article 20 (December 2018), 23 pages. - https://doi.org/10.1145/3242898 - - 1 INTRODUCTION - Recent advances in deep learning (DL) have led to breakthroughs in a myriad of ffelds, achiev- - ing unprecedented accuracy in tasks that were thought to be inherently unsuitable for our com- - puting machines to perform. It has become, in a very short time span, thede-factostandard for - numerous applications ranging from simple image classiffcation [36], machine translation [44], - Authors’ addresses: A. Boutros and V. Betz, Department of Electrical and Computer Engineering, University of Toronto, - 10 King’s College Road, Toronto, Ontario M5S 3G4, Canada and Vector Institute, Toronto, ON, Canada; emails: andrew. - boutros@mail.utoronto.ca, vaughn@eecg.utoronto.ca; S. Yazdanshenas, Department of Electrical and Computer Engineer- - ing, University of Toronto, 10 King’s College Road, Toronto, Ontario M5S 3G4, Canada; email: sadegh.yazdanshenas@ - mail.utoronto.ca. - Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee 20 provided that copies are not made or distributed for profft or commercial advantage and that copies bear this notice and - the full citation on the ffrst page. Copyrights for components of this work owned by others than ACM must be honored. - Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires - prior speciffc permission and/or a fee. Request permissions frompermissions@acm.org. - © 2018 Association for Computing Machinery. - 1936-7406/2018/12-ART20 $15.00 - https://doi.org/10.1145/3242898 - ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018. 20:2 A.Boutrosetal. - - and speech recognition [10] to generating artistic paintings [9], composing music [7], and beating - world champions in complex board games [41]. Interestingly, the basic foundations of DL and the - algorithm currently used to train deep neural networks (DNNs), known as back-propagation, were - established in the 1980s [35]. But it was not until recent years that it experienced a resurgence of - interest [20], powered by both the abundance of data required for training and the availability of - the tremendous compute-power necessary to train and deploy those models. - However, the main drawback of DNNs remains to be their high computational complexity when - compared to conventional detection and classiffcation computer vision algorithms based on hand- - crafted features. For example, a relatively simple eight-layer convolutional neural network (CNN), - AlexNet [20], has a computational complexity of 25.8GOP/Mpixel for its convolutional layers, - which is 36.9×higher than that of a conventional histogram of oriented gradients feature extractor - [43]. This gap grows even wider as we seek to improve the accuracy of CNNs by building deeper, - bigger and more complex models that can surpass human-level performance on visual recogni- - tion tasks [14]. The ImageNet large-scale visual recognition challenge witnessed a 15×increase - in operations required per image inference in return for an 11.7% reduction in classiffcation error - between 2012 and 2015 [15,36]. This substantial increase in compute requirements motivates high- - performance and energy-efffcient hardware accelerators to replace or co-exist with conventional - CPUs in executing both CNN training and inference tasks. - The training of CNN models is commonly performed in ffoating-point representation on graph- - ics processing units (GPUs) having thousands of cores and large external memory bandwidth. It - does not require much effort to deploy existing models or train new ones on GPUs using various - frameworks (e.g., Caffe [18] and TensorFlow [1]) that exploit highly optimized GPU libraries such - as Nvidia CuDNN [5] for dense and sparse matrix operations. Although GPUs can deliver high - performance by performing batch computations, they are extremely power-hungry. This is afford- - able for training, which has no constraints on output latency and is carried out a limited number - of times during the development phase. However, when it comes to inference, this is not ideal for - a wide class of applications that have limited power budget and tight latency constraints such as - mobile embedded platforms, self-driving cars or large-scale datacenter services. - Toachievethebestperformanceandenergy-efffciency,manyresearchershavefocusedonbuild- - ing custom application-speciffc integrated circuits (ASICs) for accelerating CNNs inference work- - loads. Some examples are DaDianNao [3] that accelerates different types of DNNs using a multi- - chip architecture and Eyeriss [4] that focuses on energy-efffcient acceleration of convolutional - layers by maximizing data re-use, performing data compression and using a zero-skipping tech- - nique. Despite being an attractive solution, ASICs do not offer enough ffexibility to accommodate - the rapid evolution of CNN models and the emergence of new types of layers used in them includ- - ing the branching, elementwise addition and batch normalization layers as in more recent models - (e.g., GoogLeNet [45] and ResNet [15]). As well, the high non-recurring engineering (NRE) cost - and time for design, veriffcation and fabrication of a large ASIC chip makes it difffcult to keep pace - with the rapid model improvements in this space. - As a trade-off between performance, power-efffciency, and ffexibility, FPGAs offer an interest- - ing design point between GPUs and ASICs and recently have had much success in accelerating - datacenter workloads in general [32] and more speciffcally CNN inference tasks [30]. In contrast - to GPUs, FPGAs are generally more energy-efffcient. A high-end Titan X Nvidia GPU can consume - up to 5×more power compared to a high-end Intel Arria 10 FPGA running AlexNet inference tasks - [2]. Several studies have also shown that CNN inference does not require high-precision ffoating- - point computations and can be carried out using ffxed-point arithmetic for less than 1% accuracy - degradation [13]. This wide variety of precisions used in CNN inference matches well with FP- - GAs as they can execute non-standard custom bit-width datapaths with much higher efffciency - - ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018. FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:3 - - and ffexibility than GPUs. However, they have a shorter turn-around time, less NRE cost, and can - be re-conffgured to support new models and layer types when compared to ASIC accelerators. - Another interesting advantage for FPGAs is that they offer a variety of I/Os that support different - communication protocols. This is useful when the CNN accelerator is a part of a larger system - and receives inputs from different types of digital and analog sensors as the case in automotive - applications. However, FPGAs run at signiffcantly lower frequencies due to their reconffgurability - overhead and thus have lower raw performance compared to both GPUs and ASICs. - For this reason and despite their drawbacks, several companies have developed ASIC solutions - to meet the processing needs of high-performance DL applications. A recent example for that is - Google’sTensorProcessingUnit[19]thatwasdeployedindatacenterstoaccelerateinferencetasks - for various types of DNNs. It has almost 17×more multiply accumulate (MAC) units, 5.6×more - on-chip memory and runs at 3.5×higher frequency when compared to Microsoft’s Catapult V1 - [32] that uses Intel Stratix V FPGAs. In this work, we study the area and performance gap between - FPGAs and ASICs in accelerating inference tasks using multiple CNN computing architectures - (CAs) to highlight the limitations of current FPGA architectures and how they affect the overall - performance of DL accelerators. The motive behind this study is twofold; First, it shows which - design practices are more suitable for FPGA platforms and make the best use of current FPGA - architectures. Second, it provides FPGA architects with data on where FPGAs have the largest - efffciency gap compared to ASICs, which can lead to insights on how current FPGA architectures - could be modiffed to shrink this gap and deliver higher performance in a domain with extremely - high demand such as DL. - In this article, we make the following contributions: - •WeimplementhighlyoptimizedRTLdesignsforthreestate-of-the-artCAsthatusedifferent - parallelization schemes to accelerate CNNs. We then extend each of these previously pub- - lished architectures to support all layer types required to implement three different CNN - models: AlexNet, VGG-16, and ResNet-50 to ensure our comparisons consider a broadly - representative set of CNN models and implementations. - •We present a quantitative comparison of area and performance results to measure the gap - between the same CAs implemented on a high-end Intel Arria 10 FPGA and a 28nm ASIC. - •We trace back the bottlenecks resulting in this gap and pinpoint the limitations of current - FPGA architectures in accelerating CNNs. - 2 BACKGROUND - Deep Neural Networks are a class of machine-learning algorithms that were developed to mimic - the information-processing paradigm in biological nervous systems. The human brain as an ex- - ample has an average of around 86 billion neurons [16] connected in a complex network in which - each neuron receives inputs from its surrounding neurons and ffres an activation if those inputs - are greater than a speciffc threshold. Inspired by this system, DNNs typically consist of several - layers each of which hasd(l) neurons wherelis the layer number ranging from 1 toL.Eacharti- - ffcial neuron performs a biased weighted sum of all its inputs followed by a non-linear activation - function to produce its output as shown in Equation (1), wherex(l) is the output of neuroniof i - layerl,w(l) is the weight parameter between the neuronjin layerland neuroniin layerl−1, ij - w(l) is the bias term andθis the non-linear activation function that can be a sigmoid, tanh, or 0j rectiffed linear unit (ReLU) function. This equation can be viewed as a series of MAC operations, - which form the majority of computations in DNNs: - - x(l) =θ d (l−1) - w(l) + x(l−1) wl . (1) j 0j i ij - i=1 - ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018. 20:4 A.Boutrosetal. - - - - - - - - - - - - - - - - - - - - - - - Fig. 1. Different layer types in an example CNN. - - CNNs are a subset of DNNs in which the connections between neurons of successive layers - are sparse. Each neuron receives inputs only from neighboring neurons of the previous layer or - so-called itsreceptive ffeld. This signiffcantly reduces the number of weights and MAC operations - required and achieves high accuracy in applications with spacial or temporal correlation between - input samples such as image classiffcation, gesture and speech recognition. Sections2.1and2.2 - describe the main layers of a CNN and present a summary of the previous related work on accel- - erating CNNs on FPGAs. - - 2.1 Overview of CNN Layers - CNN models typically consist of different layer types cascaded together such that the output of a - speciffc layer is consumed by the subsequent one in a feed-forward scheme during inference. In - Figure1, we show an example CNN, and we illustrate the functionality of each of the layer types - subsequently explained in this section. - - 2.1.1 Convolutional (CONV) Layers.A CONV layer takes a set ofNIM two-dimensional input - feature maps. It accumulates the results of 2D convolutions with strideSbetween each input fea- - ture map and its correspondingK×Kkernel of learnable weights to produce a two-dimensional - output feature map. This is performed usingNOM different sets of kernels to generateNOM output - feature maps that are consumed by the subsequent layer. CONV layers are very compute-intensive - and represent the majority of computation in a CNN, which motivated many designers to focus - on accelerating only the CONV and not all CNN layers [55]. We also notice that as CNN models - get deeper, the portion of CONV layers operations compared to the total number of operations - increases as they constitute 91.6%, 99.1%, and 99.8% of the total operations count for AlexNet, - VGG-16, and ResNet-50, respectively. - The computation of CONV layers can be summarized using the six nested loops in Algorithm1; - they are highly parallelizable and can achieve high gains through hardware acceleration. However, - it is a non-trivial optimization problem to choose the tiling and unrolling factors of those loops - to achieve the best performance within the limited available hardware resources [27]. Typically, a - - ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018. FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:5 - - - ALGORITHM 1:Nested loops for CONV layers computation - Loop 1: for(j=0;j