<|startoftext|> 

   Neural Ordinary Differential Equations

Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud
      University of Toronto, Vector Institute
   {rtqichen, rubanova, jessebett, duvenaud}@cs.toronto.edu

                                   Abstract
We introduce a new family of deep neural network models. Instead of specifying a
discrete sequence of hidden layers, we parameterize the derivative of the hidden
state using a neural network. The output of the network is computed using a black-
box differential equation solver. These continuous-depth models have constant
memory cost, adapt their evaluation strategy to each input, and can explicitly trade
numerical precision for speed. We demonstrate these properties in continuous-depth
residual networks and continuous-time latent variable models. We also construct
continuous normalizing flows, a generative model that can train by maximum
likelihood, without partitioning or ordering the data dimensions. For training, we
show how to scalably backpropagate through any ODE solver, without access to its
internal operations. This allows end-to-end training of ODEs within larger models.

                               1   Introduction
                                                                                                         
Models such as residual networks, recurrent neural network decoders, and normalizing flows build 
complicated transformations by composing a sequence of transformations to a hidden state:    

                          <<FORMULA>>           (1)          
                                                                        
where t ∈ {0 . . . T } and ht ∈ R . These iterative updates can be seen as an Euler discretization of a
continuous transformation (Lu et al., 2017; Haber and Ruthotto, 2017; Ruthotto and Haber, 2018).                    
What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous     
dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:       
Starting from the input layer h(0), we can define the output layer h(T ) to be the solution to this

                          <<FORMULA>>           (2)                                  

ODE initial value problem at some time T . This value can be computed by a black-box differential
equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the
solution with the desired accuracy. Figure 1 contrasts these two approaches.
Defining and evaluating models using ODE solvers has several benefits:
Memory efficiency In Section 2, we show how to compute gradients of a scalar-valued loss with
respect to all inputs of any ODE solver, without backpropagating through the operations of the solver.
Not storing any intermediate quantities of the forward pass allows us to train our models with constant
memory cost as a function of depth, a major bottleneck of training deep models.
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Adaptive computation Euler’s method is perhaps the simplest method for solving ODEs. There
have since been more than 120 years of development of efficient and accurate ODE solvers (Runge,
1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth
of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to
achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with
problem complexity. After training, accuracy can be reduced for real-time or low-power applications.
Scalable and invertible normalizing flows An unexpected side-benefit of continuous transforma-
tions is that the change of variables formula becomes easier to compute. In Section 4, we derive
this result and use it to construct a new class of invertible density models that avoids the single-unit
bottleneck of normalizing flows, and can be trained directly by maximum likelihood.
Continuous time-series models Unlike recurrent neural networks, which require discretizing
observation and emission intervals, continuously-defined dynamics can naturally incorporate data
which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.
2    Reverse-mode automatic differentiation of ODE solutions
The main technical difficulty in training continuous-depth networks is performing reverse-mode
differentiation (also known as backpropagation) through the ODE solver. Differentiating through
the operations of the forward pass is straightforward, but incurs a high memory cost and introduces
additional numerical error.
We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity
method (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug-
mented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly
with problem size, has low memory cost, and explicitly controls numerical error.
Consider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver:
                                      
                                    <<FORMULA>>       (3)               

To optimize L, we require gradients with respect to θ. The first step is to determining how the gradient 
of the loss depends on the hidden state z(t) at each instant. This quantity is called the adjoint a(t) = ∂L/∂z(t). 
Its dynamics are given by another ODE, which can be thought of as the State instantaneous analog of the chain rule:
 Adjoint State

                                    <<FORMULA>>       (4)

We can compute ∂L/∂z(t0 ) by another call to an ODE solver. This solver must run backwards, starting from the initial 
value of ∂L/∂z(t1 ). One complication is that solving this ODE requires the knowing value of z(t) along its entire tra-
jectory. However, we can simply recompute z(t) backwards in time together with the adjoint, starting from its final 
value z(t1 ).

If the loss depends directly on the state at multi- Computing the gradients with respect to the pa-
ple observation times, the adjoint state must be parameters θ requires evaluating a third integral,
updated in the direction of the partial derivative of which depends on both z(t) and a(t):
the loss with respect to each observation.                     

                                  <<FORMULA>>         (5)

The vector-Jacobian products <<FORMULA>> and <<FORMULA>> in (4) and (5) can be efficiently evaluated by
automatic differentiation, at a time cost similar to that of evaluating f . All integrals for solving z, 
and <<FORMULA>> can be computed in a single call to an ODE solver, which concatenates the original state, the
adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the
necessary dynamics, and call an ODE solver to compute all gradients at once.

                                 <<ALGORITHM>>

Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends
on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate
solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint
must be adjusted in the direction of the corresponding partial derivative ∂L/∂z(ti ).
The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of
Algorithm 1 including derivatives w.r.t. t0 and t1 can be found in Appendix C. Detailed derivations
are provided in Appendix B. Appendix D provides Python code which computes all derivatives for
scipy.integrate.odeint by extending the autograd automatic differentiation package. This
code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al.,
2017) implementation, including GPU-based implementations of several standard ODE solvers at
github.com/rtqichen/torchdiffeq.

                   Replacing residual networks with ODEs for supervised learning

In this section, we experimentally investigate the training of neural ODEs for supervised learning.
Software To solve ODE initial value problems numerically, we use the implicit Adams method
implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being
an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires
solving a nonlinear optimization problem at every step. This setup makes direct backpropagation
through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd
framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden
state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the
Fortran ODE solvers, which were called from Python autograd code.

Model Architectures We experiment with a small residual network which downsamples the et al. (1998).
input twice then applies 6 standard residual blocks He et al. (2016b), which are replaced by an ODESolve 
module in the ODE-Net variant. We also test a network with the same architecture but where gradients are 
backpropagated directly through a Runge-Kutta integrator, re-ferred to as RK-Net. Table 1 shows test error,
number of parameters, and memory cost. L denotes the number of layers in the ResNet, and L̃ is the number 
of function evaluations that the ODE solver
requests in a single forward pass, which can be interpreted as an implicit number of layers. We find
that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet.
Error Control in ODE-Nets ODE solvers can approximately ensure that the output is within a
given tolerance of the true solution. Changing this tolerance changes the behavior of the network.
We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is
proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a
                                                      3
trade-off between accuracy and computational cost. One could train with high accuracy, but switch to
a lower accuracy at test time.
         Figure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.)
Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly
half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory
efficient, but also more computationally efficient than directly backpropagating through the integrator,
because the latter approach will need to backprop through each function evaluation in the forward
pass.
Network Depth It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is
the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver
and dependent on the initial state or input. Figure 3d shows that he number of function evaluations
increases throughout training, presumably adapting to increasing complexity of the model.

                     4    Continuous Normalizing Flows

The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and
the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to
compute exact changes in probability if samples are transformed through a bijective function f :

                      <<FORMULA>>                             (6)

An example is the planar normalizing flow (Rezende and Mohamed, 2015):

                     <<FORMULA>>                             (7)

Generally, the main bottleneck to using the change of variables formula is computing of the deter-
minant of the Jacobian ∂f/∂z, which has a cubic cost in either the dimension of z, or the number
of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow
layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).
Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the
computation of the change in normalizing constant:
Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable
with probability p(z(t)) dependent on time. Let dz  dt = f (z(t), t) be a differential equation describing
a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z
and continuous in t, then the change in log probability also follows a differential equation,

                                 <<FORMULA>>                 (8)

Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation.
Also unlike standard finite flows, the differential equation f does not need to be bijective, since if
uniqueness is satisfied, then the entire transformation is automatically bijective.
As an example application of the instantaneous change of variables, we can examine the continuous
analog of the planar flow, and its change in normalization constant:

                               <<FORMULA>>                            (9)

Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving
this combined ODE.
Using multiple hiddenP units with  P linear cost While det is not a linear function, the trace function
is, which implies tr( n Jn ) = n tr(Jn ). Thus if our dynamics is given by a sum of functions then
the differential equation for the log density is also a sum:


                              <<FORMULA>>                              (10)
 
This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in
the number of hidden units M . Evaluating such ‘wide’ flow layers using standard normalizing flows
costs O(M 3 ), meaning that standard NF architectures use many layers of only a single hidden unit.
Time-dependent dynamics We can specify the parameters of a flow as a function of t, making the
differential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork       
(Ha et al., 2016). We also introduce a gating mechanism for each hidden unit, 

                              <<FORMULA>>

where σn (t) ∈ (0, 1) is a neural network that learns when the dynamic fn (z) should be applied. We
call these models continuous normalizing flows (CNF).

4.1    Experiments with Continuous Normalizing Flows

We first compare continuous and discrete planar flows at learning to sample from a known distribution.
We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with
K = M layers, and sometimes much more expressive.
Density matching We configure the CNF as described above, and train for 10,000 iterations
using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using
RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we
minimize KL (q(x)kp(x)) as the loss function where q is the flow model and the target density p(·)
can be evaluated. Figure 4 shows that CNF generally achieves lower loss.
Maximum Likelihood Training A useful property of continuous-time normalizing flows is that
we can compute the reverse transformation for about the same cost as the forward pass, which cannot
be said for normalizing flows. This lets us train the flow on a density estimation task by performing
maximum likelihood estimation, which maximizes Ep(x) [log q(x)] where q(·) is computed using
the appropriate change of variables theorem, then afterwards reverse the CNF to generate random
samples from q(x).
For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5
shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the
transformed distribution after a small amount of time which shows the locations of the initial planar
flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that
the particles can be evenly spread into circles. While the CNF transformations are smooth and
interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting
the two moons dataset in Figure 5b.

               5  A generative latent function time-series model

Applying neural networks to irregularly-sampled data such as medical records, network traffic, or
neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the
latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-
defined latent variables. Missing data can be addressed using generative time-series models (Álvarez
and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data
imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of
an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).
We present a continuous-time, generative approach to modeling time series. Our model represents
each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0 , and
a global set of latent dynamics shared across all time series. Given observation times t0 , t1 , . . . , tN
and an initial state zt0 , an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each
observation.We define this generative model formally through a sampling procedure:
                             <<FORMULA>>                                    (11)
                             <<FORMULA>>                                    (12)
                             <<FORMULA>>                                    (13)
Function f is a time-invariant function that takes the value z at the current time step and outputs the
gradient: ∂z(t)/∂t = f (z(t), θf ). We parametrize this function using a neural net. Because f is time-

invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating
this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.
Training and Prediction We can train this latent-variable model as a variational autoen-
coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our
recognition net is an RNN, which consumes the data sequentially backwards in time, and out-
puts qφ (z0 |x1 , x2 , . . . , xN ). A detailed algorithm can be found in Appendix E. Using ODEs as a
generative model allows us to make predictions for arbitrary time points t1 ...tM on a continuous
timeline.
Poisson Process likelihoods The fact that an observation oc-
curred often tells us something about the latent state. For ex-
ample, a patient may be more likely to take a medical test if           
they are sick. The rate of events can be parameterized by a
function of the latent state: p(event at time t| z(t)) = λ(z(t)).
Given this rate function, the likelihood of a set of indepen-
dent observation times in the interval [tstart , tend ] is given by an                        t
inhomogeneous Poisson process (Palm, 1943):                                

We can parameterize λ(·) using another neural network. Con-
veniently, we can evaluate both the latent trajectory and the
Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate
learned by such a model on a toy dataset.
A Poisson process likelihood on observation
times can be combined with a data likelihood to
jointly model all observations and the times at
which they were made.

5.1   Time-series Latent ODE Experiments 

We investigate the ability of the latent ODE
model to fit and extrapolate time series. The
recognition network is an RNN with 25 hidden
units. We use a 4-dimensional latent space. We
parameterize the dynamics function f with a
one-hidden-layer network with 20 hidden units.
The decoder computing p(xti |zti ) is another              
neural network with one hidden layer with 20                       
hidden units. Our baseline was a recurrent neu-                   
ral net with 25 hidden units trained to minimize                  
negative Gaussian log-likelihood. We trained a                     
second version of this RNN whose inputs were
concatenated with the time difference to the next
observation to aid RNN with irregular observations.
Bi-directional spiral dataset We generated neural network. (b): Reconstructions and extrapo-
a dataset of 1000 2-dimensional spirals, each lations by a latent neural ODE. Blue curve shows
starting at a different point, sampled at 100 model prediction. Red shows extrapolation. (c) A
equally-spaced timesteps. The dataset contains projection of inferred 4-dimensional latent ODE
two types of spirals: half are clockwise while trajectories onto their first two dimensions. Color
the other half counter-clockwise. To make the indicates the direction of the corresponding trajec-
task more realistic, we add gaussian noise to the tory. The model has learned latent dynamics which
observations.                                     
                                                       
progression through time, starting at purple and ending at red. Note that the trajectories on the left
are counter-clockwise, while the trajectories on the right are clockwise.
Time series with irregular time points To generate irregular timestamps, we randomly sample
points from each trajectory without replacement (n = {30, 50, 100}). We report predictive root-
mean-squared error (RMSE) on 100 time points extending beyond those that were used for training.
Table 2 shows that the latent ODE has substantially lower predictive RMSE.

We observed that reconstructions and extrapolations are consistent with the ground truth
regardless of number of observed points and despite the noise.
Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimen-
sions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to
clockwise spirals, the other to counter-clockwise. Figure 9 shows that the latent trajectories change
smoothly as a function of the initial point z(t0 ), switching from a clockwise to a counter-clockwise
spiral.

                        6    Scope and Limitations

Minibatching The use of mini-batches is less straightforward than for standard neural networks.
One can still batch together evaluations through the ODE solver by concatenating the states of each
batch element together, creating a combined ODE with dimension D × K. In some cases, controlling
error on all batch elements together might require evaluating the combined system K times more
often than if each system was solved individually. However, in practice the number of evaluations did
not increase substantially when using minibatches.
Uniqueness When do continuous dynamics have a unique solution? Picard’s existence theo-
rem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is
unique if the differential equation is uniformly Lipschitz continuous in z and continuous in t. This
theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities,
such as tanh or relu.
Setting tolerances Our framework allows the user to trade off speed for precision, but requires
the user to choose an error tolerance on both the forward and reverse passes during training. For
sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation
experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading
performance.
Reconstructing forward trajectories Reconstructing the state trajectory by running the dynamics
backwards can introduce extra numerical error if the reconstructed trajectory diverges from the
original. This problem can be addressed by checkpointing: storing intermediate values of z on the
forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We
did not find this to be a practical problem, and we informally checked that reversing many layers of
continuous normalizing flows with default tolerances recovered the initial states.
                                                     8
                        7    Related Work

The use of the adjoint method for training continuous-time neural networks was previously pro-
posed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The
interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research
into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al.,
2017). We demonstrate these same properties in more generality by directly using an ODE solver.
Adaptive computation One can adapt computation time by training secondary neural networks
to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al.,
2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training
and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied,
computationally cheap, and generalizable rules for adapting the amount of computation.
Constant memory backprop through reversibility Recent work developed reversible versions
of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives
the same constant memory advantage as our approach. However, these methods require restricted
architectures, which partition the hidden units. Our approach does not have these restrictions.
Learning differential equations Much recent work has proposed learning differential equations
from data. One can train feed-forward or recurrent neural networks to approximate a differential
equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica-
tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting
Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit
differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and
interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic
variational inference to recover the solution of a given stochastic differential equation.
Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint
computation for general ODE and PDE solutions, but only by backpropagating through the individual
operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient
estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity
analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is
linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint
method to train bespoke latent dynamic models.
In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained
end-to-end with any other differentiable model components. While use of vector-Jacobian products
for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson
et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers
into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.
8    Conclusion
We investigated the use of black-box ODE solvers as a model component, developing new models
for time-series modeling, supervised learning, and density estimation. These models are evaluated
adaptively, and allow explicit control of the tradeoff between computation speed and accuracy.
Finally, we derived an instantaneous version of the change of variables formula, and developed
continuous-time normalizing flows, which can scale to large layer sizes.
9    Acknowledgements
We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya,
Hossein Soleimani, Eldad Haber, Ken Caluwaerts, Daniel Flam-Shepherd, and Harry Braviner for
feedback. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful
discussions. We also thank Yuval Frommer for pointing out an unsupported claim about parameter
efficiency.
                                                    9
References
Mauricio A Álvarez and Neil D Lawrence. Computationally efficient convolved multiple output
   Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011.
Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.
   In International Conference on Machine Learning, pages 136–145, 2017.
Joel Andersson. A general-purpose software framework for dynamic optimization. PhD thesis, 2013.
Joel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi – A
   software framework for nonlinear optimization and optimal control. Mathematical Programming
   Computation, In Press, 2018.
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.
   Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18
   (153):1–153, 2018.
Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
   normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan-
   court. The Stan math library: Reverse-mode automatic differentiation in c++. arXiv preprint
   arXiv:1509.07164, 2015.
Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible
   architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017.
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks
   from dynamical systems view. In International Conference on Learning Representations, 2018.
   URL https://openreview.net/forum?id=SyJS-OgR-.
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural
   networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018.
   URL https://doi.org/10.1038/s41598-018-24271-9.
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun.
   Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st
   Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning
   Research, pages 301–318. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/
   v56/Choi16.html.
Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-
   Hill Education, 1955.
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
   estimation. arXiv preprint arXiv:1410.8516, 2014.
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.
   Recurrent marked temporal point processes: Embedding event history to vector. In International
   Conference on Knowledge Discovery and Data Mining, pages 1555–1564. ACM, 2016.
Patrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of
   high-level transient finite element programs. SIAM Journal on Scientific Computing, 2013.
Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and
   Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint,
   2017.
J. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process
   RNN Classifier. ArXiv e-prints, 2017.
Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:
   Backpropagation without storing activations. In Advances in Neural Information Processing
   Systems, pages 2211–2221, 2017.
                                                    10
Alex Graves. Adaptive computation time for recurrent neural networks.                 arXiv preprint
   arXiv:1603.08983, 2016.
David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34
   (1):014004, 2017.
E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems.
   Springer, 1987.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
   recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
   pages 770–778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
   networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
   6a overview of mini-batch gradient descent, 2012.
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in
   recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
   arXiv:1412.6980, 2014.
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference
   on Learning Representations, 2014.
Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
   Improved variational inference with inverse autoregressive flow. In Advances in Neural Information
   Processing Systems, pages 4743–4751, 2016.
W. Kutta. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
   Mathematik und Physik, 46:435–453, 1901.
Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation.
   In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU,
   Pittsburgh, Pa: Morgan Kaufmann, 1988.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
   document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Yang Li. Time-dependent representation for neural event sequence prediction. arXiv preprint
   arXiv:1708.00065, 2017.
Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with
   RNNs: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning
   for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253–
   270. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/v56/Lipton16.html.
Z. Long, Y. Lu, X. Ma, and B. Dong. PDE-Net: Learning PDEs from Data. ArXiv e-prints, 2017.
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks:
   Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,
   2017.
Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reverse-mode differentiation of
   native Python. In ICML workshop on Automatic Machine Learning, 2015.
Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating
   multivariate point process. In Advances in Neural Information Processing Systems, pages 6757–
   6767, 2017.
                                                  11
Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for
   ODE based models using adjoint-state method. Computational Statistics, 32(4):1621–1643, 2017.
Conny Palm. Intensitätsschwankungen im fernsprechverker. Ericsson Technics, 1943.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
   Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
   pytorch. 2017.
Barak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE
  Transactions on Neural networks, 6(5):1212–1228, 1995.
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-
   ical theory of optimal processes. 1962.
M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial
   differential equations. Journal of Computational Physics, pages 125–141, 2018.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-
   driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for
   time-dependent and nonlinear partial differential equations. SIAM Journal on Scientific Computing,
  40(1):A172–A198, 2018b.
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate
   inference in deep generative models. In Proceedings of the 31st International Conference on
  Machine Learning, pages 1278–1286, 2014.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
   preprint arXiv:1505.05770, 2015.
C. Runge. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
  167–178, 1895.
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.
   arXiv preprint arXiv:1804.04272, 2018.
T. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Black-box Variational Inference for
   Stochastic Differential Equations. ArXiv e-prints, 2018.
Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta
   means. In Advances in Neural Information Processing Systems 25, 2014.
Peter Schulam and Suchi Saria. What-if reasoning with counterfactual Gaussian processes. arXiv
   preprint arXiv:1703.10651, 2017.
Hossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty-
   aware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a.
Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for coun-
   terfactual reasoning with continuous-time, continuous-valued interventions. arXiv preprint
   arXiv:1704.02038, 2017b.
Jos Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and
   interactive techniques, pages 121–128. ACM Press/Addison-Wesley Publishing Co., 1999.
Paul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE
   models using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018.
Jakub M Tomczak and Max Welling. Improving variational auto-encoders using Householder flow.
   arXiv preprint arXiv:1611.09630, 2016.
Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latent-space physics: Towards learning the
   temporal evolution of fluid flow. arXiv preprint arXiv:1802.10123, 2018.
Hong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration
   of ODEs. SIAM Journal on Scientific Computing, 36(5):C504–C523, 2014.
                                           

         Appendix A   Proof of the Instantaneous Change of Variables Theorem

Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability
p(z(t)) dependent on time. Let dz/dt = f (z(t), t) be a differential equation describing a continuous-in-time
transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the
change in log probability also follows a differential equation:

                                 <<FORMULA>>

Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First
we denote the transformation of z over an ε change in time as
                                 <<FORMULA>>                                           (14)
We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique
solution by Picard’s existence theorem. We also assume z(t) is bounded. These conditions imply that f , Tε , and
 ∂T are all bounded. In the following, we use these conditions to exchange limits and products.

                                 <<FORMULA>>

We can write the differential equation <<FORMULA>> using the discrete change of variables formula, and the
definition of the derivative:

                                 <<FORMULA>>                                                                    (15)

                                 <<FORMULA>>                                                                    (16)

                                 <<FORMULA>>                                           (by L’Hôpital’s rule)    (17)

                                 <<FORMULA>>                                                                    (18)

                                 <<FORMULA>>                                                                    (19)
    
                                 <<FORMULA>>                                                                    (20)

The derivative of the determinant can be expressed using Jacobi’s formula, which gives

                                 <<FORMULA>>                                                                    (21)

                                 <<FORMULA>>                                                                    (22)
    
                                 <<FORMULA>>                                                                    (23)


Substituting Tε with its Taylor series expansion and taking the limit, we complete the proof.

                                 <<FORMULA>>                                                                    (24)

                                 <<FORMULA>>                                                                    (25)

                                 <<FORMULA>>                                                                    (26)
    
                                 <<FORMULA>>                                                                    (27)


                                          A.1     Special Cases

Planar CNF. Let f (z) = uh(wz + b), then  ∂z = u ∂h  ∂z. Since the trace of an outer product is the inner
product, we have

                                <<FORMULA>>                                                                     (28)

This is the parameterization we use in all of our experiments.
Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits

the data into two equal partitions and is a volume-preserving transformation, implying that ∂t = 0. We
can verify this. Let

                               <<FORMULA>>                                 (29)

Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow.
A.2     Connection to Fokker-Planck and Liouville PDEs
The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability
density function of a stochastic differential equation as it changes with time. We relate the instantaneous change
of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation.
As with the instantaneous change of variables, let z(t) ∈ RD evolve through time following dz(t)/dt = f (z(t), t).
Then Liouville equation describes the change in density of z–a fixed point in space–as a PDE,

                              <<FORMULA>>                                    (30)

However, (30) cannot be easily used as it requires the partial derivatives of p(z,t)/∂z, which is typically approximated
using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999).
Instead of evaluating p(·, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain

                              <<FORMULA>>

partial derivative from first argument, z(t) partial derivative from second argument, t

                              <<FORMULA>>                                      (31)

We arrive at the instantaneous change of variables by taking the log,

                              <<FORMULA>>                                      (32)

While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1,

                              <<FORMULA>>                                       (33)

Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more
practical impact as it can be numerically solved much more easily, requiring an extra state of D for following
the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation
would require a grid size that is exponential in D.
Appendix B             A Modern Proof of the Adjoint Method
We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow.
                                                                         14
B.1        Continuous Backpropagation

Let z(t) follow the differential equation dt = f (z(t), t, θ), where θ are the parameters. We will prove that if
we define an adjoint state

                                                                <<FORMULA>>                                                            (34)

then it follows the differential equation

                                                               <<FORMULA>>                                                             (35)

For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.
The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks,
the gradient of a hidden layer ht depends on the gradient from the next layer ht+1 by chain rule

                                                                 <<FORMULA>>                                                            (36)

With a continuous hidden state, we can write the transformation after an ε change in time as

                                                                 <<FORMULA>>                                                            (37)
 
                                                                 <<FORMULA>>                                                            (38)

The proof of (35) follows from the definition of derivative:

              <<FORMULA>>                                                                                                               (39)

              <<FORMULA>>                                                                                     (by Eq 38)                (40)

              <<FORMULA>>                                                                        (Taylor series around z(T))            (41)

              <<FORMULA>>                                                                                                               (42)

              <<FORMULA>>                                                                                                               (43)

             <<FORMULA>>                                                                                                                (44)
  
             <<FORMULA>>                                                                                                                (45)

We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga-
tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time
point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to
the hidden state at any time, including the initial value.

                                 <<FORMULA>>                   (46)

Here we assumed that loss function L depends only on the last time point tN . If function L depends also on
intermediate time points t1 , t2 , . . . , tN −1 , etc., we can repeat the adjoint step for each of the intervals [tN −1 , tN ],
[tN −2 , tN −1 ] in the backward order and sum up the obtained gradients.
B.2        Gradients wrt. θ and t
We can generalize (35) to obtain gradients with respect to θ–a constant wrt. t–and and the initial and end times,
t0 and tN . We view θ and t as states with constant differential equations and write

                                  <<FORMULA>>                                                 (47)

We can then combine these with z to form an augmented state1 with corresponding differential equation and
adjoint state,

                                    <<FORMULA>>                 (48)

Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the
previous section still hold as this is a special case of a time-variant ODE. The Jacobian of f has the form

                                      <<FORMULA>>                                (49)

where each 0 is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain

                                    <<FORMULA>>                                  (50)

The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain
the total gradient with respect to the parameters, by integrating over the full interval and setting aθ (tN ) = 0.

                                       <<FORMULA>>                               (51)

Finally, we also get gradients with respect to t0 and tN , the start and end of the integration interval.

                                       <<FORMULA>>                               (52)

Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver.

            Appendix C              Full Adjoint sensitivities algorithm

This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration.
Algorithm 2 Complete reverse-mode derivative of an ODE initial value problem

Input: dynamics parameters θ, start time t0 , stop time t1 , final state z(t1 ), loss gradient ∂L/∂z(t1 )

                  <<ALGORITHM>>

Note that we’ve overloaded t to be both a part of the state and the (dummy) independent variable. The
distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the
text.

                     Appendix D                Autograd Implementation

                        <<ALGORITHM>>

                     Appendix E               Algorithm for training the latent ODE model

To obtain the latent representation zt0 , we traverse the sequence using RNN and obtain parameters of distribution
q(zt0 |{xti , ti }i , θenc ). The algorithm follows a standard VAE algorithm with an RNN variational posterior and
an ODESolve model:
                                       <<ALGORITHM>>
                                        <<FORMULA>>                         (53)
                                       <<ALGORITHM>>

                     Appendix F               Extra Figures

                                       <<FIGURE>>

<|endoftext|>


<|startoftext|> 

              Learning differential equations that are easy to solve

                      Jacob Kelly∗                                Jesse Bettencourt∗
         University of Toronto, Vector Institute         University of Toronto, Vector Institute
             jkelly@cs.toronto.edu                          jessebett@cs.toronto.edu
             Matthew James Johnson                               David Duvenaud
               Google Brain                                 University of Toronto, Vector Institute

            mattjj@google.com                           duvenaud@cs.toronto.edu

                              Abstract


Differential equations parameterized by neural networks become expensive to solve
numerically as training progresses. We propose a remedy that encourages learned
dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate
for the time cost of standard numerical solvers, using higher-order derivatives
of solution trajectories. These derivatives are efficient to compute with Taylor-
mode automatic differentiation. Optimizing this additional objective trades model
performance against the time cost of solving the learned dynamics. We demonstrate
our approach by training substantially faster, while nearly as accurate, models in
supervised classification, density estimation, and time-series modelling tasks.

                    1       Introduction

Differential equations describe a system’s behavior by specifying its instantaneous dynamics. 
Historically, differential equations have been derived from theory, such as Newtonian mechanics, 
Maxwell’s equations, or epidemiological models of infectious disease, with parameters inferred 
from observations. Solutions to these equations usually cannot be expressed in closed-form, 
requiring numerical approximation. Recently, ordinary differential equations parameterized by 
millions of learned parameters, called neural ODEs, have been fit for latent time series models, 
density models, or as a replacement for very deep neural networks (Rubanova et al., 2019; Grath-
wohl et al., 2019; Chen et al., 2018). These models are not constrained to match a theoretical 
model,and sometimes substantially different dynamics can give nearly indistinguishable predictions. 
This raises the possibility that we can find nearly equivalent models that are substantially easier
and faster to solve. Yet standard training methods have no way to penalize the complexity of the 
dynamics being learned.                                                  

                          <<FIGURE>>

Equal Contribution. Code available at: github.com/jacobjinkelly/easy-neural-ode

How can we learn dynamics that are faster to solve numerically without substantially changing their
predictions? Much of the computational advantages of a continuous-time formulation come from
using adaptive solvers, and most of the time cost of these solvers comes from repeatedly evaluating
the dynamics function, which in our settings is a moderately-sized neural network. So, we’d like to
reduce the number of function evaluations (NFE) required for these solvers to reach a given error
tolerance. Ideally, we would add a term penalizing the NFE to the training objective, and let a
gradient-based optimizer trade off between solver cost and predictive performance. But because NFE
is integer-valued, we need to find a differentiable surrogate.
The NFE taken by an adaptive solver depends on how far it can extrapolate the trajectory forward
without introducing too much error. For example, for a standard adaptive-step Runge-Kutta solver
with order m, the step size is approximately inversely proportional to the norm of the local mth total
derivative of the solution trajectory with respect to time. That is, a larger mth derivative leads to a
smaller step size and thus more function evaluations. Thus, we propose to minimize the norm of this
total derivative during training, as a way to control the time required to solve the learned dynamics.
In this paper, we investigate the effect of this speed regularization in various models and solvers.
We examine the relationship between the solver order and the regularization order, and characterize
the tradeoff between speed and performance. In most instances, we find that solver speed can be
approximately doubled without a substantial increase in training loss. We also provide an extension
to the JAX program transformation framework that provides Taylor-mode automatic differentiation,
which is asymptotically more efficient for computing the required total derivatives than standard
nested gradients.
Our work compares against and generalizes that of Finlay et al. (2020), who proposed regularizing
dynamics in the FFJORD density estimation model, and showed that it stabilized dynamics enough
in that setting to allow the use of fixed-step solvers during training.
2    Background
An ordinary differential equation (ODE) specifies the instantaneous change of a vector-valued state
<<FORMULA>>, computing the state at a later time:

                                    <<FORMULA>>

is called an initial value problem (IVP). For example, f could describe the equations of motion for a
particle, or the transmission and recovery rates for a virus across a population. Usually, the required
integral has no analytic solution, and must be approximated numerically.
Adaptive-step Runge-Kutta ODE Solvers Runge-Kutta methods (Runge, 1895; Kutta, 1901)
approximate the solution trajectories of ODEs through a series of small steps, starting at time t0 .
At each step, they choose a step size h, and fit a local approximation to the solution, ẑ(t), using
several evaluations of f . When h is sufficiently small, the numerical error of a mth-order method
is bounded by kẑ(t + h) − z(t + h)k ≤ chm+1 for some constant c (Hairer et al., 1993). So, for a
mth-order method, the local error grows approximately in proportion to the size of the mth coefficient
in the Taylor expansion of the true solution. All else being equal, controlling this coefficient for all
dimensions of z(t) will allow larger steps to be taken without surpassing the error tolerance.
Neural Ordinary Differential Equations The dynamics function f can be a moderately-sized
neural network, and its parameters θ trained by gradient descent. Solving the resulting IVP is
analogous to evaluating a very deep residual network in which the number of layers corresponds
to the number of function evaluations of the solver (Chang et al., 2017; Ruthotto & Haber, 2018;
Chen et al., 2018). Solving such continuous-depth models using adaptive numerical solvers has
several computational advantages over standard discrete-depth network architectures. However, this
approach is often slower than using a fixed-depth network, due to an inability to control the number
of steps required by an adaptive-step solver.

                        3   Regularizing Higher-Order Derivatives for Speed

The ability of Runge-Kutta methods to take large and accurate steps is limited by the Kth-order
Taylor coefficients of the solution trajectory. We would like these coefficients to be small. Specifically,
we propose to regularize the squared norm of the Kth-order total derivatives of the state with respect
to time, integrated along the entire solution trajectory:

                                     <<FORMULA>>                                  (1)

where k·k2 is the squared `2 norm, and the dependence on the dynamics parameters θ is implicit
through the solution z(t) integrating dz(t)

                                       <<dt = f (z(t), t, θ)>>. 

During training, we weigh this regularization term by a hyperparameter λ and add it to our original loss 
to get our regularized objective:

                                     <<FORMULA>>                                   (2)

What kind of solutions are allowed when RK = 0? For K = 0,

                                     <<FORMULA>>

we have kz(t)k2 = 0, so the only possible solution is z(t) = 0.                  

For K = 1, we have kf (z(t), t)k2 = 0, so all solutions are
constant, flat trajectories. For K = 2 solutions are straight-line               
trajectories. Higher values of K shrink higher derivatives, but
don’t penalize lower-order dynamics. For instance, a quadratic                      
trajectory will have R3 = 0. Setting the Kth order dynamics to
exactly zero everywhere automatically makes all higher orders                        
zero as well. Figure 1 shows that regularizing R3 on a toy 1D
neural ODE reduces NFE.     

                                    <<FIGURE>>
                                                                                     
Which orders should we regularize? We propose matching the
order of the regularizer to that of the solver being used. We        
conjecture that regularizing dynamics of lower orders than that     
of the solver restricts the model unnecessarily, and that let-     
ting the lower orders remain unregularized should not increase    
NFE very much. Figure 2 shows empirically which orders           
of Runge-Kutta solvers can efficiently solve which orders of       
toy polynomial trajectories. We test these conjectures on real      
models and datasets in section 6.2.                             

The solution trajectory and our regularization term can be computed in a single call to an ODE solver
by augmenting the system with the integrand in eq. (1).

            4   Efficient Higher Order Differentiation with Taylor Mode

The number of terms in higher-order forward derivatives grows exponentially in K, becoming
prohibitively expensive for K = 5, and causing substantial slowdowns even for K = 2 and K = 3.
Luckily, there exists a generalization of forward-mode automatic differentiation (AD), known as
Taylor mode, which can compute the total derivative exactly for a cost of only O(K 2 ). We found
that this asymptotic improvement reduced wall-clock time by an order of magnitude, even for K as
low as 3.
First-order forward-mode AD Standard forward-mode AD computes, for a function f (x) and
an input perturbation vector v, the product ∂f  ∂x v. This Jacobian-vector product, or JVP, can be
computed efficiently without explicitly instantiating the Jacobian. This implicit computation of JVPs
is straightforward whenever f is a composition of operations for which which implicit JVP rules are
known.
Higher-order Jacobian-vector products Forward-mode AD can be generalized to higher orders
                                                                                            K
to compute Kth-order Jacobians contracted K times against the perturbation vector: ∂∂xKf v ⊗K .
Similarly, this can also be computed without representing any Jacobian matrices explicitly.

A naïve approach to higher-order forward mode is to recursively apply first-order forward mode.
                                                             K
Specifically, nesting JVPs K times gives the right answer: <<FORMULA>> but
causes an unnecessary exponential slowdown, costing O(exp(K)). This is because expressions that
appear in lower derivatives also appear in higher derivatives, but the work to compute is not shared
across orders.
Taylor Mode Taylor-mode AD generalizes             Function               Taylor propagation rule
first-order forward mode to compute the first    <<y = z + cw>>                  <<y[k] = z[k] + cw[k]>>
K derivatives exactly with a time cost of only                                     <<Pk>>
O(K 2 ) or O(K log K), depending on the op-       <<y =z∗w>>                 << y[k] =  h j=0   z[j] w[k−j] i>>
                                                                                       <<Pk−1>>
erations involved. Instead of providing rules      <<y = z/w>>         <<y[k] = w10 zk − j=0 z[j] w[k−j]>>
for propagating perturbation vectors, one pro-                                     <<Pk>>
                                                 <<y = exp(z)>>                <<ỹ[k] = j=1 y[k−j] z̃[j]>>
vides rules for propagating truncated Taylor                                       <<Pk>>
series. Some example rules are shown in ta-      <<s = sin(z)>>                <<s̃[k] = j=1 z̃[j] c[k−j]>>
                                                                                   <<Pk>>
ble 1. For more details see the Appendix and     <<c = cos(z)>>              <<c̃[k] = j=1 −z̃[j] s[k−j]>>
Griewank & Walther (2008, Chapter 12). We
provide an open source implementation of Table 1: Rules for propagating Taylor polynomial
Taylor mode AD in the JAX Python library coefficients through standard functions. These rules
(Bradbury et al., 2018).                       generalize standard first-order derivatives. Notation
                                               <<z[i] = i!1 zi>> and <<ỹ[i] = i!i zi>>.

                     5     Experiments
                                                                                    
We consider three different tasks in which continuous-
                                                                      
depth or continuous time models might have computa-                     
                                                                     
                                                                                 
tional advantages over standard discrete-depth models:
supervised learning, continuous generative modeling of                              
time-series (Rubanova et al., 2019), and density estima-                             
tion using continuous normalizing flows (Grathwohl et al.,
2019). Unless specified otherwise, we use the standard
                                                                                     
dopri5 Runge-Kutta 4(5) solver (Dormand & Prince,
1980; Shampine, 1986).                                                             <<FIGURE>>      
                                                                                                   
5.1   Supervised Learning                                             Figure 3: Number of function evalua-
                                                                      tions (NFE) and training error during
We construct a model for MNIST classification: it takes in            training. Speed regularization (solid)
as input a flattened MNIST image and integrates it through            decreases the NFE throughout training
dynamics given by a simple MLP, then applies a linear                 without substantially changing the train-
classification layer. In fig. 3 we compare the NFE and                ing error.
training error of a model with and without regularizing
R3 .
                                                                  
5.2   Continuous Generative Time Series Models

As in Rubanova et al. (2019), we use the Latent ODE        
architecture for modelling trajectories of ICU patients
using the PhysioNet Challenge 2012 dataset (Silva
et al., 2012). This variational autoencoder architec-            
ture uses an RNN recognition network, and models                     
the state dynamics using an ODE in a latent space.
In the supervised learning setting described in the
previous section only the final state affects model pre- Figure 4: Regularizing dynamics in a la-
dictions. In contrast, time-series models’ predictions tent ODE modeling PhysioNet clinical data.
also depend on the value of the trajectory at all inter- Shown are a representative 2-dimensional
mediate times when observations were made. So, we slice of 20 dimensional dynamics. We re-
might expect speed regularization to be ineffective duce average NFE from 281 to 90 while only
due to these extra constraints on the dynamics. How- incurring an 8% increase in loss.
ever, fig. 4 shows that, without changing their overall
shape the latent dynamics can be adjusted to reduce their NFE by a factor of 3.
                                                   
5.3                       Density Estimation with Continuous Normalizing Flows

Our third task is unsupervised density estimation, using a scalable variant of continuous normalizing
flows called FFJORD (Grathwohl et al., 2019). We fit the MINIBOONE tabular dataset from
Papamakarios et al. (2017) and the MNIST image dataset (LeCun et al., 2010). We use the respective
singe-flow architectures from Grathwohl et al. (2019).
Grathwohl et al. (2019) noted that the NFE required to numerically integrate their dynamics could
become prohibitively expensive throughout training. Table 2 shows that we can reduce NFE by 38%
for only a 0.6% increase in log-likelihood measured in bits/dim.
How to train your Neural ODE We compare against the approach of Finlay et al. (2020), who
design two regularization terms specifically for stabilizing the dynamics of FFJORD models:
       
                        <<FORMULA>>

The first term is designed to encourage straight-line paths, and the second, stochastic, term is designed
to reduce overfitting. Finlay et al. (2020) used fixed-step solvers during training for some datasets.
We compare these two regularization on training with each of adaptive and fixed-step solvers, and
evaluated using an adaptive solver, in section 6.3.
6                        Analysis and Discussion
6.1                       Trading off function evaluations for loss
What does the trade off between accuracy and speed look like? Ideally, we could reduce the solver
time a lot without substantially reducing model performance. Indeed, this is demonstrated in all three
settings we explored. Figure 5 shows that generally, model performance starts getting substantially
worse only after a 50% reduction in solver speed when controlling R2 .

               <<FIGURE>>

Figure 5: Tuning the regularization of R2 trades off between training loss and solver speed in three
different applications of neural ODEs. Horizontal axes show average number of function evaluations,
and vertical axes show unregularized training loss, both at the end of training.

6.2 Order of regularization vs. order of solver

Which order of total derivatives should we regularize for a particular solver? As mentioned earlier,
we conjecture that the best choice would be to match the order of the solver being used. Regularizing
too low an order might needlessly constrain the dynamics and make it harder to fit the data, while
regularizing too high an order might leave the dynamics difficult to solve for a lower-order solver.
However, we also expect that optimizing higher-order derivatives might be challenging, since these
higher derivatives can change quickly even for small changes to the dynamics parameters.
Figures 6 and 7 investigate this question on the task of MNIST classification. Figure 6 compares the
effectiveness of regularizing different orders when using a solver of a particular order. For a 2nd
order solver, regularizing K = 2 produces a strictly better trade-off between performance and speed,
as expected. For higher-order solvers, including ones with adaptive order, we found that regularizing
orders above K = 3 gave little benefit.

               <<FIGURE>>

Figure 7 investigates the relationship between RK and the quantity it is meant to be a surrogate
for: NFE. We observe a clear monotonic relationship between the two, for all orders of solver and
regularization.

               6.3          Do we reduce training time?

Our approach produces models that are fastest to evaluate at test time. However, when we train
with adaptive solvers we do not improve overall training time, due to the additional expense of
computing our regularizer. Training with a fixed-grid solver is faster, but can be unstable if dynamics
are unregularized. Finlay et al. (2020)’s regularization and ours allow us to use fixed grid solvers and
reduce training time. However, ours is 2.4× slower than Finlay et al. (2020) for FFJORD because
their regularization re-uses terms already computed in the FFJORD training objective. For objectives
where these cannot be re-used, like MNIST classification, our method is 1.7× slower, but achieves
better test-time NFE.

               6.4       Are we making the solver overconfident?

Because we optimize dynamics in a way specifically designed to make the solver take longer steps,
we might fear that we are “adversarially attacking” our solver, making it overconfident in its ability
to extrapolate. Figure 8c shows that this is not the case for MNIST classification.

               6.5       Does speed regularization overfit?

Finlay et al. (2020) motivated one of their regularization terms by the possibility of overfitting: having
faster dynamics only for the examples in the training set, but still low on the test set. However, they
did not check whether overfitting was occurring. In fig. 8b we confirm that our regularized dynamics
have nearly identical average solve time on a held-out test set, on MNIST classification.

                             7        Related Work

Although the field of numerical ODE solvers is extremely mature, as far as we know, there has
been almost no work specifically on tuning differential equations to be faster to solve. The closest

                                 <<FIGURE>>

Figure 8: Figure 8c We observe that the actual solver error is about equally well-calibrated for
regularized dynamics as random dynamics, indicating that regularization does not make the solver
overconfident. Figure 8b: There is negligible overfitting of solver speed. ??: Speed regularization
does not usefully improve generalization. For large λ, our method reduces overfitting, but increases
overall test error due to under-fitting.

related work is Grathwohl et al. (2019) who mention attempting to use weight decay and spectral
normalization to reduce NFE, and of course Finlay et al. (2020), who, among other contributions,
introduced the use of fixed-step solvers for stable training.
Stabilizing dynamics Simard et al. (1991) regularized the dynamics of discrete-time recurrent
neural networks to improve their stability, by constraining the norm of the Jacobian of the dynamics
function in the direction of its largest eigenvalue. However, this approach has an O(D3 ) time cost.
De Brouwer et al. (2019) introduced a parameterization of neural ODEs analogous to instantaneous
Gated Recurrent Unit (GRU) recurrent neural network architectures in order to stabilize training
dynamics. Dupont et al. (2019) provided theoretical arguments that adding extra dimensions to the
state of a neural ODE should make training easier, and showed that this helped reduce NFE during
training.
Gradually increasing depth Chang et al. (2017) noted the connection between residual networks
and ODEs, and took advantage of this connection to gradually make resnets deeper during training,
in order to save time. One can view the increase in NFE while neural ODEs as an automatic, but
uncontrolled, version of their method. Their results suggest we might benefit from introducing a
speed regularization schedule that gradually tapers off during training.
Gradient Regularization Novak et al. (2018); Drucker & LeCun (1992) regularized the gradients
of neural networks to improve generalization.
Table 2: Density Estimation on MNIST using FFJORD. For adaptive solvers, indicated by ∞ Steps,
our approach is slowest to train, but requires the fewest NFE once trained. For fixed-step solvers our
approach achieves lower bits/dim and NFE when comparing across fixed-grid solvers using the same
number of steps. Fixed step solvers that diverged due to instability are indicated by NaN bits/dim.
  
                        8    Scope

The initial speedups obtained in this paper are not yet enough to make neural ODEs competitive with
standard fixed-depth architectures in terms of speed for standard supervised learning. However, there
are many applications where continuous-depth architectures provide a unique advantage. Besides
density models such as FFJORD and time series models, continuous-depth architectures have been
applied in solving mean-field games (Ruthotto et al., 2019), image segmentation (Pinckaers & Litjens,
2019), image super-resolution (Scao, 2020), and molecular simulations (Wang et al., 2020). These
applications, which already use continuous-time models, could benefit from the speed regularization
proposed in this paper.
While we investigated only ODEs in this paper, this approach could presumably be extended straight-
forwardly to neural stochastic differential equations fit by adaptive solvers (Li et al., 2020) and other
flavors of parametric differential equations fit by gradient descent (Rackauckas et al., 2019).

                      9    Limitations

Hyperparameters The hyperparameter λ needs to be chosen to balance speed and training loss.
One the other hand, neural ODEs don’t require choosing the outer number of layers, which needs to
be chosen separately for each stack of layers in standard architectures.
One also needs to choose solver order and tolerances, and these can substantially affect solver speed.
We did not investigate loosening tolerances, or modifying other parameters of the solver. The default
tolerance of 1.4e-8 for both atol and rtol behaved well in all our experiments.
One also needs to choose K. Higher K seems to generally work better, but is slower per step at
training time. In principle, if one can express their utility explicitly in terms of training loss and NFE,
it may be possible to tune λ automatically during training using the predictable relationship between
RK and NFE shown in fig. 7.
Slower overall training Although speed regularization reduces the overall NFE during training, it
makes each step more expensive. In our density estimation experiments (table 2), the overall effect
was about about 70% slower training, compared to no regularization, when using adaptive solvers.
However, test-time evaluation is much faster, since there is no slowdown per step.
10     Conclusions
This paper is an initial attempt at controlling the integration time of differential equations by regular-
izing their dynamics. This is an almost unexplored problem, and there are almost certainly better
quantities to optimize than the ones examined in this paper.
Based on these initial experiments, we propose three practical takeaways:
       1. Across all tasks, tuning the regularization usually gave at least a 2x speedup without
          substantially hurting model performance.
       2. Overall training time with speed regularization is in general about 30% to 50% slower with
          adaptive solvers.
       3. For standard solvers, regularizing orders higher than R2 or R3 provided little additional
          benefit.
Future work It may be possible to adapt solver architectures to take advantage of flexibility in
choosing the dynamics. Standard solver design has focused on robustly and accurately solving a
given set of differential equations. However, in a learning setting, we could consider simply rejecting
some kinds of dynamics as being too difficult to solve, analogous to other kinds of constraints we put
on models to encourage statistical regularization.
                                                   
                        Acknowledgements

We thank Barak Perlmutter, Ken Jackson, Ricky T.Q. Chen, Will Grathwohl, Chris Finlay, and
Chris Rackauckas for feedback and helpful discussions. Resources used in preparing this research
were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and
companies sponsoring the Vector Institute.

                        References

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-
   Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http:
  //github.com/google/jax.
Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. Multi-level residual networks from
   dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential
   equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. GRU-ODE-Bayes: Continuous modeling of
   sporadically-observed time series. In Advances in Neural Information Processing Systems, pp.
  7377–7388, 2019.
Dormand, J. R. and Prince, P. J. A family of embedded Runge-Kutta formulae. Journal of computa-
   tional and applied mathematics, 6(1):19–26, 1980.
Drucker, H. and LeCun, Y. Improving generalization performance using double backpropagation.
  IEEE Trans. Neural Networks, 3(6):991–997, 1992. doi: 10.1109/72.165600. URL https:
  //doi.org/10.1109/72.165600.
Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural ODEs. In Advances in Neural Information
  Processing Systems, pp. 3134–3144, 2019.
Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. How to train your neural ODE.
   arXiv preprint arXiv:2002.02798, 2020.
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form
   continuous dynamics for scalable reversible generative models. International Conference on
  Learning Representations, 2019.
Griewank, A. and Walther, A. Evaluating derivatives. 2008.
Hairer, E., Norsett, S., and Wanner, G. Solving Ordinary Differential Equations I: Nonstiff Problems,
  volume 8. 01 1993. doi: 10.1007/978-3-540-78862-1.
Kutta, W. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
  Mathematik und Physik, 46:435–453, 1901.
LeCun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online].
  Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Li, X., Chen, R. T. Q., Wong, T.-K. L., and Duvenaud, D. Scalable gradients for stochastic differential
   equations. In Artificial Intelligence and Statistics, 2020.
Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and
   generalization in neural networks: an empirical study. In 6th International Conference on Learning
  Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
  Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation.
  Advances in Neural Information Processing Systems, 2017.
Pinckaers, H. and Litjens, G. Neural ordinary differential equations for semantic segmentation of
   individual colon glands. arXiv preprint arXiv:1910.10470, 2019.
                                                    9
Rackauckas, C., Innes, M., Ma, Y., Bettencourt, J., White, L., and Dixit, V. Diffeqflux.jl-a Julia
   library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-
   sampled time series. In Advances in Neural Information Processing Systems, pp. 5321–5331,
   2019.
Runge, C. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
  167–178, 1895.
Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differential equations. Journal
   of Mathematical Imaging and Vision, pp. 1–13, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and Fung, S. W. A machine learning framework for
   solving high-dimensional mean field game and mean field control problems. CoRR, abs/1912.01825,
   2019. URL http://arxiv.org/abs/1912.01825.
Scao, T. L. Neural differential equations for single image super-resolution. arXiv preprint
   arXiv:2005.00865, 2020.
Shampine, L. F. Some practical Runge-Kutta formulas. Mathematics of Computation, 46(173):
  135–150, 1986. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2008219.
Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. Predicting in-hospital mortality of
   ICU patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in
  Cardiology, pp. 245–248, 2012.
Simard, P., Raysz, J. P., and Victorri, B. Shaping the state space landscape in recurrent networks. In
  Advances in neural information processing systems, pp. 105–112, 1991.
Wang, W., Axelrod, S., and Gómez-Bombarelli, R. Differentiable molecular simulations for control
   and learning. arXiv preprint arXiv:2003.00868, 2020.

                  Appendix A            Taylor-mode Automatic Differentiation

                              A.1    Taylor Polynomials

To clarify the relationship between the presentation in Chapter 13 of Griewank & Walther (2008) and
our results we give the distinction between the Taylor coefficients and derivative coefficients, also
known, unhelpfully, as Tensor coefficients.

For a sufficiently smooth vector valued function f : Rn → Rm and the polynomial

                         << x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn>>                                 (5)

we are interested in the d-truncated Taylor expansion

                          <<y(t) = f (x(t)) + O(td+1 )>>                                                                      (6)
                               
                          <<≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R >>                                    (7)

with the notation that <<FORMULA>> is the Taylor coefficient, which is the normalized derivative coefficient.

The Taylor coefficients of the expansion, y[j] , are smooth functions of the i ≤ j coefficients x[i],

                                       <<FORMULA>>                                                                        (8)

                                       <<FORMULA>>                                                                        (9)

                                       <<FORMULA>>                                                                       (10)

                                       <<FORMULA>>                                                                       (11)

These, as given in Griewank & Walther (2008), are written in terms of the normalized, Taylor
coefficients. This obscures their direct relationship with the derivatives, which we make explicit.
Consider the polynomial eq. (5) with Taylor coefficients expanded so their normalization is clear.
Further, let’s use suggestive notation that these coefficients correspond to the higher derivatives of 
x with respect to t, making x(t) a Taylor polynomial. That is <<FORMULA>>.
                                       <<FORMULA>>                                                                       (12)

                                       <<FORMULA>>                                                                       (13)

                                       <<FORMULA>>                                                                       (14)

Again, we are interested in the polynomial eq. (7), but with the normalization terms explicit

                                       <<FORMULA>>                                                                       (15)

Now we can expand the expressions for the Taylor coefficients y[i] to expressions for derivative
coefficients yi = i!y[i].

The coefficients of the Taylor expansion, yj , are smooth functions of the i ≤ j coefficients xi,

                                       <<FORMULA>>                                                                       (16)

                                       <<FORMULA>>                                                                       (17)

                                       <<FORMULA>>                                                                       (18)

                                       <<FORMULA>>                                                                       (19)

                                       <<FORMULA>>                                                                       (20)

                                       <<FORMULA>>                                                                       (21)

Therefore, eqs. (16), (17), (19) and (21) show that the derivative coefficient yi are exactly the ith
order higher derivatives of the composition f (x(t)) with respect to t. The key insight to this exercise
is that by writing the derivative coefficients explicitly we reveal that the expressions for the terms,
eqs. (16) to (18) and (20), involve terms previously computed for lower order terms.
In general, it will be useful to consider that the yk derivative coefficients is a function of all lower
order input derivatives

                                             <<yk = yk (x0 , . . . , xk )>>.                                  (22)

We provide the API to compute this in JAX by indexing the k-output of jet

                                      <<yk = jet(f, x0 , (x1 , . . . , xk ))[k]>>.

                  A.2    Relationship with Differential Equations

                           A.2.1    Autonomous Form

We can transform the initial value problem

                                     <<FORMULA>>                                (23)

into an autonomous dynamical system by augmenting the system to include the independent variable
with trivial dynamics Hairer et al. (1993):

                               <<FORMULA>>                              (24)

We do this for notational convenience, as well it disambiguates that derivatives with respect to t are 
meant in the “total" sense. This is aleviates the potential ambiguity of ∂t f (x(t), t) which could mean
both the derivative with respect to the second argument and the derivative through x(t) by the chain
rule <<FORMULA>>.

            A.2.2    Taylor Coefficients for ODE Solution with jet

Recall that jet gives us the coefficients for yi as a function of f and the coefficients xj≤i . We
can use jet and the relationship xk+1 = yk to recursively compute the coefficients of the solution
polynomial.

                     Algorithm 1 Taylor Coefficients for ODE Solution by Recursive Jet

                                    <<ALGORITHM>>

                        A.3    Regularizing Taylor Terms

Computing the Taylor coefficients for the ODE solution as in algorithm 1 will give a local approx-
imation to the ODE solution. If infinitely many Taylor coefficients could be computed this would
give the exact solution. The order of the final Taylor coefficient, determining the truncation of the
polynomial, gives the order of the approximation.
If the higher order Taylor coefficients of the solution are large, then truncation will result in a local
approximation that quickly diverts from the solution. However, if the higher Taylor coefficients are
small then the local approximation will remain close to the solution. This motivates our regularization
method. The effect of our regularizer on the Taylor expansion of a solution to a neural ODE can be
seen in fig. 9.

                  Appendix B         Experimental Details

Experiments were conducted using GPU-based ODE solvers. Training gradients were computed
using the adjoint method, in which the trajectory is reconstructed backwards in time to save memory,
for backpropagation. As in Finlay et al. (2020), we normalize our regularization term in eq. (1) by
the dimension of the vector-valued trajectory z(t) so that we may choose λ free of scaling by the
dimension of the problem.

               B.1    Efficient computation of the gradient of regularization term

To optimize our regularized objective, we must compute its gradient. We use the adjoint method
as described in Chen et al. (2018) to differentiate through the solution to the ODE. In particular, to
optimize our model we only need to compute the gradient of the regularization term. The adjoint
method gives the gradient of the ODE solution as a solution to an augmented ODE.

                               <<FIGURE>>

Figure 9: Left: The dynamics and a trajectory of a neural ODE trained on a toy supervised learning
problem. The dynamics are poorly approximated by a 6th-order local Taylor series, and requires 92
NFE by a solve by a 5th-order Runge-Kutta solver. Right: Regularizing the 6th-order derivatives of
trajectories gives dynamics that are easier to solve numerically, requiring only 68 NFE.

                     B.2   Supervised Learning

The dynamics function f : Rd × R → Rd is given by an MLP as follows

                                            <<z1 = σ(x)>>
                                        <<h1 = W1 [z1 ; t] + b1>>
                                            <<z2 = σ(h1 )>>
                                        <<y = W2 [z2 ; t] + b2>>

Where <<[·; ·]>> denotes concatenation of a scalar onto a column vector. The parameters are <<W1 ∈
R^h×d>>, <<b1 ∈ R^h>> and <<W2 ∈ R^d×h>> , <<b2 ∈ R^d>> . Here we use 100 hidden units, i.e.<< h = 100>>. We have
<<d = 784>>, the dimension of an MNIST image.
We train with a batch size of 100 for 160 epochs. We use the standard training set of 60,000 images,
and the standard test set of 10,000 images as a validation/test set. We optimize our model using SGD
with momentum with β = 0.9. Our learning rate schedule is 1e-1 for the first 60 epochs, 1e-2 until
epoch 100, 1e-3 until epoch 140, and 1e-4 for the final 20 epochs.
B.3   Continuous Generative Modelling of Time-Series
The PhysioNet dataset consists of observations of 41 distinct traits over a time period of 48 hours.
We remove the parameters “Age”, “Gender”, “Height”, and “ICUType” as these attributes do not vary
in time. We also quantize the measurements for each attribute by the hour by averaging multiple
measurements within the same hour. This leaves 49 unique time stamps (the extra time stamp for
observations at exactly the endpoint of the 48 hour observation period). We report all our losses on
this quantized data. We performed this rather coarse quantization for computational reasons having
to do with our particular implementation of this model. The validation split was obtained by taking
a random split of 20% of the trajectories from the full dataset. In total there are 8000 trajectories.
Code is included for processing the dataset, and links to downloading the data may be found in the
code for Rubanova et al. (2019). All other experimental details may be found in the main body and
appendices of Rubanova et al. (2019).

                     B.4   Continuous Normalizing Flows

For the model trained on the MINIBOONE tabular dataset from Papamakarios et al. (2017), we used
the same architecture as in Table 4 in the appendix of Grathwohl et al. (2019). We chose the number
of epochs and a learning rate schedule based on manual tuning on the validation set, in contrast
to Grathwohl et al. (2019) who tuned these automatically using early stopping and an automatic
heuristic for the learning rate decay using evaluation on a validation set. In particular, we trained for
500 epochs with a learning rate of 1e-3 for the first 300 epochs, 1e-4 until epoch 425, and 1e-5
for the remaining 75 epochs. The number of epochs and learning rate schedule was determined by
evaluating the model on the validation set every 10 epochs, and decaying the learning rate by a factor
of 10 once the loss on the validation set stopped improving for several evaluations, with the goal of
matching or improving upon the log-likelihood reported in Grathwohl et al. (2019). The data was
obtained as made available from Papamakarios et al. (2017), which was already processed and split
into train/validation/test. In particular, the training set has 29556 examples, the validation set has
3284 examples, and the test set has 3648 examples, which consist of 43 features.
It is important to note that we implemented a single-flow model for the MNIST dataset, while the
original comparison in Finlay et al. (2020) was on a multi-flow model. This accounts for discrepancy
in bits/dim and NFE reported in Finlay et al. (2020).
All other experimental details are as in Grathwohl et al. (2019).

                              B.5   Hardware

MNIST Supervised learning, Physionet Time-series, and MNIST FFJORD experiments were trained
and evaluated on NVIDIA Tesla P100 GPU. Tabular data FFJORD experiments were evaluated on
NVIDIA Tesla P100 GPU but trained on NVIDIA Tesla T4 GPU. All experiments except for MNIST
FFJORD were trained with double precision for purposes of reproducibility.

                     Appendix C         Additional Results

                               C.1   Overfitting of NFE


                                    <<FIGURE>>

                Figure 10: The difference in NFE is tracked by the variance of NFE.

In fig. 10 we note that there is a striking correspondence in the variance of NFE across individual
examples (in both the train set (dark red) and test set (light red)) and the absolute difference in NFE
between examples in the training set and test set. This suggests that any difference in the average
NFE between training examples and test examples is explained by noise in the estimate of the true
average NFE. It is also interesting that speed regularization does not have a monotonic relationship
with the variance of NFE, and we speculate as to how this might interact between the correspondence
of NFE for a particular example and the difficulty in the model correctly classifying it.

                     C.2         Trading off function evaluations with a surrogate loss

In fig. 11 and fig. 12 we confirm that our method poses a suitable tradeoff not only on the loss being
optimized, but also on the potentially non-differentiable loss which we truly care about. On MNIST,
we get a similar pareto curve when plotting classification error as opposed to cross-entropy loss, and
similarly on the time-series modelling task we see that we get a similar pareto curve on MSE loss as
compared to IWAE loss. The pareto curves are plotted for R3 , R2 respectively.

                                                   <<FIGURE>>

                                         Figure 11: MNIST Classification                                                                                                 
                                         
                                                   <<FIGURE>>

                                         Figure 12: Physionet Time-Series

                                    C.3         Wall-clock Time

We include additional tables with wall-clock time and training with fixed grid solvers in table 3 and
table 4.


                           Appendix D          Comparison to How to Train Your Neural ODE

The terms from Finlay et al. (2020) are

                                    <<FORMULA>>

and an estimate of
                                    <<FORMULA>>

                       Table 3: Classification on MNIST

                                    <<TABLE>>

These are combined with a weighted average and integrated along the solution trajectory.
These terms are motivated by the expansion

                                    <<FORMULA>>

Namely, eq. (3) regularizes the first total derivative of the solution, f (z(t), t), along the trajectory, and
eq. (4) regularizes a stochastic estimate of the Frobenius norm of the spatial derivative, ∇z f (z(t), t),
along the solution trajectory.
In contrast, R2 regularizes the norm of the second total derivative directly. In particular, this takes
into account the ∂f ∂t term. In other words, this accounts for the explicit dependence of f on time,
while eq. (3) and eq. (4) capture only the implicit dependence on time through z(t).
Even in the case of an autonomous system, that is, where ∂f    ∂t is identically 0 and the dynamics f only
depend implicitly on time, these terms still differ. Namely, R2 integrates the following along the
solution trajectory:

                                       <<FORMULA>>

while Finlay et al. (2020) penalizes the respective norms of the matrix ∇z f (z(t), t) and vector
f (z(t), t) separately.

                     Table 4: Density Estimation on Tabular Data (MINIBOONE)

                                       <<TABLE>>

<|endoftext|>


<<START> <<START>> <<START>>


          How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization

                Chris Finlay 1 Jörn-Henrik Jacobsen 2 Levon Nurbekyan 3 Adam M Oberman 1
                                                                                                                        
                    Abstract

         Training neural ODEs on large datasets has not
         been tractable due to the necessity of allowing
         the adaptive numerical ODE solver to refine its
         step size to very small values. In practice this
         leads to dynamics equivalent to many hundreds
         or even thousands of layers. In this paper, we
         overcome this apparent difficulty by introducing
         a theoretically-grounded combination of both op-
         timal transport and stability regularizations which
         encourage neural ODEs to prefer simpler dynam-
         ics out of all the dynamics that solve a problem
         well. Simpler dynamics lead to faster conver-
         gence and to fewer discretizations of the solver,
         considerably decreasing wall-clock time without
         loss in performance. Our approach allows us to
         train neural ODE-based generative models to the
         same performance as the unregularized dynamics,                 
         with significant reductions in training time. This
         brings neural ODEs closer to practical relevance             
         in large-scale applications.

                                       <<FIGURE>>   

                        Figure 1. Optimal transport map and a generic normalizing flow.

                                                                      Indeed, it was observed that there is a striking similarity
 1. Introduction                                                      between ResNets and the numerical solution of ordinary
                                                                        differential equations (E, 2017; Haber & Ruthotto, 2017;
Recent research has bridged dynamical systems, a                     Ruthotto & Haber, 2018; Chen et al., 2018; 2019). In these
workhorse of mathematical modeling, with neural networks,            works, deep networks are interepreted as discretizations of
the defacto function approximator for high dimensional data.         an underlying dynamical system, where time indexes the
The great promise of this pairing is that the vast mathemat-         “depth” of the network and the parameters of the discretized
ical machinery stemming from dynamical systems can be                dynamics are learned. An alternate viewpoint was taken by
leveraged for modelling high dimensional problems in a               neural ODEs (Chen et al., 2018), where the dynamics of
dimension-independent fashion.                                       the neural network are approximated by an adaptive ODE
Connections between neural networks and ordinary differ-             solver on the fly. This latter approach is quite compelling
ential equations (ODEs) were almost immediately noted                as it does not require specifying the number of layers of the
after residual networks (He et al., 2016) were first proposed.       network beforehand. Furthermore, it allows the learning of
                                                                     homeomorphisms without any structural constraints on the
                                                                     function computed by the residual block.
                                                                     Neural ODEs have shown great promise in the physical sciences 
                                                                     (Köhler et al., 2019), in modeling irregular time series
                                                                     (Rubanova et al., 2019), mean field games (Ruthotto et al.,
                                                                     2019), continuous-time modeling (Yildiz et al., 2019; Kanaa
                                                                     et al., 2019), and for generative modeling through normaliz-
                                                                     ing flows with free-form Jacobians (Grathwohl et al., 2019).


Recent work has even adapted neural ODEs to the stochas-        based on (ODE) which abstain from a priori fixing step-size.
tic setting (Li et al., 2020). Despite these successes, some    Chen et al.’s method is a continuous-time generalization of
hurdles still remain. In particular, although neural ODEs are   residual networks, where the dynamics are generated by an
memory efficient, they can take a prohibitively long time to    adaptive ODE solver that chooses step-size on-the-fly.
train, which is arguably one of the main stumbling blocks
                                                                Because of their adaptive nature, neural ODEs can be more
towards their widespread adoption.
                                                                flexible than ResNets in certain scenarios, such as when
In this work we reduce the training time of neural ODEs         trading between model speed and accuracy. Moreover given
by regularizing the learned dynamics, complementing other       a fixed network depth, the memory footprint of neural ODEs
recent approaches to this end such as augmented neural          is orders of magnitude smaller than a standard ResNet dur-
ODEs (Dupont et al., 2019). Without further constraints on      ing training. They therefore show great potential on a host
their dynamics, high dimensional neural ODEs may learn          of applications, including generative modeling and density
dynamics which minimize an objective function, but which        estimation. An apparent drawback of neural ODEs is their
generate irregular solution trajectories. See for example       long training time: although a learned function f (· ; θ) may
Figure 1b, where an unregularized flow exhibits undesirable     generate a map that solves a problem particularly well, the
properties due to unnecessarily fluctuating dynamics. As        computational cost of numerically integrating (ODE) may
a solution, we propose two theoretically motivated regular-     be so prohibitive that it is not tractable in practice. In this
ization terms arising from an optimal transport viewpoint       paper we demonstrate this need not be so: with proper reg-
of the learned map, which encourage well-behaved dynam-         ularization, it is possible to learn f (· ; θ) so that (ODE) is
ics (see 1a left). We empirically demonstrate that proper       easily and quickly solved.
regularization leads to significant speed-up in training time
without loss in performance, thus bringing neural ODEs          2.1. FFJORD
closer to deployment on large-scale datasets. Our methods
are validated on the problem of generative modelling and        In density estimation and generative modeling, we wish
density estimation, as an example of where neural ODEs          to estimate an unknown data distribution p(x) from which
have shown impressive results, but could easily be applied      we have drawn N samples. Maximum likelihood seeks to
elsewhere.                                                      approximate p(x) with a parameterized distribution pθ (x)
                                                                by minimizing the Kullback-Leibler divergence between the
In summary, our proposed regularized neural ODE (RN-            two, or equivalently minimizing
ODE) achieves the same performance as the baseline, while
reducing the wall-clock training time by many hours or even                                    
days.                                                                               <<FORMULA>>              (1)
                                                                                             
2. Neural ODEs & Continuous normalizing                         Continuous normalizing flows (Grathwohl et al., 2019; Chen
    flows                                                       et al., 2018) parameterize pθ (x) using a vector field f :
                                                                Rd × R 7→ Rd as follows. Let z(x, T ) be the solution map
Neural ODEs simplify the design of deep neural networks         given by running the dynamics (ODE) for fixed time T .
by formulating the forward pass of a deep network as the        Suppose we are given a known distribution q at final time T ,
solution of a ordinary differential equation. Initial work      such as the normal distribution. Change of variables tells us
along these lines was motivated by the similarity of the eval-  that the distribution pθ (x) may be evaluated through
uation of one layer of a ResNet and the Euler discretization
of an ODE. Suppose the block in the t-th layer of a ResNet          <<log pθ (x) = log q (z(x, T )) + log det | ∇ z(x, T )|>>          (2)
is given by the function f (x, t; θ), where θ are the block’s
parameters. Then the evaluation of this layer of the ResNet     Evaluating the log determinant of the Jacobian is difficult.
is simply xt+1 = xt + f (xt , t; θ). Now, instead consider      Grathwohl et al. (2019) exploit the following identity from
the following ODE                                               fluid mechanics (Villani, 2003, p 114)

                <<FORMULA>>                        (ODE)                 <<log det | ∇ z(x, t)| = div (f ) (z(x, t), t))>>       (3)

The Euler discretization of this ODE with step-size <<τ>> is        where <<div(·)>> is the divergence operator, <<div(f ) (x) =
<<zt+1 = zt + τ f (zt , t; θ)>>, which is nearly identical to the      i ∂xi fi (x)>>. By the fundamental theorem of calculus, we
forward evaluation of the ResNet’s layer (setting step-size          1
                                                                       In the normalizing flow literature divergence is typically writ-
<<τ = 1>> gives equality). Armed with this insight, Chen et al.     ten explicitly as the trace of the Jacobian, however we use div (·)
(2018) suggested a method for training neural networks          which is more common elsewhere.

                                                         <<FIGURE>>
Figure 2. Log-likelihood (measured in bits/dim) on the validation set as a function of wall-clock time. Rolling average of three hours, with
90% confidence intervals.

may then rewrite (2) in integral form                                    From this simple motivating example, the need for regular-
                                                                         ity of the vector field is apparent. Without placing demands
                                                                         on the vector field f , it is entirely possible that the learned
 <<log pθ (x) = log q (z(x, T )) + div (f ) (z(x, s), s) ds>>
                                                                         dynamics will be poorly conditioned. This is not just a theo-
                                                               (4)       retical exercise: because the dynamics must be solved with
Remark 2.1 (Divergence trace estimate). In (Grathwohl                    a numerical integrator, poorly conditioned dynamics will
et al., 2019), the divergence is estimated using an unbiased             lead to difficulties during numerical integration of (ODE).
Monte-Carlo trace estimate (Hutchinson, 1990; Avron &                    Indeed, later we present results demonstrating a clear corre-
Toledo, 2011),                                                           lation between the number of time steps an adaptive solver
                                                                         takes to solve (ODE), and the regularity of f .       
             <<FORMULA>>             (5)                                 How can the regularity of the vector field be measured? One
                                                                         motivating approach is to measure the force experienced by
                                                                         a particle z(t) under the dynamics generated by the vector
By using the substitution (4), the task of maximizing log-               field f , which is given by the total derivative of f with
likelihood shifts from choosing pθ to minimize (1), to learn-            respect to time
ing the flow generated by a vector field f . This results in a
normalizing flow with a free-form Jacobian and reversible                      
dynamics, and was named FFJORD by Grathwohl et al..                                          <<FORMULA>>                      (6)
                                                                                      
2.2. The need for regularity                                                                  <<FORMULA>>                  (7)

The vector field learned through FFJORD that maximizes                  Well conditioned flows will place constant, or nearly con-
the log-likelihood is not unique, and raises troubling prob-             stant, force on particles as they travel. Thus, in this work we
lems related to the regularity of the flow. For a simple                 propose regularizing the dynamics with two penalty terms,
example, refer to Figure 1, where we plot two normaliz-                  one term regularizing f and the other ∇ f . The first penalty,
ing flows, both mapping a toy one-dimensional distribution               presented in Section 3, is a measure of the distance travelled
to the unit Gaussian, and where both maximize the log-                   under the flow f , and can alternately be interpreted as the
likelihood of exactly the same sample of particles. Figure               kinetic energy of the flow. This penalty term is based off
1a presents a “regular” flow, where particles travel in straight         of numerical methods in optimal transport, and encourages
lines that travel with constant speed. In contrast, Figure 1b            particles to travel in straight lines with constant speed. The
shows a flow that still maximizes the log-likelihood, but                second penalty term, discussed in Section 4, performs regu-
that has undesirable properties, such as rapidly varying local           larization on the Jacobian of the vector field. Taken together
trajectories and non-constant speed.                                     the two terms ensure that the force experienced by a particle

under the flow is constant or nearly so.                         3.1. Linking normalizing flows to optimal transport

These two regularizers will promote dynamics that follow         Now suppose we wish to minimize (18a), with q(z) a unit
numerically easy-to-integrate paths, thus greatly speeding       normal distribution, and p(x) a data distribution, unknown
up training time.                                                to us, but from which we have drawn N samples, and which
                                                                 we model as a discrete distribution of Dirac masses. Enforc-
3. Optimal transport maps &                                      ing the initial condition is trivial because we have sampled
                                                                 from p directly. The continuity equation (18b) need not be
   Benamou-Brenier
                                                                 enforced because we are tracking a finite number of sam-
There is a remarkable similarity between density estimation      pled particles. However the final time condition ρT = q
using continuous time normalizing flows, and the calcula-        cannot be implemented directly, since we do not have di-
tion of the optimal transport map between two densities          rect control on the form ρT (z) takes. Instead, introduce
using the Benamou-Brenier formulation (Benamou & Bre-            a Kullback-Leibler term to (18a) penalizing discrepancy
nier, 2000; Santambrogio, 2015). While a review of optimal       between ρT and q. This penalty term has an elegant simpli-
transport theory is far outside the scope of this paper, here    fication when p(x) is modeled as a distribution of a finite
we provide an informal summary of key ideas relevant to          number of masses, as is done in generative modeling. Set-
continuous normalizing flows. The quadratic-cost optimal         ting ρ0 = pθ a brief derivation yields
transport map between two densities p(x) and q(x) is a map       
z : Rd 7→ Rd minimizing the transport cost                       
                                                                                <<FORMULA>>                 (10)

              <<FORMULA>>                  (8)
                                                                 With this simplification (18a) becomes

subject to the constraint that A q(z) dz = z−1 (A) p(x) dx, 
in other words that the measure of any set A is preserved                      
under the map z. In a seminal work, Benamou & Brenier                            <<FORMULA>>                (11)
(2000) showed that rather than solving for minimizers of (8)
directly, an indirect (but computationally efficient) method
is available by writing z(x, T ) as the solution map of a
flow under a vector field f (as in (ODE)) for time T , by        For further details on this derivation consult the supplemen-
minimizing                                                       tary materials.
                                                                 The connection between the Benamou-Brenier formulation
                      <<FORMULA>>                          (9a)   of the optimal transport problem on a discrete set of points
                                                                 and continuous normalizing flows is apparent: the optimal
                                                                 transport problem (11) is a regularized form of the continu-
                      <<FORMULA>>                          (9b)  ous normalizing flow optimization problem (1). We there-
                      <<ρ0 (x) = p>>,                      (9c)  fore expect that adding a kinetic energy regularization term
                      <<ρT (z) = q>>.                      (9d)  to FFJORD will encourage solution trajectories to prefer
                                                                 straight lines with constant speed.
 The objective function (18a) is a measure of the kinetic
energy of the flow. The constraint (18b) ensures probability
mass is conserved. The latter two constraints guarantee the      4. Unbiased Frobenius norm regularization of
learned distribution agrees with the source p and target q.          the Jacobian
Note that the kinetic energy (18a) is an upper bound on the
                                                                 Refering to equation (7), one can see that even if f is regu-
transport cost, with equality only at optimality.
                                                                 larized to be small, via a kinetic energy penalty term, if the
The optimal flow f minimizing (18) has several particularly      Jacobian is large then the force experienced by a particle
appealing properties. First, particles induced by the opti-      may also still be large. As a result, the error of the numerical
mal flow f travel in straight lines. Second, particles travel    integrator can be large, which may lead an adaptive solver
with constant speed. Moreover, under suitable conditions         to make many function evaluations. This relationship is
on the source and target distributions, the optimal solution     apparent in Figure 3, where we empirically demonstrate the
map is unique (Villani, 2008). Therefore the solution map        correlation between the number of function evaluations of
z(x, t) is entirely characterized by the initial and final posi- f taken by the adaptive solver, and the size of the Jacobian
tions: z(x, t) = (1 − Tt )z(x, 0) + Tt z(x, T ). Consequently,   norm of f . The correlation is remarkably strong: dynamics
given an optimal f it is extraordinarily easy to solve (ODE)     governed by a poorly conditioned Jacobian matrix require
numerically with minimal computational effort.                   the adaptive solver to take many small time steps.


Algorithm 1 RNODE: regularized neural ODE training of
FFJORD
         <<ALGORITHM>>

                                                                                 <<FIGURE>>

                                                                 Figure 3. Number of function evaluations vs Jacobian Frobenius
                                                                norm of flows on CIFAR10 during training with vanilla FFJORD,
                                                                 using an adaptive ODE solver.
\
                                                                 Avron & Toledo, 2011). For real matrix B, an unbiased
                     <<FORMULA>>                                 estimate of the trace is given by

                                                                                    <<FORMULA>>                 (14)

                                                                 where <<FORMULA>> is drawn from a unit normal distribution. 
                                                                 Thus the squared Frobenius norm can be easily estimated by 
                                                                 setting B = AAT.
Moreover, in particle-based methods, the kinetic energy          Turning to the Jacobian <<FORMULA>> of a vector valued func-
term forces dynamics to travel in straight lines only on         tion f : Rd 7→ Rd , recall that the vector-Jacobian product
data seen during training, and so the regularity of the map      <<FORMULA>> may be quickly computed through reverse-mode
is only guaranteed on trajectories taken by training data.       automatic differentiation. Therefore an unbiased Monte-
The issue here is one of generalization: the map may be          Carlo estimate of the Frobenius norm of the Jacobian is
irregular on off-distribution or perturbed images, and cannot    readily available
be remedied by the kinetic energy term during training alone.
In the context of generalization, Jacobian regularization is                    <<FORMULA>>                         (15)
analagous to gradient regularization, which has been shown       
to improve generalization (Drucker & LeCun, 1992; Novak                         <<FORMULA>>                         (16)
et al., 2018).

For these reasons, we also propose regularizing the Jacobian     Conveniently, in the FFJORD framework the quantity
through its Frobenius norm. The Frobenius norm k · kF of a       <<FORMULA>> must be computed during the estimate of the prob-
real matrix A can be thought of as the `2 norm of the matrix     ability distribution under the flow, in the Monte-Carlo esti-
A vectorized                                                     mate of the divergence term (5). Thus Jacobian Frobenius
                      <<FORMULA>>                         (12)   norm regularization is available with essentially no extra
                                                                 computational cost.
Equivalently it may be computed as
                                                                 5. Algorithm description
                               
                     <<kAkF = tr(AAT)>>                   (13)   All together, we propose modifying the objective function
                                                                 of the FFJORD continuous normalizing flow (Grathwohl
and is the Euclidean norm of the singular values of a matrix.    et al., 2019) with the two regularization penalties of Sec-
In trace form, the Frobenius norm lends itself to estimation     tions 3 & 4. The proposed method is called RNODE, short
using a Monte-Carlo trace estimator (Hutchinson, 1990;           for regularized neural ODE. Pseudo-code of the method is

                                                         <<TABLE>>

Table 1. Log-likelihood (in bits/dim) and training time (in hours) on validation images with uniform dequantization. Results on clean
images are found in the supplemental materials. For comparison we report both the results of the original FFJORD paper (Grathwohl
et al., 2019) and our own independent run of FFJORD (“vanilla”) on CIFAR10 and MNIST. Vanilla FFJORD did not train on ImageNet64
(denoted by “x”). Also reported are results for other flow-based generative modeling papers. Our method (FFJORD with RNODE) has
comparable log-likelihood as FFJORD but is significantly faster.
   

                                                              <<FIGURE>>

Figure 4. Quality of generated samples samples on 5bit CelebA-HQ64 with RNODE. Here temperature annealing (Kingma & Dhariwal,
2018) with T = 0.7 was used to generate visually appealing images. For full sized CelebA-HQ256 samples, consult the supplementary
materials.

presented in Algorithm 1. The optimization problem to be                   Here E, l, and n are respectively the kinetic energy, the
solved is                                                                  log determinant of the Jacobian, and the integral of the
                                                                           Frobenius norm of the Jacobian.
                                                                           Both the divergence term and the Jacobian Frobenius norm
                                                                           are approximated with Monte-Carlo trace estimates. In our
                     <<FORMULA>>                                           implementation, the Jacobian Frobenius estamate reuses
                                                                           the computatian T ∇ f from the divergence estimate for
                                                                           efficiency. We remark that the kinetic energy term only
                     <<FORMULA>>                                           requires the computation of a dot product. Thus just as
                                                                           in FFJORD, our implementation scales linearly with the
                     <<FORMULA>>             (17)                          number of time steps taken by the ODE solver.

                                                                           Gradients of the objective function with respect to the net-
where z(x, t) is determined by numerically solving (ODE).                  work parameters are computed using the adjoint sensitivity
Note that we take the mean over number of samples and                      method (Pontryagin et al., 1962; Chen et al., 2018).
input dimension. This is to ensure that the choice of regu-
larization strength λK and λJ is independent of dimension
size and sample size.                                                      6. Experimental design
To compute the three integrals and the log-probability under               Here we demonstrate the benefits of regularizing neural
q of z(x, T ) at final time T , we augment the dynamics of                 ODEs on generative models, an application where neu-
the ODE with three extra terms, so that the entire system                  ral ODEs have shown strong empirical performance. We
solved by the numerical integrator is                                      use four datasets: CIFAR10 (Krizhevsky & Hinton, 2009),
                                                                           MNIST (LeCun & Cortes, 1998), downsampled ImageNet
                                                                           (64x64) (van den Oord et al., 2016), and 5bit CelebA-HQ
                                                                           (256x256) (Karras et al., 2017). We use an identical neural
        <<FORMULA>>                                   (RNODE)              architecture to that of Grathwohl et al. (2019). The dynamics
                                                                           (Kingma & Dhariwal, 2018) trained with 40 GPUs for a week;
                                                                           in contrast we train with four GPUs in just under a week.
 
                                                      <<FIGURE>>

Figure 5. Ablation study of the effect of the two regularizers, comparing two measures of flow regularity during training with a fixed
step-size ODE solver. Figure 5a: mean Jacobian Frobenius norm as a function of training epoch. Figure 5b: mean kinetic energy of the
flow as a function of training epoch. Figure 5c: number of function evaluations.

are defined by a neural network <<f (z, t; θ(t)) : Rd × R+ 7→          step size by a factor of two until the discrete dynamics were
Rd>> where <<θ(t)>> is piecewise constant in time. On MNIST we         stable and achieved good performance. The Runge-Kutta
use 10 pieces; CIFAR10 uses 14; downsampled ImageNet                   4(5) adaptive solver was used on the two larger datasets. We
uses 18; and CelebA-HQ uses 26 pieces. Each piece is a                 have also observed that RNODE improves the training time
4-layer deep convolutional network comprised of 3x3 ker-               of the adaptive solvers as well, requiring many fewer func-
nels and softplus activation functions. Intermediary layers            tion evaluations; however in Python we have found that the
have 64 hidden dimensions, and time t is concatenated to               fixed grid solver is typically quicker at a specified number
the spatial input z. The integration time of each piece is             of function evaluations. At test time RNODE uses the same
[0, 1]. Weight matrices are chosen to imitate the multi-scale          adaptive solver as FFJORD.
architecture of Real NVP (Dinh et al., 2017), in that im-
                                                                       We always initialize RNODE so that <<f(z, t) = 0>>; thus train-
ages are ‘squeezed’ via a permutation to halve image height
                                                                       ing begins with an initial identity map. This is done by zero-
and width but quadruple the number of channels. Diver-
                                                                       ing the parameters of the last layer in each piece (block),
gence of f is estimated using the Gaussian Monte-Carlo
                                                                       following Goyal et al. (2017). The identity map is an ap-
trace estimator with one sample of fixed noise per solver
                                                                       propriate choice because it has zero transport cost and zero
time-step.
                                                                       Frobenius norm. Moreover the identity map is trivially
On MNIST and CIFAR10 we train with a batch size of                     solveable for any numerical solver, thus training begins
200 and train for 100 epochs on a single GPU3 , using the              without any effort required on the solver’s behalf.
Adam optimizer (Kingma & Ba, 2015) with a learning rate
                                                                       On all datasets we set both the kinetic energy regularization
of 1e−3. On the two larger datasets, we train with four
                                                                       coefficient λK and the Jacobian norm coefficient λJ to 0.01.
GPUs, using a per-GPU batch size of respectively 3 and 50
for CelebA-HQ and ImageNet. Data is preprocessed by per-
turbing with uniform noise followed by the logit transform.            7. Results
The reference implementation of FFJORD solves the dy-                  A comparison of RNODE against FFJORD and other flow-
namics using a Runge-Kutta 4(5) adaptive solver (Dormand               based generative models is presented in Table 1. We report
& Prince, 1980) with error tolerances 1e−5 and initial step            both our running of “vanilla” FFJORD and the results as
size 1e−2. We have found that using less accurate solvers              originally reported in (Grathwohl et al., 2019). We highlight
on the reference implementation of FFJORD results in nu-               that RNODE runs roughly 2.8x faster than FFJORD on both
merically unstable training dynamics. In contrast, a simple            datasets, while achieving or surpassing the performance of
fixed-grid four stage Runge-Kutta solver suffices for RN-              FFJORD. This can further be seen in Figure 2 where we plot
ODE during training on MNIST and CIFAR10, using a                      bits per dimension ( − d1 log2 p(x), a normalized measure
step size of 0.25. The step size was determined based on               of log-likelihood) on the validation set as a function of
a simple heuristic of starting with 0.5 and decreasing the             training epoch, for both datasets. Visual inspection of the
sample quality reveals no qualitative difference between

                                          <<FIGURE>>

          Figure 6. Quality of generated samples samples with and without regularization on MNIST, left, and CIFAR10, right.

regularized and unregularized approaches; refer to Figure 6.         encourages flows to travel a minimal distance. In addition,
Generated images for downsampled ImageNet and CelebA-                we see that the Jacobian norm alone also has a beneficial
HQ are deferred to the supplementary materials; we provide           effect on the distance particles travel. Overall, the results
smaller generated images for networks trained on CelebA-             support our theoretical reasoning empirically.
HQ 64x64 in Figure 4.
Surprisingly, our run of “vanilla” FFJORD achieved slightly          8. Previous generative flows inspired by
better performance than the results reported in (Grathwohl               optimal transport
et al., 2019). We suspect the discrepancy in performance
and run times between our implementation of FFJORD and               Zhang et al. (2018) define a neural ODE flow where the
that of the original paper is due to batch size: Grathwohl           dynamics are given as the gradient of a scalar potential func-
et al. use a batch size of 900 and train on six GPUs, whereas        tion. This interpretation has deep connections to optimal
on MNIST and CIFAR10 we use a batch size of 200 and                  transport: the optimal transport map is the gradient of a
train on a single GPU.                                               convex potential function. Yang & Karniadakis (2019) con-
                                                                     tinue along these lines, and define an optimal transport again
We were not able to train vanilla FFJORD on ImageNet64,              as a scalar potential gradient. Yang & Karniadakis (2019)
due to numerical underflow in the adaptive solver’s time step.       enforce that the learned map is in fact an optimal trans-
This issue cannot be remedied by increasing the solver’s             port map by penalizing their objective function with a term
error tolerance, for this would bias the log-likelihood esti-        measuring violations of the continuity equation. Ruthotto
mates on validation.                                                 et al. (2019) place generative flows within a broader context
                                                                     of mean field games, and as an example consider a neural
7.1. Ablation study on MNIST                                         ODE gradient potential flow solving the optimal transport
                                                                     problem in up to 100 dimensions. We also note the recent
In Figure 5, we compare the effect of each regularizer by
                                                                     work of Twomey et al. (2019), who proposed regularizing
itself on the training dynamics with the fixed grid ODE
                                                                     neural ODEs with an Euler-step discretization of the kinetic
solver on the MNIST dataset. Without any regularization at
                                                                     energy term to enforce ‘straightness’, although connections
all, training dynamics are numerically unstable and fail after
                                                                     to optimal transport were not discussed.
just under 50 epochs. This is precisely when the Jacobian
norm grows large; refer to Figure 5a. Figure 5a demonstrates         When a flow is the gradient of a scalar potential, the change
that each regularizer by itself is able to control the Jacobian      of variables formula (4) simplifies so that the divergence
norm. The Jacobian regularizer is better suited to this task,        term is replaced by the Laplacian of the scalar potential.
although it is interesting that the kinetic energy regularizer       Although mathematically parsimonious and theoretically
also improves the Jacobian norm. Unsurprisingly Figure 5b            well-motivated, we chose not to implement our flow as the
demonstrates the addition of the kinetic energy regularizer          gradient of a scalar potential function due to computational
                       How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
constraints: such an implementation would require ‘triple       through CIFAR, and companies sponsoring the Vector Insti-
backprop’ (twice to compute or approximate the Laplacian,       tute (www.vectorinstitute.ai/#partners).
and once more for the parameter gradient). Ruthotto et al.
(2019) circumvented this problem by utilizing special struc-    References
tural properties of residual networks to efficiently compute
the Laplacian.                                                  Avron, H. and Toledo, S. Randomized algorithms for esti-
                                                                   mating the trace of an implicit symmetric positive semi-
                                                                   definite matrix. J. ACM, 58(2):8:1–8:34, 2011. doi:
9. Discussion
                                                                   10.1145/1944345.1944349. URL https://doi.org/
In practice, RNODE is simple to implement, and only re-            10.1145/1944345.1944349.
quires augmenting the dynamics (ODE) with two extra
scalar equations (one for the kinetic energy term, and an-      Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duve-
other for the Jacobian penalty). In the setting of FFJORD,         naud, D., and Jacobsen, J. Invertible residual networks.
because we may recycle intermediary terms used in the              In Chaudhuri, K. and Salakhutdinov, R. (eds.), Pro-
divergence estimate, the computational cost of evaluating          ceedings of the 36th International Conference on Ma-
these two extra equations is minimal. RNODE introduces             chine Learning, ICML 2019, 9-15 June 2019, Long
two extra hyperparameters related to the strength of the reg-      Beach, California, USA, volume 97 of Proceedings
ularizers; we have found these required almost no tuning.          of Machine Learning Research, pp. 573–582. PMLR,
                                                                   2019. URL http://proceedings.mlr.press/
Although the problem of classification was not considered          v97/behrmann19a.html.
in this work, we believe RNODE may offer similar im-
provements both in training time and the regularity of the      Benamou, J.-D. and Brenier, Y. A computational fluid me-
classifier learned. In the classification setting we expect the    chanics solution to the Monge-Kantorovich mass transfer
computional overhead of calculating the two extra terms            problem. Numerische Mathematik, 84(3):375–393, 2000.
should be marginal relative to gains made in training time.
                                                                Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duve-
                                                                   naud, D. Neural Ordinary Differential Equations. In
10. Conclusion                                                     Advances in Neural Information Processing Systems 31:
We have presented RNODE, a regularized method for neu-             Annual Conference on Neural Information Processing
ral ODEs. This regularization approach is theoretically            Systems 2018, NeurIPS 2018, 3-8 December 2018,
well-motivated, and encourages neural ODEs to learn well-          Montréal, Canada, pp. 6572–6583, 2018. URL http:
behaved dynamics. As a consequence, numerical integration          //papers.nips.cc/paper/7892-neural-
of the learned dynamics is straight forward and relatively         ordinary-differential-equations.
easy, which means fewer discretizations are needed to solve     Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen,
the dynamics. In many circumstances, this allows for the re-       J. Residual flows for invertible generative modeling.
placement of adaptive solvers with fixed grid solvers, which       In Wallach, H. M., Larochelle, H., Beygelzimer,
can be more efficient during training. This leads to a sub-        A., d’Alché-Buc, F., Fox, E. B., and Garnett, R.
stantial speed up in training time, while still maintaining        (eds.), Advances in Neural Information Processing
the same empirical performance, opening the use of neural          Systems 32: Annual Conference on Neural Information
ODEs to large-scale applications.                                  Processing Systems 2019, NeurIPS 2019, 8-14 Decem-
                                                                   ber 2019, Vancouver, BC, Canada, pp. 9913–9923,
Acknowledgements                                                   2019.     URL http://papers.nips.cc/paper/
                                                                   9183-residual-flows-for-invertible-
C. F. and A. O. were supported by a grant from the Innova-         generative-modeling.
tive Ideas Program of the Healthy Brains and Healthy Lives
initiative (HBHL) through McGill University.                    Dinh, L., Sohl-Dickstein, J., and Bengio, S. Den-
L. N. was supported by AFOSR MURI FA9550-18-1-0502,                sity estimation using real NVP. In 5th International
AFOSR Grant No. FA9550-18-1-0167, and ONR Grant No.                Conference on Learning Representations, ICLR 2017,
N00014-18-1-2527.                                                  Toulon, France, April 24-26, 2017, Conference Track Pro-
                                                                   ceedings, 2017. URL https://openreview.net/
A. O. was supported by the Air Force Office of Scientific          forum?id=HkpbnH9lx.
Research under award number FA9550-18-1-0167
                                                                Dormand, J. R. and Prince, P. J. A family of embedded
Resources used in preparing this research were provided, in        Runge-Kutta formulae. Journal of computational and
part, by the Province of Ontario, the Government of Canada         applied mathematics, 6(1):19–26, 1980.
                     How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
Drucker, H. and LeCun, Y. Improving generalization per-      Hutchinson, M. F. A stochastic estimator of the trace of the
  formance using double backpropagation. IEEE Trans.            influence matrix for Laplacian smoothing splines. Com-
  Neural Networks, 3(6):991–997, 1992. doi: 10.1109/            munications in Statistics-Simulation and Computation,
  72.165600.       URL https://doi.org/10.1109/                 19(2):433–450, 1990.
  72.165600.
                                                             Kanaa, D., Voleti, V., Kahou, S., and Pal, C. Simple video
Dupont, E., Doucet, A., and Teh, Y. W. Augmented                generation using neural ODEs. Workshop on Learning
  neural ODEs. In Wallach, H. M., Larochelle, H.,               with Rich Experience, Advances in Neural Information
  Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Gar-       Processing Systems 32: Annual Conference on Neural
  nett, R. (eds.), Advances in Neural Information Pro-          Information Processing Systems 2019, NeurIPS 2019,
  cessing Systems 32: Annual Conference on Neural               8-14 December 2019, Vancouver, BC, Canada, 2019.
  Information Processing Systems 2019, NeurIPS 2019,
  8-14 December 2019, Vancouver, BC, Canada, pp.             Karras, T., Aila, T., Laine, S., and Lehtinen, J. Pro-
  3134–3144, 2019. URL http://papers.nips.cc/                   gressive growing of gans for improved quality, stabil-
  paper/8577-augmented-neural-odes.                             ity, and variation. CoRR, abs/1710.10196, 2017. URL
                                                                http://arxiv.org/abs/1710.10196.
E, W. A Proposal on Machine Learning via Dynam-
  ical Systems. Communications in Mathematics and            Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
  Statistics, 5(1):1–11, March 2017. ISSN 2194-671X.            timization. In 3rd International Conference on Learning
  doi: 10.1007/s40304-017-0103-z. URL https://                  Representations, ICLR 2015, San Diego, CA, USA, May
  doi.org/10.1007/s40304-017-0103-z.                            7-9, 2015, Conference Track Proceedings, 2015. URL
                                                                http://arxiv.org/abs/1412.6980.
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
                                                             Kingma, D. P. and Dhariwal, P. Glow: Generative flow
  Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
                                                                with invertible 1x1 convolutions. In Bengio, S., Wallach,
  He, K. Accurate, large minibatch SGD: training ima-
                                                                H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,
  genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
                                                                and Garnett, R. (eds.), Advances in Neural Information
  http://arxiv.org/abs/1706.02677.
                                                                Processing Systems 31: Annual Conference on Neural
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever,      Information Processing Systems 2018, NeurIPS 2018,
  I., and Duvenaud, D. FFJORD: free-form continu-               3-8 December 2018, Montréal, Canada, pp. 10236–
  ous dynamics for scalable reversible generative mod-          10245, 2018.        URL http://papers.nips.cc/
  els. In 7th International Conference on Learning Rep-         paper/8224-glow-generative-flow-with-
  resentations, ICLR 2019, New Orleans, LA, USA, May            invertible-1x1-convolutions.
  6-9, 2019, 2019. URL https://openreview.net/
                                                             Köhler, J., Klein, L., and Noé, F. Equivariant flows: sam-
  forum?id=rJxgknCcK7.
                                                                pling configurations for multi-body systems with sym-
Haber, E. and Ruthotto, L. Stable architectures for deep        metric energies. arXiv preprint arXiv:1910.00753, 2019.
  neural networks. Inverse Problems, 34(1):014004, 2017.     Krizhevsky, A. and Hinton, G.              Learning multiple
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-             layers of features from tiny images. Technical re-
  ual learning for image recognition. In 2016 IEEE              port, University of Toronto, 2009. URL http://
  Conference on Computer Vision and Pattern Recogni-            www.cs.toronto.edu/ ̃kriz/cifar.html.
  tion, CVPR 2016, Las Vegas, NV, USA, June 27-30,           LeCun, Y. and Cortes, C. The MNIST database of handwrit-
  2016, pp. 770–778. IEEE Computer Society, 2016. doi:          ten digits. 1998. URL http://yann.lecun.com/
  10.1109/CVPR.2016.90. URL https://doi.org/                    exdb/mnist/.
  10.1109/CVPR.2016.90.
                                                             Li, X., Wong, T. L., Chen, R. T. Q., and Duvenaud, D. Scal-
Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P.        able gradients for stochastic differential equations. CoRR,
  Flow++: Improving flow-based generative models with           abs/2001.01328, 2020. URL http://arxiv.org/
  variational dequantization and architecture design. In        abs/2001.01328.
  Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings
  of the 36th International Conference on Machine Learn-     Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and
  ing, ICML 2019, 9-15 June 2019, Long Beach, California,       Sohl-Dickstein, J. Sensitivity and generalization in neural
  USA, volume 97 of Proceedings of Machine Learning             networks: an empirical study. In 6th International Con-
  Research, pp. 2722–2730. PMLR, 2019. URL http:                ference on Learning Representations, ICLR 2018, Van-
  //proceedings.mlr.press/v97/ho19a.html.                       couver, BC, Canada, April 30 - May 3, 2018, Conference
                      How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
  Track Proceedings. OpenReview.net, 2018. URL https:            Processing Systems 2019, NeurIPS 2019, 8-14 December
  //openreview.net/forum?id=HJC2SzZCW.                           2019, Vancouver, BC, Canada, pp. 13412–13421, 2019.
                                                                 URL      http://papers.nips.cc/paper/9497-
Pontryagin, L. S., Mishchenko, E., Boltyanskii, V., and          ode2vae-deep-generative-second-order-
  Gamkrelidze, R. The mathematical theory of optimal             odes-with-bayesian-neural-networks.
  processes. 1962.
                                                              Zhang, L., E, W., and Wang, L. Monge-Ampère flow for
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent or-        generative modeling. CoRR, abs/1809.10188, 2018. URL
  dinary differential equations for irregularly-sampled time     http://arxiv.org/abs/1809.10188.
  series. In Advances in Neural Information Processing
  Systems, pp. 5321–5331, 2019.
Ruthotto, L. and Haber, E. Deep neural networks motivated
  by partial differential equations. Journal of Mathematical
  Imaging and Vision, pp. 1–13, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and
  Fung, S. W. A machine learning framework for solv-
  ing high-dimensional mean field game and mean field
  control problems. CoRR, abs/1912.01825, 2019. URL
  http://arxiv.org/abs/1912.01825.
Santambrogio, F. Benamou-Brenier and other continu-
  ous numerical methods, pp. 219–248. Springer Interna-
  tional Publishing, Cham, 2015. ISBN 978-3-319-20828-
  2. doi: 10.1007/978-3-319-20828-2 6. URL https:
  //doi.org/10.1007/978-3-319-20828-2 6.
Twomey, N., Kozlowski, M., and Santos-Rodrı́guez, R. Neu-
  ral ODEs with stochastic vector field mixtures. CoRR,
  abs/1905.09905, 2019. URL http://arxiv.org/
  abs/1905.09905.
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
  K.       Pixel recurrent neural networks.           CoRR,
  abs/1601.06759, 2016. URL http://arxiv.org/
  abs/1601.06759.
Villani, C. Topics in Optimal Transportation. Graduate
  studies in mathematics. American Mathematical Society,
  2003. ISBN 9780821833124.
Villani, C. Optimal Transport: Old and New. Grundlehren
  der mathematischen Wissenschaften. Springer Berlin Hei-
  delberg, 2008. ISBN 9783540710509. URL https://
  books.google.ca/books?id=hV8o5R7 5tkC.
Yang, L. and Karniadakis, G. E. Potential flow gener-
  ator with L2 Optimal Transport regularity for gener-
  ative models. CoRR, abs/1908.11462, 2019. URL
  http://arxiv.org/abs/1908.11462.
Yildiz, C., Heinonen, M., and Lähdesmäki, H. ODE2VAE:
  deep generative second order ODEs with Bayesian neural
  networks. In Wallach, H. M., Larochelle, H., Beygelz-
  imer, A., d’Alché-Buc, F., Fox, E. B., and Garnett,
  R. (eds.), Advances in Neural Information Processing
  Systems 32: Annual Conference on Neural Information

A. Details of Section 3.1: Benamou-Brenier                         Hence, multiplying the objective function in (20) by λ and
    formulation in Lagrangian coordinates                          ignoring the f -independent term Ex∼p log p(x) we obtain
                                                                   an equivalent objective function
The Benamou-Brenier formulation of the optimal transporta-                   
tion (OT) problem in Eulerian coordinates is                        
                                                                      <<FORMULA>>                      (21)

                  <<FORMULA>>                             (18a)

                                                                   Finally, if we assume that {xi }N  i=1 are iid sampled from p,
                  <<FORMULA>>                             (18b)    we obtain the empirical objective function

                  <<ρ0 (x) = p>>,                        (18c)         

                  <<ρT (z) = q>>.                      (18d)                 <<FORMULA>>                         (22)

 The connection between continuous normalizing flows
(CNF) and OT becomes transparent once we rewrite (18) in
Lagrangian coordinates. Indeed, for regular enough velocity
                                                                   B. Additional results
fields f one has that the solution of the continuity equation      Here we present additional generated samples on the two
(18b), (18c) is given by ρt = z(·, t)]p where z is the flow        larger datasets considered, CelebA-HQ and ImageNet64. In
                                                                   addition bits/dim on clean images are reported in Table 2.
         <<FORMULA>>

The relation ρt = z(·, t)]p means that for arbitrary test
function φ we have that

               <<φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx>>

Therefore (18) can be rewritten as

   <<min      kf (z(x, t), t)k2 p(x) dxdt>>               (19a)

   <<subject to         ż(x, t) = f (z(x, t), t)>>,       (19b)

                      <<z(x, 0) = x>>,                     (19c)

                      <<z(·, T )]p = q>>.                  (19d)

Note that ρt is eliminated in this formulation. The terminal
condition (18d) is trivial to implement in Eulerian coordi-
nates (grid-based methods) but not so simple in Lagrangian
ones (19d) (grid-free methods). To enforce (19d) we intro-
duce a penalty term in the objective function that measures
the deviation of z(·, T )]p from q. Thus, the penalized ob-
jective function is
               <<FORMULA>>          (20)
where λ > 0 is the penalization strength. Next, we observe
that this objective function can be written as an expectation
with respect to x ∼ p. Indeed, the Kullback-Leibler di-
vergence is invariant under coordinate transformations, and
therefore

         <<FORMULA>>
              
                  <<FIGURE>>

Figure 7. Quality of FFJORD RNODE generated images on ImageNet-64.

               <<FIGURE>>

Figure 8. Quality of FFJORD RNODE generated images on CelebA-HQ. We use temperature annealing, as described in (Kingma &
Dhariwal, 2018), to generate visually appealing images, with T = 0.5, . . . , 1.

Table 2. Additional results and model statistics of FFJORD RNODE. Here we report validation bits/dim on both validation images, and on
validation images with uniform variational dequantization (ie perturbed by uniform noise). We also report number of trainable model
parameters.
                          <<TABLE>>

<|endoftext|>


<|startoftext|>

                     A guide to convolution arithmetic for   deep
                                      learning

                               The authors of this guide would like to thank David Warde-Farley,
                             Guillaume Alain  and  Caglar Gulcehre for   their valuable feedback. We
                             are likewise grateful to all those who helped improve this tutorial with
                             helpful comments, constructive criticisms  and  code contributions. Keep
                             them coming!
                               Special thanks to Ethan Schoonover, creator of the Solarized color
                             scheme, 1 whose colors were used for   the ﬁgures.

                                                    Feedback
                               Your feedback is welcomed! We did our best to be as precise, infor-
                             mative  and  up to the point as possible, but should there be any thing you
                             feel might be an error or could be rephrased to be more precise or com-
                             prehensible, please don’t refrain from contacting us. Likewise, drop us a
                             line if you think there is something that might ﬁt this technical report
                              and  you would like us to discuss – we will make our best eﬀort to update
                             this document.

                                            Source code  and  animations
                               The code used to generate this guide along with its ﬁgures is available
                             on GitHub. 2 There the reader can also ﬁnd an animated version of the
                             ﬁgures.


                      1 Introduction 5
                        1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
                        1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

                      2 Convolution arithmetic 12
                        2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
                        2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
                            2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
                            2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
                        2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
                        2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15

                      3 Pooling arithmetic 18

                      4 Transposed convolution arithmetic 19
                        4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
                        4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
                        4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
                        4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
                            4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
                            4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
                        4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
                        4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24

                      5 Miscellaneous convolutions 28
                        5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28


                                                                        Chapter 1


                      Introduction


                      Deep convolutional neural networks (CNNs) have been at the heart of spectac-
                      ular advances in deep learning. Although CNNs have been used as early as the
                      nineties to solve character recognition tasks (Le Cunet al., 1997), their current
                      widespread application is due to much more recent work, when a deep CNN
                      was used to beat state-of-the-art in the ImageNet image classiﬁcation challenge
                      (Krizhevskyet al., 2012).
                        Convolutional neural networks therefor  e constitute a very useful tool for   ma-
                      chine learning practitioners. However, learning to use CNNs for   the ﬁrst time
                      is generally an intimidating experience. A convolutional layer’s output shape
                      is aﬀected by the shape of its input as well as the choice of kernel shape, zero
                      padding  and  strides,  and  the relationship between these properties is not triv-
                      ial to infer. This contrasts with fully-connected layers, whose output size is
                      independent of the input size. Additionally, CNNs also usually feature apool-
                      ingstage, adding yet another level of complexity with respect to fully-connected
                      networks. Finally, so-called transposed convolutional layers (also known as frac-
                      tionally strided convolutional layers) have been employed in more  and  more work
                      as of late (Zeileret al., 2011; Zeiler  and  Fergus, 2014; Longet al., 2015; Rad-
                      for  det al., 2015; Visinet al., 2015; Imet al., 2016),  and  their relationship with
                      convolutional layers has been explained with various degrees of clarity.
                        This guide’s objective is twofold:

                        1.Explain the relationship between convolutional layers  and  transposed con-
                          volutional layers.
                        2.Provide an intuitive underst and ing of the relationship between input shape,
                          kernel shape, zero padding, strides  and  output shape in convolutional,
                          pooling  and  transposed convolutional layers.

                        In order to remain broadly applicable, the results shown in this guide are
                      independent of implementation details  and  apply to all commonly used machine
                      learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),


                      Torch (Collobertet al., 2011), Tensorﬂow (Abadiet al., 2015)  and  Caﬀe (Jia et al., 2014).

                        This chapter brieﬂy reviews the main building blocks of CNNs, namely dis-
                      crete convolutions  and  pooling. for   an in-depth treatment of the subject, see
                      Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).


                      1.1 Discrete convolutions

                      The bread  and  butter of neural networks is aﬃne transformations: a vector
                      is received as input  and  is multiplied with a matrix to produce an output (to
                      which a bias vector is usually added before passing the result through a non-
                      linearity). This is applicable to any  type of input, be it an image, a sound
                      clip or an unordered collection of features: whatever their dimensionality, their
                      representation can always be ﬂattened into a vector before the transfomation.
                        Images, sound clips  and  many  other similar kinds of data have an intrinsic
                      structure. More formally, they share these important properties:

                        They are stored as multi-dimensional arrays.
                        They feature one or more axes for   which ordering matters (e.g., width  and 
                          height axes for   an image, time axis for   a sound clip).
                        One axis, called the channel axis, is used to access diﬀerent views of the
                          data (e.g., the red, green  and  blue channels of a color image, or the left
                           and  right channels of a stereo audio track).

                        These properties are not exploited when an aﬃne transformation is applied;
                      in fact, all the axes are treated in the same way  and  the topological information
                      is not taken into account. Still, taking advantage of the implicit structure of
                      the data may prove very h and y in solving some tasks, like computer vision  and 
                      speech recognition,  and  in these cases it would be best to preserve it. This is
                      where discrete convolutions come into play.
                        A discrete convolution is a linear transformation that preserves this notion
                      of ordering. It is sparse (only a few input units contribute to a given output
                      unit)  and  reuses parameters (the same weights are applied to multiple locations
                      in the input).
                        Figure 1.1 provides an example of a discrete convolution. The light blue
                      grid is called the input feature map. To keep the drawing simple, a single input
                      feature map is represented, but it is not uncommon to have multiple feature
                      maps stacked one onto another. 1 A kernel(shaded area) of value

                                            <<FIGURE>>

                           Figure 1.1: Computing the output values of a discrete convolution.


                                            <<FIGURE>>


                          Figure 1.2: Computing the output values of a discrete convolution for   N = 2, i1 =i2 = 5, k1 =k2 = 3, s1 =s2 = 2,  and  p1 =p2 = 1.


                      slides across the input feature map. At each location, the product between
                      each element of the kernel  and  the input element it overlaps is computed  and 
                      the results are summed up to obtain the output in the current location. The
                      procedure can be repeated using diﬀerent kernels to for  m as many  output feature
                      maps as desired (Figure 1.3). The ﬁnal outputs of this procedure are called
                      output feature maps.2 If there are multiple input feature maps, the kernel will
                      have to be 3-dimensional – or, equivalently each one of the feature maps will
                      be convolved with a distinct kernel –  and  the resulting feature maps will be
                      summed up elementwise to produce the output feature map.
                        The convolution depicted in Figure 1.1 is an instance of a 2-D convolution,
                      but it can be generalized to N-D convolutions. for   instance, in a 3-D convolu-
                      tion, the kernel would be a cuboid and  would slide across the height, width  and 
                      depth of the input feature map.
                        The collection of kernels deﬁning a discrete convolution has a shape corre-
                      sponding to some permutation of(n;m;k 1 ;:::;k N ), where


                                      <<FORMULA>>

                        The following properties aﬀect the output size oj of a convolutional layer
                      along axis j:

                                      <<FORMULA>>

                      for   instance, Figure 1.2 shows a 3x3 kernel applied to a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides.
                        Note that strides constitute a for  m of subsampling. As an alternative to
                      being interpreted as a measure of how much the kernel is translated, strides can
                      also be viewed as how much of the output is retained. for   instance, moving
                      the kernel by hops of two is equivalent to moving the kernel by hops of one but
                      retaining only odd output elements (Figure 1.4).
                        1 An example of this is what was referred to earlier as channels for images  and  sound clips.
                        2 While there is a distinction between convolution  and  cross-correlation from a signal pro-
                      cessing perspective, the two become interchangeable when the kernel is learned. for   the sake
                      of simplicity  and  to stay consistent with most of the machine learning literature, the term
                      convolution will be used in this guide.

                                            <<FIGURE>>

                      Figure 1.3: A convolution mapping from two input feature maps to three output
                      feature maps using a32 3x3 collection of kernels w. In the left pathway,
                      input feature map 1 is convolved with kernel w1;1  and  input feature map 2 is
                      convolved with kernel w1;2 ,  and  the results are summed together elementwise
                      to for  m the ﬁrst output feature map. The same is repeated for   the middle  and 
                      right pathways to for  m the second  and  third feature maps,  and  all three output
                      feature maps are grouped together to for  m the output.

                                            <<FIGURE>>

                      Figure 1.4: An alternative way of viewing strides. Instead of translating the
                       3x3 kernel by increments ofs= 2(left), the kernel is translated by increments
                      of1 and  only one ins= 2output elements is retained (right).


                                                1.2 Pooling

                      In addition to discrete convolutions themselves,pooling operations make up
                      another important building block in CNNs. Pooling operations reduce the size
                      of feature maps by using some function to summarize subregions, such as taking
                      the average or the maximum value.
                        Pooling works by sliding a window across the input  and  feeding the content
                      of the window to a pooling function. In some sense, pooling works very much
                      like a discrete convolution, but replaces the linear combination described by the
                      kernel with some other function. Figure 1.5 provides an example for   average
                      pooling,  and  Figure 1.6 does the same for   max pooling.
                        The following properties aﬀect the output size j of a pooling layer along
                      axisj:

                                      <<FORMULA>>


                                                  <<FIGURE>>


                     Figure 1.5: Computing the output values of a  3x3  average pooling operation on a 5x5 input using 1x1 strides.

                                                  <<FIGURE>>


                     Figure 1.6: Computing the output values of a  3x3  max pooling operation on a 5X5 input using 1X1 strides.


                      Convolution arithmetic


                      The analysis of the relationship between convolutional layer properties is eased
                      by the fact that they don’t interact across axes, i.e., the choice of kernel size,
                      stride  and  zero padding along axis j only aﬀects the output size of axis j.
                      Because of that, this chapter will focus on the following simpliﬁed setting:

                        2-D discrete convolutions (N= 2),
                        square inputs (i1 =i2 =i),
                        square kernel size (k1 =k2 =k),
                        same strides along both axes (s1 =s2 =s),
                        same zero padding along both axes (p1 =p2 =p).

                        This facilitates the analysis  and  the visualization, but keep in mind that the
                      results outlined here also generalize to the N-D  and  non-square cases.


                      2.1 No zero padding, unit strides

                      The simplest case to analyze is when the kernel just slides across every position
                      of the input (i.e.,s= 1 and p= 0). Figure 2.1 provides an example for  i= 4
                       and k= 3.
                        One way of deﬁning the output size in this case is by the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts on the leftmost part of the input feature map  and  slides by steps of one
                      until it touches the right side of the input. The size of the output will be equal
                      to the number of steps made, plus one, accounting for   the initial position of the
                      kernel (Figure 2.8a). The same logic applies for   the height axis.
                        More formally, the following relationship can be inferred:
                        
                          Relationship 1.for   any i,k and p,  and  for  s= 1,

                                                     <<FORMULA>>


                      2.2 Zero padding, unit strides

                      To factor in zero padding (i.e., only restricting tos= 1), let’s consider its eﬀect
                      on the eﬀective input size: padding with p zeros changes the eﬀective input size
                      from i to i+ 2p. In the general case, Relationship 1 can then be used to infer
                      the following relationship:

                          Relationship 2.for   any  i,k  and  p,  and  for   s= 1,

                                                <<FORMULA>>

                      Figure 2.2 provides an example for   i= 5,k= 4  and  p= 2.
                        In practice, two speciﬁc instances of zero padding are used quite extensively
                      because of their respective properties. Let’s discuss them in more detail.

                      2.2.1 Half (same) padding
                      Having the output size be the same as the input size (i.e.,o=i) can be a
                      desirable property:

                          Relationship 3.for   any  i  and  for   k o d (k= 2n+ 1; n2N),
                          s= 1  and  p=b k=2 c=n,

                                               <<FORMULA>> 

                      This is sometimes referred to as half(or same) padding. Figure 2.3 provides an
                      example for   i= 5,k= 3 and  (therefor  e) p= 1.

                      2.2.2 Full padding
                      While convolving a kernel generally decreases the output size with respect to
                      the input size, sometimes the opposite is required. This can be achieved with
                      proper zero padding:

                          Relationship 4.for   any  i  and  k,  and  for   p=kx1  and  s= 1,

                                                <<FORMULA>>


                                                <<FIGURE>>

                      Figure 2.1: (No padding, unit strides) Convolving a 3x3 kernel over a 4x4 
                      input using unit strides (i.e.,i= 4,k= 3,s= 1  and  p= 0).


                                                <<FIGURE>>

                      Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4x4 kernel over a
                      5x5 input padded with a 2x2 border of zeros using unit strides (i.e.,i= 5,
                      k= 4,s= 1 and p= 2).


                                              <<FIGURE>>


                      Figure 2.3: (Half padding, unit strides) Convolving a 3x3 kernel over a 5x5 
                      input using half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 1).


                                            <<FIGURE>>


                      Figure 2.4: (Full padding, unit strides) Convolving a 3x3 kernel over a 5x5 
                      input using full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 2).


                      This is sometimes referred to as full padding, because in this setting every
                      possible partial or complete superimposition of the kernel on the input feature
                      map is taken into account. Figure 2.4 provides an example for i= 5,k= 3  and 
                      (therefore) p= 2.


                      2.3 No zero padding, non-unit strides

                      All relationships derived so far only apply for   unit-strided convolutions. Incorporating 
                      non unitary strides requires another inference leap. To facilitate
                      the analysis, let’s momentarily ignore zero padding (i.e.,s >1  and  p= 0).
                      Figure 2.5 provides an example for   i= 5,k= 3 and s= 2.
                        Once again, the output size can be deﬁned in terms of the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts as usual on the leftmost part of the input, but this time it slides by steps
                      of sizes until it touches the right side of the input. The size of the output is
                      again equal to the number of steps made, plus one, accounting for   the initial
                      position of the kernel (Figure 2.8b). The same logic applies for   the height axis.
                        From this, the following relationship can be inferred:

                          Relationship 5.for   any  i,k  and  s,  and  for   p= 0,
                                               
                                               <<FORMULA>>

                      The ﬂoor function accounts for   the fact that sometimes the last possible step
                      does not coincide with the kernel reaching the end of the input, i.e., some input
                      units are left out (see Figure 2.7 for   an example of such a case).


                      2.4 Zero padding, non-unit strides

                      The most general case (convolving over a zero padded input using non-unit
                      strides) can be derived by applying Relationship 5 on an eﬀective input of size
                      i+ 2p, in analogy to what was done for   Relationship 2:

                          Relationship 6.for   any i,k,p and s,
                                             
                                             <<FORMULA>>

                      As before, the ﬂoor function means that in some cases a convolution will produce
                      the same output size for   multiple input sizes. More speciﬁcally, ifi+ 2pkis
                      a multiple ofs, then any  input size j=i+a; a2 f0;:::; sx1 g will produce
                      the same output size. Note that this ambiguity applies only for   s >1.

                                            <<FIGURE>>

                        Figure 2.6 shows an example with i= 5,k= 3,s= 2  and  p= 1, while

                                            <<FIGURE>>

                      Figure 2.7 provides an example for   i= 6,k= 3,s= 2  and  p= 1. Interestingly,

                      despite having diﬀerent input sizes these convolutions share the same output
                      size. While this doesn’t aﬀect the analysis for   convolutions, this will complicate
                      the analysis in the case of transposed convolutions.


                                            <<FIGURE>>

                      Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3x3 kernel over
                      a 5x5 input using 2x2 strides (i.e.,i= 5,k= 3,s= 2  and  p= 0).

                                            <<FIGURE>>

                      Figure 2.6: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                       5x5 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,
                      k= 3,s= 2  and  p= 1).

                                            <<FIGURE>>

                      Figure 2.7: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                      6x6 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,
                      k= 3,s= 2  and  p= 1). In this case, the bottom row  and  right column of the
                      zero padded input are not covered by the kernel.

                      (a) The kernel has to slide two steps (b) The kernel has to slide one step of
                      to the right to touch the right side of size two to the right to touch the right
                      the input ( and  equivalently downwards). side of the input ( and  equivalently down-
                      Adding one to account for   the initial ker- wards). Adding one to account for   the
                      nel position, the output size is 3x3. initial kernel position, the output size is 2x2.

                                            <<FIGURE>>

                                     Figure 2.8: Counting kernel positions.


                                                 Chapter 3

                      Pooling arithmetic

                      In a neural network, pooling layers provide invariance to small translations of
                      the input. The most common kind of pooling is max pooling, which consists
                      in splitting the input in (usually non-overlapping) patches  and  outputting the
                      maximum value of each patch. Other kinds of pooling exist, e.g., mean or
                      average pooling, which all share the same idea of aggregating the input locally
                      by applying a non-linearity to the content of some patches (Boureauet al.,
                      2010a,b, 2011; Saxeet al., 2011).
                        Some readers may have noticed that the treatment of convolution arithmetic
                      only relies on the assumption that some function is repeatedly applied onto
                      subsets of the input. This means that the relationships derived in the previous
                      chapter can be reused in the case of pooling arithmetic. Since pooling does not
                      involve zero padding, the relationship describing the general case is as follows:

                          Relationship 7.for   any  i,k  and  s,

                                              <<FORMULA>>

                      This relationship holds for any  type of pooling.


                                                Chapter 4

                      Transposed convolution arithmetic

                        
                      The need for   transposed convolutions generally arises from the desire to use a
                      transfor  mation going in the opposite direction of a normal convolution, i.e., from
                      something that has the shape of the output of some convolution to something
                      that has the shape of its input while maintaining a connectivity pattern that
                      is compatible with said convolution. for   instance, one might use such a trans-
                      for  mation as the decoding layer of a convolutional autoencoder or to project
                      feature maps to a higher-dimensional space.
                        Once again, the convolutional case is considerably more complex than the
                      fully-connected case, which only requires to use a weight matrix whose shape has
                      been transposed. However, since every convolution boils down to an eﬃcient im-
                      plementation of a matrix operation, the insights gained from the fully-connected
                      case are useful in solving the convolutional case.
                        Like for   convolution arithmetic, the dissertation about transposed convolu-
                      tion arithmetic is simpliﬁed by the fact that transposed convolution properties
                      don’t interact across axes.
                        The chapter will focus on the following setting:

                        2-D transposed convolutions (N= 2),
                        square inputs (i1 =i2 =i),
                        square kernel size (k1 =k2 =k),
                        same strides along both axes (s1 =s2 =s),
                        same zero padding along both axes (p1 =p2 =p).

                      Once again, the results outlined generalize to the N-D  and  non-square cases.


                                                 4.1 Convolution as a matrix operation

                      Take for   example the convolution represented in Figure 2.1. If the input  and 
                      output were to be unrolled into vectors from left to right, top to bottom, the
                      convolution could be represented as a sparse matrix C where the non-zero elements 
                      are the elements w i;j of the kernel (with i  and  j being the row  and  column
                      of the kernel respectively):
                    
                                                <<FORMULA>>

                        This linear operation takes the input matrix ﬂattened as a 16-dimensional
                      vector  and  produces a 4-dimensional vector that is later reshaped as the 2x2 
                      output matrix.
                        Using this representation, the backward pass is easily obtained by trans-
                      posingC; in other words, the error is backpropagated by multiplying the loss
                      withCT . This operation takes a 4-dimensional vector as input  and  produces
                      a 16-dimensional vector as output,  and  its connectivity pattern is compatible
                      withCby construction.
                        Notably, the kernel w deﬁnes both the matrices C  and  CT used for   the
                      for  ward  and  backward passes.


                      4.2 Transposed convolution

                      Let’s now consider what would be required to go the other way around, i.e.,
                      map from a 4-dimensional space to a 16-dimensional space, while keeping the
                      connectivity pattern of the convolution depicted in Figure 2.1. This operation
                      is known as a transposed convolution.
                        Transposed convolutions – also called fractionally strided convolutions or
                      deconvolutions 1 – work by swapping the for  ward  and  backward passes of a con-
                      volution. One way to put it is to note that the kernel deﬁnes a convolution, but
                      whether it’s a direct convolution or a transposed convolution is determined by
                      how the for  ward  and  backward passes are computed.
                        for   instance, although the kernel w deﬁnes a convolution whose for  ward  and 
                      backward passes are computed by multiplying with C  and  CT respectively, it
                      also deﬁnes a transposed convolution whose for  ward  and  backward passes are
                      computed by multiplying withCT  and  (CT )T =C respectively. 2
                        Finally note that it is always possible to emulate a transposed convolution
                      with a direct convolution. The disadvantage is that it usually involves adding
                        1 The term “deconvolution” is sometimes used in the literature, but we advocate against it
                      on the grounds that a deconvolution is mathematically deﬁned as the inverse of a convolution,
                      which is diﬀerent from a transposed convolution.
                        2 The transposed convolution operation can be thought of as the gradient of some convolution 
                        with respect to its input, which is usually how transposed convolutions are implemented
                      in practice.


                      many  columns  and  rows of zeros to the input, resulting in a much less eﬃcient
                      implementation.
                        Building on what has been introduced so far, this chapter will proceed some-
                      what backwards with respect to the convolution arithmetic chapter, deriving the
                      properties of each transposed convolution by referring to the direct convolution
                      with which it shares the kernel,  and  deﬁning the equivalent direct convolution.


                      4.3 No zero padding, unit strides, transposed

                      The simplest way to think about a transposed convolution on a given input is
                      to imagine such an input as being the result of a direct convolution applied on
                      some initial feature map. The transposed convolution can be then considered as
                      the operation that allows to recover the shape 3 of this initial feature map.
                        Let’s consider the convolution of a 3x3 kernel on a 4x4 input with unitary
                      stride  and  no padding (i.e.,i= 4,k= 3,s= 1  and  p= 0). As depicted in
                      Figure 2.1, this produces a 2x2 output. The transpose of this convolution will
                      then have an output of shape 4x4 when applied on a 2x2 input.
                        Another way to obtain the result of a transposed convolution is to apply an
                      equivalent – but much less eﬃcient – direct convolution. The example described
                      so far could be tackled by convolving a 3x3 kernel over a 2x2 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,k0 =k,s0 = 1 and 
                      p0 = 2), as shown in Figure 4.1. Notably, the kernel’s  and  stride’s sizes remain
                      the same, but the input of the transposed convolution is now zero padded. 4
                        One way to understand  the logic behind zero padding is to consider the
                      connectivity pattern of the transposed convolution  and  use it to guide the design
                      of the equivalent convolution. for   example, the top left pixel of the input of the
                      direct convolution only contribute to the top left pixel of the output, the top
                      right pixel is only connected to the top right output pixel,  and  so on.
                        To maintain the same connectivity pattern in the equivalent convolution it is
                      necessary to zero pad the input in such a way that the ﬁrst (top-left) application
                      of the kernel only touches the top-left pixel, i.e., the padding has to be equal to
                      the size of the kernel minus one.
                        Proceeding in the same fashion it is possible to determine similar observa-
                      tions for   the other elements of the image, giving rise to the following relationship:
                        3 Note that the transposed convolution does not guarantee to recover the input itself, as it
                      is not deﬁned as the inverse of the convolution, but rather just returns a feature map that has
                      the same width  and  height.
                        4 Note that although equivalent to applying the transposed matrix, this visualization adds
                      a lot of zero multiplications in the for  m of zero padding. This is done here for   illustration
                      purposes, but it is ineﬃcient,  and  software implementations will normally not perfor  m the
                      useless zero multiplications.

                      Relationship 8.A convolution described bys= 1,p= 0 and k
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = kx1  and  its output size is

                                            <<FORMULA>>

                        Interestingly, this corresponds to a fully padded convolution with unit strides.


                      4.4 Zero padding, unit strides, transposed

                      Knowing that the transpose of a non-padded convolution is equivalent to con-
                      volving a zero padded input, it would be reasonable to suppose that the trans-
                      pose of a zero padded convolution is equivalent to convolving an input padded
                      withlesszeros.
                        It is indeed the case, as shown in Figure 4.2 for  i= 5,k= 4 and p= 2.
                        for  mally, the following relationship applies for   zero padded convolutions:

                          Relationship 9.A convolution described by s= 1,k and phas an
                          associated transposed convolution described by k0 =k,s0 =s and 
                          p0 =kp1 and  its output size is

                                           <<FORMULA>>

                      4.4.1 Half (same) padding, transposed
                      By applying the same inductive reasoning as befor  e, it is reasonable to expect
                      that the equivalent convolution of the transpose of a half padded convolution
                      is itself a half padded convolution, given that the output size of a half padded
                      convolution is the same as its input size. Thus the following relation applies:

                          Relationship 10.A convolution described byk= 2n+1; n2N,
                          s= 1 and p=bk=2c=nh as an associated transposed convolution
                          described byk0 =k,s0 =s and p0 =p and  its output size is

                                           <<FORMULA>>


                                           <<FIGURE>>

                        Figure 4.3 provides an example for   i= 5,k= 3 and  (therefor  e)p= 1.

                      4.4.2 Full padding, transposed
                      Knowing that the equivalent convolution of the transpose of a non-padded con-
                      volution involves full padding, it is unsurprising that the equivalent of the trans-
                      pose of a fully padded convolution is a non-padded convolution:

                                          <<FIGURE>>

                      Figure 4.1: The transpose of convolving a 3x3 kernel over a 4x4 input using
                      unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input padded with a 2x2 border of zeros using unit
                      strides (i.e.,i0 = 2,k0 =k,s0 = 1 and p0 = 2).

                                          <<FIGURE>>

                      Figure 4.2: The transpose of convolving a 4x4 kernel over a 5x5 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i= 5,k= 4,s= 1 and 
                      p= 2). It is equivalent to convolving a 4x4 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using unit strides (i.e.,i0 = 6,k0 =k,s0 = 1 and 
                      p0 = 1).

                                           <<FIGURE>>

                      Figure 4.3: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). It is
                      equivalent to convolving a 3x3 kernel over a 5x5 input using half padding
                       and  unit strides (i.e.,i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                       Relationship 11.A convolution described bys= 1,k and p= kx1 
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = 0 and  its output size is

                                          <<FIGURE>>

                        Figure 4.4 provides an example for  i= 5,k= 3 and  (therefor  e)p= 2.


                      4.5 No zero padding, non-unit strides, transposed

                      Using the same kind of inductive logic as for   zero padded convolutions, one
                      might expect that the transpose of a convolution with s >1 involves an equiv-
                      alent convolution with s <1. As will be explained, this is a valid intuition,
                      which is why transposed convolutions are sometimes called fractionally strided
                      convolutions.
                        Figure 4.5 provides an example for  i= 5,k= 3 and s= 2which helps
                      understand  what fractional strides involve: zeros are inserted between input
                      units, which makes the kernel move around at a slower pace than with unit
                      strides. 5
                        for   the moment, it will be assumed that the convolution is non-padded
                      (p= 0)  and  that its input size i is such that  ixk  is a multiple ofs. In that
                      case, the following relationship holds:

                          Relationship 12.A convolution described byp= 0,k and s and 
                          whose input size is such that ixk is a multiple ofs, has an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = kx1 ,
                          where~i0 is the size of the stretched input obtained by adding  sx1 
                          zeros between each input unit,  and  its output size is

                                            <<FORMULA>>

                      4.6 Zero padding, non-unit strides, transposed

                      When the convolution’s input sizeiis such thati+ 2pkis a multiple ofs,
                      the analysis can extended to the zero padded case by combining Relationship 9
                       and  Relationship 12:
                        5 Doing so is ineﬃcient  and  real-world implementations avoid useless multiplications by
                      zero, but conceptually it is how the transpose of a strided convolution can be thought of.
 
                                          <<FIGURE>> 

                      Figure 4.4: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). It is equivalent
                      to convolving a 3x3 kernel over a77input using unit strides (i.e.,i0 = 7,
                      k0 =k,s0 = 1 and p0 = 0).

                                          <<FIGURE>>

                      Figure 4.5: The transpose of convolving a 3x3 kernel over a 5x5 input using
                       2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input (with1zero inserted between inputs) padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,~i0 = 3,k0 =k,s0 = 1
                       and p0 = 2).

                                        <<FIGURE>>

                      Figure 4.6: The transpose of convolving a 3x3 kernel over a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and 
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 3x3 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros using unit
                      strides (i.e.,i0 = 3,~i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                         Relationship 13.A convolution described byk,s and p and  whose
                          input sizeiis such tha ti+2pk is a multiple of s has an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 =
                          kp1, where ~i0 is the size of the stretched input obtained by
                          adding sx1 zeros between each input unit,  and  its output size is

                                          <<FORMULA>>


                                          <<FIGURE>>

                        Figure 4.6 provides an example for  i= 5,k= 3,s= 2 and p= 1.
                        The constraint on the size of the inputican be relaxed by introducing
                      another parametera2 f0;:::; sx1 gthat allows to distinguish between thes
                      diﬀerent cases that all lead to the samei0 :

                          Relationship 14.A convolution described byk,s and phas an
                          associated transposed convolution described bya,~i0 ,k0 =k,s0 = 1
                           and p0 =kp1, where~i0 is the size of the stretched input obtained
                          by adding sx1 zeros between each input unit,  and a= (i+ 2pk)
                          modsrepresents the number of zeros added to the bottom  and  right
                          edges of the input,  and  its output size is

                                         <<FORMULA>>


                                         <<FIGURE>>

                        Figure 4.7 provides an example for  i= 6,k= 3,s= 2 and p= 1.

                                        <<FIGURE>>

                      Figure 4.7: The transpose of convolving a 3x3 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,k= 3,s= 2 and 
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 2x2 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros (with an
                      additional border of size1added to the bottom  and  right edges) using unit
                      strides (i.e.,i0 = 3,~i0 = 5,a= 1,k0 =k,s0 = 1 and p0 = 1).


                                                 Chapter 5


                      Miscellaneous convolutions

                      5.1 Dilated convolutions

                      Readers familiar with the deep learning literature may have noticed the term
                      “dilated convolutions” (or “atrous convolutions”, from the French expressioncon-
                      volutions à trous) appear in recent papers. Here we attempt to provide an in-
                      tuitive underst and ing of dilated convolutions. for   a more in-depth description
                       and  to underst and  in what contexts they are applied, see Chenet al.(2014); Yu
                       and  Koltun (2015).
                        Dilated convolutions “inﬂate” the kernel by inserting spaces between the ker-
                      nel elements. The dilation “rate” is controlled by an additional hyperparameter
                      d. Implementations may vary, but there are usually dx1 spaces inserted between
                      kernel elements such thatd= 1corresponds to a regular convolution.
                        Dilated convolutions are used to cheaply increase the receptive ﬁeld of output
                      units without increasing the kernel size, which is especially eﬀective when multi-
                      ple dilated convolutions are stacked one after another. for   a concrete example,
                      see Oordet al.(2016), in which the proposed WaveNet model implements an
                      autoregressive generative model for   raw audio which uses dilated convolutions
                      to condition new audio frames on a large context of past audio frames.
                        To underst and  the relationship tying the dilation rated and  the output size
                      o, it is useful to think of the impact ofdon theeﬀective kernel size. A kernel
                      of sizekdilated by a factordhas an eﬀective size

                                          <<FORMULA>>

                      This can be combined with Relationship 6 to for  m the following relationship for  
                      dilated convolutions:

                          Relationship 15.for any  i,k,p and s,  and  for   a dilation rated,

                                       <<FORMULA>>


                                        <<FIGURE>>
                      Figure 5.1: (Dilated convolution) Convolving a 3x3 kernel over a77input
                      with a dilation factor of 2 (i.e.,i= 7,k= 3,d= 2,s= 1 and p= 0).


                      Figure 5.1 provides an example for  i= 7,k= 3 and d= 2.


                                                  Bibliography


                      Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
                       G. S., Davis, A., Dean, J., Devin, M.,et al.(2015). Tensorﬂow: Large-
                       scale machine learning on heterogeneous systems. Software available from
                       tensorﬂow.org.
                      Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron,
                       A., Bouchard, N., Warde-Farley, D.,  and  Bengio, Y. (2012). Theano: new
                       features  and  speed improvements.arXiv preprint arXiv:1211.5590.
                      Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins,
                       G., Turian, J., Warde-Farley, D.,  and  Bengio, Y. (2010). Theano: A cpu  and 
                       gpu math compiler in python. InProc. 9th Python in Science Conf, pages
                       1–7.
                      Boureau, Y., Bach, F., LeCun, Y.,  and  Ponce, J. (2010a). Learning mid-level
                       features for   recognition. InProc. International Conference on Computer Vi-
                       sion  and  Pattern Recognition (CVPR’10). IEEE.
                      Boureau, Y., Ponce, J.,  and  LeCun, Y. (2010b). A theoretical analysis of feature
                       pooling in vision algorithms. InProc. International Conference on Machine
                       learning (ICML’10).
                      Boureau, Y., Le Roux, N., Bach, F., Ponce, J.,  and  LeCun, Y. (2011). Ask the
                       locals: multi-way local pooling for   image recognition. InProc. International
                       Conference on Computer Vision (ICCV’11). IEEE.
                      Chen, L.-C., Pap and reou, G., Kokkinos, I., Murphy, K.,  and  Yuille, A. L. (2014).
                       Semantic image segmentation with deep convolutional nets  and  fully con-
                       nected crfs.arXiv preprint arXiv:1412.7062.
                      Collobert, R., Kavukcuoglu, K.,  and  Farabet, C. (2011). Torch7: A matlab-like
                       environment for   machine learning. InBigLearn, NIPS Workshop, number
                       EPFL-CONF-192376.
                      Goodfellow, I., Bengio, Y.,  and  Courville, A. (2016). Deep learning. Book in
                       preparation for   MIT Press.


                      Im, D. J., Kim, C. D., Jiang, H.,  and  Memisevic, R. (2016). Generating images
                       with recurrent adversarial networks.arXiv preprint arXiv:1602.05110.
                      Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
                       rama, S.,  and  Darrell, T. (2014). Caﬀe: Convolutional architecture for   fast
                       feature embedding. InProceedings of the ACM International Conference on
                       Multimedia, pages 675–678. ACM.
                      Krizhevsky, A., Sutskever, I.,  and  Hinton, G. E. (2012). Imagenet classiﬁcation
                       with deep convolutional neural networks. InAdvances in neural infor  mation
                       processing systems, pages 1097–1105.
                      Le Cun, Y., Bottou, L.,  and  Bengio, Y. (1997). Reading checks with multilayer
                       graph transfor  mer networks. InAcoustics, Speech,  and  Signal Processing,
                       1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages
                       151–154. IEEE.
                      Long, J., Shelhamer, E.,  and  Darrell, T. (2015). Fully convolutional networks for  
                       semantic segmentation. InProceedings of the IEEE Conference on Computer
                       Vision  and  Pattern Recognition, pages 3431–3440.
                      Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
                       Kalchbrenner, N., Senior, A.,  and  Kavukcuoglu, K. (2016). Wavenet: A
                       generative model for   raw audio.arXiv preprint arXiv:1609.03499.
                      Radfor  d, A., Metz, L.,  and  Chintala, S. (2015). Unsupervised representa-
                       tion learning with deep convolutional generative adversarial networks.arXiv
                       preprint arXiv:1511.06434.
                      Saxe, A., Koh, P. W., Chen, Z., Bh and , M., Suresh, B.,  and  Ng, A. (2011).
                       On r and om weights  and  unsupervised feature learning. In L. Getoor  and 
                       T. Scheﬀer, editors,Proceedings of the 28th International Conference on Ma-
                       chine Learning (ICML-11), ICML ’11, pages 1089–1096, New York, NY, USA.
                       ACM.
                      Visin, F., Kastner, K., Courville, A. C., Bengio, Y., Matteucci, M.,  and  Cho,
                       K. (2015). Reseg: A recurrent neural network for   object segmentation.
                      Yu, F.  and  Koltun, V. (2015). Multi-scale context aggregation by dilated con-
                       volutions.arXiv preprint arXiv:1511.07122.
                      Zeiler, M. D.  and  Fergus, R. (2014). Visualizing  and  underst and ing convolu-
                       tional networks. InComputer vision–ECCV 2014, pages 818–833. Springer.
                      Zeiler, M. D., Taylor, G. W.,  and  Fergus, R. (2011). Adaptive deconvolutional
                       networks for   mid  and  high level feature learning. InComputer Vision (ICCV),
                       2011 IEEE International Conference on, pages 2018–2025. IEEE.

<|endoftext|>


<|startoftext|>

                  A Survey of Model Compression and Acceleration for Deep Neural Networks

                 Yu Cheng, Duo Wang, Pan Zhou Member IEEE, and Tao Zhang Senior Member  IEEE

         Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
        recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
        in devices with low memory resources or in applications with to billions [4].

       strict latency requirements. Therefore, a natural thought is to   As larger neural networks with more layers and nodes
 
        without signiﬁcantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
        niques for compacting and accelerating CNNs model developed. tion, recent years witnessed signiﬁcant progress in virtual
        These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
        parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
        ferred/compact convolutional ﬁlters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
        For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efﬁcient deep learning methods can have
        performance, related applications, advantages, and drawbacks signiﬁcant impacts on distributed systems, embedded devices,
        etc. Then we will go through a few very recent additional and FPGA for Artiﬁcial Intelligence. For example, the ResNet-
        successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion ﬂoating number multiplications matrix, the main datasets used for evaluating the model per-
        formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
        this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
        on this topic.                                   75% of parameters and 50% computational time. For devices
         Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
        Model Compression and Acceleration,                  resources, how to compact the models used on them is also
                                                   important.
                                                     Achieving these goal calls for joint solutions from many
                                                     
                                                     I. INTRODUCTION                
                                                     
         disciplines, including but not limited to machine learning, op-
         In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
        lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
        achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
        These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
        billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
        very high computation capability plays a key role in their   We classify these approaches into four categories: pa-
        success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
        achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional ﬁlters, and knowledge distil-
        using a network containing 60 million parameters with ﬁve lation. The parameter pruning and sharing based methods
        convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
        it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
        ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
        example is the top face veriﬁcation results on the Labeled estimate the informative parameters of the deep CNNs. The
        Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional ﬁlters
        containing hundreds of millions of parameters, using a mix design special structural convolutional ﬁlters to reduce the
        of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
                                                   distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
        Way, Redmond, WA 98052, USA.                         compact neural network to reproduce the output of a larger
         Duo Wang and Tao Zhang are with the Department of Automation, network.
        Tsinghua University, Beijing 100084, China.                     In Table I, we brieﬂy summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
        China.                                        rank factorization and knowledge distillation approaches can        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2


                                                TABLE I

                                            <<TABLE>>

        be used in DNN models with fully connected layers and
        convolutional layers, achieving comparable performances. On
        the other hand, methods using transferred/compact ﬁlters are
        designed for models with convolutional layers only. Low-rank
        factorization and transfered/compact ﬁlters based approaches
        provide an end-to-end pipeline and can be easily implemented
        in CPU/GPU environment, which is straightforward. while
        parameter pruning & sharing use different methods such as
        vector quantization, binary coding and sparse constraints to
        perform the task. Generally it will take several steps to achieve
        the goal.                                    
        
                                <<FIGURE>>
         Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
       
        eter pruning/sharing low-rank factorization can be extracted is the compression model.
        from pre-trained ones or trained from scratch. While the
        transferred/compact ﬁlter and knowledge distillation models
        can only support train from scratch. These methods are inde- memory usage and ﬂoat point operations with little loss in
        pendently designed and complement each other. For example, classiﬁcation accuracy.
        transferred layers and parameter pruning & sharing can be   The method proposed in [10] quantized the link weights
        used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
        used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
        speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
        properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
                                                   small-weight connections. Finally, the network was retrained
                                               to learn the ﬁnal weights for the remaining sparse connections. 

              II. PARAMETER PRUNING AND SHARING 
                     
         This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importanceﬁtting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which   In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classiﬁed into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix.       instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
                                                   works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization                    activation during the model training. The systematic study in
         Network quantization compresses the original network by [15] showed that networks trained with back propagation could
        reducing the number of bits required to represent each weight. be resilient to speciﬁc weight distortions, including binary
        Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
        quantization to the parameter values. Vanhouckeet al.[8]   Drawbacks: the accuracy of the binary nets is signiﬁcantly
        showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
        in signiﬁcant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
        work in [9] used 16-bit ﬁxed-point representation in stochastic rization schemes are based on simple matrix approximations
        rounding based CNN training, which signiﬁcantly reduced and ignore the effect of binarization on the accuracy loss.        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3


         To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
        Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
        directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise
        The work in [17] reduced the time on ﬂoat point multiplication nonlinear operator,xis the input vector, andMis themn
        in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
        converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
        signiﬁcant changes.                              matrix-vector products inO(mn)time. Thus, an intuitive
                                                   way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing                           structural matrix. Anmn matrix that can be described
         Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
        network complexity and to address the over-ﬁtting issue. An matrix. Typically, the structure should not only reduce the
        early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
        [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
        Surgeon [20] methods reduced the number of connections gradient computations.
        based on the Hessian of the loss function, and their work sug-   Following this direction, the work in [30], [31] proposed a
        gested that such pruning gave higher accuracy than magnitude- simple and efﬁcient approach based on circulant projections,
                                                   while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training procedure of those methods followed the way training from   <<FORMULA>>, a circulant matrix R^2 R^dxd is deﬁned
                                                   as: <<FORMULA>>
        scratch manner. A recent trend in this direction is to prune redundant, <<FORMULA>> non-informative weights in a pre-trained CNN model. For <<FORMULA>>
        example, Srinivas and Babu [21] explored the redundancy      <<FORMULA>> among neurons, and proposed a data-free pruning method to                       
        remove redundant neurons. Hanet al.[22] proposed to reduce                 <<FORMULA>>
        the total number of parameters and operations in the entire  thus the memory cost becomesO(d)instead of O(d^2) network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourier used a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan-   In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fully on soft weight-sharing was proposed, which included both connected layers. The Adaptive Fast food transform matrix quantization and pruning in one simple (re-)training procedure. R2Rnd was deﬁned as:The above pruning schemes typically produce connections
        pruning in CNNs.                                              <<FORMULA>>            (2)
         There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2
        with sparsity constraints. Those sparsity constraints are typ- <<FORMULA>> is a random permutation matrix, and H denotes
        ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
        norm regularizers. The work in [25] imposed group sparsity nected layer with d inputs and n outputs using the Adaptive
        constraint on the convolutional ﬁlters to achieve structured Fast food transform reduces the storage and the computational
        brain Damage, i.e., pruning entries of the convolution kernels costs from O(n^d) to O(n) and from O(n^d) to O(n*log(d)),
        in a group-wise fashion. In [26], a group-sparse regularizer respectively.
        on neurons was introduced during the training stage to learn   The work in [29] showed the effectiveness of the new
        compact CNNs with reduced ﬁlters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
        structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
        ﬁlters, channels or even layers. In the ﬁlter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
        the above works used l2-norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
        usedl1 -norm to select and prune unimportant ﬁlters.       Following this idea, [35] proposed a general structured efﬁ-
         Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
        and sharing. First, pruning with l1 or l2 regularization requires   Drawbacks: one problem of this kind of approaches is that
        more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
        pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
        which demands ﬁne-tuning of the parameters and could be how to ﬁnd a proper structural matrix is difﬁcult. There is no
        cumbersome for some applications.                   theoretical way to derive it out.

        C. Designing Structural Matrix                          
        
        III. LOW-RANK FACTORIZATION AND SPARSITY


         In architectures that contain fully-connected layers, it is   Convolution operations contribute the bulk of most com-
        critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4


                                                                      TABLE II
                                                    COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
                                                                   ON ILSVRC-2012.

                                                                    <<TABLE>>


                                       <<FIGURE>>

        Fig. 2. A typical framework of the low-rank regularization method. The left    
        is the original convolutional layer and the right is the low-rank constraint    
        convolutional layer with rank-K.                             
                                                      
        would improve the compression rate as well as the overall
        speedup. For the convolution kernels, it can be viewed as a
        4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
        the intuition that there is a signiﬁcant amount of redundancy parameters in deep models using the low-rank method. [42]
        in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the ﬁnal weight
        remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
        it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
        help.                                        the fully connected layer for designing compact multi-task
         It has been a long time for using low-rank ﬁlters to acceler- deep learning architectures.
        ate convolution, for example, high dimensional DCT (discrete   Drawbacks: low-rank approaches are straightforward for
        cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
        to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
        respectively. Learning separable 1D ﬁlters was introduced ﬁed units and maxout. However, the implementation is not
        by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
        idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
        approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
        kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which
        for a single convolutional layer with 1% drop in classiﬁcation is important as different layers hold different information.
        accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
        decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model.
        drop in accuracy in text recognition.
         The low-rank approximation was done layer by layer. The   IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
        parameters of one layer were ﬁxed after it was done, and the   CNNs are parameter efﬁcient due to exploring the trans-layers above were ﬁne-tuned based on a reconstruction error lation invariant property of the representations to the input criterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-ﬁtting. Although a strong theory lowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant property used nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu- decomposition for training low-rank constrained CNNs from tional ﬁlters to compress CNN models is motivated by recent scratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer and T() be the both the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is deﬁned as:Low-rank) can be used to train CNNs from scratch. However,
        there are few differences between them. For example, ﬁnding                <<FORMULA>>            (3) 
        the best low-rank approximation in CP decomposition is an ill-
        posed problem, and the best rank-K (K is the rank number) indicating that transforming the input x by the transform T()
        approximation may not exist sometimes. While for the BN and then passing it through the network or layer () should
        scheme, the decomposition always exists. We perform a simple give the same result as ﬁrst mapping x through the network
        comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
        speedup and the compression rates are used to measure their (10), the transforms <<T()>> and <<T_0()>> are not necessarily the
        performances.                                  same as they operate on different objects. According to this
         As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or ﬁlters
        be viewed as a 2D matrix and thus the above mentioned () to compress the whole network models. From empirical
        methods can also be applied there. There are several classical observation, deep CNNs also beneﬁt from using a large set of
        works on exploiting low-rankness in fully connected layers. convolutional ﬁlters by applying certain transformT()to a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5


        small set of base ﬁlters since it acts as a regularizer for the                   TABLE III
        model.                                                  A SIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
         Following this direction, there are many recent reworks               
        proposed to build a convolutional layer from a set of base                        <<TABLE>>
        ﬁlters [43]–[46]. What they have in common is that the     
        transform T() lies in the family of functions that only operate      
        in the spatial domain of the convolutional ﬁlters. For example,      
        the work in [45] found that the lower convolution layers of     
        CNNs learned redundant ﬁlters to extract both positive and
        negative phase information of an input signal, and deﬁnedT()   Drawbacks: there are few issues to be addressed for ap-to be the simple negation function:                   proaches that apply transform constraints to convolutional ﬁl-
                       
                       <<FORMULA>>             (4) 

        ters. First, these methods can achieve competitive performance x                 for wide/ﬂat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional ﬁlter andW is the ﬁlter x         ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression   Using a compact ﬁlter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric ﬁlters with compact blocks to improve the classiﬁcation accuracy. The intuition is that the learning the speed, which signiﬁcantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional ﬁlters instead of redundant ones.     convolutions was used in [48], which achieved signiﬁcantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
        transformT()was deﬁne as:                           

                    <<FORMULA>>           (5) 

         V. KNOWLEDGE DISTILLATION   
         
         To the best of our knowledge, exploiting knowledge transfer
        where  were the multi-bias factors. The work in [47] con- (KT) to compress model was ﬁrst proposed by Caruanaet
        side red a combination of rotation by a multiple of 90  and al.[50]. They trained a compressed/ensemble model of strong
        horizontal/vertical ﬂipping with:                     classiﬁers with pseudo-data labeled, and reproduced the output
                                                   of the original larger network. But the work is limited to 
                                                   
                                                   <<FORMULA>>            (6) 

                                                   shallow models. The idea has been recently adopted in [51]
        whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
        original ﬁlters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
        transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
        was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
        can achieve good classiﬁcation performance.             a large teacher model into a small one by learning the class
         The work in [44] deﬁnedT()as the set of translation distributions output via softmax.
        functions applied to 2D ﬁlters:                        The work in [52] introduced a KD compression framework,
                                                   which eased the training of deep networks by following a

                                                   <<FORMULA>>    (7) 

                                                   student-teacher paradigm, in which the student was penalized
        whereT(;x;y)denoted the translation of the ﬁrst operand by according to a softened version of the teacher’s output. The
        (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
        at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
        can be used to 1) improve the classiﬁcation accuracy as a to predict the output and the classiﬁcation labels. Despite
        regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
        parameter efﬁciency by ﬂexibly varying their architectures to image classiﬁcation tasks. The work in [53] aimed to address
        compress networks.                              the network compression problem by taking advantage of
         Table III brieﬂy compares the performance of different depth neural networks. It proposed an approach to train thin
        methods with transferred convolutional ﬁlters, using VGGNet but deep networks, called FitNets, to compress wide and
        (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
        on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
        observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
        little or no drop in classiﬁcation accuracy.               network, FitNet made the student mimic the full feature maps 


        of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
        the capacities of teacher and student may differ greatly.     ture such as GoogleNet or Network in Network, can achieve
         All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
        10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
        experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
        form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
        parameters and multiplications.                      and motivated them to increase the depth and width of the
         There are several extension along this direction of dis- network while keeping the computational budget constant.
        tillation knowledge. The work in [54] trained a parametric   The work in [63] targeted the Residual Network based
        student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
        proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
        neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
        works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
        probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
        neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
        information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
        The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
        instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
        network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
        are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
        mations between neural network speciﬁcations. Zagoruyko networks with adaptive inference graphs to adaptively deﬁne
        et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
        assumption of FitNet. They transferred the attention maps that   Other approaches to reduce the convolutional overheads in-are summaries of the full activations.                  clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help signiﬁcantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classiﬁcation tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral ﬁlters [70]. Those worksperformance competitive with other type of approaches.     only aim to speed up the computation but not reduce the
                                                   memory storage.
                                                   
                                                   VI. OTHER TYPES OF APPROACHES

         We ﬁrst summarize the works utilizing attention-based
        methods. Note that attention-based mechanism [58] can reduce    
        
                                                                                        VII. BENCHMARKS , EVALUATION AND DATABASES
        computations signiﬁcantly by learning to selectively focus or   In the past ﬁve years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to ﬁrst standard models include LeNets [71], All-CNN-nets [72] andﬁnd the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been signiﬁcantly reduced.                  layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
        a sparse combination of the experts to process each input. In   The standard criteria to measure the quality of model
        [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
        which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
        selected and executed a subset of D2NN neurons based on the in the original model Manda  is that of the compressed
        input.                                        model M , then the compression rate (M;M  ) of M over
         There have been other attempts to reduce the number of Mis                     aparameters of neural networks by replacing the fully connected                (M;M  ) =  :            (8)a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7


                          TABLE IV                       or low rank factorization based methods. If you need
           SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT         end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION .          and transferred convolutional ﬁlters approaches could be
                                                                        considered.
                                                                         For applications in some speciﬁc domains, methods with low-rank factorization [40]           human prior (like the transferred convolutional ﬁlters, Network in network [73]      low-rank factorization [40]
                      <<TABLE>>                                          structural matrix) sometimes have beneﬁts. For example,
                                                                           when doing medical images classiﬁcation, transferred Residual networks [75]  compact ﬁlters [49], stochastic depth [63]       convolutional ﬁlters could work well as medical images parameter sharing [24]
                                                                           (like organ) do have the rotation transformation property.
                                                                            Usually the approaches of pruning & sharing could give parameter pruning [20], [22]          reasonable compression rate while not hurt the accuracy.
                                                       Thus for applications which requires stable model accu-
        Another widely used measurement is the index space saving     racy, it is better to utilize pruning & sharing.
        deﬁned in several papers [30], [35] as                    If your problem involves small/medium size datasets, you
                                                       can try the knowledge distillation approaches. The com-aa
                     <<FORMULA>>           (9)     pressed student model can take the beneﬁt of transferring a                    knowledge from teacher model, making it robust datasets
        where a and a are the number of the dimension of the index     which are not large.
        space in the original model and that of the compressed model,    As we mentioned before, techniques of the four groups
        respectively.                                      are orthogonal. It is reasonable to combine two or three
         Similarly, given the running timesofMands ofM ,     of them to maximize the performance. For some spe-
        the speedup rate <<FORMULA>> is deﬁned as:                  ciﬁc applications, like object detection, which requires
                                 s                     both convolutional and fully connected layers, you can
                    <<FORMULA>>            (10)                    
                                                            compress the convolutional layers with low rank based
        Most work used the average training time per epoch to measure     method and the fully connected layers with a pruning
        the running time, while in [30], [35], the average testing time     technique.
        was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
        computation for both the training and the testing stages.       Techniques for deep model compression and acceleration
         Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
        the same performance as the original model with much smaller need to be addressed.
        parameters and less computational time. However, for different    Most of the current state-of-the-art approaches are built
        applications with different CNN designs, the relation between     on well-designed CNN models, which have limited free-
        parameter size and computational time may be different.     dom to change the conﬁguration (e.g., network structural,
        For example, it is observed that for deep CNNs with fully     hyper-parameters). To handle more complicated tasks,
        connected layers, most of the parameters are in the fully     it should provide more plausible ways to conﬁgure the
        connected layers; while for image classiﬁcation tasks, ﬂoat     compressed models.
        point operations are mainly in the ﬁrst few convolutional layers    Pruning is an effective way to compress and acceler-
        since each ﬁlter is convolved with the whole image, which is     ate CNNs. The current pruning techniques are mostly
        usually very large at the beginning. Thus compression and     designed to eliminate connections between neurons. On
        acceleration of the network should focus on different type of     the other hand, pruning channel can directly reduce the
        layers for different applications.                         feature map width and shrink the model into a thinner
                                                       one. It is efﬁcient but also challenging because removing
               VIII. D ISCUSSION AND CHALLENGES            channels might dramatically change the input of the
                                                       following layer.In this paper, we summarized recent efforts on compressing
        and accelerating deep neural networks (DNNs). Here we dis-    As we mentioned before, methods of structural matrix
                                                       and transferred convolutional ﬁlters impose prior humancuss more details about how to choose different compression     knowledge to the model, which could signiﬁcantly affectapproaches, and possible challenges/solutions on this area.       the performance and stability. It is critical to investigate
                                                       how to control the impact of those prior knowledge.A. General Suggestions                              The methods of knowledge distillation provide many ben-
         There is no golden rule to measure which approach is the     eﬁts such as directly accelerating model without special
        best. How to choose the proper method is really depending     hardware or implementations. It is still worthy developing
        on the applications and requirements. Here are some general     KD-based approaches and exploring how to improve their
        guidance we can provide:                             performances.
          If the applications need compacted models from pre-    Hardware constraints in various of small platforms (e.g.,
           trained models, you can choose either pruning & sharing     mobile, robotic, self-driving car) are still a major problem        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8


           to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
           use of the limited computational source and how to design video and image frames [88], [89]).
           special compression methods for such platforms are still
           challenges that need to be addressed.                         IX. ACKNOWLEDGMENTS
          Despite the great achievements of these compression ap-
           proaches, the black box mechanism is still the key barrier   The authors would like to thank the reviewers and broader
           to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
           is still an important problem.                    we would like to thank Hong Zhao from the Department of
                                                   Automation of Tsinghua University for her help on modifying
        C. Possible Solutions                             the paper. This research is supported by National Science
                                                   Foundation of China with Grant number 61401169.To solve the hyper-parameters conﬁguration problem, we
        can rely on the recent learning-to-learn strategies [76], [77].
        This framework provides a mechanism allowing the algorithm                  REFERENCES
        to automatically learn how to exploit structure in the problem  [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with of interest. Very recently, leveraging reinforcement learning     deep convolutional neural networks,” inNIPS, 2012.
        to efﬁciently sample the design space and improve the model  [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
        compression has also been tried [78].                     gap to human-level performance in face veriﬁcation,” inCVPR, 2014.
                                                    [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efﬁciency beneﬁt on both     adaptive feature sharing in multi-task networks with applications in
        CPU and GPU because no special implementation is required.     person attribute classiﬁcation,”CoRR, vol. abs/1611.05377, 2016.
        But it is also challenging to handle the input conﬁguration.  [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
                                                       M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel     distributed deep networks,” inNIPS, 2012.
        pruning methods [79], which focus on imposing sparse con-  [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
        straints on weights during training. However, training from     recognition,”CoRR, vol. abs/1512.03385, 2015.
                                                    [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In     deep convolutional networks using vector quantization,”CoRR, vol.
        [80], the authors provided an iterative two-step algorithm to     abs/1412.6115, 2014.
        effectively prune channels in each layer.                 [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
                                                       convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models     on Computer Vision and Pattern Recognition (CVPR), 2016.
        and transferring it to the student models is useful for the  [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
        knowledge distillation (KD) approaches. Instead of directly re-     neural networks on cpus,” inDeep Learning and Unsupervised Feature
                                                       Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl-  [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
        edge of neurons could be helpful. One can derive a way to     learning with limited numerical precision,” inProceedings of the
        select essential neurons related to the task [81], [82]. The     32Nd International Conference on International Conference on Machine
                                                       Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
        or samples, that implies these regions or samples share some     deep neural networks with pruning, trained quantization and huffman
        common properties that may relate to the task.              coding,”International Conference on Learning Representations (ICLR),
                                                       2016. For methods with the convolutional ﬁlters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
        matrix, we can conclude that the transformation lies in the     quantization,”CoRR, vol. abs/1612.01543, 2016.
        family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
                                                       neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is     in Neural Information Processing Systems 28: Annual Conference on
        to provide a generalization of the aforementioned approaches     Neural Information Processing Systems 2015, December 7-12, 2015,
        in two aspects: 1) instead of limiting the transformation to     Montreal, Quebec, Canada, 2015, pp. 3123–3131.
                                                   [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predeﬁned transformations, let it be the     works with weights and activations constrained to +1 or -1,”CoRR, vol.
        whole family of spatial transformations applied on 2D ﬁlters     abs/1602.02830, 2016.
        or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
                                                       Imagenet classiﬁcation using binary convolutional neural networks,” in model parameters.                                  ECCV, 2016.
         Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
        some general/uniﬁed approaches is one direction. Wanget al.     “Deep neural networks are robust to weight binarization and other non-
        [83] presented a feature map dimensionality reduction method     linear distortions,”CoRR, vol. abs/1606.01981, 2016.
                                                   [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen-     networks,”CoRR, vol. abs/1611.01600, 2016.
        erated from different ﬁlters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
        information of the original network. The idea can be applied     with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
                                                   [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The     construction with back-propagation,” inAdvances in Neural Information
        work in [84] proposed a one-shot whole network compression     Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185.
        scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
                                                       processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and ﬁne-tuning to make deep     Damage, pp. 598–605.
        CNNs work in mobile devices.                      [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
         Despite the classiﬁcation task, people are also adapting the     for network pruning: Optimal brain surgeon,” inAdvances in Neural
                                                       Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to     171.          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9


          [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural  [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
              networks,” inProceedings of the British Machine Vision Conference      works,”arXiv preprint arXiv:1602.07576, 2016.
              2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp.  [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
              31.1–31.12.                                              networks,” inAdvances In Neural Information Processing Systems, 2016,
          [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and      pp. 1082–1090.
              connections for efﬁcient neural networks,” inProceedings of the 28th  [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
              International Conference on Neural Information Processing Systems, ser.      improving convolutional neural networks via concatenated rectiﬁed
              NIPS’15, 2015.                                            linear units,”arXiv preprint arXiv:1603.05201, 2016.
          [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com-  [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
              pressing neural networks with the hashing trick.” JMLR Workshop and      deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
              Conference Proceedings, 2015.                             [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
          [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural      symmetry in convolutional neural networks,” inProceedings of the
              network compression,”CoRR, vol. abs/1702.04008, 2017.               33rd International Conference on International Conference on Machine
          [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain      Learning - Volume 48, ser. ICML’16, 2016.
              damage,” in2016 IEEE Conference on Computer Vision and Pattern  [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,      resnet and the impact of residual connections on learning.”CoRR, vol.
              pp. 2554–2564.                                            abs/1602.07261, 2016.
          [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact  [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Uniﬁed,
              cnns,” inEuropean Conference on Computer Vision, Amsterdam, the      small, low power fully convolutional neural networks for real-time object
              Netherlands, October 2016, pp. 662–677.                          detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
          [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured  [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
              sparsity in deep neural networks,” inAdvances in Neural Information      inProceedings of the 12th ACM SIGKDD International Conference on
              Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,      Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535–
              I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082.                 541.
          [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning  [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
              ﬁlters for efﬁcient convnets,”CoRR, vol. abs/1608.08710, 2016.           Advances in Neural Information Processing Systems 27: Annual Confer-
          [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for      ence on Neural Information Processing Systems 2014, December 8-13
              small-footprint deep learning,” inAdvances in Neural Information Pro-      2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662.
              cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,  [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
              and R. Garnett, Eds., 2015, pp. 3088–3096.                        neural network,”CoRR, vol. abs/1503.02531, 2015.
          [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F.  [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
              Chang, “An exploration of parameter redundancy in deep networks with      Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
              circulant projections,” inInternational Conference on Computer Vision      2014.
              (ICCV), 2015.                                         [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
          [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and      “Bayesian dark knowledge,” inAdvances in Neural Information Process-
              S. Chang, “Fast neural networks with circulant projections,”CoRR, vol.      ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
              abs/1502.03436, 2015.                                       and R. Garnett, Eds., 2015, pp. 3420–3428.
          [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song,  [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
              and Z. Wang, “Deep fried convnets,” inInternational Conference on      by distilling knowledge from neurons,” inProceedings of the Thirtieth
              Computer Vision (ICCV), 2015.                                 AAAI Conference on Artiﬁcial Intelligence, February 12-17, 2016,
          [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block-      Phoenix, Arizona, USA., 2016, pp. 3560–3566.
              Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel-  [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
              berg: Springer Berlin Heidelberg, 1991, pp. 215–236.                  via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
          [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution  [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
              in low-rank tensor formats via cross approximation,”SIAM J. Scientiﬁc      Improving the performance of convolutional neural networks via atten-
              Computing, vol. 37, no. 2, 2015.                                tion transfer,”CoRR, vol. abs/1612.03928, 2016.
          [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc:  [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
              A structured efﬁcient linear layer,” inInternational Conference on      jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
              Learning Representations (ICLR), 2016.                       [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
          [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable      A. C. Courville, “Dynamic capacity networks,” inProceedings of the
              ﬁlters,” in2013 IEEE Conference on Computer Vision and Pattern      33nd International Conference on Machine Learning, ICML 2016, New
              Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754–      York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558.
              2761.                                               [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
          [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,      and J. Dean, “Outrageously large neural networks: The sparsely-gated
              “Exploiting linear structure within convolutional networks for efﬁcient      mixture-of-experts layer,” 2017.
              evaluation,” inAdvances in Neural Information Processing Systems 27,  [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
              Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.      J. Odobez, “Deep dynamic neural networks for multimodal gesture
              Weinberger, Eds., 2014, pp. 1269–1277.                           segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
          [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional      vol. 38, no. 8, pp. 1583–1597, 2016.
              neural networks with low rank expansions,” inProceedings of the British  [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
              Machine Vision Conference. BMVA Press, 2014.                    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
          [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit-      inComputer Vision and Pattern Recognition (CVPR), 2015.
              sky, “Speeding-up convolutional neural networks using ﬁne-tuned cp-  [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
              decomposition,”CoRR, vol. abs/1412.6553, 2014.                    Networks with Stochastic Depth, 2016.
          [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks  [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
              with low-rank regularization,” vol. abs/1511.06067, 2015.               networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
          [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas,      2016.
              “Predicting parameters in deep learning,” in Advances in Neural  [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
              Information Processing Systems 26, C. Burges, L. Bottou, M. Welling,      R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
              Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156.      inCVPR, 2018.
              [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper   [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
              ﬁles/nips26/1053.pdf                                        ence graphs,” 2018.
          [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab-  [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
              hadran, “Low-rank matrix factorization for deep neural network training      networks through FFTs, 2014.
              with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on  [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
              Acoustics, Speech and Signal Processing, 2013.                      works,” in2016 IEEE Conference on Computer Vision and Pattern          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10


              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,  [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
              pp. 4013–4021.                                            M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
          [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S.      Yu, “Ibm research and columbia university trecvid-2012 multimedia
              Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol.      event detection (med), multimedia event recounting (mer), and semantic
              abs/1611.05138, 2016.                                       indexing (sin) systems,” 2012.
          [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
              pooling in deep networks,” inProceedings of the IEEE Conference on
              Computer Vision and Pattern Recognition, 2018.                                  Yu Cheng(yu.cheng@microsoft.com) currently is a
          [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning                   Researcher at Microsoft. Before that, he was a Re-
              applied to document recognition,” inProceedings of the IEEE, 1998, pp.                   search Staff Member at IBM T.J. Watson Research
              2278–2324.                                                            Center. Yu got his Ph.D. from Northwestern Univer-
          [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried-                   sity in 2015 and bachelor from Tsinghua University
              miller, “Striving for simplicity: The all convolutional net,”CoRR, vol.                   in 2010. His research is about deep learning in
              abs/1412.6806, 2014.                                                     general, with speciﬁc interests in the deep generative
          [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014.                    model, model compression, and transfer learning.
          [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for                   He regularly serves on the program committees of
              large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014.                        top-tier AI conferences such as NIPS, ICML, ICLR,
          [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image                   CVPR and ACL.
              recognition,”arXiv preprint arXiv:1512.03385, 2015.
          [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
              D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
              descent by gradient descent,” inNeural Information Processing Systems
              (NIPS), 2016.                                                          Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference                   ceived the B.S. degree in automation from theon Learning Representations 2016, 2016.                                       Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl                   Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe                   Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018.                    Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in                   are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016.                                         learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating                   on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on                   robotics vision.Computer Vision (ICCV), Oct 2017.
          [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
              neural networks,”ECCV, 2018.
          [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
              learning via cross sample similarities transfer,” inProceedings of the
              Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18),
              New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852–                   Pan Zhou(panzhou@hust.edu.cn) is currently an
              2859.                                                                associate professor with School of Electronic In-
          [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond ﬁlters: Compact feature                   formation and Communications, Wuhan, China. He
              map for portable deep model,” inProceedings of the 34th International                   received his Ph.D. in the School of Electrical and
              Conference on Machine Learning, ser. Proceedings of Machine Learning                   Computer Engineering at the Georgia Institute of
              Research, D. Precup and Y. W. Teh, Eds., vol. 70. International                   Technology in 2011. Before that, he received his
              Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp.                   B.S. degree in theAdvanced Classof HUST, and
              3703–3711.                                                            a M.S. degree in the Department of Electronics
          [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression                   and Information Engineering from HUST, Wuhan,
              of deep convolutional neural networks for fast and low power mobile                   China, in 2006 and 2008, respectively. His current
              applications,”CoRR, vol. abs/1511.06530, 2015.                                   research interest includes big data analytics and
          [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efﬁcient  machine learning, security and privacy, and information networks.
              object detection models with knowledge distillation,” inAdvances in
              Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
              S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
              Eds., 2017, pp. 742–751.
          [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,                   Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
              “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE                   tained his B.S., M.S., and Ph.D. degrees from Ts-
              Conference on Computer Vision and Pattern Recognition (CVPR), June                   inghua University, Beijing, China, in 1993, 1995,
              2018.                                                                and 1999, respectively, and another Ph.D. degree
          [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,                   from Saga University, Saga, Japan, in 2002, all in
              Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy                   control engineering. He is currently a Professor with
              trade-offs for modern convolutional object detectors,” in2017 IEEE                   the Department of Automation, Tsinghua University.
              Conference on Computer Vision and Pattern Recognition, CVPR 2017,                   He serves the Associate Dean, School of Information
              Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297.                          Science and Technology and Head of the Department
          [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence                   of Automation. His current research interests include
              modeling for video event detection,” in The IEEE Conference on                   artiﬁcial intelligence, robotics, image processing,
              Computer Vision and Pattern Recognition (CVPR), June 2014.        control theory, and control of spacecraft.

<|endoftext|>


<|startoftext|>

            Analysis and Design of Echo State Networks

            Mustafa C. Ozturk
            can@cnel.uﬂ.edu

            Dongming Xu
            dmxu@cnel.uﬂ.edu

            Jose C. Principe
            principe@cnel.uﬂ.edu

            Computational NeuroEngineering Laboratory, Department of Electrical and
            Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.


            The design of echo state network (ESN) parameters relies on the selec-
            tion of the maximum eigenvalue of the linearized system around zero
            (spectral radius). However, this procedure does not quantify in a sys-
            tematic manner the performance of the ESN in terms of approximation
            error. This article presents a functional space approximation framework
            to better understand the operation of ESNs and proposes an information-
            theoretic metric, the average entropy of echo states, to assess the richness
            of the ESN dynamics. Furthermore, it provides an interpretation of the
            ESN dynamics rooted in system theory as families of coupled linearized
            systems whose poles move according to the input signal dynamics. With
            this interpretation, a design methodology for functional approximation
            is put forward where ESNs are designed with uniform pole distributions
            covering the frequency spectrum to abide by the richness metric, irre-
            spective of the spectral radius. A single bias parameter at the ESN input,
            adapted with the modeling error, conﬁgures the ESN spectral radius to
            the input-output joint space. Function approximation examples compare
            the proposed design methodology versus the conventional design.


            1 Introduction

            Dynamic computational models require the ability to store and access the
            time history of their inputs and outputs. The most common dynamic neural
            architecture is the time-delay neural network (TDNN) that couples delay
            lines with a nonlinear static architecture where all the parameters (weights)
            are adapted with the backpropagation algorithm. The conventional delay
            line utilizes ideal delay operators, but delay lines with local ﬁrst-order re-
            cursive ﬁlters have been proposed by Werbos (1992) and extensively stud-
            ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
            1993). Chains of ﬁrst-order integrators are interesting because they effec-
            tively decrease the number of delays necessary to create time embeddings


           (Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
           ent type of embedding that is largely unexplored. RNNs are perhaps the
           most biologically plausible of the artiﬁcial neural network (ANN) models
           (Anderson, Silverstein, Ritz, & Jones, 1977; Hopﬁeld, 1984; Elman, 1990),
           but are not well understood theoretically (Siegelmann & Sontag, 1991;
           Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
           RNNs is the difﬁculty to adapt the system weights. Various algorithms,
           such as backpropagation through time (Werbos, 1990) and real-time recur-
           rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
           however, these algorithms suffer from computational complexity, resulting
           in slow training, complex performance surfaces, the possibility of instabil-
           ity, and the decay of gradients through the topology and time (Haykin,
           1998). The problem of decaying gradients has been addressed with spe-
           cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
           native second-order training methods based on extended Kalman ﬁltering
           (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
           Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
           et al., 1998) provide more reliable performance and have enabled practical
           applications in identiﬁcation and control of dynamical systems (Kechri-
           otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
           Kambhampati, & Warwick, 1995).
             Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
           echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
           the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
           2002). ESNs possess a highly interconnected and recurrent topology of
           nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
           and contain information about the history of input and output patterns.
           The outputs of these internal PEs (echo states) are fed to a memoryless but
           adaptive readout network (generally linear) that produces the network out-
           put. The interesting property of ESN is that only the memoryless readout is
           trained, whereas the recurrent topology has ﬁxed connection weights. This
           reduces the complexity of RNN training to simple linear regression while
           preserving a recurrent topology, but obviously places important constraints
           in the overall architecture that have not yet been fully studied. Similar ideas
           have been explored independently by Maass and formalized in the LSM
           architecture. LSMs, although formulated quite generally, are mostly im-
           plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
           whereas ESNs are dynamical ANN models. Both attempt to model biolog-
           ical information processing using similar principles. We focus on the ESN
           formulation in this letter.

             The echo state condition is deﬁned in terms of the spectral radius (the
           largest among the absolute values of the eigenvalues of a matrix, denoted
           by·) of the reservoir’s weight matrix (W<1). This condition states
           that the dynamics of the ESN is uniquely controlled by the input, and the
           effect of the initial states vanishes. The current design of ESN parameters           
           relies on the selection of spectral radius. However, there are many possible
           weight matrices with the same spectral radius, and unfortunately they do
           not all perform at the same level of mean square error (MSE) for functional
           approximation. A similar problem exists in the design of the LSM. LSMs
           have been shown to possess universal approximation given the separation
           property (SP) for the liquid (reservoir in ESNs) and the approximation
           property (AP) for the readout (Maass et al., 2002). SP is quantiﬁed by a
           kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
           (2005) that is based on the rank of a matrix formed by the system states
           corresponding to different input signals. The kernel quality is a measure
           for the complexity and diversity of nonlinear operations carried out by the
           liquid on its input stream in order to boost the classiﬁcation power of a
           subsequent linear decision hyperplane (Maass et al., 2005). A variation of
           SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
           been argued that complex calculations can be best carried out by networks
           on the boundary between ordered and chaotic dynamics.

           In this letter,we are interested in studying the ESN for functional approx-
           imation (ﬁlters that map input function su(·) of time on output function sy(·)
           of time). We see two major shortcomings with the current ESN approach
           that uses echo state condition as a design principle. First, the impact of ﬁxed
           reservoir parameters for function approximation means that the informa-
           tion about the desired response is conveyed only to the output projection.
           This is not optimal, and strategies to select different reservoirs for different
           applications have not been devised. Second, imposing a constraint only on
           the spectral radius is a weak condition to properly set the parameters of
           the reservoir, as experiments show (different randomizations with the same
           spectral radius perform differently for the same problem; see Figure 2).
             This letter aims to address these two problems by proposing a frame-
           work, a metric, and a design principle for ESNs. The framework is a signal
           processing interpretation of basis and projections in functional spaces to
           describe and understand the ESN architecture. According to this interpre-
           tation, the ESN states implement a set of basis functionals (representation
           space) constructed dynamically by the input, while the readout simply
           projects the desired response onto this representation space. The metric
           to describe the richness of the ESN dynamics is an information-theoretic
           quantity, the average state entropy (ASE). Entropy measures the amount of
           information contained in a given random variable (Shannon, 1948). Here,
           the random variable is the instantaneous echo state from which the en-
           tropy for the overall state (vector) is estimated. The probability density
           function (pdf) in a differential geometric framework should be thought of
           as a volume form; that is, in our case, the pdf of the state vector describes
           the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
           established information as a coordinate free metric in the state manifold.
           Therefore, entropy becomes a global descriptor of information that quanti-
           ﬁes the volume of the manifold deﬁned by the random variable. Due to the
           time dependency of the states, the state entropy averaged over time (ASE)
           is an appropriate estimate of the volume of the state manifold.
             The design principle speciﬁes that one should consider independently
           thecorrelationamongthebasisandthespectralradius.In the absence of any
           information about the desired response, the ESN states should be designed
           with the highest ASE, independent of the spectral radius. We interpret the
           ESN dynamics as a combination of time-varying linear systems obtained
           from the linearization of the ESN nonlinear PE in a small, local neighbor-
           hood of the current state. The design principle means that the poles of the
           linearized ESN reservoir should have uniform pole distributions to gener-
           ate echo states with the most diverse pole locations (which correspond to
           the uniformity of time constants). Effectively, this will create the least cor-
           related bases for a given spectral radius, which corresponds to the largest
           volume spanned by the basis set. When the designer has no other informa-
           tion about the desired response to set the basis, this principle distributes
           the system’s degrees of freedom uniformly in space. It approximates for
           ESNs the well-known property of orthogonal basis. The unresolved issue
           that ASE does not quantify is how to set the spectral radius, which depends
           again on the desired mapping. The concept of memory depth as explained
           in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
           issues associated with the spectral radius. The correlation time of the de-
           sired response (as estimated by the ﬁrst zero of the autocorrelation function)
           gives an indication of the type of spectral radius required (long correlation
           time requires high spectral radius). Alternatively, a simple adaptive bias is
           added at the ESN input to control the spectral radius integrating the infor-
           mation from the input-output joint space in the ESN bases. For sigmoidal
           PEs, the bias adjusts the operating points of the reservoir PEs, which has
           the net effect of adjusting the volume of the state manifold as required to
           approximate the desired response with a small error. This letter shows that
           ESNs designed with this strategy obtain systematically better results in a
           set of experiments when compared with the conventional ESN design.


           2 Analysis of Echo State Networks

              2.1 Echo States as Bases and Projections.Let us consider the ar-
           chitecture and recursive update equation of a typical ESN more closely.
           Consider the recurrent discrete-time neural network given in Figure 1
           with M input units, N internal PEs, and L output units. The value of
           the input unit at time n is <<u(n)=[u1 (n),u2 (n),...,uM (n)]^T>> , of internal
           units are <<x(n)=[x1 (n),x2 (n),...,xN (n)]^T>> , and of output units are <<y(n)=
           [y1 (n),y2 (n),...,yL (n)]^T>> . The connection weights are given in anN×M
           weight matrixWin =(win ) for connections between the input and the inter- ij 
           nalPEs,in an N×N matrix W=(wij ) for connections between the internal
           PEs, in an L×N matrix <<W_out =(w_out)>> for connections from PEs to the ij 
          Input Layer Dynamical Reservoir Read-out

                            <<FIGURE>>

           Figure 1: An echo state network (ESN). ESN is composed of two parts: a ﬁxed-
           weight (W<1) recurrent network and a linear readout. The recurrent net-
           work is a reservoir of highly interconnected dynamical components, states of
           which are called echo states. The memoryless linear readout is trained to pro-
           duce the output.


           output units, and in an N× L matrix <<FORMULA>> for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
           activation of the internal PEs (echo state) is updated according to

                             <<FORMULA>>,             (2.1)

           where f=(f1 ,f2 ,...,fN ) are the internal PEs’ activation functions.Here, all
          i ’s are hyperbolic tangent functions ( ex −  ). The output from the readout ex +e−x
           network is computed according to

               <<y(n+1)=f_out (W_out x(n+1))>>,                           (2.2)

           where <<f_out =(f_out ,f_out ,...,f_out )>> are the output unit’s nonlinear functions <<FORMULA>> (Jaeger, 2001, 2002a). 
           Generally, the readout is linear so f_out is identity.
             ESNs resemble the RNN architecture proposed in Puskorius and
           Feldkamp (1996) and also used by Sanchez (2004) in brain-machine
           interfaces. The critical difference is the dimensionality of the hidden re-
           current PE layer and the adaptation of the recurrent weights. We submit
           that the ideas of approximation theory in functional spaces (bases and pro-
           jections), so useful in adaptive signal processing (Principe, 2001), should
           be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued
           function of a real-valued vector

              <<u(t)=[u1 (t),u2 (t),...,uM (t)] T>>.

           In functional approximation, the goal is to estimate the behavior ofh(u(t))
           as a combination of simpler functions ϕi (t), called the basis functionals,
           such that its approximant,hˆ(u(t)), is given by

                   <<FORMULA>>.

           Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
           the central questions in practical functional approximation is how to choose
           the set of bases to approximate a given desired signal. In signal processing,
           thechoicenormallygoesforacompletesetoforthogonalbasis,independent
           of the input. When the basis set is complete and can be made as large
           as required, ﬁxed bases work wonders (e.g., Fourier decompositions). In
           neural computing, the basic idea is to derive the set of bases from the
           input signal through a multilayered architecture. For instance, consider a
           single hidden layer TDNN with NPEs and a linear output. The hidden-
           layer PE outputs can be considered a set of nonorthogonal basis functionals
           dependent on the input,

                    <<FORMULA>>

           bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
           mation produced by the TDNN is then

                    <<FORMULA>>,                                (2.3)

           whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
           the bases and theai ’s adapt the projection in the projection space. Here the
           goal is to restrict the number of bases (number of hidden layer PEs) because
           their number is coupled with the number of parameters to adapt, which
           has an impact on generalization and training set size, for example. Usually,
           since all of the parameters of the network are adapted, the best basis in the
           joint (input and desired signals) space as well as the best projection can be
           achieved and represents the optimal solution. The output of the TDNN is
           a linear combination of its internal representations, but to achieve a basis
           set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
           be enforced. Ito, Shah and Pon, and others have shown that this is indeed
           the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
           the scope of this article.

             The ESN (and the RNN) architecture can also be studied in this frame-
           work. The states of equation 2.1 correspond to the basis set, which are
           recursively computed from the input, output, and previous states through
           Win ,W,andWback . Notice, however, that none of these weight matrices is
           adapted, that is, the functional bases in the ESN are uniquely deﬁned by the
           input and the initial selection of weights. In a sense, ESNs are trading the
           adaptive connections in the RNN hidden layer by a brute force approach
           of creating ﬁxed diversiﬁed dynamics in the hidden layer.
             For an ESN with a linear readout network, the output equation (y(n+
           1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
           ai ’s are replaced by the echo states and the readout weights, respectively.
           The readout weights are adapted in the training data, which means that the
           ESN is able to ﬁnd the optimal projection in the projection space, just like
           the RNN or the TDNN.

             A similar perspective of basis and projections for information processing
           in biological networks has been proposed by Pouget and Sejnowski (1997).
           They explored the possibility that the response of neurons in parietal cortex
           serves as basis functions for the transformations from the sensory input
           to the motor responses. They proposed that “the role of spatial represen-
           tations is to code the sensory inputs and posture signals in a format that
           simpliﬁes subsequent computation, particularly in the generation of motor
           commands”.

             The central issue in ESN design is exactly the nonadaptive nature of
           the basis set. Parameter sets in the reservoir that provide linearly inde-
           pendent states and possess a given spectral radius may deﬁne drastically
           different projection spaces because the correlation among the bases is not
           constrained. A simple experiment was designed to demonstrate that the se-
           lection of the ESN parameters by constraining the spectral radius is not the
           most suitable for function approximation. Consider a 100-unit ESN where
           the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
           the ESN generate the seventh power of the input signal. Different realiza-
           tions of a randomly connected 100-unit ESN were constructed where the
           entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
           and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
           weights are set to+1or,−1 with equal probabilities, andWback is set to
           zero. Input is applied for 300 time steps, and the echo states are calculated
           using equation 2.1. The next step is to train the linear readout. One method

                                      <<FIGURE>>

           Figure 2: Performances of ESNs for different realizations ofWwith the same
           weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
           bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
           of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
           show that for each set of random weights that provide the same spectral ra-
           dius, the correlation or degree of redundancy among the bases will change, and
           different performances are encountered in practice.


           to determine the optimal output weight matrix,Wout , in the mean square
           error (MSE) sense (where MSE is deﬁned by <<FORMULA>>) is to use 2 the Wiener solution given by Haykin (2001):

                                        <<FORMULA>>

           Here,E[.] denotes the expected value operator, andddenotes the desired
           signal. Figure 2 depicts the MSE values for 50 different realizations of
           the ESNs. As observed, even though each ESN has the same sparseness
           and spectral radius, the MSE values obtained vary greatly among differ-
           ent realizations. The minimum MSE value obtained among the 50 realiza-
           tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment    
           demonstrates that a design strategy that is based solely on the spectral
           radius is not sufﬁcient to specify the system architecture for function ap-
           proximation. This shows that for each set of random weights that provide
           thesamespectralradius,thecorrelationordegreeofredundancyamongthe
           bases will change, and different performances are encountered in practice.

             2.2 ESN Dynamics as a Combination of Linear Systems.
             
           It is well known that the dynamics of a nonlinear system can be approximated by
           that of a linear system in a small neighborhood of an equilibrium point
           (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
           with hyperbolic tangent nonlinearities and approximate the ESN dynam-
           ics by the dynamics of the linearized system in the neighborhood of the
           current system state. Hence, when the system operating point varies over
           time, the linear system approximating the ESN dynamics changes. We are
           particularly interested in the movement of the poles of the linearized ESN.
           Consider the update equation for the ESN without output feedback given
           by

               <<x(n+1)=f(Win u(n+1)+Wx(n))>>.

           Linearizing the system around the current statex(n), one obtains the
           Jacobian matrix, <<J(n+1)>>, deﬁned by
                
                              <<FORMULA>>

           Here,net i(n) is the ith entry of the vector <<(W_in u(n+1)+Wx(n))>>, and w_ij
           denotes the (i,j)th entry of W. The poles of the linearized system at time
           n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
           amplitude of each PE changes, the local slope changes, and so the poles of
           A. The transfer function of a linear system <<x(n+1)=Ax(n)+Bu(n)>> is <<X(z) =(zI−U(z)A)−1>> 
           Adjoint <<(zI−A)>>. The poles of the transfer function can be obtained by solving <<det(zI−A)=0>>.
           The solution corresponds to the eigenvalues of A.     


           the linearized system are time varying, although the parameters of ESN are
           ﬁxed. In order to visualize the movement of the poles, consider an ESN with
           100 states. The entries of the internal weight matrix are chosen to be 0,
           0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05.W is scaled such that a
           spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with
           equal probabilities. A sinusoidal signal with a period of 100 is fed to the
           system, and the echo states are computed according to equation 2.1. Then
           the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
           Figure 3 shows the pole tracks of the linearized ESN for different input
           values. A single ESN with ﬁxed parameters implements a combination of
           many linear systems with varying pole locations, hence many different
           time constants that modulate the richness of the reservoir of dynamics as a
           function of input amplitude. Higher-amplitude portions of the signal tend
           to saturate the nonlinear function and cause the poles to shrink toward
           the origin of thez-plane (decreases the spectral radius), which results in a
           system with a large stability margin. When the input is close to zero, the
           poles of the linearized ESN are close to the maximal spectral radius chosen,
           decreasing the stability margin. When compared to their linear counterpart,
           an ESN with the same number of states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems.
           Similar results can be obtained using signals of different shapes at the ESN
           input.
             A key corollary of the above analysis is that the spectral radius of an
           ESN can be adjusted using a constant bias signal at the ESN input without
           changing the recurrent connection matrix,W. The application of a nonzero
           constant bias will move the operating point to regions of the sigmoid func-
           tion closer to saturation and always decrease the spectral radius due to the
           shape of the nonlinearity. 2 The relevance of bias in terms of overall system
           performance has also been discussed in Jaeger (2002b) and Bertschinger
           and Natschlager (2004), but here we approach it from a system theory per-¨
           spective and explain its effect on reservoir dynamics.

           3 Average State Entropy as a Measure of the Richness of ESN Reservoir

           Previous research was aware of the inﬂuence of diversity of the recurrent
           layer outputs on the overall performance of ESNs and LSMs. Several met-
           rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,


             2 Assume W has nondegenerate eigenvalues and corresponding linearly independent
           eigenvectors. Then consider the eigendecomposition of W, where <<FORMULA>>,Pis the
           eigenvectormatrixandDisthediagonalmatrixofeigenvalues <<FORMULA>> of W.SinceF(n)andD
           are diagonal, <<FORMULA>> is the eigendecomposition
           of <<J(n+1)>>. Here, each entry of <<FORMULA>>, is an eigenvalue of J. Therefore,
           <<FORMULA>> since <<FORMULA>>.

                              <<FIGURE>>

           Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
           goes through a cycle. An ESN with ﬁxed parameters implements a combination
           of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
           with a period of 100. (B–E) The positions of poles of the linearized systems
           when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
           pole locations show the movement of the poles as the input changes. Due to
           the varying pole locations, different time constants modulate the richness of
           the reservoir of dynamics as a function of input amplitude. Higher-amplitude
           signals tend to saturate the nonlinear function and cause the poles to shrink
           toward the origin of thez-plane (decreases the spectral radius), which results in
           a system with a large stability margin. When the input is close to zero, the poles
           ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
           the stability margin. An ESN with more states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems, when
           compared to their linear counterpart.

           Here, our approach of bases and projections leads to a new metric.
           We propose the instantaneous state entropy to quantify the distribution of
           instantaneous amplitudes across the ESN states. Entropy of the instanta-
           neous ESN states is appropriate to quantify performance in function ap-
           proximation because the ESN output is a mere weighted combination of
           the instantaneous value of the ESN states. If the echo state’s instantaneous
           amplitudes are concentrated on only a few values across the ESN state dy-
           namic range, the ability to approximate an arbitrary desired response by
           weighting the states is limited (and wasteful due to redundancy between
           the different states), and performance will suffer. On the other hand, if the
           ESN states provide a diversity of instantaneous amplitudes, it is much eas-
           ier to achieve the desired mapping. Hence, the instantaneous entropy of the
           states appears as a good measure to quantify the richness of dynamics with
           instantaneous mappers. Due to the time structure of signals, the average
           state entropy (ASE), deﬁned as the state entropy averaged over time, will be
           the parameter used to quantify the diversity in the dynamical reservoir of
           the ESN. Moreover, entropy has been proposed as an appropriate measure
           of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
           measures the volume of the echo state manifold spanned by trajectories.
             Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure
           of information. In addition, an efﬁcient nonparametric estimator of Renyi’s
           entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
           Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable
           X with a <<FORMULA>> is given by Renyi (1970):


                        <<FORMULA>>


           Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en-
           tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
           known pdf to be estimated, Parzen windowing approximates the underly-
           ing pdf by

                        <<FORMULA>>

           whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s
           quadratic entropy can be estimated by (Principe et al., 2000)

                          <<FORMULA>>


             The instantaneous state entropy is estimated using equation 3.1 where
           thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
           of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
           with kernel size chosen to be 0.3 of the standard deviation of the entries
           of the state vector. We will show that ASE is a more sensitive parameter to
           quantify the approximation properties of ESNs by experimentally demon-
           strating that ESNs with different spectral radius and even with the same
           spectral radius display different ASEs.

             Let us consider the same 100-unit ESN that we used in the previous
           section built with three different spectral radii 0.2, 0.5, 0.8 with an input
           signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
           The instantaneous state entropy is also calculated at each time step using
           equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
           state entropy changes over time with the distribution of the echo states as
           we would expect, since state entropy is dependent on the input signal that
           also changes in this case. Second, as the spectral radius increases in the
           simulation, the diversity in the echo states increases. For the spectral radius
           of 0.2, echo state’s instantaneous amplitudes are concentrated on only a
           few values, which is wasteful due to redundancy between different states.
           In practice, to quantify the overall representation ability over time, we will
           use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral
           radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
           radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
           realizations of ESNs with the same spectral radius of 0.5, which means that
           ASE is a ﬁner descriptor of the dynamics of the reservoir. Although we
           have presented an experiment with sinusoidal signal, similar results are
           obtained for other inputs as long as the input dynamic range is properly
           selected.

             Maximizing ASE means that the diversity of the states over time is the
           largest and should provide a basis set that is as uncorrelated as possible.
           This condition is unfortunately not a guarantee that the ESN so designed
           will perform the best, because the basis set in ESNs is created independent
           of the desired response and the application may require a small spectral
           radius. However, we maintain that when the desired response is not ac-
           cessible for the design of the ESN bases or when the same reservoir is
           to be used for a number of problems, the default strategy should be to
           maximize the ASE of the state vector. The following section addresses
           the design of ESNs with high ASE values and a simple mechanism to
           adjust the reservoir dynamics without changing the recurrent connection
           weights.

           4 Designing Echo State Networks

             4.1 Design of the Echo State Recurrent Connections.According to the
           interpretation of ESNs as coupled linear systems, the design of the internal
           connection matrix, W, will be based on the distribution of the poles of the
           linearized system around zero state. Our proposal is to design the ESN
           such that the linearized system has uniform pole distribution inside the
           unit circle of thez-plane. With this design scenario, the system dynamics
           will include uniform coverage of time constants arising from the uniform
           distribution of the poles, which also decorrelates as much as possible the
           basis functionals. This principle was chosen by analogy to the identiﬁcation
           oflinearsystemsusingKautzﬁlters(Kautz,1954),whichshowsthatthebest
           approximation of a given transfer function by a linear system with ﬁnite
           order is achieved when poles are placed in the neighborhood of the spectral
           resonances. When no information is available about the desired response,
           we should uniformly spread the poles to anticipate good approximation to
           arbitrary mappings.

             We again use a maximum entropy principle to distribute the poles inside
           the unit circle uniformly. The constraints of a circle as boundary conditions
           for discrete linear systems and complex conjugate locations are easy to
           include for the pole distribution (Thogula, 2003). The poles are ﬁrst initial-
           ized at random locations; the quadratic Renyi’s entropy is calculated by
           equation 3.1, and poles are moved such that the entropy of the new dis-
           tribution is increased over iterations (Erdogmus & Principe, 2002). This
           method is efﬁcient to ﬁnd uniform coverage of the unit circle with an arbi-
           trary number of poles. The system with the uniform pole locations can be
           interpreted using linear system theory. The poles that are close to the unit
           circle correspond to many sharp bandpass ﬁlters specializing in different
           frequency regions, whereas the inner poles realize ﬁlters of larger frequency
           support. Moreover, different orientations (angles) of the poles create ﬁlters
           of different center frequencies.

             Now the problem is to construct an internal weight matrix from the pole
           locations (eigenvalues ofW). In principle, we would like to create a sparse

                                    <<FIGURE>>

           Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
           ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
           from top to bottom, respectively. The diversity of echo states increases when the
           spectral radius increases. Within the dynamic range of the echo states, systems
           with smaller spectral radius can generate only uneven representations, while
           forW=0.8, outputs of echo states almost uniformly distribute within their
           dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
           Information contained in the echo states is changing over time according to the
           input amplitude. Therefore, the richness of representation is controlled by the
           input amplitude. Moreover, the value of ASE increases with spectral radius.
           (C) ASEs from 50 different realizations of ESNs with the same spectral radius
           of 0.5. The plot shows that ASE is a ﬁner descriptor of the dynamics of the
           reservoir than the spectral radius. 

           matrix, so we started with the sparsest matrix (with an inverse), which is
           the direct canonical structure given by (Kailath, 1980)

                  <<FORMULA>>

           The characteristic polynomial of W_i's

                  <<FORMULA>>,                        (4.2)

           wherepi ’s are the eigenvalues andai ’s are the coefﬁcients of the character-
           istic polynomial ofW. Here, we know the pole locations of the linear system
           obtained from the linearization of the ESN, so using equation 4.2, we can
           obtain the characteristic polynomial and constructWmatrix in the canon-
           ical form using equation 4.1. We will call the ESN constructed based on
           the uniform pole principle ASE-ESN. All other possible solutions with the
           same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular
           matrix.

             To corroborate our hypothesis, we would like to show that the linearized
           ESN designed with the recurrent weight matrix having the eigenvalues
           uniformly distributed inside the unit circle creates higher ASE values for a
           given spectral radius compared to other ESNs with random internal con-
           nection weight matrices. We will consider an ESN with 30 states and use our
           procedure to create theWmatrix for ASE-ESN for different spectral radii
           between <<[0.1, 0.95]>>. Similarly, we constructed ESNs with sparse randomW
           matrices with different sparseness constraints. This corresponds to a weight
           distribution having the values 0, c and −c with probabilities <<p_1>> ,<<(1−p_1)/2>>,
           and <<(1−p_1)/2>>, wherep1 deﬁnes the sparseness ofWandcis a constant
           that takes a speciﬁc value depending on the spectral radius. We also created
           Wmatrices with values uniformly distributed between−1 and 1 (U-ESN)
           and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
           for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
           given in section 3 and calculate ASE. Figure 5 compares the ASE values
           averaged over 1000 realizations. As observed from the ﬁgure, the ASE-ESN
           with uniform pole distribution generates higher ASE on average for all
           spectral radii compared to ESNs with sparse and uniform random connec-
           tions. This approach is indeed conceptually similar to Jeffreys’ maximum
           entropy prior (Jeffreys, 1946): it will provide a consistently good response
           for the largest class of problems. Concentrating the poles of the linearized


                                    <<FIGURE>>

           Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
           uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
           with uniformly distributed weights between−1 and 1. Randomly generated
           weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
           networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
           distribution generates a higher ASE on average for all spectral radii compared
           to ESNs with random connections.


           system in certain regions of the space provides good performance only if
           the desired response has energy in this part of the space, as is well known
           from the theory of Kautz ﬁlters (Kautz, 1954).

             4.2 Design of the Adaptive Bias.
             
           In conventional ESNs, only the output weights are trained, optimizing the 
           projections of the desired response onto the basis functions (echo states). 
           Since the dynamical reservoir is ﬁxed,
           the basis functions are only input dependent. However, since function ap-
           proximation is a problem in the joint space of the input and desired signals,
           a penalty in performance will be incurred. From the linearization analysis
           that shows the crucial importance of the operating point of the PE non-
           linearity in deﬁning the echo state dynamics, we propose to use a single
           external adaptive bias to adjust the effective spectral radius of an ESN. No-
           tice that according to linearization analysis, bias can reduce only spectral
           radius. The information for adaptation of bias is the MSE in training, which
           modulates the spectral radius of the system with the information derived
           from the approximation error. With this simple mechanism, some informa-
           tionfromtheinput-outputjointspaceisincorporatedinthedeﬁnitionofthe
           projection space of the ESN. The beauty of this method is that the spectral
           radius can be adjusted by a single parameter that is external to the system
           without changing reservoir weights.

             The training of bias can be easily accomplished. Indeed, since the pa-
           rameter space is only one-dimensional, a simple line search method can be
           efﬁciently employed to optimize the bias. Among different line search al-
           gorithms, we will use a search that uses Fibonacci numbers in the selection
           of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
           imizes the maximum number of evaluations needed to reduce the interval
           of uncertainty to within the prescribed length. In our problem, a bias value
           is picked according to Fibonacci search. For each value of bias, training
           data are applied to the ESN, and the echo states are calculated. Then the
           corresponding optimal output weights and the objective function (MSE)
           are evaluated to pick the next bias value.
             Alternatively, gradient-based methods can be utilized to optimize the
           bias, due to simplicity and low computational cost. System update equation
           with an external bias signal,b,isgivenby

               <<x(n+1)=f(W_in u(n+1)+Win b+Wx(n))>>.

           The update equation forbis given by

                <<FORMULA>>

             Here,Ois the MSE deﬁned previously. This algorithm may suffer from
           similar problems observed in gradient-based methods in recurrent net-
           works training. However, we observed that the performance surface is
           rather simple. Moreover, since the search parameter is one-dimensional,
           the gradient vector can assume only one of the two directions. Hence, im-
           precision in the gradient estimation should affect the speed of convergence
           but normally not change the correct gradient direction.

           5 Experiments

           This section presents a variety of experiments in order to test the validity
           of the ESN design scheme proposed in the previous section.

             5.1 Short-Term Memory Capacity.

             This experiment compares the shortterm memory (STM) capacity of ESNs 
             with the same spectral radius using
           the framework presented in Jaeger (2002a). Consider an ESN with a sin-
           gle input signal, <<u(n)>>, optimally trained with the desired signal <<u(n−k)>>,
           for a given delayk. Denoting the optimal output signalyk (n), thek-delay     
           STM capacity of a network,MC k , is deﬁned as a squared correlation coef-
           ﬁcient betweenu <<(n−k)>> and <<FORMULA>> (Jaeger, 2002a). The STM capacity, MC,
           of the network is deﬁned as  <<FORMULA>>. STM capacity measures how accu-
           rately the delayed versions of the input signal are recovered with optimally
           trained output units. Jaeger (2002a) has shown that the memory capacity
           for recalling an independent and identically distributed (i.i.d.) input by an
           Nunit RNN with linear output units is bounded by N.
             We use ESNs with 20 PEs and a single input unit. ESNs are driven
           by an i.i.d. random input signal,<<u(n)>>, that is uniformly distributed over
           [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
           of the input, <<u(n−1),...,u(n−40)>>. We used four different ESNs: R-ESN,
           U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
           used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
           −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
           sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
           U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec-
           tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
           with uniform poles. BASE-ESN has the same recurrent weight matrix as
           ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
           are set to 0.1 or−0.1 with equal probability, and direct connections from the
           input to the output are allowed, whereasWback is set to 0 (Jaeger, 2002a).
           The echo states are calculated using equation 2.1 for 200 samples of the
           input signal, and the ﬁrst 100 samples corresponding to initial transient
           are eliminated. Then the output weight matrix is calculated using equation
           2.4. For the BASE-ESN, the bias is trained for each task. All networks are
           run with a test input signal, and the corresponding output andMC k are
           calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
           trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
           ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
           and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
           ESN and BASE-ESN) haveMCs that are much longer than the randomly
           generated ESN given in Jaeger (2002a) in spite of all having the same spec-
           tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
           maximumvalueofN=20.AcloserlookattheﬁgureshowsthatR-ESNper-
           forms slightly better than ASE-ESN for delays less than 9. In fact, for small
           k, large ASE degrades the performance because the tasks do not need long
           memory depth. However, the drawback of high ASE for smallkis recov-
           ered in BASE-ESN, which reduces the ASE to the appropriate level required
           for the task. Overall, the addition of the bias to the ASE-ESN increases the
           STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
           better STM compared to R-ESN with only three different weight values,
           although it has more distinct weight values compared to R-ESN. It is also
           signiﬁcant to note that theMCwill be very poor for an ESN with smaller
           spectral radius even with an adaptive bias, since the problem requires large
           ASE and bias can only reduce ASE. This experiment demonstrates the

                                       <<FIGURE>>

           Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
           using the test signal. The results are averaged over 100 different realizations of
           each ESN type with the speciﬁcations given in the text for differentWandWin
           matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
           13.09, 13.55, 16.70, and 16.90, respectively.


           suitability of maximizing ASE in tasks that require a substantial memory
           length.

             5.2 Binary Parity Check.
             
             The effect of the adaptive bias was marginal
           in the previous experiment since the nature of the problem required large
           ASE values. However, there are tasks in which the optimal solutions re-
           quire smaller ASE values and smaller spectral radius. Those are the tasks
           where the adaptive bias becomes a crucial design parameter in our design
           methodology.
             Consider an ESN with 100 internal units and a single input unit. ESN is
           drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
           is to train an ESN to generate them-bit parity corresponding to lastmbits
           received, wheremis 3,...,8. Similar to the previous experiments, we used
           the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
           connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06
           with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
           connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
           are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
           with equal probability, and direct connections from the input to the output
           are allowed whereasWback is set to 0. The echo states are calculated using
           equation 2.1 for 1000 samples of the input signal, and the ﬁrst 100 samples
           corresponding to the initial transient are eliminated.Then the output weight        

                                         <<FIGURE>>

           Figure 7: The number of wrong decisions made by each ESN form=3,...,8
           in the binary parity check problem. The results are averaged over 100 differ-
           ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
           matrices with the speciﬁcations given in the text. The total numbers of wrong
           decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
           699. 

           matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
           is trained for each task. The binary decision is made by a threshold detector
           that compares the output of the ESN to 0.5. Figure 7 shows the number of
           wrong decisions (averaged over 100 different realizations) made by each
           ESN for <<m=3,...,8>>.
             The total numbers of wrong decisions for <<m=3,...,8>> of R-ESN, ASE-
           ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
           poorly since the nature of the problem requires a short time constant for
           fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
           R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
           BASE-ESN performs a lot better than ASE-ESN and slightly better than
           the R-ESN since the adaptive bias reduces the spectral radius effectively.
           Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
           since the task requires access to longer input history, which compromises
           the need for fast response. Indeed, the bias in the BASE-ESN takes effect
           when there are errors (m>4) and when the task beneﬁts from smaller
           spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
           2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
           range of bias values that result in similar MSE values (between 0 and 3). In 
           summary, this experiment clearly demonstrates the power of the bias signal
           to conﬁgure the ESN reservoir according to the mapping task.

             5.3 System Identiﬁcation.
             This section presents a function approxima-
           tion task where the aim is to identify a nonlinear dynamical system. The
           unknown system is deﬁned by the difference equation

               <<y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n))>>,

           where

                <<f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu)>>.

           The input to the system is chosen to be <<sin(2πn/25)>>.
             We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
           30 internal units and a single input unit. TheWmatrix of each ESN is scaled
           suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
           where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8,
           0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with
           equal probability, and direct connections from the input to the output are
           allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
           using equation 2.4. The MSE values (averaged over 100 realizations) for R-
           ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition
           of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10^−6
           to 3.27x10^−9 .

           6 Discussion

           The great appeal of echo state networks (ESNs) and liquid state machine
           (LSM) is their ability to construct arbitrary mappings of signals with rich
           and time-varying temporal structures without requiring adaptation of the
           free parameters of the recurrent layer. The echo state condition allows the
           recurrent connections to be ﬁxed with training limited to the linear output
           layer. However, the literature did not elucidate on how to properly choose
           the recurrent parameters for system identiﬁcation applications. Here, we
           provide an alternate framework that interprets the echo states as a set
           of functional bases formed by ﬁxed nonlinear combinations of the input.
           The linear readout at the output stage simply computes the projection of
           the desired output space onto this representation space. We further in-
           troduce an information-theoretic criterion, ASE, to better understand and
           evaluate the capability of a given ESN to construct such a representation
           layer. The average entropy of the distribution of the echo states quantiﬁes
           thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
           to achieve the smallest correlation among the bases and be able to cope with     
           arbitrary mappings. However, not all function approximation problems re-
           quire the same memory depth, which is coupled to the spectral radius. The
           effective spectral radius of an ESN can be optimized for the given problem
           with the help of an external bias signal that is adapted using the joint input-
           output space information. The interesting property of this method when
           applied to ESN built from sigmoidal nonlinearities is that it allows the ﬁne
           tuning of the system dynamics for a given problem with a single external
           adaptive bias input and without changing internal system parameters. In
           our opinion, the combination of the largest possible ASE and the adapta-
           tion of the spectral radius by the bias produces the most parsimonious pole
           location of the linearized ESN when no knowledge about the mapping is
           available to optimally locate the bass functionals. Moreover, the bias can be
           easily trained with either a line search method or a gradient-based method
           since it is one-dimensional. We have illustrated experimentally that the de-
           sign of the ESN using the maximization of ASE with the adaptation of the
           spectral radius by the bias has provided consistently better performance
           across tasks that require different memory depths. This means that these
           two parameters’ design methodology is preferred to the spectral radius
           criterion proposed by Jaeger, and it is still easily incorporated in the ESN
           design.

             Experiments demonstrate that the ASE for ESN with uniform linearized
           poles is maximized when the spectral radius of the recurrent weight matrix
           approaches one (instability). It is interesting to relate this observation with
           the computational properties found in dynamical systems “at the edge of
           chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchﬁeld, 1993;
           Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
           tomata rules are evolved to perform a complex computation, evolution will
           tend to select rules with “critical” parameter values, which correlate with
           a phase transition between ordered and chaotic regimes. Recently, similar
           conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
           Langton’s interpretation of edge of chaos was questioned by Mitchell et al.
           (1993). Here, we provide a system-theoretic view and explain the computa-
           tional behavior with the diversity of dynamics achieved with linearizations
           that have poles close to the unit circle. According to our results, the spectral
           radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
           and in general it is impossible to forecast the computational performance
           as the system approaches instability (the spectral radius of the recurrent
           weight matrix approaches one). However, allowing the system to modu-
           late the spectral radius by either the output or internal biasing may allow
           a system close to instability to solve various problems requiring different
           spectral radii.

             Our emphasis here is mostly on ESNs without output feedback connec-
           tions. However, the proposed design methodology can also be applied to
           ESNs with output feedback. Both feedforward and feedback connections
           contribute to specify the bases to create the projection space. At the same
           time, there are applications where the output feedback contributes to the
           system dynamics in a different fashion. For example, it has been shown that
           a ﬁxed weight (fully trained) RNN with output feedback can implement a
           family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
           In meta-learning, the role of output feedback in the network is to bias the
           system to different regions of dynamics, providing multiple input-output
           mappings required (Santiago & Lendaris, 2004). However, results could not
           be replicated with ESNs (Prokhorov, 2005). We believe that more work has
           to be done on output feedback in the context of ESNs but also suspect that
           the echo state condition may be a restriction on the system dynamics for
           this type of problem.

             There are many interesting issues to be researched in this exciting new
           area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s
           representation layer in an unsupervised fashion. In fact, we can easily adapt
           withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
           and Principe (2003): extra weights linking the outputs of recurrent states to
           maximize output entropy. Output entropy maximization is a well-known
           metric to create independent components (Bell & Sejnowski, 1995), and
           here it means that the echo states will become as independent as possible.
           This would circumvent the linearization of the dynamical system to set the
           recurrent weights and would ﬁne-tune continuously in an unsupervised
           manner the parameters of the ESN among different inputs. However, it
           goes against the idea of a ﬁxed ESN reservoir.

             The reservoir of recurrent PEs can be thought of as a new form of a time-
           to-space mapping. Unlike the delay line that forms an embedding (Takens,
           1981), this mapping may have the advantage of ﬁltering noise and produce
           representations with better SNRs to the peaks of the input, which is very
           appealing for signal processing and seems to be used in biology. However,
           further theoretical work is necessary in order to understand the embedding
           capabilities of ESNs. One of the disadvantages of the ESN correlated basis
           is in the design of the readout. Gradient-based algorithms will be very
           slow to converge (due to the large eigenvalue spread of modes), and even
           if recursive methods are used, their stability may be compromised by the
           condition number of the matrix. However, our recent results incorporating
           anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
           solving this problem.

             Finally we would like to brieﬂy comment on the implications of these
           models to neurobiology and computational neuroscience. The work by
           Pouget and Sejnowski (1997) has shown that the available physiological
           data are consistent with the hypothesis that the response of a single neuron
           in the parietal cortex serves as a basis function generated by the sensory
           input in a nonlinear fashion. In other words, the neurons transform the
           sensory input into a format (representation space) such that the subsequent
           computation is simpliﬁed. Then, whenever a motor command (output of
           the biological system) needs to be generated, this simple computation to
           read out the neuronal activity is done. There is an intriguing similarity
           betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
           and our interpretation of echo states in ESN. We believe that similar ideas
           can be applied to improve the design of microcircuit implementations of
           LSMs. First, the framework of functional space interpretation (bases and
           projections) is also applicable to microcircuits. Second, the ASE measure
           may be directly utilized for LSM states because the states are normally low-
           pass-ﬁltered before the readout. However, the control of ASE by changing
           the liquid dynamics is unclear. Perhaps global control of thresholds or bias
           current will be able to accomplish bias control as in ESN with sigmoid
           PEs.


           Acknowledgments

           This work was partially supported by NSFECS-0422718, NSFCNS-0540304,
           and ONR N00014-1-1-0405.


           References

           Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
           Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
             ical perception, and probability learning: Some applications of a neural model.
             Psychological Review, 84, 413–451.
           Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
             to blind separation and blind deconvolution.Neural Computation, 7(6), 1129–
             1159.
           Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
             in recurrent neural networks.Neural Computation, 16(7), 1413–1436.
           Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
             of Physics, 14(1), 1–13.
           de Vries, B. (1991).Temporal processing with neural networks—the development of the
             gamma model. Unpublished doctoral dissertation, University of Florida.
           Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
             network for system identiﬁcation and control.IEEE Proceedings of Control Theory
             and Applications, 142(4), 307–314.
           Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211.
           Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
             Stochastic information gradient.Signal Processing Letters, 10(8), 242–245.
           Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
             adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044.
           Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
             Kalman ﬁlter training for recurrent networks. In J. Suykens, & J. Vandewalle
             (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht,
             Netherlands: Kluwer.           136 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
             River, NJ. Prentice Hall.
           Haykin, S. (2001).Adaptive ﬁlter theory(4th ed.). Upper Saddle River, NJ: Prentice
             Hall.
           Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
             tion, 9(8), 1735–1780.
           Hopﬁeld, J. (1984). Neurons with graded response have collective computational
             properties like those of two-state neurons.Proceedings of the National Academy of
             Sciences, 81, 3088–3092.
           Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
             ematics, 5(1), 189–203.
           Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
             networks(Tech. Rep. No. 148). Bremen: German National Research Center for
             Information Technology.
           Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
             Bremen: German National Research Center for Information Technology.
           Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
             EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
             National Research Center for Information Technology.
           Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
             and saving energy in wireless communication.Science, 304(5667), 78–80.
           Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
             Proceedings of the Royal Society of London, A 196, 453–461.
           Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
           Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
             Theory, 1(3), 29–39.
           Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
             for adaptive communication channel equalization.IEEE Transactions on Neural
             Networks, 5(2), 267–278.
           Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
             IEEE Transactions on Neural Networks, 6(5), 1000–1004.
           Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
             theory(2nd ed.). New York: Springer-Verlag.
           Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37.
           Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
             computational power and generalization capability of neural microcircuits. In
             L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
             systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press.
           Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
             stable states: A new framework for neural computation based on perturbations.
             Neural Computation, 14(11), 2531–2560.
           Mitchell, M., Hraber, P., & Crutchﬁeld, J. (1993). Revisiting the edge of chaos:
             Evolving cellular automata to perform computations.Complex Systems, 7, 89–
             130.
           Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
             Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293–
             301). Singapore: World Scientiﬁc.           Analysis and Design of Echo State Networks 137


           Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
             using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237.
           Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
             Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6–
             28). Boca Raton, FL: CRC Press.
           Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma ﬁlter—a new
             class of adaptive IIR ﬁlters with restricted feedback.IEEE Transactions on Signal
             Processing, 41(2), 649–656.
           Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
             (Ed.),Unsupervised adaptive ﬁltering(pp. 265–319). Hoboken, NJ: Wiley.
           Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
             national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada.
           Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with ﬁxed
             weights in recurrent neural networks: An overview. InProc. of International Joint
             Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii.
           Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
             tems with Kalman ﬁlter trained recurrent networks.IEEE Transactions on Neural
             Networks, 5(2), 279–297.
           Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
             plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420.
           Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
             M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
             echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
             Signal Processing. Philadelphia.
           Renyi, A. (1970).Probability theory. New York: Elsevier.
           Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
             Unpublished doctoral dissertation, University of Florida.
           Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
             works: Reformulating ﬁxed weight neural networks. InProc. of International Joint
             Conference on Neural Networks(pp. 189–194). Budapest, Hungary.
           Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
             multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18.
           Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
             Journal, 27, 623–656.
           Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
             toral dissertation, Rutgers University.
           Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
             Mathematics Letters, 4(6), 77–80.
           Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
             Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
             ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann.
           Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
             Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer.
           Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
             lished master’s thesis, University of Florida.
           Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
             Proceedings of IEEE, 78(10), 1550–1560.           138 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
             tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New
             York: Van Nostrand Reinhold.
           Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
           Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
             fully recurrent neural networks.Neural Computation, 1, 270–280.
<|endoftext|>


<|startoftext|>
                         Bayesian Compression for Deep Learning

                        Christos Louizos          Karen Ullrich               Max Welling
                     University of Amsterdam    University of Amsterdam    University of Amsterdam
                     TNO Intelligent Imaging     k.ullrich@uva.nl               CIFAR 
                      c.louizos@uva.nl                                       m.welling@uva.nl


                                             Abstract

                       Compression and computational efﬁciency in deep learning have become a problem
                       of great signiﬁcance. In this work, we argue that the most principled and effective
                       way to attack this problem is by adopting a Bayesian point of view, where through
                       sparsity inducing priors we prune large parts of the network. We introduce two
                       novelties in this paper: 1) we use hierarchical priors to prune nodes instead of
                       individual weights, and 2) we use the posterior uncertainties to determine the
                       optimal ﬁxed point precision to encode the weights. Both factors signiﬁcantly
                       contribute to achieving the state of the art in terms of compression rates, while
                       still staying competitive with methods designed to optimize for speed or energy
                       efﬁciency.


                 1 Introduction

                 While deep neural networks have become extremely successful in in a wide range of applications,
                 often exceeding human performance, they remain difﬁcult to apply in many real world scenarios. For
                 instance, making billions of predictions per day comes with substantial energy costs given the energy
                 consumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often
                 about a factor100away in terms of speed from what deep NNs can deliver, and sending NNs with
                 millions of parameters through band limited channels is still impractical. As a result, running them on
                 hardware limited devices such as smart phones, robots or cars requires substantial improvements on
                 all of these issues. For all those reasons, compression and efﬁciency have become a topic of interest
                 in the deep learning community.
                 While all of these issues are certainly related, compression and performance optimizing procedures
                 might not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which
                 account for only 4% of the parameters but 91% of the computation [68]. Compressing these layers
                 will not contribute much to the overall memory footprint.
                 There is a variety of approaches to address these problem settings. However, most methods have
                 the common strategy of reducing both the neural network structure and the effective ﬁxed point
                 precision for each weight. A justiﬁcation for the former is the ﬁnding that NNs suffer from signiﬁcant
                 parameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary
                 connections are being removed [40,24,21], or student-teacher learning where a large network is
                 used to train a signiﬁcantly smaller network [5, 27].
                 From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned
                 with achieving high accuracy, because Bayesian methods search for the optimal model structure
                 (which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters
                 through the bits back argument [28] (which leads to removing insigniﬁcant bits). This relation is
                 made explicit in the MDL principle [20] which is known to be related to Bayesian inference.

                 In this paper we will use the variational Bayesian approximation for Bayesian inference which has
                 also been explicitly interpreted in terms of model compression [28]. By employing sparsity inducing
                 priors for hidden units (and not individual weights) we can prune neurons including all their ingoing
                 and outgoing weights. This avoids more complicated and inefﬁcient coding schemes needed for
                 pruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the
                 variational posterior uncertainty to assess which bits are signiﬁcant and remove the ones which
                 ﬂuctuate too much under approximate posterior sampling. From this we derive the optimal ﬁxed
                 point precision per layer, which is still practical on chip.

                 2 Variational Bayes and Minimum Description Length

                 A fundamental theorem in information theory is the minimum description length (MDL) principle [20].
                 It relates to compression directly in that it deﬁnes the best hypothesis to be the one that communicates
                 the sum of the model (complexity costLC ) and the data misﬁt (error costLE ) with the minimum
                 number of bits [59,60]. It is well understood that variational inference can be reinterpreted from an
                 MDL point of view [56,72,28,30,19]. More speciﬁcally, assume that we are presented with a dataset QD that consists from N input-output pairs <<FORMULA>>. Let <<FORMULA>>
                 be a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs
                 y using parameters w governed by a prior distribution <<p(w)>>. In this scenario, we wish to approximate
                 the intractable posterior distribution <<p(w|D) =p(D|w)p(w)=p(D)>> with a ﬁxed form approximate
                 posterior <<q(w)>> by optimizing the variational parameters  according to:

                             <<FORMULA>> 

                 where <<H()>> denotes the entropy and <<L()>> is known as the evidence-lower-bound (ELBO) or negative
                 variational free energy. As indicated in eq.1, <<L()>> naturally decomposes into a minimum cost for
                 communicating the targets <<FORMULA>> under the assumption that the sender and receiver agreed on a n=1 prior <<p(w)>> and that the receiver knows the inputs <<FORMULA>> and form of the parametric model. n=1
                 By using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mecha-
                 nism will start pruning hidden units that are not strictly necessary for prediction and thus achieving
                 compression. But there is also a second mechanism by which Bayes can help us compress. By
                 explicitly entertaining noisy weight encodings through <<q(w)>> we can beneﬁt from the bits-back
                 argument [28,30] due to the entropy term; this is in contrast to inﬁnitely precise weights that lead to
                 <<FORMULA>>. Nevertheless in practice, the data misﬁt termLE is intractable for neural network
                 models under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed.
                 Continuous q(w) allow for the reparametrization trick [36,58]. Here, we replace sampling from
                 q(w) by a deterministic function of the variational parameters  and random samples from some
                 noise variables:

                            <<FORMULA>>;        (2)

                 where <<w=f(;)>>. By applying this trick, we obtain unbiased stochastic gradients of the ELBO
                 with respect to the variational parameters, thus resulting in a standard optimization problem that is
                 ﬁt for stochastic gradient ascent. The efﬁciency of the gradient estimator resulting from eq. 2 can be
                 further improved for neural networks by utilizing local reparametrizations [37] (which we will use in
                 our experiments); they provide variance reduction in an efﬁcient way by locally marginalizing the
                 weights at each layer and instead sampling the distribution of the pre-activations.

                 3 Related Work

                 One of the earliest ideas and most direct approaches to tackle efﬁciency is pruning. Originally
                 introduced by [40], pruning has recently been demonstrated to be applicable to modern architectures
                 [25,21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters
                 can be pruned in common architectures. There have been quite a few encouraging results obtained
                 by (empirical) Bayesian approaches that employ weight pruning [19,7,52,70,51]. Nevertheless,

                    2 In practice this term is a large constant determined by the weight precision.

                 weight pruning is in general inefﬁcient for compression since the matrix format of the weights is not
                 taken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed.
                 Moreover, note that in conventional CNNs most ﬂops are used by the convolution operation. Inspired
                 by this observation, several authors proposed pruning schemes that take these considerations into
                 account [73, 74] or even go as far as efﬁciency aware architectures to begin with [32, 15, 31]. From
                 the Bayesian viewpoint, similar pruning schemes have been explored at [47, 53, 39, 34].
                 Given optimal architecture, NNs can further be compressed by quantization. More precisely, there
                 are two common techniques. First, the set of accessible weights can be reduced drastically. As an
                 extreme example, [13,48,57,76] and [11] trained NN to use only binary or tertiary weights with
                 ﬂoating point gradients. This approach however is in need of signiﬁcantly more parameters than
                 their ordinary counterparts. Work by [18] explores various techniques beyond binary quantization:
                 k-means quantization, product quantization and residual quantization. Later studies extent this set to
                 optimal ﬁxed point [44] and hashing quantization [10]. [25] apply k-means clustering and consequent
                 center training. From a practical point of view, however, all these are fairly unpractical during
                 test time. For the computation of each feature map in a net, the original weight matrix must be
                 reconstructed from the indexes in the matrix and a codebook that contains all the original weights.
                 This is an expensive operation and this is why some studies propose a different approach than set
                 quantization. Precision quantization simply reduces the bit size per weight. This has a great advantage
                 over set quantization at inference time since feature maps can simply be computed with less precision
                 weights. Several studies show that this has little to no effect on network accuracy when using 16bit
                 weights [49,22,12,71,9]. Somewhat orthogonal to the above discussion but certainly relevant are
                 approaches that customize the implementation of CNNs for hardware limited devices[31, 4, 62].


                 4 Bayesian compression with scale mixtures of normals


                 Consider the following prior over a parameter w where its scale z is governed by a distribution <<p(z)>>:


                                       <<FORMULA>>;                    (3)


                 with z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales
                 of w as random variables we can recover marginal prior distributions over the parameters that have
                 heavier tails and more mass at zero; this subsequently biases the posterior distribution over w to
                 be sparse. This family of distributions is known as scale-mixtures of normals [6,2] and it is quite
                 general, as a lot of well known sparsity inducing distributions are special cases.
                 One example of the aforementioned framework is the spike-and-slab distribution [50], the golden
                 standard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a
                 Bernoulli distribution, thus the marginal <<p(w)>> has a delta “spike” at zero and a continuous “slab” over
                 the real line. Unfortunately, this prior leads to a computationally expensive inference since we have
                 to explore a space of2M models, whereMis the number of the model parameters. Dropout [29,67],
                 one of the most popular regularization techniques for neural networks, can be interpreted as positing a
                 spike and slab distribution over the weights where the variance of the “slab” is zero [17,45]. Another
                 example is the Laplace distribution which arises by considering <<FORMULA>>. The mode of
                 the posterior distribution under a Laplace prior is known as the Lasso [69] estimator and has been
                 previously used for sparsifying neural networks at [73,61]. While computationally simple, the
                 Lasso estimator is prone to “shrinking" large signals [8] and only provides point estimates about
                 the parameters. As a result it does not provide uncertainty estimates, it can potentially overﬁt and,
                 according to the bits-back argument, is inefﬁcient for compression.
                 For these reasons, in this paper we will tackle the problem of compression and efﬁciency in neural
                 networks by adopting a Bayesian treatment and inferring an approximate posterior distribution over
                 the parameters under a scale mixture prior. We will consider two choices for the prior over the scales
                 p(z); the hyperparameter free log-uniform prior [16,37] and the half-Cauchy prior, which results into
                 a horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the
                 spike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C.

                 4.1 Reparametrizing variational dropout for group sparsity

                 One potential choice for p(z) is the improper log-uniform prior [37] <<FORMULA>>. It turns out that
                 we can recover the log-uniform prior over the weightswif we marginalize over the scales z: 
                 
                                              <<FORMULA>>                (4)
                 
                 This alternative parametrization of the log uniform prior is known in the statistics literature as the
                 normal-Jeffreys prior and has been introduced by [16]. This formulation allows to “couple" the
                 scales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the
                 corresponding scale variablezin the joint prior 3 :

                                              <<FORMULA>>;                  (5)
   
                 where W is the weight matrix of a fully connected neural network layer with A being the dimen-
                 sionality of the input and B the dimensionality of the output. Now consider performing variational
                 inference with a joint approximate posterior parametrized as follows:

                                             <<FORMULA>>;                  (6) 
                                       
                 where _i is the dropout rate [67,37,51] of the given group. As explained at [37,51], the multiplicative
                 parametrization of the approximate posterior over z suffers from high variance gradients; therefore
                 we will follow [51] and re-parametrize it in terms of <<FORMULA>>, hence optimize w.r.t._2 . 
                 The <<FORMULA>> lower bound under this prior and approximate posterior becomes:

                                              <<FORMULA>>                    (7)

                 Under this particular variational posterior parametrization the negative KL-divergence from the
                 conditional prior <<p(W|z)>> to the approximate posterior <<q(W|z)>> is independent of z:

                                                                        <<FORMULA>>       (8)

                 This independence can be better understood if we consider a non-centered parametrization of the
                 prior [55]. More speciﬁcally, consider reparametrizing the weights asw~ij =wij ; this will then result zi
                 into <<p(W|z)p(z) =p(W~)p(z)>>, where <<FORMULA>>. Now if <<FORMULA>> and <<W= diag(z)>>
                 we perform variational inference under the p(W~)p(z)prior with an approximate posterior that has Q the form of <<FORMULA>>, with <<FORMULA>>, then we see that we ij arrive at the same expressions for the negative KL-divergence from the prior to the approximate
                 posterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the
                 Gaussian variational posterior q depends only on the “implied” dropout rate, <<FORMULA>>, and zi z takes the following form [51]:       

                                               <<FORMULA>>;                  (9)
                                          
                 where <<FORMULA>> are the sigmoid and softplus functions respectively 4 and k1 = 0:63576,k2 =
                 1:87320,k3 = 1:48695. We can now prune entire groups of parameters by simply specifying a thresh-
                 old for the variational dropout rate of the corresponding group, e.g.<<FORMULA>>. It should be mentioned that this prior parametrization readily allows for a more ﬂexible marginal pos-
                 terior over the weights as we now have a compound distribution, <<FORMULA>>; this
                 is in contrast to the original parametrization and the Gaussian approximations employed by [37,51].
                 Strictly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is
                 shared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it
                 biases the variational posterior to be sparse.

                                                <<FORMULA>>

                 Furthermore, this approach generalizes the low variance additive parametrization of variational
                 dropout proposed for weight sparsity at [51] to group sparsity (which was left as an open question
                 at [51]) in a principled way.
                 At test time, in order to have a single feedforward pass we replace the distribution overWat each
                 layer with a single weight matrix, the masked variational posterior mean:

                                                 <<FORMULA>>;                         (10)

                 where m is a binary mask determined according to the group variational dropout rate andMW are
                 the means ofq (W~). We further use the variational posterior marginal variances 5 for this particular
                 posterior approximation:              
                 
                                                <<FORMULA>>;                           (11)

                 to assess the bit precision of each weight in the weight matrix. More speciﬁcally, we employed the
                 mean variance across the weight matrixW^ to compute the unit round off necessary to represent the
                 weights. This method will give us the amount signiﬁcant bits, and by adding 3 exponent and 1 sign
                 bits we arrive at the ﬁnal bit precision for the entire weight matrixW^6 . We provide more details at
                 Appendix B.

                 4.2 Group horseshoe with half-Cauchy scale priors

                 Another choice for p(z) is a proper half-Cauchy distribution: <<FORMULA>>; it
                 induces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing
                 prior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as
                 (in a non-centered parametrization):

                                                  <<FORMULA>>;                           (12)

                 where0 is the free parameter that can be tuned for speciﬁc desiderata. The idea behind the horseshoe
                 is that of the “global-local" shrinkage; the global scale variablespulls all of the variables towards
                 zero whereas the heavy tailed local variableszi can compensate and allow for some weights to escape.
                 Instead of directly working with the half-Cauchy priors we will employ a decomposition of the
                 half-Cauchy that relies upon (inverse) gamma distributions [54] as this will allow us to compute
                 the negative KL-divergence from the scale priorp(z)to an approximate log-normal scale posterior
                 q (z)in closed form (the derivation is given in Appendix D). More speciﬁcally, we have that the
                 half-Cauchy prior can be expressed in a non-centered parametrization as:

                                                    <<FORMULA>>;                       (13)

                 where <<IG(;);G(;)>> correspond to the inverse Gamma and Gamma distributions in the scale
                 parametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express
                 the whole hierarchy as:

                                                  <<FORMULA>>;                           (14)

                 It should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior
                 when the shapes of the (inverse) Gamma hyperpriors on <<FORMULA>> go to zero [8]. In fact, several well
                 known shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma
                 hyperpriors [3]. For the variational posterior we will employ the following mean ﬁeld approximation:

                                                <<FORMULA>>.

                Notice that the fact that we are using mean-ﬁeld variational approximations (which we chose for simplicity)
                 can potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the
                 exploration of more involved posteriors for future work.

                  Where <<LN(;)>> is a log-normal distribution. It should be mentioned that a similar form of non-
                 centered variational inference for the horseshoe has been also successfully employed for undirected
                 models at [q     33]. Notice that we can also apply local reparametrizations [37] when we are sampling
                   <<FORMULA>>
                    i i and sa sb by exploiting properties of the log-normal distribution 7 and thus forming the
                 implied:

                                                    <<FORMULA>>                           (17)
                    
                 As a threshold rule for group pruning we will use the negative log-mode 8 of the local log-normal r.v.
                 <<FORMULA>> , i.e. prune when <<FORMULA>>, with <<FORMULA>>. This ignores <<FORMULA>> and <<FORMULA>>, but nonetheless we found <<FORMULA>> dependencies among the zi elements induced by the common scale
                 that it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the
                 distribution overWat each layer with the masked variational posterior mean during test time:

                                                       <<FORMULA>>;                        (19)

                 wheremis a binary mask determined according to the aforementioned threshold,MW are the means
                 ofq(W~)and;2 are the means and variances of the local log-normals over <<FORMULA>>. Furthermore,
                 similarly to the group normal-Jeffreys approach, we will use the variational posterior marginal
                 variances:
                                                      <<FORMULA>>;                           (20)

                 to compute the ﬁnal bit precision for the entire weight matrix W.

                 5 Experiments

                 We validated the compression and speed-up capabilities of our models on the well-known architectures
                 of LeNet-300-100 [41], LeNet-5-Caffe 9 on MNIST [42] and, similarly with [51], VGG [63]10 on
                 CIFAR 10 [38]. The groups of parameters were constructed by coupling the scale variables for each
                 ﬁlter for the convolutional layers and for each input neuron for the fully connected layers. We provide
                 the algorithms that describe the forward pass using local reparametrizations for fully connected
                 and convolutional layers with each of the employed approximate posteriors at appendix F. For the
                 horseshoe prior we set the scale 0 of the global half-Cauchy prior to a reasonably small value, e.g.
                 0 = 1e5. This further increases the prior mass at zero, which is essential for sparse estimation
                 and compression. We also found that constraining the standard deviations as described at [46] and
                 “warm-up" [65] helps in avoiding bad local optima of the variational objective. Further details about
                 the experimental setup can be found at Appendix A. Determining the threshold for pruning can be
                 easily done with manual inspection as usually there are two well separated clusters (signal and noise).
                 We provide a sample visualization at Appendix E.

                 5.1 Architecture learning & bit precisions

                 We will ﬁrst demonstrate the group sparsity capabilities of our methods by illustrating the learned
                 architectures at Table 1, along with the inferred bit precision per layer. As we can observe, our
                 methods infer signiﬁcantly smaller architectures for the LeNet-300-100 and LeNet-5-Caffe, compared
                 to Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe
                 that for the VGG network almost all of big 512 feature map layers are drastically reduced to around
                 10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian
                 methods considered require far fewer than the standard 32 bits per-layer to represent the weights,
                 sometimes even allowing for 5 bit precisions.

                    The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal.
                    Empirically, it slightly better separates the scales compared to the negative log-mean <<FORMULA>>. 
                    https://github.com/BVLC/caffe/tree/master/examples/mnist
                    The adapted CIFAR 10 version described athttp://torch.ch/blog/2015/07/30/cifar.html.

                 Table 1: Learned architectures with Sparse VD [51], Generalized Dropout (GD) [66] and Group
                 Lasso (GL) [73]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group
                 horseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left
                 after pruning along with the average bit precisions for the weights at each layer.

                                        <<TABLE>>

                 5.2 Compression Rates

                 For the actual compression task we compare our method to current work in three different scenarios:
                 (i) compression achieved only by pruning, here, for non-group methods we use the CSC format
                 to store parameters; (ii) compression based on the former but with reduced bit precision per layer
                 (only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe

                 Table 2: Compression results for our methods. “DC” corresponds to Deep Compression method
                 introduced at [25], “DNS” to the method of [21] and “SWS” to the Soft-Weight Sharing of [70].
                 Numbers marked with * are best case guesses.

                            <<TABLE>>

                 these to be relevant scenarios because (i) can be applied with already existing frameworks such as
                 Tensorﬂow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to
                 work with low and mixed precision arithmetics [43,23]. For (iii), we perform k-means clustering on
                 the weights with k=32 and consequently store a weight index that points to a codebook of available
                 weights. Note that the latter achieves highest compression rate but it is however fairly unpractical at
                 test time since the original matrix needs to be restored for each layer. As we can observe at Table 2,
                 our methods are competitive with the state-of-the art for LeNet-300-100 while offering signiﬁcantly
                 better compression rates on the LeNet-5-Caffe architecture, without any loss in accuracy. Do note
                 that group sparsity and weight sparsity can be combined so as to further prune some weights when a
                 particular group is not removed, thus we can potentially further boost compression performance at
                 e.g. LeNet-300-100. For the VGG network we observe that training from a random initialization
                 yielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the
                 approximate posterior from a pretrained network, similarly with [51], thus we only report the latter
                 results 11 . After initialization we trained the VGG network regularly for 200 epochs using Adam with
                 the default hyperparameters. We observe a small drop in accuracy for the ﬁnal models when using
                 the deterministic version of the network for prediction, but nevertheless averaging across multiple
                 samples restores the original accuracy. Note, that in general we can maintain the original accuracy on
                 VGG without sampling by simply ﬁnetuning with a small learning rate, as done at [51]. This will
                 still induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision
                 remains very high due to not appropriately increasing the marginal variances of the weights.

                 5.3 Speed and energy consumption

                 We demonstrate that our method is competitive with [73], denoted as GL, a method that explicitly
                 prunes convolutional kernels to reduce compute time. We measure the time and energy consumption
                 of one forward pass of a mini-batch with batch size 8192 through LeNet-5-Caffe. We average over10 4
                 forward passes and all experiments were run with Tensorﬂow 1.0.1, cuda 8.0 and respective cuDNN.
                 We apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned
                 architecture as lower bit precision would further increase the speed-up but is not implementable in
                 any common framework. Further, all methods we compare to in the latter experiments would barely
                 show an improvement at all since they do not learn to prune groups but only parameters. In ﬁgure 1
                 we present our results. As to be expected the largest effect on the speed up is caused by GPU usage.
                 However, both our models and best competing models reach a speed up factor of around 8x. We
                 can further save about 3x energy costs by applying our architecture instead of the original one on a
                 GPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256
                 we have a speed-up factor of 51x.

                                                <<FIGURE>>

                 Figure 1:Left:Avg. Time a batch of 8192 samples takes to pass through LeNet-5-Caffe. Numbers on
                 top of the bars represent speed-up factor relative to the CPU implementation of the original network.
                 Right:Energy consumption of the GPU of the same process (when run on GPU).

                 6 Conclusion

                 We introduced Bayesian compression, a way to tackle efﬁciency and compression in deep neural
                 networks in a uniﬁed and principled way. Our proposed methods allow for theoretically principled
                 compression of neural networks, improved energy efﬁciency with reduced computation while naturally
                 learning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian
                 methods for neural networks, when we are concerned with compression and speed up.

                   11 We also tried to ﬁnetune the same network with Sparse VD, but unfortunately it increased the error
                 considerably (around 3% extra error), therefore we do not report those results.

                                                  8                   Acknowledgments
                   We would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for
                   valuable discussions and feedback. This research was supported by TNO, NWO and Google.


                   References
                    [1]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
                       M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems.arXiv
                       preprint arXiv:1603.04467, 2016.
                    [2]D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions.Journal of the Royal Statistical
                       Society. Series B (Methodological), pages 99–102, 1974.
                    [3]A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. InAdvances in neural
                       information processing systems, pages 523–531, 2011.
                    [4]E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy efﬁcient deep learning
                       with smart memory cubes.arXiv preprint arXiv:1701.06420, 2017.
                    [5]J. Ba and R. Caruana. Do deep nets really need to be deep? InAdvances in neural information processing
                       systems, pages 2654–2662, 2014.
                    [6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means.The Annals of
                       Mathematical Statistics, 30(4):1145–1151, 1959.
                    [7]C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks.
                       Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11
                       July 2015, 2015.
                    [8]C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.Biometrika, 97
                       (2):465–480, 2010.
                    [9]S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband
                       decomposition.arXiv preprint arXiv:1703.08595, 2017.
                   [10]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural
                       networks.arXiv preprint arXiv:1506.04449, 2015.
                   [11]M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations
                       constrained to+1or1.arXiv preprint arXiv:1602.02830, 2016.
                   [12]M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplica-
                       tions.arXiv preprint arXiv:1412.7024, 2014.
                   [13]M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary
                       weights during propagations. InAdvances in Neural Information Processing Systems, pages 3105–3113,
                       2015.
                   [14]M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. InAdvances in
                       Neural Information Processing Systems, pages 2148–2156, 2013.
                   [15]X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference
                       complexity.arXiv preprint arXiv:1703.08651, 2017.
                   [16]M. A. Figueiredo. Adaptive sparseness using jeffreys’ prior.Advances in neural information processing
                       systems, 1:697–704, 2002.
                   [17]Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep
                       learning.ICML, 2016.
                   [18]Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector
                       quantization.ICLR, 2015.
                   [19]A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information
                       Processing Systems, pages 2348–2356, 2011.
                   [20]P. D. Grünwald.The minimum description length principle. MIT press, 2007.
                   [21]Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efﬁcient dnns. InAdvances In Neural
                       Information Processing Systems, pages 1379–1387, 2016.
                   [22]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical
                       precision.CoRR, abs/1502.02551, 392, 2015.
                   [23]P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks.Master’s thesis,
                       University of California, 2016.
                   [24]S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efﬁcient neural networks.
                       InAdvances in Neural Information Processing Systems, pages 1135–1143, 2015.
                   [25]S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,
                       trained quantization and huffman coding.ICLR, 2016.
                   [26]K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on
                       imagenet classiﬁcation. InProceedings of the IEEE International Conference on Computer Vision, pages
                       1026–1034, 2015.
                   [27]G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint
                       arXiv:1503.02531, 2015.
                   [28]G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length
                       of the weights. InProceedings of the sixth annual conference on Computational learning theory, pages
                       5–13. ACM, 1993.
                   [29]G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
                       networks by preventing co-adaptation of feature detectors.arXiv preprint arXiv:1207.0580, 2012.
                   [30]A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to
                       bayesian learning.IEEE Transactions on Neural Networks, 15(4):800–810, 2004.
                   [31]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.
                       Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint
                       arXiv:1704.04861, 2017.
                   [32]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level
                       accuracy with 50x fewer parameters and< 0.5 mb model size.ICLR, 2017.
                   [33]J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint
                       arXiv:1602.03807, 2016.
                   [34]T. Karaletsos and G. Rätsch. Automatic relevance determination for deep generative models.arXiv preprint
                       arXiv:1505.07765, 2015.
                   [35]D. Kingma and J. Ba. Adam: A method for stochastic optimization.International Conference on Learning
                       Representations (ICLR), San Diego, 2015.
                   [36]D. P. Kingma and M. Welling. Auto-encoding variational bayes.International Conference on Learning
                       Representations (ICLR), 2014.
                   [37]D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick.
                       Advances in Neural Information Processing Systems, 2015.
                   [38]A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.
                   [39]N. D. Lawrence. Note relevance determination. InNeural Nets WIRN Vietri-01, pages 128–133. Springer,
                       2002.
                   [40]Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. InNIPs,
                       volume 2, pages 598–605, 1989.
                   [41]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
                       Proceedings of the IEEE, 86(11):2278–2324, 1998.
                   [42]Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.
                   [43]D. D. Lin and S. S. Talathi. Overcoming challenges in ﬁxed point training of deep convolutional networks.
                       Workshop ICML, 2016.
                   [44]D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks.
                       arXiv preprint arXiv:1511.06393, 2015.
                   [45]C. Louizos. Smart regularization of deep architectures.Master’s thesis, University of Amsterdam, 2015.
                    [46]C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.
                       ArXiv e-prints, Mar. 2017.
                   [47]D. J. MacKay. Probable networks and plausible predictions—a review of practical bayesian methods for
                       supervised neural networks.Network: Computation in Neural Systems, 6(3):469–505, 1995.
                   [48]N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with
                       ﬁne-grained quantization.arXiv preprint arXiv:1705.01462, 2017.
                   [49]P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to
                       weight binarization and other non-linear distortions.arXiv preprint arXiv:1606.01981, 2016.
                   [50]T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the
                       American Statistical Association, 83(404):1023–1032, 1988.
                   [51]D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsiﬁes deep neural networks.arXiv
                       preprint arXiv:1701.05369, 2017.
                   [52]E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural
                       networks.arXiv preprint arXiv:1506.03208, 2015.
                   [53]R. M. Neal.Bayesian learning for neural networks. PhD thesis, Citeseer, 1995.
                   [54]S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean ﬁeld variational bayes for continuous sparse signal
                       shrinkage: pitfalls and remedies.Electronic Journal of Statistics, 8(1):1113–1151, 2014.
                   [55]O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of
                       hierarchical models.Statistical Science, pages 59–73, 2007.
                   [56]C. Peterson. A mean ﬁeld theory learning algorithm for neural networks.Complex systems, 1:995–1019,
                       1987.
                   [57]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classiﬁcation using binary
                       convolutional neural networks. InEuropean Conference on Computer Vision, pages 525–542. Springer,
                       2016.
                   [58]D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in
                       deep generative models. InProceedings of the 31th International Conference on Machine Learning, ICML
                       2014, Beijing, China, 21-26 June 2014, pages 1278–1286, 2014.
                   [59]J. Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978.
                   [60]J. Rissanen. Stochastic complexity and modeling.The annals of statistics, pages 1080–1100, 1986.
                    [61]S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural
                       networks.arXiv preprint arXiv:1607.00485, 2016.
                   [62]S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectiﬁer units.
                       arXiv preprint arXiv:1704.07724, 2017.
                   [63]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
                       ICLR, 2015.
                   [64]M. Sites. Ieee standard for ﬂoating-point arithmetic. 2008.
                   [65]C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders.
                       arXiv preprint arXiv:1602.02282, 2016.
                   [66]S. Srinivas and R. V. Babu. Generalized dropout.arXiv preprint arXiv:1611.06791, 2016.
                   [67]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to
                       prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958,
                       2014.
                   [68]V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efﬁcient processing of deep neural networks: A tutorial and
                       survey.arXiv preprint arXiv:1703.09039, 2017.
                   [69]R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society.
                       Series B (Methodological), pages 267–288, 1996.
                   [70]K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression.ICLR, 2017.
                   [71]G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision
                       and sparsity.arXiv preprint arXiv:1610.00324, 2016.
                   [72]C. S. Wallace. Classiﬁcation by minimum-message-length inference. InInternational Conference on
                       Computing and Information, pages 72–81. Springer, 1990.
                   [73]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In
                       Advances In Neural Information Processing Systems, pages 2074–2082, 2016.
                   [74]T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efﬁcient convolutional neural networks using
                       energy-aware pruning.CVPR, 2017.
                   [75]S. Zagoruyko and N. Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016.
                   [76]C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization.ICLR, 2017.


                   Appendix

                   A. Detailed experimental setup

                   We implemented our methods in Tensorﬂow [1] and optimized the variational parameters using
                   Adam [35] with the default hyperparameters. The means of the conditional Gaussian <<q(W|z)>>


                     Table 3: Floating point formats Bits per Exponent 
                     
                                    <<TABLE>>

                 were initialized with the scheme proposed at [26], whereas the log of the standard deviations were
                 initialized by sampling from N(9;1e4). The parameters of q(z) were initialized such that the
                 overall mean of zise 1 and the overall variance is very low (1e^8); this ensures that all of the
                 groups are active during the initial training iterations.
                 As for the standard deviation constraints; for the LeNet-300-100 architecture we constrained the
                 standard deviation of the ﬁrst layer to be 0:2 whereas for the LeNet-5-Caffe we constrained
                 the standard deviation of the ﬁrst layer to be 0:5. The remaining standard deviations were left
                 unconstrained. For the VGG network we constrained the standard deviations of the 64 and 128
                 feature map layers to be 0:1, the standard deviations of the 256 feature map layers to be0:2
                 and left the rest of the standard deviations unconstrained. We also found beneﬁcial the incorporation
                 of “warm-up” [65], i.e we annealed the negative KL-divergence from the prior to the approximate
                 posterior with a linear schedule for the ﬁrst 100 epochs. We initialized the means of the approximate
                 posterior by the weights and biases obtained from a VGG network trained with batch normalization
                 and dropout on CIFAR 10. For our method we disabled batch-normalization during training.
                 As for preprocessing the data; for MNIST the only preprocessing we did was to rescale the digits to
                 lie at the [-1,1] range and for CIFAR 10 we used the preprocessed dataset provided by [75].
                 Furthermore, do note that by pruning a given ﬁlter at a particular convolutional layer we can also
                 prune the parameters corresponding to that feature map for the next layer. This similarly holds for
                 fully connected layers; if we drop a given input neuron then the weights corresponding to that node
                 from the previous layer can also be pruned.

                 B. Standards for Floating-Point Arithmetic

                 Floating points values eventually need to be represented in a binary basis in a computer. The most
                 common standard today is the IEEE 754-2008 convention [64]. It deﬁnesx-bit base-2 formats,
                 ofﬁcially referred to as binaryx, withx2 f16;32;64;128g. The formats are also widely known as
                 half, single, double and quadruple precision ﬂoats, respectively and used in almost all programming
                 languages as a standard. The format considers 3 kinds of bits: one sign bit,wexponent bits andp
                 precision bits.

                                            <<FIGURE>>

                             Figure 2: A symbolic representation of the binaryxformat [64].


                 The Sign bit determines the sign of the number to be represented. The exponentEis anw-bit signed
                 integer, e.g. for single precisionw= 8and thusE2[127;128]. In practice, exponents range from
                 is smaller since the ﬁrst and the last number are reserved for special numbers. The true signiﬁcand or
                 mantissa includes t bits on the right of the binary point. There is an implicit leading bit with value
                 one. A values is consequently decomposed as follows

                                    <<FORMULA>>                          (21)
                                                
                                    <<FORMULA>>                          (22)

                 In table 3, we summarize common and less common ﬂoating point formats.

                 There is however the possibility to design a self deﬁned format. There are 3 important quantities
                 when choosing the right speciﬁcation: overﬂow, underﬂow and unit round off also known as machine
                 precision. Each one can be computed knowing the number of exponent and signiﬁcant bits. in
                 our work for example we consider a format that uses signiﬁcantly less exponent bits since network
                 parameters usually vary between [-10,10]. We set the unit round off equal to the precision and thus
                 can compute the signiﬁcant bits necessary to represent a speciﬁc weight.
                 Beyond designing a tailored ﬂoating point format for deep learning, recent work also explored the
                 possibility of deep learning with mixed formats [43,23]. For example, imagine the activations having
                 high precision while weights can be low precision.

                 C. Shrinkage properties of the normal-Jeffreys and horseshoe priors

                            <<FIGURE>>

                 Figure 3: Comparison of the behavior of the log-uniform / normal-Jeffreys (NJ) prior and the
                 horseshoe (HS) prior (wheres= 1). Both priors behave similarly at zero but the normal-Jeffreys has
                 an extremely heavy tail (thus making it non-normalizable).

                 In this section we will provide some insights about the behavior of each of the priors we employ by
                 following the excellent analysis of [8]; we can perform a change of variables and express the scale
                 mixture distribution of eq.3 in the main paper in terms of a shrinkage coefﬁcient,

                                                     <<FORMULA>>                  (23) 

                 It is easy to observe that eq. 23 corresponds to a continuous relaxation of the spike-and-slab prior:
                 when <<= 0>> we have that <<FORMULA>>, i.e. no shrinkage/regularization forw, when
                 <<= 1>> we have that <<FORMULA>>, i.e.wis exactly zero, and when <<=1>> we have that <<FORMULA>>. Now by examining the implied prior on the shrinkage coefﬁcient  for both
                 the log-uniform and the horseshoe priors we can better study their behavior. As it is explained at                                                        
                 the half-Cauchy prior onzcorresponds to a beta prior on the shrinkage coefﬁcient, <<FORMULA>>,
                 whereas the normal-Jeffreys / log-uniform prior onzcorresponds <<top() =B(;)>> with <<FORMULA>>.
                 The densities of both of these distributions can be seen at Figure 3b. As we can observe, the log-
                 uniform prior posits a distribution that concentrates almost all of its mass at either0or1,
                 essentially either pruning the parameter or keeping it close to the maximum likelihood estimate due
                 <<FORMULA>>. In contrast the horseshoe prior maintains enough probability mass for
                 the in-between values of  and thus can, potentially, offer better regularization and generalization.

                 D. Negative KL-divergences for log-normal approximating posteriors

                 Le <<FORMULA>> be a log-normal approximating posterior. Here we will derive the negative
                 KL-divergences toq(z)from inverse gamma, gamma and half-normal distributions.
                 Letp(z)be an inverse gamma distribution, i.e. <<p(z) =IG(;)>>. The negative KL-divergence can
                 be expressed as follows:
                     
                              <<FORMULA>>         (24)


                The second term is the entropy of the log-normal distribution which has the following form:

                                    <<FORMULA>>         (25)

                 The ﬁrst term is the negative cross-entropy of the log-normal approximate posterior from the inverse-
                 Gamma prior:
                                  <<FORMULA>>        (26)

                                  <<FORMULA>>        (27)

                 Since the natural logarithm of a log-normal distribution <<FORMULA>> follows a normal distribution
                 <<FORMULA>> we have that <<FORMULA>>. Furthermore we have that <<FORMULA>> then <<FORMULA>>, therefore
                 <<FORMULA>>. Putting everything together we have that: 

                                  <<FORMULA>>         (28) 

                 Therefore the negative KL-divergence is:

                                          <<FORMULA>>                  (29)

                 Now let p(z) be a Gamma prior, i.e. <<p(z) =G(;)>>. We have that the negative cross-entropy
                 changes to:
                                  <<FORMULA>>        (30)

                                <<FORMULA>>      (31)
                                                            
                                <<FORMULA>>        (32)2

                 Therefore the negative KL-divergence is:

                                          <<FORMULA>>                   (33)

                 Now, by employing the aforementioned we can express the negative KL-divergence from
                <<FORMULA>> to <<FORMULA>> as follows:

                                              <<FORMULA>>

                 with the KL-divergence for the weight distribution <<q (W~)>> given by eq.8 in the main paper.

                            E. Visualizations

                                        <<FIGURE>>

                 Figure 4: Distribution of the thresholds for the Sparse Variational Dropout 4a, Bayesian Compression
                 with group normal-Jeffreys (BC-GNJ) 4b and group Horseshoe (BC-GHS) 4c priors for the three
                 layer LeNet-300-100 architecture. It is easily observed that there are usually two well separable
                 groups with BC-GNJ and BC-GHS, thus making the choice for the threshold easy. Smaller values
                 indicate signal whereas larger values indicate noise (i.e. useless groups).

                                        <<FIGURE>>

                 Figure 5: Distribution of the bit precisions for the Sparse Variational Dropout 5a, Bayesian Com-
                 pression with group normal-Jeffreys (BC-GNJ) 5b and group Horseshoe (BC-GHS) 5c priors for the
                 three layer LeNet-300-100 architecture. All of the methods usually require far fewer than 32bits for
                 the weights.

                 F. Algorithms for the feedforward pass

                 Algorithms 1, 2, 3, 4 describe the forward pass using local reparametrizations for fully connected and
                 convolutional layers with the approximate posteriors for the Bayesian Compression (BC) with group
                 normal-Jeffreys (BC-GNJ) and group Horseshoe (BC-GHS) priors employed at the experiments. For
                 the fully connected layers we coupled the scales for each input neuron whereas for the convolutional
                 we couple the scales for each output feature map.Mw ;w are the means and variances of each layer,
                 His a minibatch of activations of sizeK. For the ﬁrst layer we have thatH=XwhereXis the
                 minibatch of inputs. For the convolutional layersNf are the number of convolutional ﬁlters,is the
                 convolution operator and we assume the [batch, height, width, feature maps] convention.

                   Algorithm 1 Fully connected BC-GNJ layer h. 
                   
                            <<ALGORITHM>>
                   
                   Algorithm 2Convolutional BC-GNJ layerh.
                
                            <<ALGORITHM>>

                 Algorithm 3 Fully connected BC-GHS layerh. 
                 
                            <<ALGORITHM>>
                 
                 Algorithm 4Convolutional BC-GHS layerh.

                            <<ALGORITHM>>           

<|endoftext|>


<|startoftext|>
Channel Pruning for Accelerating Very Deep Neural Networks 
Yihui He*  Xiangyu Zhang  Jian Sun  
Xifian Jiaotong University  Megvii Inc.  Megvii Inc.  
Xifian, 710049, China  Beijing, 100190, China  Beijing, 100190, China  
heyihui@stu.xjtu.edu.cn  zhangxiangyu@megvii.com  sunjian@megvii.com  

Abstract 
In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural net.works. Given a trained CNN model, we propose an it.erative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method re.duces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5. speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, exception and suffers only 1.4%, 1.0% accuracy loss under 2. speed.up respectively, which is significant. 
1. Introduction 
Recent CNN acceleration works fall into three categories: optimized implementation (e.g., FFT [47]), quantization (e.g., BinaryNet [8]), and structured simplification that convert a CNN into compact one [22]. This work focuses on the last one. 
Structured simplification mainly involves: tensor factorization [22], sparse connection [17], and channel pruning [48]. Tensor factorization factorizes a convolutional layer into several efficient ones (Fig. 1(c)). However, feature map width (number of channels) could not be reduced, which makes it difficult to decompose 1 . 1 convolutional layer favored by modern networks (e.g., GoogleNet [45], ResNet [18], Xception [7]). This type of method also intro.duces extra computation overhead. Sparse connection deactivates connections between neurons or channels (Fig. 1(b)). Though it is able to achieves high theoretical speed-up ratio, the sparse convolutional layers have an fiirregularfi shape which is not implementation friendly. In contrast, channel pruning directly reduces feature map width, which shrinks 

<<FIGURE>>

Figure 1. Structured simplification methods that accelerate CNNs: 
(a) a network with 3 conv layers. (b) sparse connection deactivates some connections between channels. (c) tensor factorization factorizes a convolutional layer into several pieces. (d) channel pruning reduces number of channels in each layer (focus of this paper). 
a network into thinner one, as shown in Fig. 1(d). It is efficient on both CPU and GPU because no special implementation is required. 
Pruning channels is simple but challenging because re.moving channels in one layer might dramatically change the input of the following layer. Recently, training-based channel pruning works [1, 48] have focused on imposing sparse constrain on weights during training, which could adaptively determine hyper-parameters. However, training from scratch is very costly and results for very deep CNNs on ImageNet have been rarely reported. Inference-time at.tempts [31, 3] have focused on analysis of the importance of individual weight. The reported speed-up ratio is very limited. 
In this paper, we propose a new inference-time approach for channel pruning, utilizing redundancy inter channels. Inspired by tensor factorization improvement by feature maps reconstruction [52], instead of analyzing filter weights [22, 31], we fully exploits redundancy within feature maps. Specifically, given a trained CNN model, pruning each layer is achieved by minimizing reconstruction error on its output feature maps, as showed in Fig. 2. We solve this mini.

<<FIGURE>>

Figure 2. Channel pruning for accelerating a convolutional layer. We aim to reduce the width of feature map B, while minimizing the reconstruction error on feature map C. Our optimization algorithm (Sec. 3.1) performs within the dotted box, which does not involve nonlinearity. This figure illustrates the situation that two channels are pruned for feature map B. Thus corresponding channels of filters W can be removed. Furthermore, even though not directly optimized by our algorithm, the corresponding filters in the previous layer can also be removed (marked by dotted filters). c, n: number of channels for feature maps B and C, kh . kw : kernel size. 
minimization problem by two alternative steps: channels selection and feature map reconstruction. In one step, we figure out the most representative channels, and prune redundant ones, based on LASSO regression. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alternatively take two steps. Further, we approximate the network layer-by-layer, with accumulated error accounted. We also discuss methodologies to prune multi-branch networks (e.g., ResNet [18], exception [7]). 
For VGG-16, we achieve 4. acceleration, with only 1.0% increase of top-5 error. Combined with tensor factorization, we reach 5. acceleration but merely suffer 0.3% increase of error, which outperforms previous state-of-the.arts. We further speed up ResNet-50 and Xception-50 by 2. with only 1.4%, 1.0% accuracy loss respectively. 

2. Related Work

There has been a significant amount of work on accelerating CNNs. Many of them fall into three categories: optimized implementation [4], quantization [40], and structured simplification [22]. 
Optimized implementation based methods [35, 47, 27, 4] accelerate convolution, with special convolution algorithms like FFT [47]. Quantization [8, 40] reduces floating point computational complexity. 
Sparse connection eliminates connections between neurons [17, 32, 29, 15, 14]. [51] prunes connections based on weights magnitude. [16] could accelerate fully connected layers up to 50.. However, in practice, the actual speed-up maybe very related to implementation. 
Tensor factorization [22, 28, 13, 24] decompose weights into several pieces. [50, 10, 12] accelerate fully connected layers with truncated SVD. [52] factorize a layer into 3 . 3 and 1 . 1 combination, driven by feature map redundancy. 
Channel pruning removes redundant channels on feature maps. There are several training-based approaches. [1, 48] regularize networks to improve accuracy. Channel-wise SSL [48] reaches high compression ratio for first few conv layers of LeNet [30] and AlexNet [26]. However, training-based approaches are more costly, and the effectiveness for very deep networks on large datasets is rarely exploited. 
Inference-time channel pruning is challenging, as re.ported by previous works [2, 39]. Some works [44, 34, 19] focus on model size compression, which mainly operate the fully connected layers. Data-free approaches [31, 3] results for speed-up ratio (e.g., 5.) have not been reported, and requires long retraining procedure. [3] select channels via over 100 random trials, however it need long time to eval.ate each trial on a deep network, which makes it infeasible to work on very deep models and large datasets. [31] is even worse than naive solution from our observation sometimes (Sec. 4.1.1). 

3. Approach 

In this section, we first propose a channel pruning algorithm for a single layer, then generalize this approach to multiple layers or the whole model. Furthermore, we dis.cuss variants of our approach for multi-branch networks. 

3.1. Formulation 

Fig. 2 illustrates our channel pruning algorithm for a sin.gle convolutional layer. We aim to reduce the width of feature map B, while maintaining outputs in feature map 
C. Once channels are pruned, we can remove correspond.ing channels of the filters that take these channels as in.put. Also, filters that produce these channels can also be removed. It is clear that channel pruning involves two key points. The first is channel selection, since we need to select most representative channels to maintain as much information. The second is reconstruction. We need to reconstruct the following feature maps using the selected channels. 
Motivated by this, we propose an iterative two-step algorithm. In one step, we aim to select most representative channels. Since an exhaustive search is infeasible even for tiny networks, we come up with a LASSO regression based method to figure out representative channels and prune redundant ones. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alter.natively take two steps. 
Formally, to prune a feature map with c channels, we consider applying n.c.kh .kw convolutional filters W on <<FORMULA>> input volumes X sampled from this feature map, which produces N . n output matrix Y. Here, N is the number of samples, n is the number of output channels, and kh,kw are the kernel size. For simple representation, bias term is not included in our formulation. To prune the 
.. 
input channels from c to desired <<FORMULA>>, while minimizing reconstruction error, we formulate our problem as follow: 

<<FORMULA>>       (1)

F is Frobenius norm. <<FORMULA>> matrix sliced from ith channel of input volumes X_i, i =1, ..., c. W_i is n . filter weights sliced from ith channel of W. is coefficient vector of length c for channel selection, and .i is ith entry of . Notice that, if .i =0, X_i will be no longer useful, which could be safely pruned from feature map. W_i could also be removed. Optimization Solving this minimization problem in Eqn. 1 is NP-hard. Therefore, we relax the l_0 to l_1 regularization: 

<<FORMULA>>       (2)

. is a penalty coefficient. By increasing l, there will be more zero terms in and one can get higher speed-up ratio. We also add a constrain .i WiF =1 to this formulation, which avoids trivial solution. 
Now we solve this problem in two folds. First, we fix W, solve for channel selection. Second, we fix , solve W to reconstruct error. 
(i) The subproblem of . In this case, W is fixed. We solve for channel selection. This problem can be solved by LASSO regression [46, 5], which is widely used for model selection. 

<<FORMULA>>       (3) 
.
Here Zi =XiWi (size N .n). We will ignore ith channels if .i =0. 
(ii) The subproblem of W. In this case, is fixed. We utilize the selected channels to minimize reconstruction error. We can find optimized solution by least squares: 

<<FORMULA>>. (4)

Here <<FORMULA>> (size N.). W is n reshaped W, <<FORMULA>>. After obtained result W, it is reshaped back to W. Then we assign <<FORMULA>>. Constrain <<FORMULA>> satisfies.
We alternatively optimize (i) and (ii). In the beginning, W is initialized from the trained model, <<FORMULA>>, namely no penalty, and <<k = c>>. We gradually increase <<FORMULA>> For each
change of <<FORMULA>>, we iterate these two steps until k is stable. 

After <<FORMULA>> satisfies, we obtain the final solution W from <<FORMULA>> In practice, we found that the two steps iteration is time consuming. So we apply (i) multiple times, 

<<FORMULA>>

until <<FORMULA>> satisfies. Then apply (ii) just once, to obtain 

<<FORMULA>>

the final result. From our observation, this result is comparable with two steps iterations. Therefore, in the following experiments, we adopt this approach for efficiency. 
Discussion: Some recent works [48, 1, 17] (though train.
ing based) also introduce .1-norm or LASSO. However, we must emphasis that we use different formulations. Many of them introduced sparsity regularization into training loss, instead of explicitly solving LASSO. Other work [1] solved LASSO, while feature maps or data were not considered during optimization. Because of these differences, our ap.proach could be applied at inference time. 

3.2. Whole Model Pruning 
Inspired by [52], we apply our approach layer by layer sequentially. For each layer, we obtain input volumes from the current input feature map, and output volumes from the output feature map of the un-pruned model. This could be formalized as: 

<<FORMULA>> (5)

Different from Eqn. 1, Y is replaced by Y . , which is from feature map of the original model. Therefore, the accumulated error could be accounted during sequential pruning. 

3.3. Pruning Multi.Branch Networks 
The whole model pruning discussed above is enough for single-branch networks like LeNet [30], AlexNet [26] and VGG Nets [43]. However, it is insufficient for multi-branch networks like GoogLeNet [45] and ResNet [18]. We mainly focus on pruning the widely used residual structure (e.g., ResNet [18], Xception [7]). Given a residual block shown in Fig. 3 (left), the input bifurcates into shortcut and residual branch. On the residual branch, there are several convolutional layers (e.g., 3 convolutional layers which have spatial size of 1 . 1, 3 . 3, 1 . 1, Fig. 3, left). Other layers except the first and last layer can be pruned as is described previously. For the first layer, the challenge is that the large input feature map width (for ResNet, 4 times of its output) can it be easily pruned, since it is shared with shortcut. For the last layer, accumulated error from the shortcut is hard to be recovered, since there is no parameter on the shortcut. To address these challenges, we propose several variants of our approach as follows. 

<<FIGURE>>

Figure 3. Illustration of multi-branch enhancement for residual block. Left: original residual block. Right: pruned residual block with enhancement, cx denotes the feature map width. Input channels of the first convolutional layer are sampled, so that the large input feature map width could be reduced. As for the last layer, rather than approximate Y2 , we try to approximate <<Y1+Y2>> directly (Sec. 3.3 Last layer of residual branch). 
Last layer of residual branch: Shown in Fig. 3, the output layer of a residual block consists of two inputs: feature map Y1 and Y2 from the shortcut and residual branch. We aim to recover Y1 +Y2 for this block. Here, Y1, Y2 are the original feature maps before pruning. Y2 could be approximated as in Eqn. 1. However, shortcut branch is parameter-free, then Y1 could not be recovered directly. To compensate this error, the optimization goal of the last layer is changed from Y2 to Y1 .Y . +Y2, which does not change 

<<FORMULA>>

our optimization. Here, Y . is the current feature map after

<<FORMULA>>

previous layers pruned. When pruning, volumes should be sampled correspondingly from these two branches. 
First layer of residual branch: Illustrated in Fig. 3(left), the input feature map of the residual block could not be pruned, since it is also shared with the short.cut branch. In this condition, we could perform feature map sampling before the first convolution to save computation. We still apply our algorithm as Eqn. 1. Differently, we sample the selected channels on the shared feature maps to construct a new input for the later convolution, shown in Fig. 3(right). Computational cost for this operation could be ignored. More importantly, after introducing feature map sampling, the convolution is still irregular. 
Filter-wise pruning is another option for the first con.volution on the residual branch. Since the input channels of parameter-free shortcut branch could not be pruned, we apply our Eqn. 1 to each filter independently (each fil.ter chooses its own representative input channels). Under single layer acceleration, filter-wise pruning is more accurate than our original one. From our experiments, it im.proves 0.5% top-5 accuracy for 2. ResNet-50 (applied on the first layer of each residual branch) without fine-tuning. However, after fine-tuning, there is no noticeable improvement. In addition, it outputs irregular convolutional layers, which need special library implementation support. We do not adopt it in the following experiments. 

4. Experiment 

We evaluation our approach for the popular VGG Nets [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR.10 [25] and PASCAL VOC 2007 [11]. 
For Batch Normalization [21], we first merge it into convolutional weights, which do not affect the outputs of the networks. So that each convolutional layer is followed by ReLU [36]. We use Caffe [23] for deep network evaluation, and scikit-learn [38] for solvers implementation. For channel pruning, we found that it is enough to extract 5000 images, and 10 samples per image. On ImageNet, we evaluate the top-5 accuracy with single view. Images are re.sized such that the shorter side is 256. The testing is on center crop of 224 . 224 pixels. We could gain more per.formance with fine-tuning. We use a batch size of 128 and 
.5
learning rate 1e^-4. We fine-tune our pruned models for 10 epochs. The augmentation for fine-tuning is random crop of 224 . 224 and mirror. 

4.1. Experiments with VGG.16 

VGG-16 [43] is a 16 layers single path convolutional neural network, with 13 convolutional layers. It is widely used in recognition, detection and segmentation, etc. Single view top-5 accuracy for VGG-16 is 89.9%1. 

4.1.1 Single Layer Pruning 

In this subsection, we evaluate single layer acceleration performance using our algorithm in Sec. 3.1. For better under.standing, we compare our algorithm with two naive chan.nel selection strategies. first k selects the first k channels. max response selects channels based on corresponding filters that have high absolute weights sum [31]. For fair com.parison, we obtain the feature map indexes selected by each of them, then perform reconstruction (Sec. 3.1 (ii)). We hope that this could demonstrate the importance of channel selection. Performance is measured by increase of error af.ter a certain layer is pruned without fine-tuning, shown in Fig. 4. 
As expected, error increases as speed-up ratio increases. Our approach is consistently better than other approaches in different convolutional layers under different speed-up ra.tio. Unexpectedly, sometimes max response is even worse than first k. We argue that max response ignores correlations between different filters. Filters with large absolute weight may have strong correlation. Thus selection based on filter weights is less meaningful. Correlation on feature maps is worth exploiting. We can find that channel selection http://www.vlfeat.org/matconvnet/pretrained/ 

<<FIGURE>>

Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify the importance of channel selection referred in Sec. 3.1, we considered two naive baselines. first k selects the first k feature maps. max response selects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is better). 

<<TABLE>>

Table 1. Accelerating the VGG-16 model [43] using a speedup ratio of 2., 4., or 5. (smaller is better). 
affects reconstruction error a lot. Therefore, it is important for channel pruning. 
Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration. 


4.1.2 Whole Model Pruning 
Shown in Table 1, whole model acceleration results under 2., 4., 5. are demonstrated. We adopt whole model pruning proposed in Sec. 3.2. Guided by single layer experiments above, we pruning more aggressive for shallower layers. Remaining channels ratios for shallow lay.ers (conv 1_x to conv 3_x) and deep layers (conv4_x) is 1:1.5. conv 5_x are not pruned, since they only con.tribute 9% computation in total and are not redundant. 
After fine-tuning, we could reach 2. speed-up without losing accuracy. Under 4., we only suffers 1.0% drops. Consistent with single layer analysis, our approach outperforms previous channel pruning approach (Li et al. [31]) by large margin. This is because we fully exploits channel redundancy within feature maps. Compared with tensor factorization algorithms, our approach is better than Jaderberg et al. [22], without fine-tuning. Though worse than Asym. [52], our combined model outperforms its combined Asym. 3D (Table 2). This may indicate that channel pruning is more challenging than tensor factorization, since removing channels in one layer might dramatically change the input of the following layer. However, channel pruning keeps the original model architecture, do not introduce additional layers, and the absolute speed-up ratio on GPU is much higher (Table 3). 
Since our approach exploits a new cardinality, we further combine our channel pruning with spatial factorization [22] and channel factorization [52]. Demonstrated in Table 2, 

<<TABLE>>

Table 2. Performance of combined methods on the VGG-16 model 

[43] using a speed-up ratio of 4. or 5.. Our 3C solution outperforms previous approaches (smaller is better). 
our 3 cardinalities acceleration (spatial, channel factorization, and channel pruning, denoted by 3C) outperforms previous state-of-the-arts. Asym. 3D [52] (spatial and chan.nel factorization), factorizes a convolutional layer to three parts: <<FORMULA>>. 
We apply spatial factorization, channel factorization, and our channel pruning together sequentially layer-by-layer. We fine-tune the accelerated models for 20 epochs, since they are 3 times deeper than the original ones. After fine-tuning, our 4. model suffers no degradation. Clearly, a combination of different acceleration techniques is better than any single one. This indicates that a model is redundant in each cardinality. 


4.1.3 Comparisons of Absolute Performance 
We further evaluate absolute performance of acceleration on GPU. Results in Table 3 are obtained under Caffe [23], CUDA 8 [37] and cuDNN5 [6], with a mini-batch of 32 on a GPU (GeForce GTX TITAN X). Results are averaged from 50 runs. Tensor factorization approaches decompose weights into too many pieces, which heavily increase over.head. They could not gain much absolute speed-up. Though our approach also encountered performance decadence, it generalizes better on GPU than other approaches. Our re.sults for tensor factorization differ from previous research [52, 22], maybe because current library and hardware prefer single large convolution instead of several small ones. 

4.1.4 Comparisons with Training from Scratch 
Though training a compact model from scratch is time-consuming (usually 120 epochs), it worths comparing our approach and from scratch counterparts. To be fair, we evaluated both from scratch counterpart, and normal setting net.work that has the same computational complexity and same architecture. 
Shown in Table 4, we observed that it is difficult for from scratch counterparts to reach competitive accuracy. our model outperforms from scratch one. Our approach successfully picks out informative channels and constructs highly compact models. We can safely draw the conclusion that the same model is difficult to be obtained from scratch. This coincides with architecture design researches [20, 1] that the model could be easier to train if there are more channels in shallower layers. However, channel prun.ing favors shallower layers. 
For from scratch (uniformed), the filters in each layers is reduced by half (eg. reduce conv1_1 from 64 to 32). We can observe that normal setting networks of the same complexity couldn't reach same accuracy either. This consolidates our idea that there is much redundancy in networks while training. However, redundancy can be opt out at inference-time. This maybe an advantage of inference-time acceleration approaches over training-based approaches. 
Notice that there is a 0.6% gap between the from scratch model and uniformed one, which indicates that there is room for model exploration. Adopting our approach is much faster than training a model from scratch, even for a thin.ner one. Further researches could alleviate our approach to do thin model exploring. 

4.1.5 Acceleration for Detection 
VGG-16 is popular among object detection tasks [42, 41, 33]. We evaluate transfer learning ability of our 2./4. pruned VGG-16, for Faster R-CNN [42] object detections. PASCAL VOC 2007 object detection benchmark [11] contains 5k trainable images and 5k test images. The performance is evaluated by mean Average Precision (mAP). In our experiments, we first perform channel pruning for VGG-16 on the ImageNet. Then we use the pruned model as the pre-trained model for Faster R-CNN. 
The actual running time of Faster R-CNN is 220ms / im.age. The convolutional layers contributes about 64%. We got actual time of 94ms for 4. acceleration. From Table 5, we observe 0.4% mAP drops of our 2. model, which is not harmful for practice consideration. 

4.2. Experiments with Residual Architecture Nets 
For Multi-path networks [45, 18, 7], we further explore the popular ResNet [18] and latest Xception [7], on Ima.geNet and CIFAR-10. Pruning residual architecture nets is more challenging. These networks are designed for both efficiency and high accuracy. Tensor factorization algorithms [52, 22] have difficult to accelerate these model. Spatially, 1 . 1 convolution is favored, which could hardly be factorized. 

4.2.1 ResNet Pruning 
ResNet complexity uniformly drops on each residual block. Guided by single layer experiments (Sec. 4.1.1), we still prefer reducing shallower layers heavier than deeper ones. 
Following similar setting as Filter pruning [31], we keep 70% channels for sensitive residual blocks (res5 and blocks close to the position where spatial size 

<<TABLE>>

Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is better). 

<<TABLE>>

Table 4. Comparisons with training from scratch, under 4. acceleration. Our fine-tuned model outperforms scratch trained counterparts (smaller is better). 

<<TABLE>>

Table 5.Acceleration for Faster R-CNN detection. 
  
<<TABLE>>

Table 6. 2. acceleration for ResNet-50 on ImageNet, the base.line network is top-5 accuracy is 92.2% (one view). We improve performance with multi-branch enhancement (Sec. 3.3, smaller is better). 
change, e.g. res3a,res3d). As for other blocks, we keep 30% channels. With multi-branch enhancement, we prune branch 2a more aggressively within each residual block. The remaining channels ratios for branch 2a,branch 2b,branch 2c is 2:4:3 (e.g., Given 30%, we keep 40%, 80%, 60% respectively). 
We evaluate performance of multi-branch variants of our approach (Sec. 3.3). From Table 6, we improve 4.0% with our multi-branch enhancement. This is because we accounted the accumulated error from shortcut connection which could broadcast to every layer after it. And the large input feature map width at the entry of each residual block is well reduced by our feature map sampling. 
 
<<TABLE>>

Table 7. Comparisons for Xception-50, under 2. acceleration ra.tio. The baseline network is top-5 accuracy is 92.8%. Our approach outperforms previous approaches. Most structured simplification methods are not effective on Xception architecture (smaller is better). 


4.2.2 Xception Pruning 
Since computational complexity becomes important in model design, separable convolution has been payed much attention [49, 7]. Xception [7] is already spatially optimized and tensor factorization on 1 . 1 convolutional layer is destructive. Thanks to our approach, it could still be accelerated with graceful degradation. For the ease of comparison, we adopt Xception convolution on ResNet-50, denoted by Xception-50. Based on ResNet-50, we swap all convolutional layers with spatial conv blocks. To keep the same computational complexity, we increase the input channels of all branch2b layers by 2.. The baseline Xception.50 has a top-5 accuracy of 92.8% and complexity of 4450 MFLOPs. 
We apply multi-branch variants of our approach as de.scribed in Sec. 3.3, and adopt the same pruning ratio setting as ResNet in previous section. Maybe because of Xcep.tion block is unstable, Batch Normalization layers must be maintained during pruning. Otherwise it becomes nontrivial to fine-tune the pruned model. 
Shown in Table 7, after fine-tuning, we only suffer 1.0% increase of error under 2.. Filter pruning [31] could also apply on Xception, though it is designed for small speed.up ratio. Without fine-tuning, top-5 error is 100%. After training 20 epochs which is like training from scratch, in.creased error reach 4.3%. Our results for Xception-50 are not as graceful as results for VGG-16, since modern net.works tend to have less redundancy by design. 

<<TABLE>>

Table 8. 2. speed-up comparisons for ResNet-56 on CIFAR-10, the baseline accuracy is 92.8% (one view). We outperforms previous approaches and scratch trained counterpart (smaller is better). 


4.2.3 Experiments on CIFAR-10 
Even though our approach is designed for large datasets, it could generalize well on small datasets. We perform experiments on CIFAR-10 dataset [25], which is favored by many acceleration researches. It consists of 50k images for training and 10k for testing in 10 classes. 
We reproduce ResNet-56, which has accuracy of 92.8% (Serve as a reference, the official ResNet-56 [18] has ac.curacy of 93.0%). For 2. acceleration, we follow similar setting as Sec. 4.2.1 (keep the final stage unchanged, where the spatial size is 8 . 8). Shown in Table 8, our approach is competitive with scratch trained one, without fine-tuning, under 2. speed-up. After fine-tuning, our result is significantly better than Filter pruning [31] and scratch trained one. 

5. Conclusion 
To conclude, current deep CNNs are accurate with high inference costs. In this paper, we have presented an inference-time channel pruning method for very deep net.works. The reduced CNNs are inference efficient networks while maintaining accuracy, and only require off-the-shelf libraries. Compelling speed-ups and accuracy are demonstrated for both VGG Net and ResNet-like networks on Im.ageNet, CIFAR-10 and PASCAL VOC. 
In the future, we plan to involve our approaches into training time, instead of inference time only, which may also accelerate training procedure. 

References 
[1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2262fi2270, 2016. 1, 2, 3, 6 
[2] S. Anwar, K. Hwang, and W. Sung. Structured prun.ing of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015. 2 
[3] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639, 2016. 1, 2 
[4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016. 2 
[5] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373fi384, 1995. 3 
[6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, 
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. 6 
[7] F. Chollet. Xception: Deep learning with depthwise separa.ble convolutions. arXiv preprint arXiv:1610.02357, 2016. 1, 2, 3, 4, 6, 7 
[8] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. 1, 2 
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248fi255. IEEE, 2009. 4 
[10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional net.works for efficient evaluation. In Advances in Neural In.formation Processing Systems, pages 1269fi1277, 2014. 2 
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal.network.org/challenges/VOC/voc2007/workshop/index.html. 4, 6 
[12] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter.national Conference on Computer Vision, pages 1440fi1448, 2015. 2 
[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress.ing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2 
[14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Process.ing Systems, pages 1379fi1387, 2016. 2 
[15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on com.pressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243fi254. IEEE Press, 2016. 2 
[16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quantiza.tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 
2 
[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135fi1143, 2015. 1, 2, 3 
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn.ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1,2,3,4,6,8 
[19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim.ming: A data-driven neuron pruning approach towards effi.cient deep architectures. arXiv preprint arXiv:1607.03250, 2016. 2 

[20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, 
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 6 
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4 
[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 1, 2, 5, 6, 7 
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir.shick, S. Guadarrama, and T. Darrell. Caffe: Convolu.tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4, 6 
[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. 2 
[25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 4, 8 
[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097fi1105, 2012. 2, 3 
[27] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. 2 
[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and 
V. Lempitsky. Speeding-up convolutional neural net.works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2 
[29] V. Lebedev and V. Lempitsky. Fast convnets using group-wise brain damage. arXiv preprint arXiv:1506.02515, 2015. 
2 
[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed.ings of the IEEE, 86(11):2278fi2324, 1998. 2, 3 
[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710,2016. 1,2,4,5,6,7,8 
[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. 2 
[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, 
C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 6 
[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint arXiv:1511.05077, 2015. 2 
[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013. 2 
[36] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807fi814, 2010. 4 
[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40fi53, 2008. 6 
[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, 
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma.chine learning in Python. Journal of Machine Learning Re.search, 12:2825fi2830, 2011. 4 
[39] A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163fi2175, 2015. 2 
[40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In European Conference on Computer Vision, pages 525fi542. Springer, 2016. 2 
[41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. 6 
[42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal net.works. CoRR, abs/1506.01497, 2015. 6 
[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3, 4, 5, 6 
[44] S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015. 2 
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, 
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1fi9, 2015. 1, 3, 6 
[46] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267fi288, 1996. 3 
[47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi.antino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014. 1, 2 
[48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances In Neural Information Processing Systems, pages 2074fi2082, 2016. 1, 2, 3 
[49] S. Xie, R. Girshick, P. Dollfiar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016. 7 
[50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365fi2369, 2013. 2 
[51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016. 2 
[52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelli.gence, 38(10):1943fi1955, 2016. 1, 2, 3, 5, 6, 7 
<|endoftext|>


<|startoftext|>
                                    Convex Neural Networks

                      Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
                                         Dept. IRO, Universite de Montr´      eal´
                             P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
                              fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca

                                                Abstract
                           Convexity has recently received a lot of attention in the machine learning
                           community, and the lack of convexity has been seen as a major disad-
                           vantage of many learning algorithms, such as multi-layer artiﬁcial neural
                           networks. We show that training multi-layer neural networks in which the
                           number of hidden units is learned can be viewed as a convex optimization
                           problem. This problem involves an inﬁnite number of variables, but can be
                           solved by incrementally inserting a hidden unit at a time, each time ﬁnding
                           a linear classiﬁer that minimizes a weighted sum of errors.

                     1 Introduction
                     The objective of this paper is not to present yet another learning algorithm, but rather to point
                     to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
                     und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
                     matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
                     L1 regularization on the output weights. This analysis helps to understand the underlying
                     convex optimization problem that one is trying to solve.
                     This paper was motivated by the unproven conjecture (based on anecdotal experience) that
                     when the number of hidden units is “large”, the resulting average error is rather insensitive to
                     the random initialization of the NN parameters. One way to justify this assertion is that to re-
                     ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
                     in all directions. When the number of hidden units is large, it seems implausible for none of
                     them to offer a descent direction. Although this paper does not prove or disprove the above
                     conjecture, in trying to do so we found an interesting characterization of the optimization
                     problem for NNs as a convex program if the output loss function is convex in the NN out-
                     put and if the output layer weights are regularized by a convex penalty. More speciﬁcally,
                     if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
                     sonable” solution exists, involving a ﬁnite number of hidden units (no more than the number
                     of examples, and in practice typically much less). We present a theoretical algorithm that
                     is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
                     one at a time. Each insertion requires solving a weighted classiﬁcation problem, very much
                     like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
                     et al., 2000; Friedman, 2001).
                     Neural Networks, Gradient Boosting, and Column Generation
                     Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
                     we call “Neural Network” (NN) here is a predictor for supervised learning of the form 
                     <<FORMULA>> where x is an input vector, <<h_i(x)>> is obtained from a linear dis-
                     criminant function hi <<FORMULA>> with e.g. <<s(a) = sign(a)>>, or <<s(a) = tanh(a)>> or
                     <<s(a) =  1>>. A learning algorithm must specify how to select m, the <<FORMULA>>                                        
                     i ’s and the vi ’s.                 

                     The classical solution (Rumelhart, Hinton and Williams, 1986) involves (a) selecting a loss
                     function Q(^y;y)that speciﬁes how to penalize for mismatches between y^(x)and the ob-
                     served y’s (target output or target class), (b) optionally selecting a regularization penalty that
                     favors “small” parameters, and (c) choosing a method to approximately minimize the sum of
                     the losses on the training data D=f(x1 ;y 1 );:::;(xn ;y n )gplus the regularization penalty.
                     Note that in this formulation, an output non-linearity can still be used, by inserting it in the
                     loss function Q. Examples of such loss functions are the quadratic loss jjy^yjj 2 , the hinge
                     loss <<FORMULA>> (used in SVMs), the cross-entropy loss <<FORMULA>>
                     (used in logistic regression), and the exponential loss <<FORMULA>> (used in Boosting).
                     Gradient Boosting has been introduced in (Friedman, 2001) and (Mason et al., 2000) as a
                     non-parametric greedy-stagewise supervised learning algorithm in which one adds a function
                     at a time to the current solution <<y^(x)>>, in a steepest-descent fashion, to form an additive model
                     as above but with the functions hi typically taken in other kinds of sets of functions, such as
                     those obtained with decision trees. In a stagewise approach, when the (m+1)-th basis <<FORMULA>> is added, 
                     only <<w_m+1>> is optimized (by a line search), like in matching pursuit algorithms. Such
                     a greedy-stagewise approach is also at the basis of Boosting algorithms (Freund and Schapire,
                     1997), which is usually applied using decision trees as bases and Qthe exponential loss.
                     It may be difﬁcult to minimize exactly for wm+1 and hm+1 when the previous bases and
                     weights are ﬁxed, so (Friedman, 2001) proposes to “follow the gradient” in function space,
                     i.e., look for a base learner hm+1 that is best correlated with the gradient of the average
                     loss on the <<FORMULA>> (that would be the residue <<FORMULA>> in the case of the square loss). The
                     algorithm analyzed here also involves maximizing the correlation between Q0 (the derivative
                     of Q with respect to its ﬁrst argument, evaluated on the training predictions) and the next
                     basis hm+1 . However, we follow a “stepwise”, less greedy, approach, in which all the output
                     weights are optimized at each step, in order to obtain convergence guarantees.
                     Our approach adapts the Column Generation principle (Chvatal, 1983), a decomposition´
                     technique initially proposed for solving linear programs with many variables and few con-
                     straints. In this framework, active variables, or “columns”, are only generated as they are
                     required to decrease the objective. In several implementations, the column-generation sub-
                     problem is frequently a combinatorial problem for which efﬁcient algorithms are available.
                     In our case, the subproblem corresponds to determining an “optimal” linear classiﬁer.

                     2 Core Ideas
                     Informally, consider the set Hof all possible hidden unit functions (i.e., of all possible hidden
                     unit weight vectors vi ). Imagine a NN that has all the elements in this set as hidden units. We
                     might want to impose precision limitations on those weights to obtain either a countable or
                     even a ﬁnite set. For such a NN, we only need to learn the output weights. If we end up with
                     a ﬁnite number of non-zero output weights, we will have at the end an ordinary feedforward
                     NN. This can be achieved by using a regularization penalty on the output weights that yields
                     sparse solutions, such as the L1 penalty. If in addition the loss function is convex in the output
                     layer weights (which is the case of squared error, hinge loss, -tube regression loss, and
                     logistic or softmax cross-entropy), then it is easy to show that the overall training criterion
                     is convex in the parameters (which are now only the output weights). The only problem is
                     that there are as many variables in this convex program as there are elements in the set H,
                     which may be very large (possibly inﬁnite). However, we ﬁnd that with L1 regularization,
                     a ﬁnite solution is obtained, and that such a solution can be obtained by greedily inserting
                     one hidden unit at a time. Furthermore, it is theoretically possible to check that the global
                     optimum has been reached.

                     Deﬁnition 2.1.Let Hbe a set of functions from an input space X to R. Elements of H
                     can be understood as “hidden units” in a NN. Let Wbe the Hilbert space of functions from
                     Hto R, with an inner product denoted by <<FORMULA>>. An element of W can be
                     understood as the output weights vector in a neural network. Let <<h(x):H -> R>> the function
                     that maps any element <<h_i>> of <<H to h_i(x)>>. <<h(x)>> can be understood as the vector of activations                     
                     of hidden units when input x is observed. Let w2 W represent a parameter(the output
                     weights). The NN prediction is denoted <<FORMULA>>. Let <<Q:R -> RxR>> be a
                     cost function convex in its ﬁrst argument that takes a scalar prediction y^(x)and a scalar
                     target value y and returns a scalar cost. This is the cost to be minimized on example pair
                     (x;y). Let <<FORMULA>> be the training set. Let <<FORMULA>> be a convex
                     regularization functional that penalizes for the choice of more “complex” parameters (e.g.,
                     <<FORMULA>> according to a 1-norm in W, if His countable). We deﬁne the convex NN
                     criterion C(H;Q;;D;w)with parameter was follows: 

                                  <<FORMULA>>          (1)
 
                     The following is a trivial lemma, but it is conceptually very important as it is the basis for the
                     rest of the analysis in this paper.

                     Lemma 2.2.The convex NN cost <<FORMULA>> is a convex function of w.
                     Proof. <<FORMULA>> is convex in w and <<>> is convex in w, by the above construction. C
                     is additive in <<FORMULA>> and additive in . Hence C is convex in w.
                     Note that there are no constraints in this convex optimization program, so that at the global
                     minimum all the partial derivatives of C with respect to elements of w cancel.
                     Let jHj be the cardinality of the set H. If it is not ﬁnite, it is not obvious that an optimal
                     solution can be achieved in ﬁnitely many iterations.

                     Lemma 2.2 says that training NNs from a very large class (with one or more hidden layer)
                     can be seen as convex optimization problems, usually in a very high dimensional space,as
                     long as we allow the number of hidden units to be selected by the learning algorithm.
                     By choosing a regularizer that promotes sparse solutions, we obtain a solution that has a
                     ﬁnite number of “active” hidden units (non-zero entries in the output weights vector w).
                     This assertion is proven below, in theorem 3.1, for the case of the hinge loss.
                     However, even if the solution involves a ﬁnite number of active hidden units, the convex
                     optimization problem could still be computationally intractable because of the large number
                     of variables involved. One approach to this problem is to apply the principles already suc-
                     cessfully embedded in Gradient Boosting, but more speciﬁcally in Column Generation (an
                     optimization technique for very large scale linear programs), i.e., add one hidden unit at a
                     time in an incremental fashion. The important ingredient here is a way to know that we
                     have reached the global optimum, thus not requiring to actually visit all the possible
                     hidden units.We show that this can be achieved as long as we can solve the sub-problem
                     of ﬁnding a linear classiﬁer that minimizes the weighted sum of classiﬁcation errors. This
                     can be done exactly only on low dimensional data sets but can be well approached using
                     weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms.
                     Another idea (not followed up here) would be to consider ﬁrst a smaller set H1 , for which
                     the convex problem can be solved in polynomial time, and whose solution can theoretically
                     be selected as initialization for minimizing the criterion <<FORMULA>>, with <<FORMULA>>,
                     and where H2 may have inﬁnite cardinality (countable or not). In this way we could show
                     that we can ﬁnd a solution whose cost satisﬁes <<FORMULA>>,
                     i.e., is at least as good as the solution of a more restricted convex optimization problem. The
                     second minimization can be performed with a local descent algorithm, without the necessity
                     to guarantee that the global optimum will be found.

                     3 Finite Number of Hidden Neurons
                     In this section we consider the special case with <<FORMULA>> the hinge loss,
                     and <<L1>> regularization, and we show that the global optimum of the convex cost involves at
                     most n+ 1 hidden neurons, using an approach already exploited in (Ratsch, Demiriz and¨
                     Bennett, 2002) for L1-loss regression Boosting with L1 regularization of output weights.                                                    Xn
                     The training criterion is <<FORMULA>>. Let us rewrite t=1 this cost function as the 
                     constrained optimization problem: 
                     
                                          <<FORMULA>>      (C1)
                  
                                          <<FORMULA>>      (C2)

                     Using a standard technique, the above program can be recast as a linear program. Deﬁn-
                     ing <<FORMULA>> the vector of Lagrangian multipliers for the constraints C1 , its dual
                     problem (P)takes the form (in the case of a ﬁnite number Jof base learners): 
                     
                                          <<FORMULA>>
                          
                     In the case of a ﬁnite number Jof base learners, <<FORMULA>>. If
                     the number of hidden units is uncountable, then Iis a closed bounded interval of R.
                     Such an optimization problem satisﬁes all the conditions needed for using Theorem 4.2
                     from (Hettich and Kortanek, 1993). Indeed:
                     <<FORMULA>> it is compact (as a closed bounded interval of <<FORMULA>> is a concave function 
                     it is even a linear function);
                     <<FORMULA>> is convex in <<>> (it is actually linear in <<>>);
                     <<FORMULA>> (therefore ﬁnite) ( (P)is the largest value of F satisfying the constraints);
                      for every set of n+1 points <<FORMULA>>, there exists ~such that <<FORMULA>> for
                     <<FORMULA>> (one can take <<FORMULA>> since K>0).

                     Then, from Theorem 4.2 from (Hettich and Kortanek, 1993), the following theorem holds:
                     Theorem 3.1.The solution of (P) can be attained with constraints C0 and only n+1 constraints C0 
                     (i.e., there exists a subset of n+1 constraints C0 giving rise to the same maximum 1                               
                     as when using the whole set of constraints). Therefore, the primal problem associated is the
                     minimization of the cost function of a NN with n+1 hidden neurons.

                     4 Incremental Convex NN Algorithm
                     In this section we present a stepwise algorithm to optimize a NN, and show that there is a cri-
                     terion that allows to verify whether the global optimum has been reached. This is a specializa-
                     tion of minimizing <<FORMULA>>, with <<FORMULA>> 1 and <<FORMULA>>
                     is the set of soft or hard linear classiﬁers (depending on choice of s()).

                                        Algorithm ConvexNN( D, Q, , s)

                                                <<ALGORITHM>>
                     
                     Theorem 4.1.AlgorithmConvexNN Pstops when it reaches the global optimum of

                                      <<FORMULA>>.

                     Proof.Let wbe the output weights vector when the algorithm stops. Because the set of
                     hidden units Hwe consider is such that when his in H, h is also in H, we can assume
                     all weights to be non-negative. By contradiction, if w0 6=wis the global optimum, with
                     <<C(w_0) < C(w)>>, then, since Cis convex in the output weights, for any 2(0;1) , we have
                     <<FORMULA>>. For
                      small enough, we can assume all weights in w that are strictly positive to be also strictly
                     positive in w . Let us denote by Ip the set of strictly positive weights in w (and w), by 
                     Iz the set of weights set to zero in w but to a non-zero value in w , and by k the difference
                     w;k wk in the weight of hidden unit hk between wand w . We can assume j < 0 for
                     j2Iz , because instead of setting a small positive weight to hj , one can decrease the weight
                     of hj by the same amount, which will give either the same cost, or possibly a lower one
                     when the weight of <<FORMULA>> is positive. With o() denoting a quantity such that  o()!0
                     when !0, the difference  (w) =XC(w )C(w)can now be written:

                                       <<FORMULA>>

                     since for i2Ip , thanks to step (7) of the algorithm, we have @C (w) = 0 . Thus the @w
                     inequality <<FORMULA>> rewrites into  <<FORMULA>>
                     which, when !0, yields (note that <<FORMULA>> does not depend on !  since j is linear in ):

                                      <<FORMULA>>             (2)

                     i being the optimal classiﬁer chosen in step (5a) or (5c), all hidden units <<FORMULA>> verify <<FORMULA>>

                                       <<FORMULA>>

                     <<FORMULA>> , contradicting eq. 2.

                     (Mason et al., 2000) prove a related global convergence result for the AnyBoost algorithm,
                     a non-parametric Boosting algorithm that is also similar to Gradient Boosting (Friedman,
                     2001). Again, this requires solving as a sub-problem an exact minimization to ﬁnd a function
                     hi 2 H that is maximally correlated with the gradient Q0 on the output. We now show a
                     simple procedure to select a hyperplane with the best weighted classiﬁcation error.
                     Exact Minimization                     
                     In step (5a) we are required to ﬁnd a linear classiﬁer that minimizes the weighted sum of
                     classiﬁcation errors. Unfortunately, this is an NP-hard problem (w.r.t. d, see theorem 4
                     in (Marcotte and Savard, 1992)). However, an exact solution can be easily found in O(n3 )
                     computations for d= 2 inputs.

                     Proposition 4.2.Finding a linear classiﬁer that minimizes the weighted sum of classiﬁcation
                     error can be achieved in O(n3 )steps when the input dimension is d= 2 .
                     Proof.We want to <<FORMULA>> with respect to u and b, the c’s being
                     in <<FORMULA>> Consider u ﬁxed and sort the xi ’s according to their dot product with u and denote r
                     the function which maps ito r(i) such that xr(i) is in i-th position in the sort. Depending on P       
                     the value of b, we will have n+1 possible sums, respectively <<FORMULA>>,
                     <<FORMULA>>. It is obvious that those sums only depend on the order of the products <<FORMULA>>,
                     <<FORMULA>>. When u varies smoothly on the unit circle, as the dot product is a continuous
                     function of its arguments, the changes in the order of the dot products will occur only when
                     there is a pair (i,j) such that <<FORMULA>>. Therefore, there are at most as many order
                     changes as there are pairs of different points, i.e., <<FORMULA>>. In the case of d=2, we
                     can enumerate all the different angles for which there is a change, namely a1 ;:::;a z with
                     <<FORMULA>>. We then need to test at least one <<FORMULA>> for each interval a2                                                    i <
                     <<FORMULA>>, and also one u for <<FORMULA>>, which makes a total of <<FORMULA>> possibilities. 2
                     It is possible to generalize this result in higher dimensions, and as shown in (Marcotte and
                     Savard, 1992), one can achieve <<O(log(n)nd)>> time.

                     Algorithm 1 Optimal linear classiﬁer search
                    
                                       <<ALGORITHM>>

                     Approximate Minimization

                     For data in higher dimensions, the exact minimization scheme to ﬁnd the optimal linear
                     classiﬁer is not practical. Therefore it is interesting to consider approximate schemes for
                     obtaining a linear classiﬁer with weighted costs. Popular schemes for doing so are the linear
                     SVM (i.e., linear classiﬁer with hinge loss), the logistic regression classiﬁer, and variants of
                     the Perceptron algorithm. In that case, step (5c) of the algorithm is not an exact minimization,
                     and one cannot guarantee that the global optimum will be reached. However, it might be
                     reasonable to believe that ﬁnding a linear classiﬁer by minimizing a weighted hinge loss
                     should yield solutions close to the exact minimization. Unfortunately, this is not generally
                     true, as we have found out on a simple toy data set described below. On the other hand,
                     if in step (7) one performs an optimization not only of the output weights wj (ji) but
                     also of the corresponding weight vectors vj , then the algorithm ﬁnds a solution close to the
                     global optimum (we could only verify this on 2-D data sets, where the exact solution can be
                     computed easily). It means that at the end of each stage, one ﬁrst performs a few training
                     iterations of the whole NN (for the hidden units ji) with an ordinary gradient descent
                     mechanism (we used conjugate gradients but stochastic gradient descent would work too),
                     optimizing the wj ’s and the vj ’s, and then one ﬁxes the vj ’s and obtains the optimal wj ’s for
                     these vj ’s (using a convex optimization procedure). In our experiments we used a quadratic                     
                     Q, for which the optimization of the output weights can be done with a neural network, using
                     the outputs of the hidden layer as inputs.

                     Let us consider now a bit more carefully what it means to tune the v_j’s in step (7). Indeed,
                     changing the weight vector vj of a selected hidden neuron to decrease the cost is equivalent
                     to a change in the output weights w’s. More precisely, consider the step in which the
                     value of vj becomes v0 . This is equivalent to the following operation on the w’s, when wj                                            j is the corresponding output weight value: the output weight associated with the value vj of
                     a hidden neuron is set to 0, and the output weight associated with the value v0 of a hidden j 
                     neuron is set to wj . This corresponds to an exchange between two variables in the convex
                     program. We are justiﬁed to take any such step as long as it allows us to decrease the cost
                     C(w). The fact that we are simultaneously making such exchanges on all the hidden units
                     when we tune the vj ’s allows us to move faster towards the global optimum.
                     Extension to multiple outputs
                     The multiple outputs case is more involved than the single-output case because it is not P 
                     enough to check the condition <<FORMULA>>. Consider a new hidden neuron whose output is
                     hi when the input is xi . Let us also denote <<FORMULA>> the vector of output weights
                     between the new hidden neuron and the <<FORMULA>> output neurons. The gradient with respect to j
                     is <<FORMULA>> with <<FORMULA>> the value of the j-th output neuron with input <<FORMULA>>. 
                     This means that if, for a given j, we have <<FORMULA>>, moving Pj away from 0 can
                     only increase the cost. Therefore, the right quantity to consider is <<FORMULA>>.
                     We must therefore ﬁnd <<FORMULA>>. As before, this sub-problem is not + convex, but it is not 
                     as obvious how to approximate it by a convex problem. The stopping P criterion becomes: if there is no j 
                     such that <<FORMULA>>, then all weights must remain equal to 0 and a global minimum is reached.

                     Experimental Results
                     We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Ben-
                     gio and Le Roux, 2005)), to be able to compare with the exact version of the algorithm. In
                     these experiments, <<FORMULA>>. The set-up is the following:

                     Select a new linear classiﬁer, either (a) the optimal one or (b) an approximate using logistic
                     regression.
                     Optimize the output weights using a convex optimizer.
                     In case (b), tune both input and output weights by conjugate gradient descent on Cand
                     ﬁnally re-optimize the output weights using LASSO regression.
                     Optionally, remove neurons whose output weight has been set to 0.
                     Using the approximate algorithm yielded for 100 training examples an average penalized
                     ( = 1 ) squared error of 17.11 (over 10 runs), an average test classiﬁcation error of 3.68%
                     and an average number of neurons of 5.5 . The exact algorithm yielded a penalized squared
                     error of 8.09, an average test classiﬁcation error of 5.3%, and required 3 hidden neurons. A
                     penalty of = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further
                     improved the test classiﬁcation error of the approximate algorithm. Besides, when running
                     the approximate algorithm for a long time, it converges to a solution whose quadratic error is
                     extremely close to the one of the exact algorithm.

                     5 Conclusion
                     We have shown that training a NN can be seen as a convex optimization problem, and have
                     analyzed an algorithm that can exactly or approximately solve this problem. We have shown
                     that the solution with the hinge loss involved a number of non-zero weights bounded by
                     the number of examples, and much smaller in practice. We have shown that there exists a
                     stopping criterion to verify if the global optimum has been reached, but it involves solving a
                     sub-learning problem involving a linear classiﬁer with weighted errors, which can be computationally 
                     hard if the exact solution is sought, but can be easily implemented for toy data
                     sets (in low dimension), for comparing exact and approximate solutions.
                     The above experimental results are in agreement with our initial conjecture: when there are
                     many hidden units we are much less likely to stall in the optimization procedure, because
                     there are many more ways to descend on the convex cost C(w). They also suggest, based
                     on experiments in which we can compare with the exact sub-problem minimization, that
                     applying Algorithm ConvexNN with an approximate minimization for adding each hidden
                     unit while continuing to tune the previous hidden unit s tends to lead to fast convergence
                     to the global minimum. What can get us stuck in a “local minimum” (in the traditional sense,
                     i.e., of optimizing w’s and v’s together) is simply the inability to ﬁnd a new hidden unit
                     weight vector that can improve the total cost (ﬁt and regularization term) even if there
                     exists one.

                     Note that as a side-effect of the results presented here, we have a simple way to train P neural
                     networks with hard-threshold hidden units, since increasing <<FORMULA>> can be either achieved 
                     exactly (at great price) or approximately (e.g. by using a cross-entropy
                     or hinge loss on the corresponding linear classiﬁer).

                     Acknowledgments

                     The authors thank the following for support: NSERC, MITACS, and the Canada Research
                     Chairs. They are also grateful for the feedback and stimulating exchanges with Sam Roweis,
                     Nathan Srebro, and Aaron Courville.

                     References

                     Chvatal, V. (1983).´        Linear Programming. W.H. Freeman.
                     Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efﬁcient non-parametric function induction
                        in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors,Proceedings of AIS-
                        TATS’2005, pages 96–103.
                     Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an
                        application to boosting.Journal of Computer and System Science, 55(1):119–139.
                     Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statis-
                        tics, 29:1180.
                     Hettich, R. and Kortanek, K. (1993). Semi-inﬁnite programming: theory, methods, and applications.
                        SIAM Review, 35(3):380–429.
                     Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.Zeitschrift fr
                        Operations Research (Theory), 36:517–545.
                     Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent.
                        InAdvances in Neural Information Processing Systems 12, pages 512–518.
                     Ratsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in inﬁnite and ﬁnite¨
                        hypothesis spaces.Machine Learning.
                     Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating
                        errors.Nature, 323:533–536
<|endoftext|>


<|startoftext|>                  
                  DEEP COMPRESSION: COMPRESSING DEEP NEURAL
                  NETWORKS WITH PRUNING , T RAINED QUANTIZATION
                 AND HUFFMAN CODING


                  Song Han
                  Stanford University, Stanford, CA 94305, USA
                  songhan@stanford.edu

                  Huizi Mao
                  Tsinghua University, Beijing, 100084, China
                  mhz12@mails.tsinghua.edu.cn

                  William J. Dally
                  Stanford University, Stanford, CA 94305, USA
                  NVIDIA, Santa Clara, CA 95050, USA
                  dally@stanford.edu


                                              ABSTRACT

                       Neural networks are both computationally intensive and memory intensive, making
                       them difﬁcult to deploy on embedded systems with limited hardware resources. To
                       address this limitation, we introduce “deep compression”, a three stage pipeline:
                       pruning, trained quantization and Huffman coding, that work together to reduce
                       the storage requirement of neural networks by 35% to 49% without affecting their
                       accuracy. Our method ﬁrst prunes the network by learning only the important
                       connections. Next, we quantize the weights to enforce weight sharing, ﬁnally, we
                       apply Huffman coding. After the ﬁrst two steps we retrain the network to ﬁne
                       tune the remaining connections and the quantized centroids. Pruning, reduces the
                       number of connections by 9% to 13%; Quantization then reduces the number of
                       bits that represent each connection from 32 to 5. On the ImageNet dataset, our
                       method reduced the storage required by AlexNet by 35%, from 240MB to 6.9MB,
                       without loss of accuracy. Our method reduced the size of VGG-16 by 49% from
                       552MB to 11.3MB, again with no loss of accuracy. This allows ﬁtting the model
                       into on-chip SRAM cache rather than off-chip DRAM memory. Our compression
                       method also facilitates the use of complex neural networks in mobile applications
                       where application size and download bandwidth are constrained. Benchmarked on
                       CPU, GPU and mobile GPU, compressed network has 3% to 4% layerwise speedup
                       and 3% to 7% better energy efﬁciency.


                  1 INTRODUCTION

                 Deep neural networks have evolved to the state-of-the-art technique for computer vision tasks
                 (Krizhevsky et al., 2012)(Simonyan & Zisserman, 2014). Though these neural networks are very
                 powerful, the large number of weights consumes considerable storage and memory bandwidth. For
                 example, the AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB
                 (BVLC). This makes it difﬁcult to deploy deep neural networks on mobile system.
                 First, for many mobile-ﬁrst companies such as Baidu and Facebook, various apps are updated via
                 different app stores, and they are very sensitive to the size of the binary ﬁles. For example, App
                 Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a
                 result, a feature that increases the binary size by 100MB will receive much more scrutiny than one
                 that increases it by 10MB. Although having deep neural networks running on mobile has many great

                                                 <<FIGURE>>

                 Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
                 reduces the number of weights by10%, while quantization further improves the compression rate:
                 between27%and31%. Huffman coding gives more compression: between35%and49%. The
                 compression rate already included the meta-data for sparse representation. The compression scheme
                 doesn’t incur any accuracy loss.

                 features such as better privacy, less network bandwidth and real time processing, the large storage
                 overhead prevents deep neural networks from being incorporated into mobile apps.
                 The second issue is energy consumption. Running large neural networks require a lot of memory
                 bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes
                 considerable energy. Mobile devices are battery constrained, making power hungry applications such
                 as deep neural networks hard to deploy.
                 Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit
                 ﬂoating point add consumes 0.9PJ, a 32bit SRAM cache access takes 5PJ, while a 32bit DRAM
                 memory access takes 640PJ, which is 3 orders of magnitude of an add operation. Large networks
                 do not ﬁt in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion
                 connection neural network, for example, at 20fps would require (20Hz)(1G)(640PJ) = 12.8W just
                 for DRAM access - well beyond the power envelope of a typical mobile device.
                 Our goal is to reduce the storage and energy required to run inference on such large networks so they
                 can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three-
                 stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves
                 the original accuracy. First, we prune the networking by removing the redundant connections, keeping
                 only the most informative connections. Next, the weights are quantized so that multiple connections
                 share the same weight, thus only the codebook (effective weights) and the indices need to be stored.
                 Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights.
                 Our main insight is that, pruning and trained quantization are able to compress the network without
                 interfering each other, thus lead to surprisingly high compression rate. It makes the required storage
                 so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM
                 which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al.
                 (2016) was later proposed that works on the compressed model, achieving signiﬁcant speedup and
                 energy efﬁciency improvement.

                  2 NETWORK PRUNING

                 Network pruning has been widely studied to compress CNN models. In early work, network pruning
                 proved to be a valid way to reduce the network complexity and over-ﬁtting (LeCun et al., 1989;
                 Hanson & Pratt, 1989; Hassibi et al., 1993; Strom, 1997). Recently Han et al. (2015) pruned state- ¨
                 of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on
                 the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we
                 prune the small-weight connections: all connections with weights below a threshold are removed
                 from the network. Finally, we retrain the network to learn the ﬁnal weights for the remaining sparse
                 connections. Pruning reduced the number of parameters by9%and13%for AlexNet and VGG-16
                 model.

                                                <<FIGURE>>

                 Figure 2: Representing the matrix sparsity with relative index. Padding ﬁller zero to prevent overﬂow.

                                                <<FIGURE>>

                     Figure 3: Weight sharing by scalar quantization (top) and centroids ﬁne-tuning (bottom).


                 We store the sparse structure that results from pruning using compressed sparse row (CSR) or
                 compressed sparse column (CSC) format, which requires2a+n+1numbers, where a is the number
                 of non-zero elements and n is the number of rows or columns.
                 To compress further, we store the index difference instead of the absolute position, and encode this
                 difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger
                 than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds
                 8, the largest 3-bit (as an example) unsigned number, we add a ﬁller zero.

                  3 TRAINED QUANTIZATION AND WEIGHT SHARING

                 Network quantization and weight sharing further compresses the pruned network by reducing the
                 number of bits required to represent each weight. We limit the number of effective weights we need to
                 store by having multiple connections share the same weight, and then ﬁne-tune those shared weights.
                 Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4
                 output neurons, the weight is a 4x4 matrix. On the top left is the 4x4 weight matrix, and on the
                 bottom left is the 4x4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),
                 all the weights in the same bin share the same value, thus for each weight, we then need to store only
                 a small index into a table of shared weights. During update, all the gradients are grouped by the color
                 and summed together, multiplied by the learning rate and subtracted from the shared centroids from
                 last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each
                 CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
                 To calculate the compression rate, given k clusters, we only need log_2(k) bits to encode the index. In
                 general, for a network with n connections and each connection is represented with b bits, constraining
                 the connections to have only k shared weights will result in a compression rate of:

                                                  <<FORMULA>>                                   (1)

                 For example, Figure 3 shows the weights of a single layer neural network with four input units and
                 four output units. There are4%4 = 16weights originally but there are only4shared weights: similar
                 weights are grouped together to share the same value. Originally we need to store 16 weights each

                                                    <<FIGURE>>                 

                 Figure 4: Left: Three different methods for centroids initialization. Right: Distribution of weights
                 (blue) and distribution of codebook before (green cross) and after ﬁne-tuning (red dot).


                 has 32 bits, now we need to store only 4 effective weights (blue, green, red and orange), each has 32
                 bits, together with 16 2-bit indices giving a compression rate of <<FORMULA>>

                  3.1 WEIGHT SHARING

                 We use k-means clustering to identify the shared weights for each layer of a trained network, so that
                 all the weights that fall into the same cluster will share the same weight. Weights are not shared across
                 layers. We partition n original weights <<FORMULA>> into k clusters <<FORMULA>>,
                 n%k, so as to minimize the within-cluster sum of squares (WCSS):

                                               <<FORMULA>>                      (2)

                 Different from HashNet (Chen et al., 2015) where weight sharing is determined by a hash function
                 before the networks sees any training data, our method determines weight sharing after a network is
                 fully trained, so that the shared weights approximate the original network.

                  3.2 INITIALIZATION OF SHARED WEIGHTS

                 Centroid initialization impacts the quality of clustering and thus affects the network’s prediction
                 accuracy. We examine three initialization methods: Forgy(random), density-based, and linear
                 initialization. In Figure 4 we plotted the original weights’ distribution of conv3 layer in AlexNet
                 (CDF in blue, PDF in red). The weights forms a bimodal distribution after network pruning. On the
                 bottom it plots the effective weights (centroids) with 3 different initialization methods (shown in blue,
                 red and yellow). In this example, there are 13 clusters.
                 Forgy(random) initialization randomly chooses k observations from the data set and uses these as
                 the initial centroids. The initialized centroids are shown in yellow. Since there are two peaks in the
                 bimodal distribution, Forgy method tend to concentrate around those two peaks.
                 Density-based initialization linearly spaces the CDF of the weights in the y-axis, then ﬁnds the
                 horizontal intersection with the CDF, and ﬁnally ﬁnds the vertical intersection on the x-axis, which
                 becomes a centroid, as shown in blue dots. This method makes the centroids denser around the two
                 peaks, but more scatted than the Forgy method.
                 Linear initialization linearly spaces the centroids between the [min, max] of the original weights.
                 This initialization method is invariant to the distribution of the weights and is the most scattered
                 compared with the former two methods.
                 Larger weights play a more important role than smaller weights (Han et al., 2015), but there are fewer
                 of these large weights. Thus for both Forgy initialization and density-based initialization, very few
                 centroids have large absolute value which results in poor representation of these few large weights.
                 Linear initialization does not suffer from this problem. The experiment section compares the accuracy

                                                  <<FIGURE>>

                      Figure 5: Distribution for weight (Left) and index (Right). The distribution is biased.
                 of different initialization methods after clustering and ﬁne-tuning, showing that linear initialization
                 works best.

                  3.3 FEED-FORWARD AND BACK-PROPAGATION

                 The centroids of the one-dimensional k-means clustering are the shared weights. There is one level
                 of indirection during feed forward phase and back-propagation phase looking up the weight table.
                 An index into the shared weight table is stored for each connection. During back-propagation, the
                 gradient for each shared weight is calculated and used to update the shared weight. This procedure is
                 shown in Figure 3.
                 We denote the loss byL, the weight in the ith column and jth row by Wij, the centroid index of
                 element Wij by Iij, the kth centroid of the layer by Ck. By using the indicator function <<1(.)>>, the
                 gradient of the centroids is calculated as:

                                                   <<FORMULA>>               (3)
 
                  4 HUFFMAN CODING

                 A Huffman code is an optimal preﬁx code commonly used for lossless data compression(Van Leeuwen,
                 1976). It uses variable-length codewords to encode source symbols. The table is derived from the
                 occurrence probability for each symbol. More common symbols are represented with fewer bits.
                 Figure 5 shows the probability distribution of quantized weights and the sparse matrix index of the
                 last fully connected layer in AlexNet. Both distributions are biased: most of the quantized weights are
                 distributed around the two peaks; the sparse matrix index difference are rarely above 20. Experiments
                 show that Huffman coding these non-uniformly distributed values saves 20% to 30% of network
                 storage.

                  5 EXPERIMENTS

                 We pruned, quantized, and Huffman encoded four networks: two on MNIST and two on ImageNet
                 data-sets. The network parameters and accuracy- 1 before and after pruning are shown in Table 1. The
                 compression pipeline saves network storage by 35% to 49% across different networks without loss
                 of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to
                 be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM
                 memory.

                 Training is performed with the Caffe framework (Jia et al., 2014). Pruning is implemented by adding
                 a mask to the blobs to mask out the update of the pruned connections. Quantization and weight
                 sharing are implemented by maintaining a codebook structure that stores the shared weight, and
                 group-by-index after calculating the gradient of each layer. Each shared weight is updated with all
                 the gradients that fall into that bucket. Huffman coding doesn’t require training and is implemented
                 ofﬂine after all the ﬁne-tuning is ﬁnished.

                  5.1 LE NET-300-100 AND LE NET-5 ON MNIST

                 We ﬁrst experimented on MNIST dataset with LeNet-300-100 and LeNet-5 network (LeCun et al.,
                 1998). LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100
                    1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation

                 Table 1: The compression pipeline can save35%to49%parameter storage with no loss of accuracy.

                                                <<TABLE>>

                 Table 2: Compression statistics for LeNet-300-100. P: pruning, Q:quantization, H:Huffman coding.

                                                <<TABLE>>

                 Table 3: Compression statistics for LeNet-5. P: pruning, Q:quantization, H:Huffman coding.

                                                <<TABLE>>

                 neurons each, which achieves 1.6% error rate on Mnist. LeNet-5 is a convolutional network that
                 has two convolutional layers and two fully connected layers, which achieves 0.8% error rate on
                 Mnist. Table 2 and table 3 show the statistics of the compression pipeline. The compression rate
                 includes the overhead of the codebook and sparse indexes. Most of the saving comes from pruning
                 and quantization (compressed 32%), while Huffman coding gives a marginal gain (compressed 40%)

                  5.2 ALEX NET ON IMAGE NET

                 We further examine the performance of Deep Compression on the ImageNet ILSVRC-2012 dataset,
                 which has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as
                 the reference model, which has 61 million parameters and achieved a top-1 accuracy of 57.2% and a
                 top-5 accuracy of 80.3%. Table 4 shows that AlexNet can be compressed to2:88%of its original size
                 without impacting accuracy. There are 256 shared weights in each CONV layer, which are encoded
                 with 8 bits, and 32 shared weights in each FC layer, which are encoded with only 5 bits. The relative
                 sparse index is encoded with 4 bits. Huffman coding compressed additional 22%, resulting in 35%
                 compression in total.

                  5.3 VGG-16 ON IMAGE NET

                 With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 (Si-
                 monyan & Zisserman, 2014), on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional
                 layers but still only three fully-connected layers. Following a similar methodology, we aggressively
                 compressed both convolutional and fully-connected layers to realize a signiﬁcant reduction in the
                 number of effective weights, shown in Table5.
                 The VGG16 network as a whole has been compressed by49%. Weights in the CONV layers are
                 represented with 8 bits, and FC layers use 5 bits, which does not impact the accuracy. The two largest
                 fully-connected layers can each be pruned to less than 1.6% of their original size. This reduction

                    Table 4: Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding.

                                                   <<TABLE>>

                    Table 5: Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding.

                                                    <<TABLE>>

                 is critical for real time image processing, where there is little reuse of these layers across images
                 (unlike batch processing). This is also critical for fast object detection algorithms where one CONV
                 pass is used by many FC passes. The reduced layers will ﬁt in an on-chip SRAM and have modest
                 bandwidth requirements. Without the reduction, the bandwidth requirements are prohibitive.

                  6 DISCUSSIONS

                  6.1 PRUNING AND QUANTIZATION WORKING TOGETHER

                 Figure 6 shows the accuracy at different compression rates for pruning and quantization together
                 or individually. When working individually, as shown in the purple and yellow lines, accuracy of
                 pruned network begins to drop signiﬁcantly when compressed below 8% of its original size; accuracy
                 of quantized network also begins to drop signiﬁcantly when compressed below 8% of its original
                 size. But when combined, as shown in the red line, the network can be compressed to 3% of original
                 size with no loss of accuracy. On the far right side compared the result of SVD, which is inexpensive
                 but has a poor compression rate.
                 The three plots in Figure 7 show how accuracy drops with fewer bits per connection for CONV layers
                 (left), FC layers (middle) and all layers (right). Each plot reports both top-1 and top-5 accuracy.
                 Dashed lines only applied quantization but without pruning; solid lines did both quantization and
                 pruning. There is very little difference between the two. This shows that pruning works well with
                 quantization.
                 Quantization works well on pruned network because unpruned AlexNet has 60 million weights to
                 quantize, while pruned AlexNet has only 6.7 million weights to quantize. Given the same amount of
                 centroids, the latter has less error.

                                              <<FIGURE>>

                 Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and
                 quantization works best when combined.

                                              <<FIGURE>>

                 Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:
                 quantization on pruned network; Accuracy begins to drop at the same number of quantization bits
                 whether or not the network has been pruned. Although pruning made the number of parameters less,
                 quantization still works well, or even better(3 bits case on the left ﬁgure) as in the unpruned network.

                                                <<FIGURE>>

                 Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.
                 Linear initialization gives best result.

                 The ﬁrst two plots in Figure 7 show that CONV layers require more bits of precision than FC layers.
                 For CONV layers, accuracy drops signiﬁcantly below 4 bits, while FC layer is more robust: not until
                 2 bits did the accuracy drop signiﬁcantly.


                  6.2 CENTROID INITIALIZATION

                 Figure 8 compares the accuracy of the three different initialization methods with respect to top-1
                 accuracy (Left) and top-5 accuracy (Right). The network is quantized to2%8bits as shown on
                 x-axis. Linear initialization outperforms the density initialization and random initialization in all
                 cases except at 3 bits.
                 The initial centroids of linear initialization spread equally across the x-axis, from the min value to the
                 max value. That helps to maintain the large weights as the large weights play a more important role
                 than smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nor
                 density-based initialization retains large centroids. With these initialization methods, large weights are
                 clustered to the small centroids because there are few large weights. In contrast, linear initialization
                 allows large weights a better chance to form a large centroid.

                                            <<FIGURE>>

                 Figure 9: Compared with the original network, pruned network layer achieved 3% speedup on CPU,
                 3.5% on GPU and 4.2% on mobile GPU on average. Batch size = 1 targeting real time processing.
                 Performance number normalized to CPU.

                                            <<FIGURE>>

                 Figure 10: Compared with the original network, pruned network layer takes 7% less energy on CPU,
                 3.3% less on GPU and 4.2% less on mobile GPU on average. Batch size = 1 targeting real time
                 processing. Energy number normalized to CPU.

                  6.3 SPEEDUP AND ENERGY EFFICIENCY

                 Deep Compression is targeting extremely latency-focused applications running on mobile, which
                 requires real-time inference, such as pedestrian detection on an embedded processor inside an
                 autonomous vehicle. Waiting for a batch to assemble signiﬁcantly adds latency. So when bench-
                 marking the performance and energy efﬁciency, we consider the case when batch size = 1. The cases
                 of batching are given in Appendix A.
                 Fully connected layer dominates the model size (more than90%) and got compressed the most by
                 Deep Compression (96%weights pruned in VGG-16). In state-of-the-art object detection algorithms
                 such as fast R-CNN (Girshick, 2015), up to 38% computation time is consumed on FC layers on
                 uncompressed model. So it’s interesting to benchmark on FC layers, to see the effect of Deep
                 Compression on performance and energy. Thus we setup our benchmark on FC6, FC7, FC8 layers of
                 AlexNet and VGG-16. In the non-batched case, the activation matrix is a vector with just one column,
                 so the computation boils down to dense / sparse matrix-vector multiplication for original / pruned
                 model, respectively. Since current BLAS library on CPU and GPU doesn’t support indirect look-up
                 and relative indexing, we didn’t benchmark the quantized model.
                 We compare three different off-the-shelf hardware: the NVIDIA GeForce GTX Titan X and the Intel
                 Core i7 5930K as desktop processors (same package as NVIDIA Digits Dev Box) and NVIDIA Tegra
                 K1 as mobile processor. To run the benchmark on GPU, we used cuBLAS GEMV for the original
                 dense layer. For the pruned sparse layer, we stored the sparse matrix in in CSR format, and used
                 cuSPARSE CSRMV kernel, which is optimized for sparse matrix-vector multiplication on GPU. To
                 run the benchmark on CPU, we used MKL CBLAS GEMV for the original dense model and MKL
                 SPBLAS CSRMV for the pruned sparse model.

                 To compare power consumption between different systems, it is important to measure power at a
                 consistent manner (NVIDIA, b). For our analysis, we are comparing pre-regulation power of the
                 entire application processor (AP) / SOC and DRAM combined. On CPU, the benchmark is running on
                 single socket with a single Haswell-E class Core i7-5930K processor. CPU socket and DRAM power
                 are as reported by the pcm-power utility provided by Intel. For GPU, we used nvidia-smi
                 utility to report the power of Titan X. For mobile GPU, we use a Jetson TK1 development board and
                 measured the total power consumption with a power-meter. We assume 15% AC to DC conversion
                 loss,85% regulator efﬁciency and 15% power consumed by peripheral components (NVIDIA, a) to
                 report the AP+DRAM power for Tegra K1.

                 Table 6: Accuracy of AlexNet with different aggressiveness of weight sharing and quantization. 8/5
                 bit quantization has no loss of accuracy; 8/4 bit quantization, which is more hardware friendly, has
                 negligible loss of accuracy of 0.01%; To be really aggressive, 4/2 bit quantization resulted in 1.99%
                 and 2.60% loss of accuracy.

                                                <<TABLE>>

                 The ratio of memory access over computation characteristic with and without batching is different.
                 When the input activations are batched to a matrix the computation becomes matrix-matrix multipli-
                 cation, where locality can be improved by blocking. Matrix could be blocked to ﬁt in caches and
                 reused efﬁciently. In this case, the amount of memory access isO(n2 ), and that of computation is
                 O(n3 ), the ratio between memory access and computation is in the order of1=n.
                 In real time processing when batching is not allowed, the input activation is a single vector and the
                 computation is matrix-vector multiplication. In this case, the amount of memory access isO(n2 ), and
                 the computation isO(n2 ), memory access and computation are of the same magnitude (as opposed
                 to1=n). That indicates MV is more memory-bounded than MM. So reducing the memory footprint
                 is critical for the non-batching case.

                 Figure 9 illustrates the speedup of pruning on different hardware. There are 6 columns for each
                 benchmark, showing the computation time of CPU / GPU / TK1 on dense / pruned network. Time is
                 normalized to CPU. When batch size = 1, pruned network layer obtained 3% to 4% speedup over the
                 dense network on average because it has smaller memory footprint and alleviates the data transferring
                 overhead, especially for large matrices that are unable to ﬁt into the caches. For example VGG16’s
                 FC6 layer, the largest layer in our experiment, contains 400MB data, which is far from the capacity of L3 cache.

                 In those latency-tolerating applications, batching improves memory locality, where weights could
                 be blocked and reused in matrix-matrix multiplication. In this scenario, pruned network no longer
                 shows its advantage. We give detailed timing results in Appendix A.

                 Figure 10 illustrates the energy efﬁciency of pruning on different hardware. We multiply power
                 consumption with computation time to get energy consumption, then normalized to CPU to get
                 energy efﬁciency. When batch size = 1, pruned network layer consumes 3% to 7% less energy over
                 the dense network on average. Reported by nvidia-smi, GPU utilization is 99% for both dense
                 and sparse cases.

                  6.4 RATIO OF WEIGHTS, INDEX AND CODEBOOK

                 Pruning makes the weight matrix sparse, so extra space is needed to store the indexes of non-zero
                 elements. Quantization adds storage for a codebook. The experiment section has already included
                 these two factors. Figure 11 shows the breakdown of three different components when quantizing
                 four networks. Since on average both the weights and the sparse indexes are encoded with 5 bits,
                 their storage is roughly half and half. The overhead of codebook is very small and often negligible.

                                                <<FIGURE>>

                                Figure 11: Storage ratio of weight, index and codebook.

                 Table 7: Comparison with other compression methods on AlexNet. (Collins & Kohli, 2014) reduced
                 the parameters by 4% and with inferior accuracy. Deep Fried Conv nets(Yang et al., 2014) worked
                 on fully connected layers and reduced the parameters by less than 4%. SVD save parameters but
                 suffers from large accuracy loss as much as 2%. Network pruning (Han et al., 2015) reduced the
                 parameters by 9%, not including index overhead. On other networks similar to AlexNet, (Denton
                 et al., 2014) exploited linear structure of conv nets and compressed the network by 2.4% to 13.4%
                 layer wise, with 0.9% accuracy loss on compressing a single layer. (Gong et al., 2014) experimented
                 with vector quantization and compressed the network by 16% to 24%, incurring 1% accuracy loss.

                                                     <<TABLE>>

                  7 RELATED WORK

                 Neural networks are typically over-parametrized, and there is signiﬁcant redundancy for deep learning
                 models(Denil et al., 2013). This results in a waste of both computation and memory usage. There
                 have been various proposals to remove the redundancy: Vanhoucke et al. (2011) explored a ﬁxed-
                 point implementation with 8-bit integer (vs 32-bit ﬂoating point) activations. Hwang & Sung
                 (2014) proposed an optimization method for the ﬁxed-point network with ternary weights and 3-bit
                 activations. Anwar et al. (2015) quantized the neural network using L2 error minimization and
                 achieved better accuracy on MNIST and CIFAR-10 datasets.Denton et al. (2014) exploited the linear
                 structure of the neural network by ﬁnding an appropriate low-rank approximation of the parameters
                 and keeping the accuracy within 1% of the original model.
                 The empirical success in this paper is consistent with the theoretical study of random-like sparse
                 networks with +1/0/-1 weights (Arora et al., 2014), which have been proved to enjoy nice properties
                 (e.g. reversibility), and to allow a provably polynomial time algorithm for training.
                 Much work has been focused on binning the network parameters into buckets, and only the values in
                 the buckets need to be stored. HashedNets(Chen et al., 2015) reduce model sizes by using a hash
                 function to randomly group connection weights, so that all connections within the same hash bucket
                 share a single parameter value. In their method, the weight binning is pre-determined by the hash
                 function, instead of being learned through training, which doesn’t capture the nature of images. Gong
                 et al. (2014) compressed deep conv nets using vector quantization, which resulted in 1% accuracy
                 loss. Both methods studied only the fully connected layer, ignoring the convolutional layers.
                 There have been other attempts to reduce the number of parameters of neural networks by replacing
                 the fully connected layer with global average pooling. The Network in Network architecture(Lin et al.,
                 2013) and GoogLenet(Szegedy et al., 2014) achieves state-of-the-art results on several benchmarks by
                 adopting this idea. However, transfer learning, i.e. reusing features learned on the ImageNet dataset
                 and applying them to new tasks by only ﬁne-tuning the fully connected layers, is more difﬁcult with
                 this approach. This problem is noted by Szegedy et al. (2014) and motivates them to add a linear
                 layer on the top of their networks to enable transfer learning.
                 Network pruning has been used both to reduce network complexity and to reduce over-ﬁtting. An
                 early approach to pruning was biased weight decay (Hanson & Pratt, 1989). Optimal Brain Damage
                 (LeCun et al., 1989) and Optimal Brain Surgeon (Hassibi et al., 1993) prune networks to reduce
                 the number of connections based on the Hessian of the loss function and suggest that such pruning
                 is more accurate than magnitude-based pruning such as weight decay. A recent work (Han et al.,
                 2015) successfully pruned several state of the art large scale networks and showed that the number of
                 parameters could be reduce by an order of magnitude. There are also attempts to reduce the number
                 of activations for both compression and acceleration Van Nguyen et al. (2015).

                  8 FUTURE WORK

                 While thE pruned network has been benchmarked on various hardware, the quantized network with
                 weight sharing has not, because off-the-shelf cuSPARSE or MKL SPBLAS library does not support
                 indirect matrix entry lookup, nor is the relative index in CSC or CSR format supported. So the full
                 advantage of Deep Compression that ﬁt the model in cache is not fully unveiled. A software solution
                 is to write customized GPU kernels that support this. A hardware solution is to build custom ASIC
                 architecture specialized to traverse the sparse and quantized network structure, which also supports
                 customized quantization bit width. We expect this architecture to have energy dominated by on-chip
                 SRAM access instead of off-chip DRAM access.

                  9 CONCLUSION

                 We have presented “Deep Compression” that compressed neural networks without affecting accuracy.
                 Our method operates by pruning the unimportant connections, quantizing the network using weight
                 sharing, and then applying Huffman coding. We highlight our experiments on AlexNet which
                 reduced the weight storage by 35% without loss of accuracy. We show similar results for VGG-16
                 and LeNet networks compressed by 49% and 39% without loss of accuracy. This leads to smaller
                 storage requirement of putting conv nets into mobile app. After Deep Compression the size of these
                 networks ﬁt into on-chip SRAM cache (5pJ/access) rather than requiring off-chip DRAM memory
                 (640pJ/access). This potentially makes deep neural networks more energy efﬁcient to run on mobile.
                 Our compression method also facilitates the use of complex neural networks in mobile applications
                 where application size and download bandwidth are constrained.

                  REFERENCES
                 Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional
                   neural networks for object recognition. InAcoustics, Speech and Signal Processing (ICASSP),
                   2015 IEEE International Conference on, pp. 1131–1135. IEEE, 2015.
                 Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some
                   deep representations. InProceedings of the 31th International Conference on Machine Learning,
                   ICML 2014, pp. 584–592, 2014.
                 BVLC. Caffe model zoo. URLhttp://caffe.berkeleyvision.org/model_zoo.
                 Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress-
                   ing neural networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015.
                 Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks.arXiv
                   preprint arXiv:1412.1442, 2014.
                 Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep
                   learning. InAdvances in Neural Information Processing Systems, pp. 2148–2156, 2013.
                 Denton, Emily L, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear
                   structure within convolutional networks for efﬁcient evaluation. InAdvances in Neural Information
                   Processing Systems, pp. 1269–1277, 2014.
                 Girshick, Ross. Fast r-cnn.arXiv preprint arXiv:1504.08083, 2015.
                 Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional
                   networks using vector quantization.arXiv preprint arXiv:1412.6115, 2014.
                 Han, Song, Pool, Jeff, Tran, John, and Dally, William J. Learning both weights and connections for
                   efﬁcient neural networks. InAdvances in Neural Information Processing Systems, 2015.
                 Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally,
                   William J. EIE: Efﬁcient inference engine on compressed deep neural network.arXiv preprint
                   arXiv:1602.01528, 2016.
                 Hanson, Stephen Jose and Pratt, Lorien Y. Comparing biases for minimal network construction with´
                   back-propagation. InAdvances in neural information processing systems, pp. 177–185, 1989.
                 Hassibi, Babak, Stork, David G, et al. Second order derivatives for network pruning: Optimal brain
                   surgeon.Advances in neural information processing systems, pp. 164–164, 1993.
                 Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using
                   weights+ 1, 0, and- 1. InSignal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1–6.
                   IEEE, 2014.
                 Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross,
                   Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature
                   embedding.arXiv preprint arXiv:1408.5093, 2014.
                 Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep
                   convolutional neural networks. InNIPS, pp. 1097–1105, 2012.
                 LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence D. Optimal
                   brain damage. InNIPs, volume 89, 1989.
                 LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied
                   to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                 Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network.arXiv:1312.4400, 2013.
                 NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing
                   to embedded systems, a. URLhttp://www.nvidia.com.
                 NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b.
                   URLhttp://www.nvidia.com/object/white-papers.html.
                 Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image
                   recognition.arXiv preprint arXiv:1409.1556, 2014.
                 Strom, Nikko. Phoneme probability estimation with dynamic sparsely connected artiﬁcial neural¨
                   networks.The Free Speech Journal, 1(5):1–41, 1997.
                 Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir,
                   Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions.
                   arXiv preprint arXiv:1409.4842, 2014.
                  Van Leeuwen, Jan. On the construction of huffman trees. InICALP, pp. 382–410, 1976.
                 Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical
                   images using efﬁcient location-sensitive deep network. InMedical Image Computing and Computer-
                   Assisted Intervention–MICCAI 2015, pp. 677–684. Springer, 2015.
                 Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on
                   cpus. InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.
                 Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and
                   Wang, Ziyu. Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014.

                  A APPENDIX :DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE
                     NETWORK LAYERS

                 Table 8: Average time on different layers. To avoid variance, we measured the time spent on each
                 layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time
                 consumed bycudaMallocandcudaMemcpyis not counted. For batch size = 1,gemvis used;
                 For batch size = 64,gemmis used. For sparse case,csrmvandcsrmmis used, respectively.

                                              <<TABLE>>

                 Table 9: Power consumption of different layers. We measured the Titan X GPU power with
                 nvidia-smi, Core i7-5930k CPU power withpcm-powerand Tegra K1 mobile GPU power with
                 an external power meter (scaled to AP+DRAM, see paper discussion). During power measurement,
                 we repeated each computation multiple times in order to get stable numbers. On CPU, dense matrix
                 multiplications consume2xenergy than sparse ones because it is accelerated with multi-threading.

                                              <<TABLE>>
<|endoftext|>


<|startoftext|>
                  DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT

                  Preetum Nakkiran    Gal Kaplun y       Yamini Bansal y     Tristan Yang
                  Harvard University    Harvard University    Harvard University    Harvard University

                  Boaz Barak         Ilya Sutskever
                  Harvard University    OpenAI


                                         ABSTRACT

                       We show that a variety of modern deep learning tasks exhibit a “double-descent”
                       phenomenon where, as we increase model size, performance ﬁrst gets worse and
                       then gets better. Moreover, we show that double descent occurs not just as a
                       function of model size, but also as a function of the number of training epochs.
                       We unify the above phenomena by deﬁning a new complexity measure we call
                       the effective model complexity and conjecture a generalized double descent with
                       respect to this measure. Furthermore, our notion of model complexity allows us to
                       identify certain regimes where increasing (even quadrupling) the number of train
                       samples actually hurts test performance.


                  1 INTRODUCTION

                                        <<FIGURE>>

                 Figure 1:Left:Train and test error as a function of model size, for ResNet18s of varying width
                  on CIFAR-10 with 15% label noise.Right:Test error, shown for varying train epochs. All models
                 trained using Adam for 4K epochs. The largest model (width64) corresponds to standard ResNet18.


                  The bias-variance trade-off is a fundamental concept in classical statistical learning theory (e.g.,
                  Hastie et al. (2005)). The idea is that models of higher complexity have lower bias but higher vari-
                  ance. According to this theory, once model complexity passes a certain threshold, models “overﬁt”
                  with the variance term dominating the test error, and hence from this point onward, increasing model
                  complexity will only decrease performance (i.e., increase test error). Hence conventional wisdom
                 in classical statistics is that, once we pass a certain threshold,“larger models are worse.”
                  However, modern neural networks exhibit no such phenomenon. Such networks have millions of
                  parameters, more than enough to ﬁt even random labels (Zhang et al. (2016)), and yet they perform
                  much better on many tasks than smaller models. Indeed, conventional wisdom among practitioners
                  is that“larger models are better’’ (Krizhevsky et al. (2012), Huang et al. (2018), Szegedy et al.
                   
                              <<FIGURE>>

                  Figure 2:Left:Test error as a function of model size and train epochs. The horizontal line corre-
                  sponds to model-wise double descent–varying model size while training for as long as possible. The
                  vertical line corresponds to epoch-wise double descent, with test error undergoing double-descent
                  as train time increases.RightTrain error of the corresponding models. All models are Resnet18s
                  trained on CIFAR-10 with 15% label noise, data-augmentation, and Adam for up to 4K epochs.


                  (2015), Radford et al. (2019)). The effect of training time on test performance is also up for debate.
                  In some settings, “early stopping” improves test performance, while in other settings training neu-
                  ral networks to zero training error only improves performance. Finally, if there is one thing both
                  classical statisticians and deep learning practitioners agree on is“more data is always better”.
                 In this paper, we present empirical evidence that both reconcile and challenge some of the above
                 “conventional wisdoms.” We show that many deep learning settings have two different regimes.
                 In the under-parameterized regime, where the model complexity is small compared to the number
                  of samples, the test error as a function of model complexity follows the U-like behavior predicted
                  by the classical bias/variance tradeoff. However, once model complexity is sufﬁciently large to
                  interpolate i.e., achieve (close to) zero training error, then increasing complexity only decreases test
                 error, following the modern intuition of “bigger models are better”. Similar behavior was previously
                 observed in Opper (1995; 2001), Advani & Saxe (2017), Spigler et al. (2018), and Geiger et al.
                 (2019b). This phenomenon was ﬁrst postulated in generality by Belkin et al. (2018) who named
                 it “double descent”, and demonstrated it for decision trees, random features, and 2-layer neural
                 networks with‘2 loss, on a variety of learning tasks including MNIST and CIFAR-10.


                  Main contributions. We show that double descent is a robust phenomenon that occurs in a variety
                  of tasks, architectures, and optimization methods (see Figure 1 and Section 5; our experiments are
                  summarized in Table A). Moreover, we propose a much more general notion of “double descent”
                  that goes beyond varying the number of parameters. We deﬁne the effective model complexity (EMC)
                 of a training procedure as the maximum number of samples on which it can achieve close to zero
                 training error. The EMC depends not just on the data distribution and the architecture of the classiﬁer
                 but also on the training procedure—and in particular increasing training time will increase the EMC.
                 We hypothesize that for many natural models and learning algorithms, double descent occurs as a
                 function of the EMC. Indeed we observe “epoch-wise double descent” when we keep the model ﬁxed
                 and increase the training time, with performance following a classical U-like curve in the underﬁtting
                 stage (when the EMC is smaller than the number of samples) and then improving with training time
                 once the EMC is sufﬁciently larger than the number of samples (see Figure 2). As a corollary, early
                 stopping only helps in the relatively narrow parameter regime of critically parameterized models.


                 Sample non-monotonicity. Finally, our results shed light on test performance as a function of
                 the number of train samples. Since the test error peaks around the point where EMC matches the
                 number of samples (the transition from the under- to over-parameterization), increasing the number
                 of samples has the effect of shifting this peak to the right. While in most settings increasing the
                 number of samples decreases error, this shifting effect can sometimes result in a setting wheremore
                 data is worse!For example, Figure 3 demonstrates cases in which increasing the number of samples
                 by a factor of4:5results in worse test performance.

                                                      Figure 3: Test loss (per-token perplexity) as a
                                                      function of Transformer model size (embed-
                                                      ding dimension d model) on language trans-
                     <<FIGURE>>                       lation (IWSLT‘14 German-to-English). The
                                                      curve for 18k samples is generally lower than
                                                      the one for 4k samples, but also shifted to
                                                      the right, since ﬁtting 18k samples requires
                                                      a larger model. Thus, for some models, the
                                                      performance for 18k samples is worse than
                                                      for 4k samples.


                  2 OUR RESULTS

                  To state our hypothesis more precisely, we deﬁne the notion of effective model complexity. We deﬁne
                  a training procedure T to be any procedure that takes as input a set <<FORMULA>>
                  of labeled training samples and outputs a classiﬁer <<T(S)>> mapping data to labels. We deﬁne the
                  effective model complexity of T (w.r.t. distributionD) to be the maximum number of samples non
                  which T achieves on average <<FORMULA>> training error.

                  Deﬁnition 1 (Effective Model Complexity)TheEffective Model Complexity(EMC) of a training
                  procedureT, with respect to distribution D and parameter <<FORMULA>>, is deﬁned as:

                                <<FORMULA>>

                  whereError <<S(M)>> is the mean error of modelMon train samplesS.

                  Our main hypothesis can be informally stated as follows:

                  Hypothesis 1 (Generalized Double Descent hypothesis, informal)For any natural data distribu-
                  tion D, neural-network-based training procedureT, and small <<FORMULA>>, if we consider the task of
                  predicting labels based on n samples from D then:

                  Under-parametrized regime.If <<FORMULA>> is sufﬁciently smaller than n, any perturbation of T
                       that increases its effective complexity will decrease the test error.
                  Over-parameterized regime.If <<FORMULA>> is sufﬁciently larger than n, any perturbation of T
                       that increases its effective complexity will decrease the test error.
                  Critically parameterized regime.If <<FORMULA>>, then a perturbation of T that increases its
                       effective complexity might decrease or increase the test error.

                  Hypothesis 1 is informal in several ways. We do not have a principled way to choose the parameter
                  <<FORMULA>> (and currently heuristically use <<FORMULA>>). We also are yet to have a formal speciﬁcation for
                  “sufﬁciently smaller” and “sufﬁciently larger”. Our experiments suggest that there is a critical
                  interval around the interpolation threshold when <<FORMULA>>: below and above this interval
                  increasing complexity helps performance, while within this interval it may hurt performance. The
                  width of the critical interval depends on both the distribution and the training procedure in ways we
                  do not yet completely understand.

                  We believe Hypothesis 1 sheds light on the interaction between optimization algorithms, model size,
                  and test performance and helps reconcile some of the competing intuitions about them. The main
                  result of this paper is an experimental validation of Hypothesis 1 under a variety of settings, where
                  we considered several natural choices of datasets, architectures, and optimization algorithms, and
                  we changed the “interpolation threshold” by varying the number of model parameters, the length of
                  training, the amount of label noise in the distribution, and the number of train samples.
                  Model-wise Double Descent.In Section 5, we study the test error of models of increasing size,
                  for a ﬁxed large number of optimization steps. We show that “model-wise double-descent” occurs
                  for various modern datasets (CIFAR-10, CIFAR-100, IWSLT‘14 de-en, with varying amounts of
                  label noise), model architectures (CNNs, ResNets, Transformers), optimizers (SGD, Adam), number
                  of train samples, and training procedures (data-augmentation, and regularization). Moreover, the
                 peak in test error systematically occurs at the interpolation threshold. In particular, we demonstrate
                 realistic settings in which bigger models are worse.

                 Epoch-wise Double Descent.In Section 6, we study the test error of a ﬁxed, large architecture over
                 the course of training. We demonstrate, in similar settings as above, a corresponding peak in test
                 performance when models are trained just long enough to reach <<FORMULA>> train error. The test error of a
                 large model ﬁrst decreases (at the beginning of training), then increases (around the critical regime),
                 then decreases once more (at the end of training)—that is,training longer can correct overﬁtting.
                 Sample-wise Non-monotonicity.In Section 7, we study the test error of a ﬁxed model and training
                 procedure, for varying number of train samples. Consistent with our generalized double-descent
                 hypothesis, we observe distinct test behavior in the “critical regime”, when the number of samples
                 is near the maximum that the model can ﬁt. This often manifests as a long plateau region, in which
                 taking signiﬁcantly more data might not help when training to completion (as is the case for CNNs on
                 CIFAR-10). Moreover, we show settings (Transformers on IWSLT‘14 en-de), where this manifests
                 as a peak—and for a ﬁxed architecture and training procedure,more data actually hurts.
                 Remarks on Label Noise.We observe all forms of double descent most strongly in settings with
                 label noise in the train set (as is often the case when collecting train data in the real-world). How-
                 ever, we also show several realistic settings with a test-error peak even without label noise: ResNets
                 (Figure 4a) and CNNs (Figure 20) on CIFAR-100; Transformers on IWSLT‘14 (Figure 8). More-
                 over, all our experiments demonstrate distinctly different test behavior in the critical regime— often
                 manifesting as a “plateau” in the test error in the noiseless case which develops into a peak with
                 added label noise. See Section 8 for further discussion.

                  3 RELATED WORK

                 Model-wise double descent was ﬁrst proposed as a general phenomenon by Belkin et al. (2018).
                 Similar behavior had been observed in Opper (1995; 2001), Advani & Saxe (2017), Spigler et al.
                 (2018), and Geiger et al. (2019b). Subsequently, there has been a large body of work studying the
                 double descent phenomenon. A growing list of papers that theoretically analyze it in the tractable
                 setting of linear least squares regression includes Belkin et al. (2019); Hastie et al. (2019); Bartlett
                 et al. (2019); Muthukumar et al. (2019); Bibas et al. (2019); Mitra (2019); Mei & Montanari (2019).
                 Moreover, Geiger et al. (2019a) provide preliminary results for model-wise double descent in con-
                 volutional networks trained on CIFAR-10. Our work differs from the above papers in two crucial
                 aspects: First, we extend the idea of double-descent beyond the number of parameters to incorpo-
                 rate the training procedure under a uniﬁed notion of “Effective Model Complexity”, leading to novel
                 insights like epoch-wise double descent and sample non-monotonicity. The notion that increasing
                 train time corresponds to increasing complexity was also presented in Nakkiran et al. (2019). Sec-
                 ond, we provide an extensive and rigorous demonstration of double-descent for modern practices
                 spanning a variety of architectures, datasets optimization procedures. An extended discussion of the
                 related work is provided in Appendix C.

                  4 EXPERIMENTAL SETUP

                 We brieﬂy describe the experimental setup here; full details are in Appendix B1. We consider three
                 families of architectures: ResNets, standard CNNs, and Transformers.ResNets:We parameterize
                 a family of ResNet18s (He et al. (2016)) by scaling the width (number of ﬁlters) of convolutional
                 layers. Speciﬁcally, we use layer widths [k;2k;4k;8k] for varying k. The standard ResNet18
                 corresponds tok= 64. Standard CNNs:We consider a simple family of 5-layer CNNs, with
                 4 convolutional layers of widths [k;2k;4k;8k] for varying k, and a fully-connected layer. For
                 context, the CNN with width k=64, can reach over 90% test accuracy on CIFAR-10 with data-
                 augmentation.Transformers:We consider the 6 layer encoder-decoder from Vaswani et al. (2017),
                 as implemented by Ott et al. (2019). We scale the size of the network by modifying the embedding
                 dimension d model , and setting the width of the fully-connected layers proportionally (<<FORMULA>>).

                 The raw data from our experiments are available at:   https://gitlab.com/harvard-machine-learning/double-descent/tree/master

                 For ResNets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam
                 with learning-rate0:0001for 4K epochs; (2) SGD with learning rate/p1 for 500K gradient steps. T We train Transformers for 80K gradient steps, with 10% label smoothing and no drop-out.

                 Label Noise. In our experiments, label noise of probability prefers to training on a samples which
                  have the correct label with probability (<<FORMULA>>), and a uniformly random incorrect label otherwise
                  (label noise is sampled only once and not per epoch). Figure 1 plots test error on the noisy distribu-
                  tion, while the remaining ﬁgures plot test error with respect to the clean distribution (the two curves
                  are just linear rescaling of one another).

                  5 MODEL-WISE DOUBLE DESCENT

                                            <<FIGURE>>

                 Figure 4:Model-wise double descent for ResNet18s.Trained on CIFAR-100 and CIFAR-10, with
                 varying label noise. Optimized using Adam with LR0:0001for 4K epochs, and data-augmentation.

                 In this section, we study the test error of models of increasing size, when training to completion
                 (for a ﬁxed large number of optimization steps). We demonstrate model-wise double descent across
                 different architectures, datasets, optimizers, and training procedures. The critical region exhibits
                 distinctly different test behavior around the interpolation point and there is often a peak in test error
                 that becomes more prominent in settings with label noise.
                 For the experiments in this section (Figures 4, 5, 6, 7, 8), notice that all modiﬁcations which increase
                 the interpolation threshold (such as adding label noise, using data augmentation, and increasing the
                 number of train samples) also correspondingly shift the peak in test error towards larger models.
                 Additional plots showing the early-stopping behavior of these models, and additional experiments
                 showing double descent in settings with no label noise (e.g. Figure 19) are in Appendix E.2. We
                 also observed model-wise double descent for adversarial training, with a prominent robust test error
                 peak even in settings without label noise. See Figure 26 in Appendix E.2.

                 Discussion. Fully understanding the mechanisms behind model-wise double descent in deep neu-
                 ral networks remains an important open question. However, an analog of model-wise double descent
                 occurs even for linear models. A recent stream of theoretical works analyzes this setting (Bartlett
                 et al. (2019); Muthukumar et al. (2019); Belkin et al. (2019); Mei & Montanari (2019); Hastie et al.
                 (2019)). We believe similar mechanisms may be at work in deep neural networks.
                 Informally, our intuition is that for model-sizes at the interpolation threshold, there is effectively
                 only one model that ﬁts the train data and this interpolating model is very sensitive to noise in the

                         <<FIGURE>>

                 Figure 5: Effect of Data Augmentation. 5-layer CNNs on CIFAR10, with and without data-
                 augmentation. Data-augmentation shifts the interpolation threshold to the right, shifting the test
                 error peak accordingly. Optimized using SGD for 500K steps. See Figure 27 for larger models.

                          <<FIGURE>>                                  <<FIGURE>>

                   Figure 6:SGD vs. Adam.5-Layer CNNs       Figure 7: Noiseless settings.  5-layer
                   on CIFAR-10 with no label noise, and no       CNNs on CIFAR-100 with no label noise;
                   data augmentation. Optimized using SGD       note the peak in test error. Trained with
                   for 500K gradient steps, and Adam for 4K       SGD and no data augmentation. See Fig-
                   epochs.                             ure 20 for the early-stopping behavior of
                                                      these models.


                 train set and/or model mis-speciﬁcation. That is, since the model is just barely able to ﬁt the train
                 data, forcing it to ﬁt even slightly-noisy or mis-speciﬁed labels will destroy its global structure, and
                 result in high test error. (See Figure 28 in the Appendix for an experiment demonstrating this noise
                 sensitivity, by showing that ensembling helps signiﬁcantly in the critically-parameterized regime).
                 However for over-parameterized models, there are many interpolating models that ﬁt the train set,
                 and SGD is able to ﬁnd one that “memorizes” (or “absorbs”) the noise while still performing well
                 on the distribution.
                 The above intuition is theoretically justiﬁed for linear models. In general, this situation manifests
                 even without label noise for linear models (Mei & Montanari (2019)), and occurs whenever there

                                                      Figure 8:Transformers on language trans-
                                                      lation tasks:Multi-head-attention encoder-
                                                      decoder Transformer model trained for
                          <<FIGURE>>                  80k gradient steps with labeled smoothed
                                                      cross-entropy loss on IWSLT‘14 German-
                                                      to-English (160K sentences) and WMT‘14
                                                      English-to-French (subsampled to 200K sen-
                                                      tences) dataset. Test loss is measured as per-
                                                      token perplexity.


                 is model mis-speciﬁcation between the structure of the true distribution and the model family. We
                  believe this intuition extends to deep learning as well, and it is consistent with our experiments.


                  6 EPOCH-WISE DOUBLE DESCENT

                  In this section, we demonstrate a novel form of double-descent with respect to training epochs,
                  which is consistent with our uniﬁed view of effective model complexity (EMC) and the generalized
                  double descent hypothesis. Increasing the train time increases the EMC—and thus a sufﬁciently
                  large model transitions from under- to over-parameterized over the course of training.

                                      <<FIGURE>>

                 Figure 9:Left:Training dynamics for models in three regimes. Models are ResNet18s on CIFAR10
                 with 20% label noise, trained using Adam with learning rate0:0001, and data augmentation.Right:
                 Test error over (Model size  Epochs). Three slices of this plot are shown on the left.


                 As illustrated in Figure 9, sufﬁciently large models can undergo a “double descent” behavior where
                 test error ﬁrst decreases then increases near the interpolation threshold, and then decreases again. In
                 contrast, for “medium sized” models, for which training to completion will only barely reach 0
                 error, the test error as a function of training time will follow a classical U-like curve where it is
                 better to stop early. Models that are too small to reach the approximation threshold will remain in
                 the “under parameterized” regime where increasing train time monotonically decreases test error.
                 Our experiments (Figure 10) show that many settings of dataset and architecture exhibit epoch-wise
                 double descent, in the presence of label noise. Further, this phenomenon is robust across optimizer
                 variations and learning rate schedules (see additional experiments in Appendix E.1). As in model-
                 wise double descent, the test error peak is accentuated with label noise.
                 Conventional wisdom suggests that training is split into two phases: (1) In the ﬁrst phase, the net-
                 work learns a function with a small generalization gap (2) In the second phase, the network starts
                 to over-ﬁt the data leading to an increase in test error. Our experiments suggest that this is not the
                 complete picture—in some regimes, the test error decreases again and may achieve a lower value at
                 the end of training as compared to the ﬁrst minimum (see Fig 10 for 10% label noise).

                               <<FIGURE>>

                 Figure 10:Epoch-wise double descent for ResNet18 and CNN (width=128). ResNets trained using
                 Adam with learning rate0:0001, and CNNs trained with SGD with inverse-square root learning rate.


                  7 SAMPLE-WISE NON-MONOTONICITY

                  In this section, we investigate the effect of varying the number of train samples, for a ﬁxed model and
                  training procedure. Previously, in model-wise and epoch-wise double descent, we explored behavior
                  in the critical regime, where <<FORMULA>>, by varying the EMC. Here, we explore the critical
                  regime by varying the number of train samples n. By increasing n, the same training procedure T
                  can switch from being effectively over-parameterized to effectively under-parameterized.
                  We show that increasing the number of samples has two different effects on the test error vs. model
                  complexity graph. On the one hand, (as expected) increasing the number of samples shrinks the area
                  under the curve. On the other hand, increasing the number of samples also has the effect of “shifting
                  the curve to the right” and increasing the model complexity at which test error peaks.

                                          <<FIGURE>>

                                    Figure 11: Sample-wise non-monotonicity.


                  These twin effects are shown in Figure 11a. Note that there is a range of model sizes where the
                 effects “cancel out”—and having 4% more train samples does not help test performance when
                  training to completion. Outside the critically-parameterized regime, for sufﬁciently under- or over-
                  parameterized models, having more samples helps. This phenomenon is corroborated in Figure 12,
                  which shows test error as a function of both model and sample size, in the same setting as Figure 11a.

                                                  <<FIGURE>>

                 Figure 12:Left:Test Error as a function of model size and number of train samples, for 5-layer
                 CNNs on CIFAR-10 +20% noise. Note the ridge of high test error again lies along the interpolation
                 threshold. Right: Three slices of the left plot, showing the effect of more data for models of
                 different sizes. Note that, when training to completion, more data helps for small and large models,
                 but does not help for near-critically-parameterized models (green).

                 In some settings, these two effects combine to yield a regime of model sizes where more data actually
                 hurts test performance as in Figure 3 (see also Figure 11b). Note that this phenomenon is not unique
                 to DNNs: more data can hurt even for linear models (see Appendix D).

                  8 CONCLUSION AND DISCUSSION

                 We introduce a generalized double descent hypothesis: models and training procedures exhibit atyp-
                 ical behavior when their Effective Model Complexity is comparable to the number of train samples.
                 We provide extensive evidence for our hypothesis in modern deep learning settings, and show that
                 it is robust to choices of dataset, architecture, and training procedures. In particular, we demon-
                 strate “model-wise double descent” for modern deep networks and characterize the regime where
                 bigger models can perform worse. We also demonstrate “epoch-wise double descent,” which, to the
                 best of our knowledge, has not been previously proposed. Finally, we show that the double descent
                 phenomenon can lead to a regime where training on more data leads to worse test performance.
                 Preliminary results suggest that double descent also holds as we vary the amount of regularization
                 for a ﬁxed model (see Figure 22).
                 We also believe our characterization of the critical regime provides a useful way of thinking for
                 practitioners—if a model and training procedure are just barely able to ﬁt the train set, then small
                 changes to the model or training procedure may yield unexpected behavior (e.g. making the model
                 slightly larger or smaller, changing regularization, etc. may hurt test performance).

                 Early stopping. We note that many of the phenomena that we highlight often do not occur with
                 optimal early-stopping. However, this is consistent with our generalized double descent hypothesis:
                 if early stopping prevents models from reaching0train error then we would not expect to see double-
                  descent, since the EMC does not reach the number of train samples. Further, we show at least one

                 setting where model-wise double descent can still occur even with optimal early stopping (ResNets
                 on CIFAR-100 with no label noise, see Figure 19). We have not observed settings where more data
                 hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude
                 this from occurring. We leave fully understanding the optimal early stopping behavior of double
                 descent as an important open question for future work.

                 Label Noise. In our experiments, we observe double descent most strongly in settings with label
                 noise. However, we believe this effect is not fundamentally about label noise, but rather about
                 model mis-speciﬁcation. For example, consider a setting where the label noise is not truly random,
                 but rather pseudorandom (with respect to the family of classiﬁers being trained). In this setting,
                 the performance of the Bayes optimal classiﬁer would not change (since the pseudorandom noise
                 is deterministic, and invertible), but we would observe an identical double descent as with truly
                 random label noise. Thus, we view adding label noise as merely a proxy for making distributions
                 “harder”— i.e. increasing the amount of model mis-speciﬁcation.

                 Other Notions of Model Complexity. Our notion of Effective Model Complexity is related to
                 classical complexity notions such as Rademacher complexity, but differs in several crucial ways:
                 (1) EMC depends on the true labels of the data distribution, and (2) EMC depends on the training
                 procedure, not just the model architecture.
                 Other notions of model complexity which do not incorporate features (1) and (2) would not sufﬁce
                 to characterize the location of the double-descent peak. Rademacher complexity, for example, is
                 determined by the ability of a model architecture to ﬁt a randomly-labeled train set. But Rademacher
                 complexity and VC dimension are both insufﬁcient to determine the model-wise double descent
                 peak location, since they do not depend on the distribution of labels— and our experiments show
                 that adding label noise shifts the location of the peak.
                 Moreover, both Rademacher complexity and VC dimension depend only on the model family and
                 data distribution, and not on the training procedure used to ﬁnd models. Thus, they are not capable
                 of capturing train-time double-descent effects, such as “epoch-wise” double descent, and the effect
                 of data-augmentation on the peak location.

                  ACKNOWLEDGMENTS
                 We thank Mikhail Belkin for extremely useful discussions in the early stages of this work. We
                 thank Christopher Olah for suggesting the Model SizeEpoch visualization, which led to the
                 investigation of epoch-wise double descent, as well as for useful discussion and feedback. We also
                 thank Alec Radford, Jacob Steinhardt, and Vaishaal Shankar for helpful discussion and suggestions.
                 P.N. thanks OpenAI, the Simons Institute, and the Harvard Theory Group for a research environment
                 that enabled this kind of work.
                 We thank Dimitris Kalimeris, Benjamin L. Edelman, and Sharon Qian, and Aditya Ramesh for
                 comments on an early draft of this work.
                 This work supported in part by NSF grant CAREER CCF 1452961, BSF grant 2014389, NSF US-
                 ICCS proposal 1540428, a Google Research award, a Facebook research award, a Simons Investiga-
                 tor Award, a Simons Investigator Fellowship, and NSF Awards CCF 1715187, CCF 1565264, CCF
                 1301976, IIS 1409097, and CNS 1618026. Y.B. would like to thank the MIT-IBM Watson AI Lab
                 for contributing computational resources for experiments.


                  REFERENCES
                 Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural
                   networks.arXiv preprint arXiv:1710.03667, 2017.

                 Peter L Bartlett, Philip M Long, Gabor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear´
                   regression.arXiv preprint arXiv:1906.11300, 2019.

                  Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning
                   and the bias-variance trade-off.arXiv preprint arXiv:1812.11118, 2018.

                 Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.arXiv
                   preprint arXiv:1903.07571, 2019.

                  Koby Bibas, Yaniv Fogel, and Meir Feder. A new look at an old problem: A universal learning
                   approach to linear regression.arXiv preprint arXiv:1905.04708, 2019.

                 Mauro Cettolo, Christian Girardi, and Marcello Federico. Wit 3 : Web inventory of transcribed and
                   translated talks. InProceedings of the 16 th Conference of the European Association for Machine
                   Translation (EAMT), pp. 261–268, Trento, Italy, May 2012.

                 Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stephane d’Ascoli,´
                   Giulio Biroli, Clement Hongler, and Matthieu Wyart. Scaling description of generalization with´
                   number of parameters in deep learning.arXiv preprint arXiv:1901.01608, 2019a.

                 Mario Geiger, Stefano Spigler, Stephane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli,´
                   and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep
                   neural networks.Physical Review E, 100(1):012115, 2019b.

                 Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
                   examples.arXiv preprint arXiv:1412.6572, 2014.

                 Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical
                   learning: data mining, inference and prediction.The Mathematical Intelligencer, 27(2):83–85,
                   2005.

                 Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-
                   dimensional ridgeless least squares interpolation.arXiv preprint arXiv:1903.08560, 2019.

                 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
                   networks. InEuropean conference on computer vision, pp. 630–645. Springer, 2016.

                 Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and
                   Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism.
                   CoRR, abs/1811.06965, 2018. URLhttp://arxiv.org/abs/1811.06965.

                 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

                 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convo-
                   lutional neural networks. InAdvances in neural information processing systems, pp. 1097–1105,
                   2012.

                 Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
                   Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083,
                   2017.

                 Song Mei and Andrea Montanari. The generalization error of random features regression: Precise
                   asymptotics and double descent curve.arXiv preprint arXiv:1908.05355, 2019.

                 Partha P. Mitra. Understanding overﬁtting peaks in generalization error: Analytical risk curves for
                   l2 and l1 penalized interpolation.ArXiv, abs/1906.03667, 2019.

                 Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in
                   regression.arXiv preprint arXiv:1903.09139, 2019.

                  Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L Edelman, Fred
                   Zhang, and Boaz Barak. Sgd on neural networks learns functions of increasing complexity.arXiv
                   preprint arXiv:1905.11604, 2019.

                  Manfred Opper. Statistical mechanics of learning: Generalization.The Handbook of Brain Theory
                   and Neural Networks, 922-925., 1995.

                  Manfred Opper. Learning to generalize.Frontiers of Life, 3(part 2), pp.763-775., 2001.

                 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
                   and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. InProceedings of
                   NAACL-HLT 2019: Demonstrations, 2019.

                 David Page. How to train your resnet.      https://myrtle.ai/how-to-train-your-resnet-4-architecture/, 2018.

                 Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
                   Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
                   PyTorch. InNeurIPS Autodiff Workshop, 2017.

                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                   models are unsupervised multitask learners. 2019.

                 Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in
                   neural information processing systems, pp. 1177–1184, 2008.

                 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
                   subword units.ArXiv, abs/1508.07909, 2015.

                 Stefano Spigler, Mario Geiger, Stephane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart.´
                   A jamming transition from under-to over-parametrization affects loss landscape and generaliza-
                   tion.arXiv preprint arXiv:1810.09665, 2018.

                 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-
                   mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In
                   Computer Vision and Pattern Recognition (CVPR), 2015. URLhttp://arxiv.org/abs/
                   1409.4842.

                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
                   Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017.

                 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
                   deep learning requires rethinking generalization.ICLR, abs/1611.03530, 2016.


                           A SUMMARY TABLE OF EXPERIMENTAL RESULTS

                                     <<TABLE>>


                  B APPENDIX: EXPERIMENTAL DETAILS

                  B.1 MODELS

                 We use the following families of architectures. The PyTorch Paszke et al. (2017)
                 speciﬁcation of our ResNets and CNNs are available at https://gitlab.com/harvard-machine-learning/double-descent/tree/master.

                 ResNets. We deﬁne a family of ResNet18s of increasing size as follows. We follow the Preac-
                 tivation ResNet18 architecture of He et al. (2016), using 4 ResNet blocks, each consisting of two
                 BatchNorm-ReLU-Convolution layers. The layer widths for the 4 blocks are [k;2k;4k;8k] for
                 varyingk2Nand the strides are [1, 2, 2, 2]. The standard ResNet18 corresponds to k=64 con-
                  volutional channels in the ﬁrst layer. The scaling of model size withkis shown in Figure 13b. Our
                  implementation is adapted from https://github.com/kuangliu/pytorch-cifar.

                  Standard CNNs. We consider a simple family of 5-layer CNNs, with four Conv-BatchNorm-
                  ReLU-MaxPool layers and a fully-connected output layer. We scale the four convolutional layer
                  widths as [k;2k;4k;8k]. The MaxPool is [1, 2, 2, 8]. For all the convolution layers, the kernel
                 size = 3, stride = 1 and padding=1. This architecture is based on the “backbone” architecture from
                 Page (2018). For k=64, this CNN has 1558026 parameters and can reach >90% test accuracy on
                 CIFAR-10 (Krizhevsky (2009)) with data-augmentation. The scaling of model size with k is shown
                 in Figure 13a.

                 Transformers. We consider the encoder-decoder Transformer model from Vaswani et al. (2017)
                  with 6 layers and 8 attention heads per layer, as implemented by fairseq Ott et al. (2019). We scale
                  the size of the network by modifying the embedding dimension (d model), and scale the width of the
                 fully-connected layers proportionally (dff = 4d model). We train with 10% label smoothing and no
                  drop-out, for 80 gradient steps.

                                         <<FIGURE>>            

                   Figure 13: Scaling of model size with our parameterization of width & embedding dimension.


                  B.2 IMAGE CLASSIFICATION: EXPERIMENTAL SETUP

                  We describe the details of training for CNNs and ResNets below.
                  Loss function:Unless stated otherwise, we use the cross-entropy loss for all the experiments.
                  Data-augmentation:  In experiments where data-augmentation was used, we apply
                  RandomCrop(32, padding=4)andRandomHorizontalFlip. In experiments with
                 added label noise, the label for all augmentations of a given training sample are given the same
                 label.
                 Regularization:No explicit regularization like weight decay or dropout was applied unless explic-
                  itly stated.
                  Initialization:We use the default initialization provided by PyTorch for all the layers.
                  Optimization:

                      Adam: Unless speciﬁed otherwise, learning rate was set at constant to 1e^4 and all other
                       parameters were set to their default PyTorch values.
                      SGD: Unless speciﬁed otherwise, learning rate schedule inverse-square root (deﬁned be-
                       low) was used with initial learning rate <<FORMULA>> and updates every L=512 gradient steps.
                       No momentum was used.

                  We found our results are robust to various other natural choices of optimizers and learning rate
                  schedule. We used the above settings because (1) they optimize well, and (2) they do not require
                  experiment-speciﬁc hyperparameter tuning, and allow us to use the same optimization across many
                  experiments.
                  Batch size: All experiments use a batchsize of 128.
                  Learning rate schedule descriptions:

                      Inverse-square root (<<FORMULA>>): At gradient stept, the learning rate is set to <<FORMULA>>. We set learning-rate with respect to number of gradient steps, and not epochs, <<FORMULA>>
                       in order to allow comparison between experiments with varying train-set sizes.
                      Dynamic drop (<<FORMULA>>, drop, patience): Starts with an initial learning rate of 0 and drops by
                       a factor of ’drop’ if the training loss has remained constant or become worse for ’patience’
                       number of gradient steps.

                  B.3 NEURAL MACHINE TRANSLATION: EXPERIMENTAL SETUP

                 Here we describe the experimental setup for the neural machine translation experiments.
                 Training procedure.

                 In this setting, the distributionDconsists of triples

                                   <<FORMULA>>
                                   
                 where V_src and V_tgt are the source and target vocabularies, the stringxis a sentence in the source
                 language,yis its translation in the target language, andiis the index of the token to be predicted by
                  the model. We assume that <<FORMULA>> is distributed uniformly on <<FORMULA>>.
                  A standard probabilistic model deﬁnes an autoregressive factorization of the likelihood:

                                                <<FORMULA>>

                 Given a set of training samplesS, we deﬁne

                                             <<FORMULA>>

                  In practice,S is not constructed from independent samples from D, but rather by ﬁrst sampling
                  <<(x,y)>> and then including all <<FORMULA>> in S.
                  For training transformers, we replicate the optimization procedure speciﬁed in Vaswani et al. (2017)
                  section 5.3, where the learning rate schedule consists of a “warmup” phase with linearly increasing
                  learning rate followed by a phase with inverse square-root decay. We preprocess the data using byte
                  pair encoding (BPE) as described in Sennrich et al. (2015). We use the implementation provided by
                  fairseq (https://github.com/pytorch/fairseq).

                  Datasets. The IWSLT’14 German to English dataset contains TED Talks as described in Cettolo
                  et al. (2012). The WMT’14 English to French dataset is taken from http://www.statmt.org/wmt14/translation-task.html.

                  B.4 PER-SECTION EXPERIMENTAL DETAILS

                  Here we provide full details for experiments in the body, when not otherwise provided.
                  Introduction: Experimental Details Figure 1: All models were trained using Adam with learning-
                  rate 0.0001 for 4K epochs. Plotting means and standard deviations for 5 trials, with random network
                  initialization.

                  Model-wise Double Descent: Experimental Details Figure 7: Plotting means and standard devia-
                  tions for 5 trials, with random network initialization.
                  Sample-wise Nonmonotonicity: Experimental DetailsFigure 11a: All models are trained with
                  SGD for 500K epochs, and data-augmentation. Bottom: Means and standard deviations from 5
                  trials with random initialization, and random subsampling of the train set.

                       C EXTENDED DISCUSSION OF RELATED WORK

                 Belkin et al. (2018): This paper proposed, in very general terms, that the apparent contradiction
                 between traditional notions of the bias-variance trade-off and empirically successful practices in
                 deep learning can be reconciled under a double-descent curve—as model complexity increases, the
                 test error follows the traditional “U-shaped curve”, but beyond the point of interpolation, the error
                 starts todecrease. This work provides empirical evidence for the double-descent curve with fully
                  connected networks trained on subsets of MNIST, CIFAR10, SVHN and TIMIT datasets. They use
                  thel2 loss for their experiments. They demonstrate that neural networks are not an aberration in this
                  regard—double-descent is a general phenomenon observed also in linear regression with random
                  features and random forests.

                  Theoretical works on linear least squares regression: A variety of papers have attempted to the-
                  oretically analyze this behavior in restricted settings, particularly the case of least squares regression
                  under various assumptions on the training data, feature spaces and regularization method.

                     1.Advani & Saxe (2017); Hastie et al. (2019) both consider the linear regression problem
                       stated above and analyze the generalization behavior in the asymptotic limit <<FORMULA>>
                       using random matrix theory. Hastie et al. (2019) highlight that when the model is mis-
                       speciﬁed, the minimum of training error can occur for over-parameterized models
                     2.Belkin et al. (2019) Linear least squares regression for two data models, where the input
                       data is sampled from a Gaussian and a Fourier series model for functions on a circle. They
                       provide a ﬁnite-sample analysis for these two cases
                     3.Bartlett et al. (2019) provides generalization bounds for the minimuml2 -norm interpolant
                       for Gaussian features
                     4.Muthukumar et al. (2019) characterize the fundamental limit of of any interpolating solu-
                       tion in the presence of noise and provide some interesting Fourier-theoretic interpretations.
                     5.Mei & Montanari (2019): This work provides asymptotic analysis for ridge regression over
                       random features

                 Similar double descent behavior was investigated in Opper (1995; 2001)
                 Geiger et al. (2019b) showed that deep fully connected networks trained on the MNIST dataset with
                 hinge loss exhibit a “jamming transition” when the number of parameters exceeds a threshold that
                 allows training to near-zero train loss. Geiger et al. (2019a) provide further experiments on CIFAR-
                 10 with a convolutional network. They also highlight interesting behavior with ensembling around
                 the critical regime, which is consistent with our informal intuitions in Section 5 and our experiments
                 in Figures 28, 29.
                 Advani & Saxe (2017); Geiger et al. (2019b;a) also point out that double-descent is not observed
                 when optimal early-stopping is used.

                     D RANDOM FEATURES: A CASE STUDY

                              <<FIGURE>>

                 Figure 14:Random Fourier Featureson the Fashion MNIST dataset. The setting is equivalent
                 to two-layer neural network witheix activation, with randomly-initialized ﬁrst layer that is ﬁxed
                 throughout training. The second layer is trained using gradient ﬂow.


                 In this section, for completeness sake, we show that both the model- and sample-wise double de-
                 scent phenomena are not unique to deep neural networks—they exist even in the setting of Random
                 Fourier Features of Rahimi & Recht (2008). This setting is equivalent to a two-layer neural network
                 with <<FORMULA>> activation. The ﬁrst layer is initialized with aN(0;1 )Gaussian distribution and then
                 ﬁxed throughout training. The width (or embedding dimension) d dof the ﬁrst layer parameterizes
                 the model size. The second layer is initialized with0s and trained with MSE loss.
                 Figure 14 shows the grid of Test Error as a function of both number of samplesnand model sized.
                 Note that in this settingEMC =d(the embedding dimension). As a result, as demonstrated in the
                 ﬁgure, the peak follows the path ofn=d. Both model-wise and sample-wise (see ﬁgure 15) double
                 descent phenomena are captured, by horizontally and vertically crossing the grid, respectively.

                                    <<FIGURE>>

                 Figure 15: Sample-wise double-descent slice for Random Fourier Features on the Fashion MNIST
                 dataset. In this ﬁgure the embedding dimension (number of random features) is 1000.

                            E APPENDIX: ADDITIONAL EXPERIMENTS


                  E.1 EPOCH-WISE DOUBLE DESCENT: ADDITIONAL RESULTS

                 Here, we provide a rigorous evaluation of epoch-wise double descent for a variety of optimizers and
                 learning rate schedules. We train ResNet18 on CIFAR-10 with data-augmentation and 20% label
                 noise with three different optimizers—Adam, SGD, SGD + Momentum (momentum set to 0.9) and
                 three different learning rate schedules—constant, inverse-square root, dynamic drop for differnet
                 values of initial learning rate. We observe that double-descent occurs reliably for all optimizers and
                 learning rate schedules and the peak of the double descent curve shifts with the interpolation point.

                                              <<FIGURE>>

                 Figure 16:Epoch-wise double descentfor ResNet18 trained with Adam and multiple learning rate
                  schedules

                  A practical recommendation resulting from epoch-wise double descent is that stopping the training
                  when the test error starts to increase may not always be the best strategy. In some cases, the test error
                  may decrease again after reaching a maximum, and the ﬁnal value may be lower than the minimum
                  earlier in training.

                                      <<FIGURE>>

                 Figure 17:Epoch-wise double descentfor ResNet18 trained with SGD and multiple learning rate
                 schedules

                                                                <<FIGURE>>

                 Figure 18:Epoch-wise double descentfor ResNet18 trained with SGD+Momentum and multiple
                 learning rate schedules


                  E.2 MODEL-WISE DOUBLE DESCENT: ADDITIONAL RESULTS

                  E.2.1 CLEAN SETTINGS WITH MODEL-WISE DOUBLE DESCENT

                  <<FIGURE>>

                 Figure 19:Top:Train and test performance as a function of both model size and train epochs.
                  Bottom:Test error dynamics of the same model (ResNet18, on CIFAR-100 with no label noise,
                  data-augmentation and Adam optimizer trained for 4k epochs with learning rate 0.0001). Note that
                  even with optimal early stopping this setting exhibits double descent.

                                                  <<FIGURE>>

                 Figure 20:Top:Train and test performance as a function of both model size and train epochs.
                  Bottom:Test error dynamics of the same models. 5-Layer CNNs, CIFAR-100 with no label noise,
                 no data-augmentation Trained with SGD for 1e6 steps. Same experiment as Figure 7.


                  E.2.2 WEIGHT DECAY

                                          <<FIGURE>>

                 Figure 21:Left:Test error dynamics with weight decay of 5e-4 (bottom left) and without weight
                 decay (top left). Right:Test and train error andtest lossfor models with varying amounts of
                 weight decay. All models are 5-Layer CNNs on CIFAR-10 with 10% label noise, trained with
                 data-augmentation and SGD for 500K steps.

                 Here, we now study the effect of varying the level of regularization on test error. We train CIFAR10
                 with data-augmentation and 20% label noise on ResNet18 for weight decay coefﬁcients <<FORMULA>> rang-
                 ing from 0 to 0.1. We train the networks using SGD + inverse-square root learning rate. Figure
                 below shows a picture qualitatively very similar to that observed for model-wise double descent
                 wherein ”model complexity” is now controlled by the regularization parameter. This conﬁrms our
                 generalized double descent hypothesis along yet another axis of Effective Model Complexity.

                            <<FIGURE>>

                 Figure 22: Generalized double descent for weight decay. We found that using the same initial
                 learning rate for all weight decay values led to training instabilities. This resulted in some noise in
                 the Test Error (Weight DecayEpochs) plot shown above.


                  E.2.3 EARLY STOPPING DOES NOT EXHIBIT DOUBLE DESCENT

                                      <<FIGURE>>

                 Figure 23: Model-wise test error dynamics for a subsampled IWSLT‘14 dataset. Left: 4k samples,
                 Right: 18k samples. Note that with optimal early-stopping, more samples is always better.

                          <<FIGURE>>

                 Figure 24: Model-wise test error dynamics for a IWSLT‘14 de-en and subsampled WMT‘14 en-fr
                 datasets.Left: IWSLT‘14,Right: subsampled (200k samples) WMT‘14. Note that with optimal
                  early-stopping, the test error is much lower for this task.

                                          <<FIGURE>>

                 Figure 25:Top:Train and test performance as a function of both model size and train epochs.
                  Bottom:Test error dynamics of the same model (CNN, on CIFAR-10 with 10% label noise, data-paugmentation and SGD optimizer with learning rate/1= T).


                   E.2.4 TRAINING PROCEDURE

                                  <<FIGURE>>

                 Figure 26:Model-wise double descent for adversarial trainingResNet18s on CIFAR-10 (sub-
                  sampled to 25k train samples) with no label noise. We train for L2 robustness of radius <<FORMULA>> and
                  <<FORMULA>>, using 10-step PGD (Goodfellow et al. (2014); Madry et al. (2017)). Trained using SGD
                  (batch size 128) with learning rate0:1for 400 epochs, then0:01for 400 epochs.

                                              <<FIGURE>>

                                               Figure 27


                    E.3 ENSEMBLING

                                            <<FIGURE>> 

                 Figure 28:Effect of Ensembling (ResNets, 15% label noise). Test error of an ensemble of 5
                 models, compared to the base models. The ensembled classiﬁer is determined by plurality vote over
                 the 5 base models. Note that emsembling helps most around the critical regime. All models are
                 ResNet18s trained on CIFAR-10 with 15% label noise, using Adam for 4K epochs (same setting
                 as Figure 1). Test error is measured against the original (not noisy) test set, and each model in the
                 ensemble is trained using a train set with independently-sampled 15% label noise.

                                                              <<FIGURE>>

                 Figure 29:Effect of Ensembling (CNNs, no label noise). Test error of an ensemble of 5 models,
                 compared to the base models. All models are 5-layer CNNs trained on CIFAR-10 with no label
                 noise, using SGD and no data augmentation. (same setting as Figure 7).
<|endoftext|>


<|startoftext|>
Deep Residual Learning for Image Recognition 
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com 

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learn.ing residual functions with reference to the layer inputs, in.stead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers 8. deeper than VGG nets [41] but still having lower complex.
ity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. 
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. 

1. Introduction 

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high.level features [50] and classifiers in an end-to-end multi.layer fashion, and the levels of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit very deep [41] models, with a depth of sixteen [41] to thirty [16]. Many other non.trivial visual recognition tasks [8, 12, 7, 32, 27] have also

            <<FIGURE>>

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer plain networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4. 

greatly benefited from very deep models. Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? 
An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start con.
verging for stochastic gradient descent (SGD) with back-propagation [22]. 
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example. 
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time). 

In this paper, we address the degradation problem by introducing a deep residual learning framework. In.stead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these lay.ers fit a residual mapping. Formally, denoting the desired underlying mapping as <<H(x)>>, we let the stacked nonlinear layers fit another mapping of <<FORMULA>>. The original mapping is recast into <<F(x)+x>>. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. 
The formulation of <<FORMULA>> can be realized by feedforward neural networks with shortcut connections (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity short.cut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers. 
We present comprehensive experiments on ImageNet 
[36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart plain nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks. 
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers. 
On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152.layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems. 

2. Related Work Residual Representations. In image recognition, VLAD 

[18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors. 
In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis pre.conditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization. 
Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an inception layer is composed of a shortcut branch and a few deeper branches. 
Concurrent with our work, highway networks [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is closed (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, high.way networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers). 

3. Deep Residual Learning 

3.1. Residual Learning 
Let us consider <<H(x)>> as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., <<H(x) . x>> (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate <<H(x)>>, we explicitly let these layers approximate a residual function <<F(x) := H(x) . x>>. The original function thus becomes <<F(x)+x>>. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different. 
This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear lay.ers toward zero to approach identity mappings. 
In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity map.pings provide reasonable preconditioning. 

3.2. Identity Mapping by Shortcuts 
We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as: 

<<y = F(x, {Wi})+ x>>. (1) 

Here x and y are the input and output vectors of the lay.ers considered. The function <<F(x, {Wi})>> represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, <<F = W_2.(W_1 . x)>> in which <<FORMULA>> denotes 
ReLU [29] and the biases are omitted for simplifying notations. The operation <<FORMULA>> is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., <<FORMULA>>, see Fig. 2). 
The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition). 
The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions: 

<<FORMULA>>. (2) 

We can also use a square matrix <<W_s>> in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions. 
The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: <<y = W1x + x>>, for which we have not observed advantages. 
We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function <<F(x, {Wi})>> can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel. 

3.3. Network Architectures 
We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows. 
Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3.3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle). 
It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34 layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs). 


Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1.1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2. 

3.4. Implementation 
Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224.224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 . 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16]. 
In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully 
convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}). 

4. Experiments 

4.1. ImageNet Classification 
We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates. 
Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for de.
tailed architectures. 
The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we com.pare their training/validation errors during the training procedure. We have observed the degradation problem 

                        <<TABLE>>

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2. 

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts. 

                    <<FIGURE>>  

Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures. 
34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one. 
We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve compet.itive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error3. The reason for such opti.mization difficulties will be studied in the future. 
Residual Networks. Next we evaluate 18-layer and 34.layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3.3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts. 
We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learn.ing fi the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth. 
Second, compared to its plain counterpart, the 34-layer 
3We have experimented with more training iterations (3.) and still ob.served the degradation problem, suggesting that this problem cannot be feasibly addressed by simply using more iterations. 

                    <<TABLE>>

Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions. 

                    <<TABLE>>

Table 4. Error rates (%) of single-model results on the ImageNet validation set (except fi reported on the test set). 

                    <<TABLE>>

Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.

                    <<TABLE>>

ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems. 
Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is not overly deep (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage. 
Identity vs. Projection Shortcuts. We have shown that 

Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56.56 feature maps) as in Fig. 3 for ResNet.

        <<FIGURE>>

parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections. 
Table 3 shows that all three options are considerably bet.
ter than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce mem.ory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below. 
Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the train.ing time that we can afford, we modify the building block as a bottleneck design4. For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1.1, 3.3, and 1.1 convolutions, where the 1.1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3.3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity. 
The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity short.cut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs. 
50-layer ResNet: We replace each 2-layer block in the 
4Deeper non-bottleneck ResNets (e.g., Fig. 5 left) also gain accuracy from increased depth (as shown on CIFAR-10), but are not as economical as the bottleneck ResNets. So the usage of bottleneck designs is mainly due to practical considerations. We further note that the degradation problem of plain nets is also witnessed for the bottleneck designs. 

34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs. 
101-layer and 152-layer ResNets: We construct 101.layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 bil.lion FLOPs). 
The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4). 
Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). 
This entry won the 1st place in ILSVRC 2015. 

4.2. CIFAR-10 and Analysis 

We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k test.ing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows. 
The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32.32 images, with the per-pixel mean subtracted. The first layer is 3.3 convolutions. Then we use a stack of 6n layers with 3.3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is per.formed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture: 

<<TABLE>>

When shortcut connections are used, they are connected to the pairs of 3.3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), 

<<TABLE>>

Table 6. Classification error on the CIFAR-10 test set. All meth.ods are with data augmentation. For ResNet-110, we run it 5 times and show best (mean std) as in [43]. 

so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts. 
We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmen.tation in [24] for training: 4 pixels are padded on each side, and a 32.32 crop is randomly sampled from the padded image or its horizontal fiip. For testing, we only evaluate the single view of the original 32.32 image. 
We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [42]), suggesting that such an optimization difficulty is a fundamental problem. 
Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demon.strate accuracy gains when the depth increases. 
We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging5. So we use 
0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin 
5With an initial learning rate of 0.1, it starts converging (<90% error) after several epochs, but still reaches similar accuracy. 

<<FIGURE>>

Figure 7. Standard deviations (std) of layer responses on CIFAR.
10. The responses are the outputs of each 3.3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order. 
networks such as FitNet [35] and Highway [42] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6). 
Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3.3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our ba.sic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less. 
Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6). 
But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both 

<<TABLE>>

Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also Ta.ble 10 and 11 for better results. 

<<TABLE>>

Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also Table 9 for better results. 
have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [10] or dropout [14] is applied to obtain the best results ([10, 25, 24, 35]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may im.prove results, which we will study in the future. 

4.3. Object Detection on PASCAL and MS COCO 
Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 
[5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [41] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we ob.tain a 6.0% increase in COCOfis standard metric (mAP@[.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations. 
Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: Im.ageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix. 

References 
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157fi166, 1994. 
[2] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995. 
[3] W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000. 
[4] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011. 
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303fi338, 2010. 
[6] S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015. 
[7] R. Girshick. Fast R-CNN. In ICCV, 2015. 
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier.archies for accurate object detection and semantic segmentation. In CVPR, 2014. 
[9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. 
[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013. 
[11] K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015. 
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. 
[14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. 
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735fi1780, 1997. 
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 
[17] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011. 
[18] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012. 
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014. 
[20] A. Krizhevsky. Learning multiple layers of features from tiny im.ages. Tech Report, 2009. 
[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 
[22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand.written zip code recognition. Neural computation, 1989. 
[23] Y. LeCun,L.Bottou,G.B.Orr,andK.-R.Mfiuller. Efficientbackprop. In Neural Networks: Tricks of the Trade, pages 9fi50. Springer, 1998. 
[24] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. arXiv:1409.5185, 2014. 
[25] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013. 
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollfiar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014. 
[27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 
[28] G. Montfiufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014. 
[29] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. 
[30] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007. 
[31] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012. 
[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 
[33] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015. 
[34] B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996. 
[35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015. 
[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014. 
[37] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013. 
[38] N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998. 
[39] N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade, pages 207fi226. Springer, 1998. 
[40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. 
[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 
[42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015. 
[43] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015. 
[44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er.han, V. Vanhoucke, and A. Rabinovich. Going deeper with convolu.tions. In CVPR, 2015. 
[45] R. Szeliski. Fast surface interpolation using hierarchical basis func.tions. TPAMI, 1990. 
[46] R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006. 
[47] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochas.tic gradient towards second-order methodsfibackpropagation learn.ing with transformations in nonlinearities. In Neural Information Processing, 2013. 
[48] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008. 
[49] W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999. 
[50] M. D. Zeiler and R. Fergus. Visualizing and understanding convolu.tional neural networks. In ECCV, 2014. 

A. Object Detection Baselines 

In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classification models, and then fine-tuned on the object detection data. We have experi.mented with ResNet-50/101 at the time of the ILSVRC & COCO 2015 detection competitions. 
Unlike VGG-16 used in [32], our ResNet has no hidden fc layers. We adopt the idea of fiNetworks on Conv feature maps (NoC) [33] to address this issue. We compute the full-image shared conv feature maps using those lay.ers whose strides on the image are no greater than 16 pixels (i.e., conv1, conv2 x, conv3 x, and conv4 x, totally 91 conv layers in ResNet-101; Table 1). We consider these layers as analogous to the 13 conv layers in VGG-16, and by doing so, both ResNet and VGG-16 have conv feature maps of the same total stride (16 pixels). These layers are shared by a region proposal network (RPN, generating 300 proposals) 
[32] and a Fast R-CNN detection network [7]. RoI pool.ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16fis fc layers. The final classification layer is replaced by two sibling layers (classi.fication and box regression [7]). 
For the usage of BN layers, after pre-training, we compute the BN statistics (means and variances) for each layer on the ImageNet training set. Then the BN layers are fixed during fine-tuning for object detection. As such, the BN layers become linear activations with constant offsets and scales, and BN statistics are not updated by fine-tuning. We fix the BN layers mainly for reducing memory consumption in Faster R-CNN training. 
PASCAL VOC 
Following [7, 32], for the PASCAL VOC 2007 test set, we use the 5k trainval images in VOC 2007 and 16k train-val images in VOC 2012 for training (fi07+12fi). For the PASCAL VOC 2012 test set, we use the 10k trainval+test images in VOC 2007 and 16k trainval images in VOC 2012 for training (fi07++12fi). The hyper-parameters for train.ing Faster R-CNN are the same as in [32]. Table 7 shows the results. ResNet-101 improves the mAP by >3% over VGG-16. This gain is solely because of the improved features learned by ResNet. 
MS COCO 
The MS COCO dataset [26] involves 80 object cate.gories. We evaluate the PASCAL VOC metric (mAP @ IoU = 0.5) and the standard COCO metric (mAP @ IoU = .5:.05:.95). We use the 80k images on the train set for train.ing and the 40k images on the val set for evaluation. Our detection system for COCO is similar to that for PASCAL VOC. We train the COCO models with an 8-GPU imple.mentation, and thus the RPN step has a mini-batch size of 8 images (i.e., 1 per GPU) and the Fast R-CNN step has a mini-batch size of 16 images. The RPN step and Fast R.CNN step are both trained for 240k iterations with a learn.ing rate of 0.001 and then for 80k iterations with 0.0001. 
Table 8 shows the results on the MS COCO validation set. ResNet-101 has a 6% increase of mAP@[.5, .95] over VGG-16, which is a 28% relative improvement, solely con.tributed by the features learned by the better network. Re.markably, the mAP@[.5, .95]fis absolute increase (6.0%) is nearly as big as mAP@.5fis (6.9%). This suggests that a deeper network can improve both recognition and localiza.tion. 
B. Object Detection Improvements 
For completeness, we report the improvements made for the competitions. These improvements are based on deep features and thus should benefit from residual learning. 
MS COCO 
Box refinement. Our box refinement partially follows the it.erative localization in [6]. In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig.inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3 [8], followed by box voting [6]. Box re.finement improves mAP by about 2 points (Table 9). 
Global context. We combine global context in the Fast R-CNN step. Given the full-image conv feature map, we pool a feature by global Spatial Pyramid Pooling [12] (with a fisingle-levelfi pyramid) which can be implemented as fiRoIfi pooling using the entire imagefis bounding box as the RoI. This pooled feature is fed into the post-RoI layers to obtain a global context feature. This global feature is con.catenated with the original per-region feature, followed by the sibling classification and box regression layers. This new structure is trained end-to-end. Global context im.proves mAP@.5 by about 1 point (Table 9). 
Multi-scale testing. In the above, all results are obtained by single-scale training/testing as in [32], where the imagefis shorter side is s = 600 pixels. Multi-scale training/testing has been developed in [12, 7] by selecting a scale from a feature pyramid, and in [33] by using maxout layers. In our current implementation, we have performed multi-scale testing following [33]; we have not performed multi-scale training because of limited time. In addition, we have per.formed multi-scale testing only for the Fast R-CNN step (but not yet for the RPN step). With a trained model, we compute conv feature maps on an image pyramid, where the imagefis shorter sides are s .{200, 400, 600, 800, 1000}. 

                <<TABLE>>

Table 9. Object detection improvements on MS COCO using Faster R-CNN and ResNet-101. 

    <<TABLE>>

Table 10. Detection results on the PASCAL VOC 2007 test set. The baseline is the Faster R-CNN system. The system fibaseline+++fi include box refinement, context, and multi-scale testing in Table 9. 


system  net  data  mAP  areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv  
baseline baseline baseline+++  VGG-16 ResNet-101 ResNet-101  07++12 07++12 COCO+07++12  70.4 73.8 83.8  84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6 92.1 88.4 84.8 75.9 71.4 86.3 87.8 94.2 66.8 89.4 69.2 93.9 91.9 90.9 89.6 67.9 88.2 76.8 90.3 80.0  
Table 11. Detection results on the PASCAL VOC 2012 test set (http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4). The baseline is the Faster R-CNN system. The system baseline+++ include box refinement, context, and multi-scale testing in Table 9. 


We select two adjacent scales from the pyramid following [33]. RoI pooling and subsequent layers are performed on the feature maps of these two scales [33], which are merged by maxout as in [33]. Multi-scale testing improves the mAP by over 2 points (Table 9). 
Using validation data. Next we use the 80k+40k trainval set for training and the 20k test-dev set for evaluation. The test.dev set has no publicly available ground truth and the result is reported by the evaluation server. Under this setting, the results are an mAP@.5 of 55.7% and an mAP@[.5, .95] of 34.9% (Table 9). This is our single-model result. 
Ensemble. In Faster R-CNN, the system is designed to learn region proposals and also object classifiers, so an ensemble can be used to boost both tasks. We use an ensemble for proposing regions, and the union set of proposals are pro.cessed by an ensemble of per-region classifiers. Table 9 shows our result based on an ensemble of 3 networks. The mAP is 59.0% and 37.4% on the test-dev set. This result won the 1st place in the detection task in COCO 2015. 

We revisit the PASCAL VOC dataset based on the above model. With the single model on the COCO dataset (55.7% mAP@.5 in Table 9), we fine-tune this model on the PAS.CAL VOC sets. The improvements of box refinement, con.text, and multi-scale testing are also adopted. By doing so we achieve 85.6% mAP on PASCAL VOC 2007 (Table 10) and 83.8% on PASCAL VOC 2012 (Table 11)6. The result on PASCAL VOC 2012 is 10 points higher than the previ.ous state-of-the-art result [6]. 

<<TABLE>>

Table 12. Our results (mAP, %) on the ImageNet detection dataset. Our detection system is Faster R-CNN [32] with the improvements in Table 9, using ResNet-101. 


ImageNet Detection 
The ImageNet Detection (DET) task involves 200 object categories. The accuracy is evaluated by mAP@.5. Our object detection algorithm for ImageNet DET is the same as that for MS COCO in Table 9. The networks are pre.trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data. We split the validation set into two parts (val1/val2) following [8]. We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation. We do not use other ILSVRC 2015 data. Our single model with ResNet-101 has 

<<TABLE>>

Table 13. Localization error (%) on the ImageNet validation. In the column of fiLOC error on GT classfi ([41]), the ground truth class is used. In the fitestingfi column, fi1-cropfi denotes testing on a center crop of 224.224 pixels, fidensefi denotes dense (fully convolutional) and multi-scale testing. 

<<TABLE>> 

Table 14. Comparisons of localization error (%) on the ImageNet dataset with state-of-the-art methods. 


58.8% mAP and our ensemble of 3 models has 62.1% mAP on the DET test set (Table 12). This result won the 1st place in the ImageNet detection task in ILSVRC 2015, surpassing the second place by 8.5 points (absolute). 

C. ImageNet Localization 

The ImageNet Localization (LOC) task [36] requires to classify and localize the objects. Following [40, 41], we assume that the image-level classifiers are first adopted for predicting the class labels of an image, and the localiza.tion algorithm only accounts for predicting bounding boxes based on the predicted classes. We adopt the fiper-class re.gressionfi (PCR) strategy [40, 41], learning a bounding box regressor for each class. We pre-train the networks for Im.ageNet classification and then fine-tune them for localiza.tion. We train networks on the provided 1000-class Ima.geNet training set. 
Our localization algorithm is based on the RPN frame.work of [32] with a few modifications. Unlike the way in 
[32] that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib.ling 1.1 convolutional layers for binary classification (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Specifi.cally, the cls layer has a 1000-d output, and each dimension is binary logistic regression for predicting being or not be.ing an object class; the reg layer has a 1000.4-d output consisting of box regressors for 1000 classes. As in [32], our bounding box regression is with reference to multiple translation-invariant fianchorfi boxes at each position. 
As in our ImageNet classification training (Sec. 3.4), we randomly sample 224.224 crops for data augmentation. We use a mini-batch size of 256 images for fine-tuning. To avoid negative samples being dominate, 8 anchors are ran.domly sampled for each image, where the sampled positive and negative anchors have a ratio of 1:1 [32]. For testing, the network is applied on the image fully-convolutionally. 
Table 13 compares the localization results. Following [41], we first perform fioraclefi testing using the ground truth class as the classification prediction. VGGfis paper [41] re-ports a center-crop error of 33.1% (Table 13) using ground truth classes. Under the same setting, our RPN method us.ing ResNet-101 net significantly reduces the center-crop er.ror to 13.3%. This comparison demonstrates the excellent performance of our framework. With dense (fully convolu.tional) and multi-scale testing, our ResNet-101 has an error of 11.7% using ground truth classes. Using ResNet-101 for predicting classes (4.6% top-5 classification error, Table 4), the top-5 localization error is 14.4%. 
The above results are only based on the proposal network (RPN) in Faster R-CNN [32]. One may use the detection network (Fast R-CNN [7]) in Faster R-CNN to improve the results. But we notice that on this dataset, one image usually contains a single dominate object, and the proposal regions highly overlap with each other and thus have very similar RoI-pooled features. As a result, the image-centric training of Fast R-CNN [7] generates samples of small variations, which may not be desired for stochastic training. Motivated by this, in our current experiment we use the original R-CNN [8] that is RoI-centric, in place of Fast R-CNN. 
Our R-CNN implementation is as follows. We apply the per-class RPN trained as above on the training images to predict bounding boxes for the ground truth class. These predicted boxes play a role of class-dependent proposals. For each training image, the highest scored 200 proposals are extracted as training samples to train an R-CNN classi.fier. The image region is cropped from a proposal, warped to 224.224 pixels, and fed into the classification network as in R-CNN [8]. The outputs of this network consist of two sibling fc layers for cls and reg, also in a per-class form. This R-CNN network is fine-tuned on the training set us.ing a mini-batch size of 256 in the RoI-centric fashion. For testing, the RPN generates the highest scored 200 proposals for each predicted class, and the R-CNN network is used to update these proposalsfi scores and box positions. 
This method reduces the top-5 localization error to 10.6% (Table 13). This is our single-model result on the validation set. Using an ensemble of networks for both clas.sification and localization, we achieve a top-5 localization error of 9.0% on the test set. This number significantly out.performs the ILSVRC 14 results (Table 14), showing a 64% relative reduction of error. This result won the 1st place in the ImageNet localization task in ILSVRC 2015. 
<|endoftext|>


<|startoftext|>
                    Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures

                      Julien Launay 1;2  Iacopo Poli 1  François Boniface 1  Florent Krzakala 1;2

                                     1 LightOn   2 École Normale Supérieure

                                               Abstract

                       Despite being the workhorse of deep learning, the backpropagation algorithm is
                       no panacea. It enforces sequential layer updates, thus preventing efﬁcient paral-
                       lelization of the training process. Furthermore, its biological plausibility is being
                       challenged. Alternative schemes have been devised; yet, under the constraint of
                       synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
                       tures. Here, we challenge this perspective, and study the applicability of Direct
                       Feedback Alignment to neural view synthesis, recommender systems, geometric
                       learning, and natural language processing. In contrast with previous studies lim-
                       ited to computer vision tasks, our ﬁndings show that it successfully trains a large
                       range of state-of-the-art deep learning architectures, with performance close to
                       ﬁne-tuned backpropagation. At variance with common beliefs, our work supports
                       that challenging tasks can be tackled in the absence of weight transport.


                 1 Introduction

                 While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
                 it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
                 they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
                 its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
                 weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
                 path: this is implausible in biological brains, and known as the weight transport problem [6].
                 Consequently, alternative training algorithms have been developed. Some of these algorithms are
                 explicitly biologically inspired [7–13], while others focus on making better use of available compute
                 resources [3,14–19]. Despite these enticing characteristics, none has been widely adopted, as they
                 are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
                 challenging datasets under the constraint of synaptic asymmetry is disappointing.
                 We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
                 (DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
                 view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
                 language processing with Transformers. Our results deﬁne new standards for learning without weight
                 transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
                 All code needed to reproduce our experiments is available at https://github.com/lightonai/dfa-scales-to-modern-deep-learning.

                                           1.1 Related work

                 Training a neural network is a credit assignment problem: an update is derived for each parameter
                 from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].

                 Biologically motivated methods Finding a training method applicable under the constraints of
                 biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
                 [22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
                 asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [10–12] train distinct
                 feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
                 transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
                 the forward weights learn to align with the arbitrary backward weights, eventually approximating BP.

                 Beyond biological considerations As deep learning models grow bigger, large-scale distributed
                 training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
                 by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
                 updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
                 as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
                 synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
                 [16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
                 and directly projects a global error to each layer. A shared feedback path is still needed, but it only
                 depends on a simple random projection operation.

                 Performance of alternative methods Local training methods are successful in unsupervised learn-
                 ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
                 [14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
                 tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
                 techniques to perform well on challenging datasets, some form of weight transport is necessary:
                 either by explicitly sharing sign information [24–26], or by introducing dedicated phases of alignment
                 for the forward and backward weights where some information is shared [27]. To the best of our
                 knowledge, no method compatible with the weight transport problem has ever been demonstrated on
                 challenging tasks.

                 1.2 Motivations and contributions

                 We focus on DFA, a compromise between biological and computational considerations. Notably,
                 DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
                 preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
                 and puts a single operation at the center of the training stage. This enables new classes of training
                 co-processors [28, 29], leveraging dedicated hardware to perform the random projection.

                 Extensive survey We apply DFA in a large variety of settings matching current trends in machine
                 learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
                 computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
                 different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
                 of unprecedented scale for an alternative training method, and makes a strong case for the possibility
                 of learning without weight transport in demanding scenarios.

                 Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
                 fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
                 (section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
                 modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
                 only been recently successfully tackled with deep learning.

                 Modern architectures We prove that the previously established failure of DFA to train convolutions
                 does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
                 measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
                 layers involving graph convolutions and attention. This signiﬁcantly broadens the applicability of
                 DFA–previously thought to be limited to simple problems like MNIST and CIFAR-10.

                                                  2 Methods

                 Forward pass In a fully connected network, at layer i out of N, neglecting its biases, with W_i its
                 weight matrix, f_i its non-linearity, and hi its activations, the forward pass is:

                                   <<FORMULA>>               (1)

                 <<FORMULA>> is the input data, and <<FORMULA>> are the predictions. A task-speciﬁc cost function
                 <<FORMULA>> is computed to quantify the quality of the predictions with respect to the targets y.

                  Backward pass with BP The weight updates are computed by backpropagation of the error vector.
                 Using the chain-rule of derivatives, each neuron is updated based on its contribution to the cost
                 function. Leaving aside the speciﬁcs of the optimizer used, the equation for the weight updates is:

                                     <<FORMULA>>                              (2)

                  Backward pass with DFA The gradient signal <<FORMULA>> of the (i+1)-th layer violates synaptic
                 asymmetry. DFA replaces it with a random projection of the topmost derivative of the loss, <<FORMULA>>.
                 For common classiﬁcation and regression losses such as the mean squared error or the negative log
                 likelihood, this corresponds to a random projection of the global error <<FORMULA>>. With B_i, a ﬁxed
                 random matrix of appropriate shape drawn at initialization for each layers:

                                                                <<FORMULA>>                  (3)

                 3 Experiments

                 We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architec-
                 tures. We start with fully connected networks, where DFA has already been demonstrated, and address
                 new challenging settings. We then investigate geometric learning: we apply DFA to graph neural net-
                 works in classiﬁcation tasks on citation networks, as well as graph autoencoders. These architectures
                 feature graph convolutions and attention layers. Finally, we use DFA to train a transformer-based
                 Natural Language Processing (NLP) model on a dataset of more than 100 million tokens.

                 3.1 Fully connected architectures

                 DFA has been successful at training fully connected architectures, with performance on-par with
                 backpropagation [19,20]. However, only computer vision tasks have been considered, where fully
                 connected networks considerably underperform their convolutional counterpart. Here, we focus on
                 tasks where fully connected architectures are state-of-the-art. Moreover, the architectures considered
                 are deeper and more complex than those necessary to solve a simple task like MNIST.

                 3.1.1 Neural view synthesis with Neural Radiance Fields
                 The most recent state-of-the-art neural view synthesis methods are based on large fully connected
                 networks: this is an ideal setting for a ﬁrst evaluation of DFA on a challenging task.

                 Background There has been growing interest in methods capable of synthesizing novel renders of
                 a 3D scene using a dataset of past renders. The network is trained to learn an inner representation of
                 the scene, and a classical rendering system can then query the model to generate novel views. With
                 robust enough methods, real-world scenes can also be learned from a set of pictures.
                 Until recently, most successful neural view synthesis methods were based on sampled volumetric
                 representations [30–32]. In this context, Convolutional Neural Networks (CNNs) can be used to
                 smooth out the discrete sampling of 3D space [33,34]. However, these methods scale poorly to
                 higher resolutions, as they still require ﬁner and ﬁner sampling. Conversely, alternative schemes
                 based on a continuous volume representation have succeeded in generating high-quality renders [35],
                 even featuring complex phenomenons such as view-dependant scattering [36]. These schemes make
                 point-wise predictions, and use fully connected neural networks to encode the scene.

                                            <<FIGURE>>

                 Figure 1: Comparisons of NeRF-DFA with state-of-the-art methods trained with BP on the most
                 challenging synthetic and real-world scenes. While NeRF-DFA generates render of lower quality,
                 they maintain multi-view consistency and exhibit no geometric artifacts. BP results from [36].


                 Setting We employ Neural Radiance Fields (NeRF) [36], the state-of-the-art for neural view
                 synthesis. NeRF represents scenes as a continuous 5D function of space–three spatial coordinates,
                 two viewing angles–and outputs a point-wise RGB radiance and opacity. A ray-casting renderer can
                 then query the network to generate arbitrary views of the scene. The network modeling the continuous
                 function is 10 layers deep. Two identical networks are trained: the coarse network predictions inform
                 the renderer about the spatial coordinates that the ﬁne network should preferentially evaluate to avoid
                 empty space and occluded regions.

                 Results We report quantitative results of training NeRF with DFA in Table 1. Neural view synthesis
                 methods are often better evaluated qualitatively: we showcase some renders in Figure 1.
                 On a dataset of renders featuring complex scenes with non-Lambertian materials (NeRF-Synthetic
                 [36]), NeRF-DFA outperforms two previous ﬁne-tuned state-of-the-art methods–Scene Representation
                 Networks (SRN) [35] and Local Light Field Fusion (LLFF) [32]–and nearly matches the performance
                 of Neural Volumes (NV) [34]. While DFA underperforms alternative methods trained with BP on
                 the real world view dataset (LLFF-Real [32]), its performance remains signiﬁcant: real world view
                 synthesis is a challenging tasks, and this level of PSNR indicates that learning is indeed happening.
                 In particular, we ﬁnd that NeRF-DFA retains the key characteristics of NeRF-BP: it can render view-
                 dependant effects, and is multi-view consistent. The last point is an especially important achievement,
                 and most visible in videos, as it is a challenge for most algorithms [30–32,35]. The main drawback
                 of NeRF-DFA appears to be a seemingly lower render deﬁnition. The NeRF architecture has not


                 Table 1: Peak Signal to Noise Ratio (PSNR, higher is better) of neural view synthesis methods
                 trained with backpropagation against NeRF trained with DFA. Even when trained with DFA, NeRF
                 outperforms two state-of-the-art methods on a synthetic dataset (NeRF-Synthetic), and achieves fair
                 performance on a challenging real world views datasets (LLFF-Real). BP results from [36].

                                            <<TABLE>>

                 been ﬁne-tuned to achieve these results: DFA works out-of-the-box on this advanced method. Future
                 research focusing on architectural changes to NeRF could improve performance with DFA; some
                 preliminary results are included in the supplementary material.

                 3.1.2 Click-through rate prediction with recommender systems
                 We have demonstrated that DFA can train large fully connected networks on the difﬁcult task of neural
                 view synthesis. We now seek to use DFA in more complex heterogeneous architectures, combining
                 the use of fully connected networks with other machine learning methods.Recommender systems are
                 an ideal application for such considerations.

                 Background Recommender systems are used to model the behavior of users and predict future
                 interactions. In particular, in the context of click-through rate (CTR) prediction, these systems model
                 the probability of a user clicking on a given item. Building recommender systems is hard [37]: their
                 input is high-dimensional and sparse, and the model must learn to extract high-order combinatorial
                 features from the data. Moreover, they need to do so efﬁciently, as they are used to make millions of
                 predictions and the training data may contain billions of examples.
                 Factorization Machines (FM) [38] use inner-products of latent vectors between features to extract
                 pairwise feature interactions. They constitute an excellent baseline for shallow recommender systems,
                 but fail to efﬁciently transcribe higher-level features. To avoid extensive feature engineering, it has
                 been suggested that deep learning can be used in conjunction with wide shallow models to extract
                 these higher-level features [39]. In production, these systems are regularly retrained on massive
                 datasets: the speedup allowed by backward unlocking in DFA is thus of particular interest.

                 Setting Deep Factorization Machines (DeepFM) [40] combine FM and a deep fully connected
                 neural network, which we train with DFA. The input embedding is still trained directly via gradient
                 descent, as weight transport is not necessary to backpropagate through the FM. Deep & Cross
                 Networks (DCN) [41] replace the FM with a Cross Network, a deep architecture without non-
                 linearities capable of extracting high-degree interactions across features. We train the fully connected
                 network, the deep cross network, and the embeddings with DFA. Finally, Adaptative Factorization
                 Network (AFN) [42] uses Logarithmic Neural Networks [43] to enhance the representational power
                 of its deep component. We evaluate these methods on the Criteo dataset [44], which features nearly
                 46 million samples of one million sparse features. This is a difﬁcult task, where performance
                 improvements of the AUC on the 0.001-level can enhance CTR signiﬁcantly [39].

                 Results Performance metrics are reported in Table 2. To obtain these results, a simple hyperpa-
                 rameter grid search over optimization and regularization parameters was performed for BP and DFA
                 independently. DFA successfully trains all methods above the FM baseline, and in fact matches BP
                 performance in both DeepFM and AFN. Because of their complexity, recommender systems require
                 intensive tuning and feature engineering to perform at the state-of-the-art level–and reproducing
                 existing results can be challenging [45]. Hence, it is not surprising that a performance gap exists with
                 Deep&Cross–further ﬁne-tuning may be necessary for DFA to reach BP performance.
                 Alignment measurements corroborate that learning is indeed occurring in the special layers of
                 Deep&Cross and AFN–see supplementary for details. Our results on recommender systems support
                 that DFA can learn in a large variety of settings, and that weight transport is not necessary to solve a
                 difﬁcult recommendation task.


                 Table 2: AUC (higher is better) and log loss (lower is better) of recommender systems trained on the
                 Criteo dataset [44]. Even in complex heterogeneous architectures, DFA performance is in line with
                 BP. Values in bold indicate DFA AUC within 0.001 from the BP AUC or better.

                                <<TABLE>>


                                         3.2 Geometric Learning with Graph Convolutional Networks

                 The use of sophisticated architectures beyond fully connected layers is necessary for certain tasks,
                 such as geometric learning[46], where information lies in a complex structured domain. To address
                 geometric learning tasks, methods capable of handling graph-based data are commonly needed.
                 Graph convolutional neural networks (GCNNs) [47–50] have demonstrated the ability to process
                 large-scale graph data efﬁciently. We study the applicability of DFA to these methods, including
                 recent architectures based on an attention mechanism. Overall, this is an especially interesting setting,
                 as DFA fails to train more classic 2D image convolutional layers [23].

                 Background Complex data like social networks or brain connections lie on irregular or non-
                 Euclidean domains. They can be represented as graphs, and efﬁcient processing in the spectral
                 domain is possible. Non-spectral techniques to apply neural networks to graphs have also been
                 developed [51–53], but they exhibit unfavorable scaling properties. The success of CNNs in deep
                 learning can be attributed to their ability to efﬁciently process structured high-dimensional data
                 by sharing local ﬁlters. Thus, a generalization of the convolution operator to the graph domain is
                 desirable: [47] ﬁrst proposed a spectral convolution operation for graphs, and [48] introduced a form
                 of regularization to enforce spatial locality of the ﬁlters. We use DFA to train different such GCNNs
                 implementations. We study both spectral and non-spectral convolutions, as well as methods inspired
                 by the attention mechanism. We consider the task of semi-supervised node classiﬁcation: nodes from
                 a graph are classiﬁed using their relationship to other nodes as well as node-wise features.

                 Setting Fast Localized Convolutions (ChebConv) [49] approximate the graph convolution kernel
                 with Chebyshev polynomials, and are one of the ﬁrst scalable convolution methods on graph. Graph
                 Convolutions (GraphConv) [50] remove the need for an explicit parametrization of the kernel by
                 enforcing linearity of the convolution operation on the graph Laplacian spectrum. It is often considered
                 as the canonical graph convolution. More recent methods do not operate in the spectral domain. Spline
                 Convolutions (SplineConv) [54] use a spline-based kernel, enabling the inclusion of information
                 about the relative positioning of nodes, enhancing their representational power–for instance in the
                 context of 3D meshes. Graph Attention Networks (GATConv) [55] use self-attention [56] layers to
                 enable predictions at a given node to attend more speciﬁcally to certain parts of its neighborhood.
                 Finally, building upon Jumping Knowledge Network [57], Just Jump (DNAConv) [58] uses multi-
                 head attention [59] to enhance the aggregation process in graph convolutions and enable deeper
                 architectures. We use PyTorch Geometric [60] for reference implementation of all of these methods.
                 We evaluate performance on three citation network datasets: Cora, CiteSeer, and PubMed [61].

                 Results We report classiﬁcation accuracy in Table 3. BP and DFA regularization and optimiza-
                 tion hyperparameters are ﬁne-tuned separately on the Cora dataset. In general, we ﬁnd that less
                 regularization and lower learning rates are needed with DFA. DFA successfully trains all graph
                 methods, independent of whether they use the spectral domain or not, and even if they use attention.
                 Furthermore, for GraphConv, SplineConv, and GATConv DFA performance nearly matches BP.
                 As GCNNs struggle with learning meaningful representations when stacking many layers [62], all
                 architectures but DNAConv are quite shallow (two layers). However, DFA performance is still
                 signiﬁcantly higher than that of a shallow training method–see supplementary for details. The lower
                 performance on DNAConv is not a failure to learn: alignment measurements show that learning is
                 indeed occurring. It may be explained instead by a need for more in-depth ﬁne-tuning, as this is a
                 deep architecture with 5 successive attention layers.

                 Table 3: Classiﬁcation accuracy (%, higher is better) of graph convolution methods trained with BP
                 and DFA, on citation networks [61]. But for ChebConv and DNAConv, DFA performance nearly
                 matches BP performance. Values in bold when DFA is within 2.5% of BP.

                                            <<TABLE>>

                      Table 4: AUC and Average Precision Figure 2: t-SNE visualization of the hidden layer
                      (AP, higher is better) for a Graph- activations of a two-layer GraphConv trained on
                      Conv GAE trained with BP or DFA Cora with DFA. Classes forms clear clusters, indicating 
                      that a useful intermediary representation is learned. Colors represent different classes.
                      on citation networks. DFA reproduces BP performance.         


                 We further demonstrate that DFA helps graph convolutions learn meaningful representations by
                 applying t-SNE [63,64] to the hidden layer activations in GraphConv (Figure 2). Cluster of classes
                 are well-separated, indicating that a useful intermediary representation is derived by the ﬁrst layer.

                 Graph autoencoders We consider one last application of graph convolutions, in the context of
                 graph autoencoders (GAE). We train a non-probabilistic GAE [65] based on GraphConv with DFA,
                 and report results in Table 4. DFA performance is always in line with BP.

                 3.3 Natural Language Processing with Transformers

                 We complete our study by training a Transformer [59] on a language modelling task. Transformers
                 have proved successful in text, image, music generation, machine translation, and many supervised
                 NLP tasks [59,66–69]. Here, we demonstrate that DFA can train them, and we show the inﬂuence of
                 tuning the optimizer hyperparameters in narrowing the gap with BP.

                 Background NLP has largely beneﬁted from advances in deep learning. Recurrent Neural Net-
                 works were responsible for early breakthroughs, but their sequential nature prevented efﬁcient
                 parallelization of data processing. Transformers are attention-based models that do not rely on
                 recurrence or convolution. Their ability to scale massively has allowed the training of models with
                 several billion parameters [70,71], obtaining state-of-the-art results on all NLP tasks: Transformers
                 now top the prominent SQuAD 2.0 [72,73] and SuperGLUE [74] benchmarks. In parallel, transfer
                 learning in NLP has leaped forward thanks to language modelling, the unsupervised task of predicting
                 the next word. It can leverage virtually unlimited data from web scraping [75]. This enabled the
                 training of universal language models[76] on extremely large and diversiﬁed text corpora. These
                 models are useful across a wide range of domains, and can solve most NLP tasks after ﬁne-tuning.

                 Setting The prominence of both language modelling and Transformers gives us the ideal candidate
                 for our NLP experiments: we train a Transformer to predict the next word on the WikiText-103
                 dataset [77], a large collection of good and featured Wikipedia articles. We use byte-pair-encoding
                 [78] with 32,000 tokens. Our setup is similar to GPT [66]: we adapt the Transformer, originally an
                 encoder-decoder model designed for machine translation, to language modelling. We keep only the
                 encoder and mask the tokens to predict. Our architecture consists in 6 layers, 8 attention heads, a
                 model dimension of 512, and a hidden size of 2048 in the feed-forward blocks. The text is sliced
                 in chunks of 128 tokens and batches of 64 such chunks, resulting in 8192 tokens per batch. Our
                 baseline is trained with BP using the optimization setup of [59]. We found perplexity after 20 epochs
                 to be an excellent indicator of perplexity at convergence; to maximize the number of experiments
                 we could perform, we report the best validation perplexity after 20 epochs. We study two ways of
                 implementing DFA: applying the feedback after every encoder block (macro) or after every layer in
                 those blocks (micro). The input embedding layer receives gradients from the next feedback point
                 through BP. This leaves some amount of weight transport even in the micro-case.

                 Table 5: Best validation perplexity after 20 epochs of a Transformer trained on WikiText-103 (lower
                 is better). The BP and DFA baselines share all hyper-parameters. In Macro the feedback is applied
                 after every transformer layer, while in Micro the feedback is applied after every sub-layer. The
                 learning rate of Adam without the learning rate scheduler is <<FORMULA>>. With the scheduler, the initial
                 learning rate is <<FORMULA>> and it is multiplied by 0.2 when performance plateaus, with a patience of 1.
                 * score after 22 epochs to let the learning rate scheduler take effect

                                          <<TABLE>>

                 Results Our results are summarized in Table 5. Hyper-parameters ﬁne-tuned for BP did not fare
                 well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably.
                 The learning rate schedule used on top of Adam [79] in [59] proved detrimental. Using Adam alone
                 required reducing the learning rate between BP and DFA. Increasing 2 from 0.98 [59] to 0.999
                 improved performance signiﬁcantly. Finally, a simple scheduler that reduces the learning rate when
                 the validation perplexity plateaus helped reducing it further. Considering that the perplexity of the
                 shallow baseline is over 400, DFA is clearly able to train Transformers. However, our results are not
                 on par with BP, especially in the micro setting. A substantial amount of work remains to make DFA
                 competitive with BP, even more so in a minimal weight transport scenario. The large performance
                 improvements brought by small changes in the optimizer indicate that intensive ﬁne-tuning, common
                 in publications introducing state-of-the-art results, could close the gap between BP and DFA.

                 4 Conclusion and outlooks

                 We conducted an extensive study demonstrating the ability of DFA to train modern architectures. We
                 considered a broad selection of domains and tasks, with complex models featuring graph convolutions
                 and attention. Our results on large networks like NeRF and Transformers are encouraging, suggesting
                 that with further tuning, such leading architectures can be effectively trained with DFA. Future work
                 on principled training with DFA–in particular regarding the inﬂuence of common practices and
                 whether new procedures are required–will help close the gap with BP.
                 More broadly, we veriﬁed for the ﬁrst time that learning under synaptic asymmetry is possible beyond
                 fully-connected layers, and in tasks signiﬁcantly more difﬁcult than previously considered. This
                 addresses a notable concern in biologically-plausible architectures. DFA still requires an implausible
                 global feedback pathway; however, local training has already been demonstrated at scale. The next
                 step towards biologically-compatible learning is a local method without weight transport.
                 While the tasks and architectures we have considered are not biologically inspired, they constitute
                 a good benchmark for behavioral realism[20]. Any learning algorithm claiming to approximate
                 the brain should reproduce its ability to solve complex and unseen task. Furthermore, even though
                 the current implementation of mechanisms like attention is devoid of biological considerations, they
                 represent broader concepts applicable to human brains [80]. Understanding how our brain learns is a
                 gradual process, and future research could incorporate further realistic elements, like spiking neurons.
                 Finally, unlocking the backward pass in large architectures like Transformers is promising. More opti-
                 mized implementation of DFA–built at a lower-level of existing ML libraries–could unlock signiﬁcant
                 speed-up. Leveraging the use of a single random projection as the cornerstone of training, dedicated
                 accelerators may employ more exotic hardware architectures. This will open new possibilities in the
                 asynchronous training of massive models.

                                                  Broader Impact

                 Of our survey This study is the ﬁrst experimental validation of DFA as an effective training method
                 in a wide range of challenging tasks and neural networks architectures. This signiﬁcantly broadens the
                 applications of DFA, and more generally brings new insight on training techniques alternative to back-
                 propagation. From neural rendering and recommender systems, to natural language processing or
                 geometric learning, each of these applications has its own potential impact. Our task selection process
                 was motivated by current trends in deep learning, as well as by technically appealing mechanisms
                 (graph convolutions, attention). A limit of our survey is that our–arguably biased–selection of tasks
                 cannot be exhaustive. Our experiments required substantial cloud compute resources, with state-of-
                 the-art GPU hardware. Nevertheless, as this study provides new perspectives for hardware accelerator
                 technologies, it may favor the application of neural networks in ﬁelds previously inaccessible because
                 of computational limits. Future research on DFA should continue to demonstrate its use in novel
                 contexts of interest as they are discovered.

                 Of the considered applications Each of the applications considered in our study has a wide
                 potential impact, consider for example the impact of textual bias in pretrained word embeddings [81].
                 We refer to [82] and references therein for a discussion of ethical concerns of AI applications.

                 Of DFA as a training method DFA enables parallelization of the backward pass and places a
                 single operation at the center of the training process, opening the prospect of reducing the power
                 consumption of training chips by an order of magnitude [28]. Not only is more efﬁcient training a
                 path to more environmentally responsible machine learning [83], but it may lower the barrier of entry,
                 supporting equality and sustainable development goals. A signiﬁcant downside of moving from BP to
                 DFA is a far more limited understanding of how to train models and how the trained models behave.
                 There is a clear empirical understanding of the impact of techniques such as batch normalization
                 or skip connections on the performance of BP; new insights need to be obtained for DFA. BP also
                 enjoys decades of works on topics like adversarial attacks, interpretability, and fairness. Much of
                 this work has to be cross-checked for alternative training methods, something we encourage further
                 research to consider as the next step towards safely and responsively scaling up DFA.

                 Of biologically motivated method Finally, a key motivation for this study was to demonstrate that
                 learning challenging tasks was possible without weight transport. Biologically motivated methods
                 are a more foundational research direction, and as such the possible long-term impact of our ﬁndings
                 is harder to estimate under this light. However, fundamental research of this kind is important to open
                 new pathways for ML and neuroscience.

                 Acknowledgments and Disclosure of Funding

                 We thank Igor Carron and Laurent Daudet for the general guidance on the subject of this investigation
                 and the insightful comments, as well as the larger LightOn team for their support.

                 References
                  [1]P. J. Werbos.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
                     Sciences. PhD thesis, Harvard University, 1974.
                  [2]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
                     propagation. InParallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
                  [3]Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,
                     David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.
                     InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages
                     1627–1635, 2017.
                  [4]Francis Crick. The recent excitement about neural networks.Nature, 337(6203):129–132, 1989.
                  [5]Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep
                     learning and neuroscience.Frontiers in computational neuroscience, 10:94, 2016.
                  [6]Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance.
                     Cognitive science, 11(1):23–63, 1987.
                  [7]Javier R Movellan. Contrastive hebbian learning in the continuous hopﬁeld model. InConnec-
                     tionist models, pages 10–17. Elsevier, 1991.
                  [8]Randall C O’Reilly. Biologically plausible error-driven learning using local activation differ-
                     ences: The generalized recirculation algorithm.Neural computation, 8(5):895–938, 1996.
                  [9]Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. InArtiﬁcial intelligence
                     and statistics, pages 448–455, 2009.
                 [10]Yann Le Cun. Learning process in an asymmetric threshold network. InDisordered systems
                     and biological organization, pages 233–240. Springer, 1986.
                 [11]Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target
                     propagation.arXiv preprint arXiv:1407.7906, 2014.
                 [12]Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga-
                     tion. InJoint european conference on machine learning and knowledge discovery in databases,
                     pages 498–515. Springer, 2015.
                 [13]Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap-
                     tic feedback weights support error backpropagation for deep learning.Nature communications,
                     7(1):1–10, 2016.
                 [14]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can
                     scale to imagenet. InInternational Conference on Machine Learning, pages 583–593, 2019.
                 [15]Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan
                     Pascanu. Sobolev training for neural networks. InAdvances in Neural Information Processing
                     Systems, pages 4278–4287, 2017.
                 [16]Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In
                     International Conference on Machine Learning, pages 4839–4850, 2019.
                  [17]R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
                     Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
                     estimation and maximization. InInternational Conference on Learning Representations, 2019.
                     URLhttps://openreview.net/forum?id=Bklr3j0cKX.
                 [18]Sindy Löwe, Peter O’Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-
                     isolated learning of representations. InAdvances in Neural Information Processing Systems,
                     pages 3033–3045, 2019.
                 [19] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In
                     Advances in neural information processing systems, pages 1037–1045, 2016.
                 [20]Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy
                     Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and
                     architectures. InAdvances in Neural Information Processing Systems, pages 9368–9378, 2018.
                 [21]Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
                     Backpropagation and the brain.Nature Reviews Neuroscience, pages 1–12, 2020.
                 [22]Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learning rule.
                     Annu. Rev. Neurosci., 31:25–46, 2008.
                 [23]Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with
                     direct feedback alignment.arXiv preprint arXiv:1906.04554, 2019.
                 [24]Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in back-
                     propagation? InThirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.
                 [25]Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep
                     convolutional networks.arXiv preprint arXiv:1812.06488, 2018.
                 [26]Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning
                     algorithms can scale to large datasets. InInternational Conference on Learning Representations,
                     2019. URL https://openreview.net/forum?id=SygvZ209F7.

                 [27]Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed.
                     Using weight mirrors to improve feedback alignment.arXiv preprint arXiv:1904.05391, 2019.

                 [28]Julien Launay, Iacopo Poli, Kilian Müller, Igor Carron, Laurent Daudet, Florent Krzakala, and
                     Sylvain Gigan. Light-in-the-loop: using a photonics co-processor for scalable training of neural
                     networks, 2020.

                 [29]Charlotte Frenkel.Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling
                     Roads to Embedded Cognition. PhD thesis, UCL-Université Catholique de Louvain, 2020.

                 [30]Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis.ACM Transactions on
                     Graphics (TOG), 36(6):1–11, 2017.

                  [31]John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck,
                     Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent.
                     InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
                     2367–2376, 2019.

                 [32]Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi
                     Ramamoorthi, Ren Ng, and Abhishek Kar. Local light ﬁeld fusion: Practical view synthesis
                     with prescriptive sampling guidelines.ACM Transactions on Graphics (TOG), 38(4):1–14,
                     2019.

                 [33]Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael
                     Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. InProceedings of the IEEE
                     Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019.

                 [34]Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and
                     Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM
                     Transactions on Graphics (TOG), 38(4):65, 2019.

                 [35]Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks:
                     Continuous 3d-structure-aware neural scene representations. InAdvances in Neural Information
                     Processing Systems, pages 1119–1130, 2019.

                  [36]Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi,
                     and Ren Ng. Nerf: Representing scenes as neural radiance ﬁelds for view synthesis.arXiv
                     preprint arXiv:2003.08934, 2020.

                 [37]H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady,
                     Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view
                     from the trenches. InProceedings of the 19th ACM SIGKDD international conference on
                     Knowledge discovery and data mining, pages 1222–1230, 2013.

                 [38]Steffen Rendle. Factorization machines. In2010 IEEE International Conference on Data
                     Mining, pages 995–1000. IEEE, 2010.

                 [39]Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
                     Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for
                     recommender systems. InProceedings of the 1st workshop on deep learning for recommender
                     systems, pages 7–10, 2016.

                 [40]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a
                     factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247,
                     2017.

                 [41]Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
                     predictions. InProceedings of the ADKDD’17, ADKDD’17, New York, NY, USA, 2017.
                     Association for Computing Machinery. ISBN 9781450351942. doi: 10.1145/3124749.3124754.
                     URLhttps://doi.org/10.1145/3124749.3124754.
                 [42]Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning
                     adaptive-order feature interactions. InThirty-Fourth AAAI Conference on Artiﬁcial Intelligence,
                     2020.
                 [43]J Wesley Hines. A logarithmic neural network architecture for unbounded non-linear function
                     approximation. InProceedings of International Conference on Neural Networks (ICNN’96),
                     volume 2, pages 1245–1250. IEEE, 1996.
                 [44]Criteo. Kaggle contest dataset is now available for academic use!http://labs.criteo.com/
                     2014/09/kaggle-contest-dataset-now-available-academic-use/, 2014. accessed
                     on the 2020-05-20.
                 [45]Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much
                     progress? a worrying analysis of recent neural recommendation approaches. InProceedings of
                     the 13th ACM Conference on Recommender Systems, pages 101–109, 2019.
                 [46]Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
                     Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34
                     (4):18–42, 2017.
                 [47]Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally
                     connected networks on graphs. InInternational Conference on Learning Representations, pages
                     http–openreview, 2014.
                 [48]Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured
                     data.arXiv preprint arXiv:1506.05163, 2015.
                  [49]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
                     on graphs with fast localized spectral ﬁltering. InAdvances in neural information processing
                     systems, pages 3844–3852, 2016.
                 [50]Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional
                     networks. InInternational Conference on Learning Representations (ICLR), 2017.
                 [51]Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph
                     domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,
                     volume 2, pages 729–734. IEEE, 2005.
                 [52]Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
                     The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
                  [53]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
                     networks. InInternational Conference on Learning Representations, 2016.
                 [54]Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric
                     deep learning with continuous b-spline kernels. InProceedings of the IEEE Conference on
                     Computer Vision and Pattern Recognition, pages 869–877, 2018.
                 [55]Petar Velickoviˇ  c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua´
                     Bengio. Graph attention networks. InInternational Conference on Learning Representations,
                     2018. URLhttps://openreview.net/forum?id=rJXMpikCZ.
                 [56] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
                     learning to align and translate. In3rd International Conference on Learning Representations,
                     ICLR 2015, 2015.
                 [57]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
                     networks? InInternational Conference on Machine Learning, 2018.

                 [58]Matthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. In
                     ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

                 [59]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
                     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information
                     processing systems, pages 5998–6008, 2017.

                 [60]Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.
                     InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

                 [61]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-
                     Rad. Collective classiﬁcation in network data.AI magazine, 29(3):93–93, 2008.

                 [62]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
                     networks? InInternational Conference on Learning Representations, 2019. URLhttps:
                     //openreview.net/forum?id=ryGs6iA5Km.

                  [63]Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine
                     learning research, 9(Nov):2579–2605, 2008.

                 [64]David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed
                     stochastic neighbor embedding.Journal of Parallel and Distributed Computing, 131:1–13,
                     2019.

                 [65]Thomas N Kipf and Max Welling. Variational graph auto-encoders.NIPS Workshop on Bayesian
                     Deep Learning, 2016.

                 [66]Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language
                     understanding with unsupervised learning.Technical report, OpenAI, 2018.

                 [67]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,
                     and Dustin Tran. Image transformer.ArXiv, abs/1802.05751, 2018.

                 [68]Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
                     Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020.

                 [69]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
                     deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer-
                     ence of the North American Chapter of the Association for Computational Linguistics: Human
                     Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis,
                     Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
                     URLhttps://www.aclweb.org/anthology/N19-1423.

                 [70]Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and
                     Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
                     parallelism.ArXiv, abs/1909.08053, 2019.

                 [71]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
                     Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
                     few-shot learners.arXiv preprint arXiv:2005.14165, 2020.

                  [72]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
                     questions for machine comprehension of text. InProceedings of the 2016 Conference on
                     Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Novem-
                     ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL
                     https://www.aclweb.org/anthology/D16-1264.

                  [73]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
                     questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for
                     Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia,
                     July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL
                     https://www.aclweb.org/anthology/P18-2124.

                  [74]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
                     Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose
                     language understanding systems. InAdvances in Neural Information Processing Systems, pages
                     3261–3275, 2019.
                 [75]The Common Crawl Team. Common Crawl.https://commoncrawl.org, 2020.
                 [76]Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁca-
                     tion. InACL. Association for Computational Linguistics, 2018. URLhttp://arxiv.org/
                     abs/1801.06146.
                  [77]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
                     models.ArXiv, abs/1609.07843, 2017.
                 [78]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
                     words with subword units. InProceedings of the 54th Annual Meeting of the Association
                     for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany,
                     August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL
                     https://www.aclweb.org/anthology/P16-1162.
                  [79]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International
                     Conference on Learning Representations, 12 2014.
                  [80]Grace W Lindsay. Attention in psychology, neuroscience, and machine learning.Frontiers in
                     Computational Neuroscience, 14:29, 2020.
                 [81]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
                     Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In
                     Advances in neural information processing systems, pages 4349–4357, 2016.
                 [82]Alexandra Luccioni and Yoshua Bengio. On the morality of artiﬁcial intelligence.arXiv preprint
                     arXiv:1912.11945, 2019.
                 [83]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for
                     deep learning in nlp.arXiv preprint arXiv:1906.02243, 2019.
                 [84]Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
                     Rethinking self-attention in transformer models.arXiv preprint arXiv:2005.00743, 2020.
                 [85]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
                     and Jiawei Han. On the variance of the adaptive learning rate and beyond.arXiv preprint
                     arXiv:1908.03265, 2019.
                 [86]Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns
                     in transformer-based machine translation.arXiv preprint arXiv:2002.10260, 2020.
                 [87]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
                     Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
                     Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
                     Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-
                     performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
                     Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32,
                     pages 8024–8035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.cc/paper/
                     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
                     pdf.


                                                  Appendix


                 We ﬁrst provide additional elements to corroborate our ﬁndings: alignment measurement (Section
                 A), and shallow baselines (Section B). We then discuss the process of adapting the considered
                 architectures for DFA (Section C), and the issue of weight transport in attention layers (Section D).
                 We provide some supplementary results for NeRF (Section E), including details of performance on
                 each scene of each datatset, and a discussion on possible mitigation of DFA shortcomings. Finally,
                 we outline steps necessary for reproduction of this work (Section F).

                 A Alignment

                 Alignment measurement In feedback alignment methods, the forward weights learn toalignwith
                 the random backward weights, making the delivered updates useful. This alignment can be quantiﬁed
                 by measuring the cosine similarity between the gradient signal delivered by DFABi ay and the
                 gradient signal BP would have deliveredWT <<FORMULA>>. For learning to occur and DFA to work as
                 a training method, there must be alignment. This can be measured numerically [23]. Measuring
                 alignments allows to check whether or not the layers are effectively being trained by DFA, regardless
                 of performance metrics. We note that any alignment value superior to 0 signiﬁes that learning is
                 occuring. Values closer to 1 indicate a better match with BP, but small alignment values are sufﬁcient
                 to enable learning. We report values measured at the deepest DFA layer.

                 Recommender systems We measure alignment on the Criteo dataset, in the two architectures
                 featuring non-conventional fully-connected layers: Deep & Cross and AFN. Alignment is measured
                 after 15 epochs of training, and averaged over a random batch of 512 samples. Results are reported in
                 table A.1. These alignment measurements indicate that learning is indeed occurring in the cross and
                 logarithmic layers. High-variance of alignment in the cross layers is unique: it may be explained by
                 the absence of non-linearity, and account for the difference in performance between BP and DFA on
                 this architecture–which is higher than on the others.

                 Table A.1: Alignment cosine similarity (higher is better, standard deviation in parenthesis) of
                 recommender systems as measured on the Criteo dataset. Learning occurs in both architectures, and
                 high variance may explain the larger performance gap on Deep & Cross compared to other methods.

                                              <<TABLE>>

                 Graph convolutions We measure alignment on the Cora dataset, after 250 epochs of training,
                 averaging values over every sample available–train, validation, and test split included. Results are
                 reported in Table A.2. We observe high alignment values in all architectures, indicative that learning
                 is indeed occuring. Slightly lower values in SplineConv and GATConv may be explained by the use
                 of the Exponential Linear Unit (ELU) instead of the Rectiﬁed Linear Unit (ReLU) used as activation
                 in other architectures.
                 Table A.2: Alignment cosine similarity (standard deviation in parenthesis) of various graph convolu-
                 tions architectures as measured on the Cora dataset. These values corroborate that DFA successfully
                 trains all architectures considered.

                                <<TABLE>>

                 B Shallow baselines

                 Shallow learning We compare DFA to BP, but also to shallow learning–where only the topmost
                 layer is trained. While DFA may not reach the performance level of BP, it should still vastly

                 Figure A.1: Comparisons of Tiny-NeRF trained with BP, DFA, and a shallow approach. Shallow
                 training is insufﬁcient to learn scene geometry. Lego scene from the NeRF synthetic dataset.

                                            <<FIGURE>>

                 outperform shallow learning: failure to do so would mean that the weight updates delivered by DFA
                 are useless. On a simple task like MNIST, a shallow baseline may be as high as 90%. However, given
                 the difﬁculty of the tasks we consider, the shallow baseline is here usually much lower.

                 NeRF Because NeRF models are expensive to train–up to 15 hours on a V100–we consider a
                 simpliﬁed setup for the shallow baseline, NeRF-Tiny. This setup operates at half the full resolution
                 of the training images available, runs for 5000 iterations only, and does away with view-dependant
                 characteristics. Furthermore, the network is cut down to 3 layers of half the width of NeRF, and
                 no coarse network is used to inform the sampling. We train this network on the Lego scene of the
                 NeRF-Synthetic dataset, and compare results.
                 Figure A.1 presents renders generated by NeRF-Tiny trained with BP, DFA, and a shallow approach.
                 While BP and DFA delivers similar renders, shallow training fails to reproduce even basic scene
                 geometry, instead outputting a diffuse cloud of colors. This highlights that while DFA may not reach
                 a level of performance on-par with BP on NeRF, it nonetheless delivers meaningful updates enabling
                 the learning of complex features.

                 Recommender systems Because recommender systems require ﬁne-tuning, we perform the same
                 hyperparameter search for shallow learning than for DFA and BP. Results are detailed in Table A.3.
                 Performance of shallow training is always well under BP and DFA–remember that0.001-levelmatter
                 in recommender systems. In particular, in Deep & Cross, where there was the biggest gap between
                 BP and DFA, the performance of the shallow method is extremely poor, well below the FM baseline.
                 Finally, it is expected to see that DeepFM recovers more or less the performance of FM even with a
                 shallow baseline.

                 Table A.3: Shallow baseline for recommender system models on the Criteo dataset. Performance is
                 always well below BP and DFA, as expected.

                                         <<TABLE>>

                 Graph convolutions We use the same hyperparameters as for DFA to produce the shallow baseline
                 on graph datasets. Results are reported in Table A.4. Performance is always much worse than BP
                 and DFA. GATConv recovers the best performance: random attention layers may still deliver useful
                 features [84], as do random convolutions.

                 Transformers In the baseline setting (optimizer and hyper-parameters of [59]), a Transformer
                 trained in the shallow regime yields a perplexity of 428 on WikiText-103. We do not consider

                 Table A.4: Shallow baseline for GCNNs on Cora, CiteSeer, and PubMed [61]. Performance is always
                 well below BP and DFA.

                               <<TABLE>>

                 other settings, as the cost of training a Transformer is high and we do not expect any meaningful
                 improvements–as with NeRF above.

                 C Adapting architectures to DFA

                 NeRF We use an architecture identical to the one used in [36], but based on the effective code
                 implementation rather than the description in the paper 1 . During our tests, we have found that
                 lowering the learning rate to <<FORMULA>> rather than <<FORMULA>> works best with DFA.


                 Recommender systems For all training methods (BP, DFA, and shallow), we have conducted
                 independent hyperparameter searches. We performed a grid search over the learning rate, from
                 <<FORMULA>> to <<FORMULA>> in <<FORMULA>> steps, as well as over the dropout probability, from <<FORMULA>> to <FORMULA> in <<FORMULA>> steps
                 (where applicable). On DeepFM, this search leads to reduce the learning rate from <<FORMULA>> with BP
                 to <<FORMULA>> with DFA, but to keep the 0.5 dropout rate. On Deep & Cross, we reduce learning rate
                 from <<FORMULA>> to <<FORMULA>>, with no dropout in both cases. In AFN, we reduce dropout from <<FORMULA>> to
                 <<FORMULA>> and dropout from 0.3 to 0.

                 Graph convolutions We manually test for a few hyperparameters conﬁguration on the Cora dataset,
                 focusing on learning rate, weight decay, and dropout. We do not consider architectural changes, such
                 as changing the number of ﬁlters or of attention heads. For ChebConv and GraphConv, we reduce
                 weight decay to <<FORMULA>> instead of <<FORMULA>>, and set the dropout rate to 0 and 0.1 respectively, instead
                 of 0.5 with BP. For SplineConv, we ﬁnd that no change in the hyperparameters are necessary. For
                 GATConv, we reduce weight decay to <<FORMULA>> instead of <<FORMULA>> and reduce dedicated dropout layer
                 to 0.1 instead of 0.6 but keep the 0.6 dropout rate within the GAT layer. Finally, on DNAConv we
                 disable weight decay entirely, instead of an original value of <<FORMULA>>, double the learning rate from
                 <<FORMULA>> to <<FORMULA>>, and disable dropout entirely. In all cases, we share the backward random matrix
                 across all nodes in a graph.

                 Transformers The model hyper-parameters were ﬁxed across all of our experiments, except for
                 the number of attention heads in one case, that we will precise below, and dropout. We tested several
                 values of dropout probability between 0 and 0.5, but found the original value of 0.1 to perform
                 best. We manually tested a number of optimizers, optimizer parameters and attention mechanisms.
                 We tested four combinations of optimizers and schedulers : Adam with the scheduler used in [59],
                 Adam alone, RAdam [85] alone, and Adam with a scheduler that reduces the learning rate when
                 the validation perplexity plateaus. We found it necessary to reduce the initial learning rate of Adam
                 from <<FORMULA>> to <<FORMULA>>, although it could be set back to <<FORMULA>> with a scheduler. We tried two values
                 of 0.98 and 0.999. We also tried to change <<FORMULA>> and observed some small differences that were
                 not signiﬁcant enough for the main text. Finally, we tried three attention mechanisms in addition to
                 the standard multihead scaled dot-product attention: the dense and random (learnable) Synthesizers
                 of [84], as well as the ﬁxed attention patterns of [86]. The latter needed to be adapted to language
                 modelling to prevent attending to future tokens, which led us to reduced the number of attention
                 heads to 4. The backward random matrix is always shared across all tokens and batches.

                                                D Weight transport and attention

                 We consider an attention layer operating on inputx. The queries, keys, and values are respectively
                 <<FORMULA>>, and <<FORMULA>> is the dimension of the queries and keys. The layer
                 performs:                                
                                     <<FORMULA>>                (4)

                 When using DFA on attention, we deliver the random feedback to the top of the layer. Accordingly,
                 to obtain updates toWQ ;WK ;andWV we still to have to backpropagate through the attention
                 mechanism itself. This involves weight transport onWV , sacriﬁcing some biological realism for
                 simplicity. Overall weight transport between layers still does not occur, and updating the layers in
                 parallel remains possible.
                 Beside using FA or DFA within the attention layer, alternative mechanisms like the synthesizer
                 [84]–which uses random attention in place of the query and key system–or ﬁxed attention [86] can
                 remove the need for weight transport. Implementing these mechanisms in DFA-trained Transformers,
                 or other attention-powered architectures, will require further research.


                 E Supplementary NeRF results

                 Quantitative results We report per-scene scores for each dataset in Table A.5. BP values are taken
                 from [36]. On three scenes of the synthetic datasets, NeRF-DFA even outperforms past state-of-the-art
                 methods trained with BP. Note that Neural Volumes (NV) is not applicable to forward-facing view
                 synthesis–as is required in LLFF-Real–and thus no results are reported.

                 Qualitative results We report sample renders from the NeRF-Synthetic dataset (Figure A.2) and
                 the LLFF-Real dataset (Figure A.2), for every scene available. However, we recommend readers to
                 consult the supplementary video to make better sense of characteristics like multi-view consistency
                 and view-dependent effects (most visible on the LLFF-Real Room scene).


                 Table A.5: Per-scene PSNR for NeRF DFA and BP against other state-of-the-art methods on the
                 Nerf-Synthetic and LLFF-Real. DFA performance is fairly homogeneous across each dataset and in
                 line with the differences in other methods.

                                            <<TABLE>>

                 Possible future directions Despite retranscribing scene geometry in a multi-view consistent way,
                 NeRF produces renders of a lower quality when trained with DFA instead of BP. In particular, it
                 struggles to transcribe small-scale details, resulting in "blurry" renders. Moreover, it displays high-
                 frequency artefacts: not in the scene geometry, but in individual pixels taking values very distant from
                 their neighborhood. Interestingly, this noise phenomenon is unique to NeRF-DFA: it is not observed
                 on NeRF-BP with similar PSNR values (achieved during training) or on other methods with similar
                 or lower PSNR. This leads us to hypothesize this is an aspect unique to DFA, possibly due to the
                 alignment process. Indeed, DFA creates a bias on the weights, by encouraging them to be "aligned"
                 with an arbitrary values dependant on the random matrix used. It is possible this could introduce
                 random noise in the ﬁnal renders–though we leave a more principled experiment to future research.
                 To attempt to alleviate this issue, we ﬁrst consider NeRF-Dual. In NeRF-Dual, we average the
                 pixel-wise prediction between the ﬁne and coarse network, to attempt to remove some of the noise.
                 To do so, we ﬁrst still use the coarse network to create a probability distribution for the hierarchical
                 sampling. Then, we evaluate again both the coarse and ﬁne networks at the locations informed by
                 this probability distribution. Compared to vanilla NeRF, this requires an extra batch of evaluation of
                 the coarse network for all rays–rougly speaking, this increases inference time by 30-50% depending
                 on the coarse network architecture considered. We note that this is not applied during training, so that
                 training times remain identical.
                 Figure A.2 and Figure A.3 showcase comparisons between NeRF and NeRF-Dual trained with DFA
                 on all scenes. When viewed at high resolution–such as in our supplementary video–the NeRF-Dual
                 renders are more pleasing, especially for the full scenes. They remove most of the high-frequency
                 noise, leading to smoother renders. However, this averaging process further blurs small-scale details in
                 the render. This is especially visible in the NeRF-Synthetic dataset, on scenes like Ficus. Furthermore,
                 NeRF-Dual introduces novel artefacts in the Mic and Ship scenes, with areas improperly colored
                 with a violet tint. The cause for these artefacts is unknown, but they show that NeRF-Dual is far from
                 a silver bullet. The PSNR is also minimally increased, by less than 0.5 per scene. Nevertheless, this
                 shows some promise in possibilities to allievate the shortcomings of NeRF-DFA. It is possible that
                 changes to the overall rendering process, or the use of classic image processing techniques, may help
                 enhance the NeRF-DFA images.
                 Finally, we also experimented with increasing the capacity of the ﬁne network, by widening its layers
                 to 512 neurons. We call this architecture NeRF-XL. However, we have not succeeded in getting
                 PSNR values higher than with vanilla NeRF on DFA. In particular, the training process becomes
                 much more cumbersome, as multi-GPU parallelism is needed to ﬁt the model. It is possible that
                 higher network capacity may help learning both the task at hand and to align simultaneously, but
                 further work is required.


                 F Reproducibility

                 Hardware used All main experiments require at most a single NVIDIA V100 GPU with 16GB
                 of memory to reproduce. Alignment measurement on large architectures (NeRF and Transformers)
                 require a second identical GPU to keep a copy of the network to evaluate BP gradients.
                 We estimate that a total of around 10,000 GPU-hours on V100s were necessary for this paper.
                 Accordingly, we estimate the cloud-computing carbon impact of this paper to be of 1700 kgCO 2 eq 2 .
                 However, without hyperparameter searches, our results can be reproduced with less than 500 GPU-
                 hours on V100s, with most of that budget going to NeRF and Transformers.

                 Implementation We use the shared random matrix trick from [23] to reduce memory use in DFA
                 and enable its scaling to large networks. We use PyTorch [87] for all experiments. For reference
                 implementation of the methods considered, we relied on various sources. Our NeRF implementation
                 is based on the PyTorch implementation by Krishna Murthy 3 , with modiﬁcations to allow for proper
                 test and validation, as well as DFA and multi-GPU support. For recommender systems, we use
                 PyTorch Geometric [60] for all graph operations. Our Transformer implementation is our own. 
                 Our code is available as supplementary material.

                 NeRF We provide training, testing, and rendering code along with the conﬁgurations used to obtain
                 our results. An example to reproduce our results is given in the supplementary code repository. Given
                 the computing cost associated with training a NeRF, we also provide our trained models.

                 Recommender systems We provide bash scripts to reproduce the results in Table 2 and A.3, with
                 the results of our hyperparameter search. We provide code to reproduce the results in Table A.1.

                 Graph convolutions We provide the code to reproduce all of our results. Note that the t-SNE
                 results are not exactly reproducible, as the CUDA implementation used is non-deterministic.

                 Transformers We provide bash scripts to reproduce Table 5 and the shallow results.

                                    <<FIGURE>>

                 Figure A.2: Sample renders for every scene of the NeRF-Synthetic dataset, for NeRF and NeRF-Dual
                 trained with DFA.

                                            <<FIGURE>>

                Figure A.3: Sample renders for every scene of the LLFF-Real dataset, for NeRF and NeRF-Dual
                 trained with DFA.
<|endoftext|>


<|startoftext|>
Efficient Behavior of Small-World Networks 
 
We introduce the concept of efficiency of a network, measuring how efficiently it exchanges information. By using this simple measure small-world networks are seen as systems that are both globally and locally efficient. This allows to give a clear physical meaning to the concept of small-world, and also to perform a precise quantitative analysis of both weighted and unweighted networks. We study neural networks and man-made communication and transportation systems and we show that the underlying general principle of their construction is in fact a small-world principle of high efficiency. PACS numbers 89.70.+c, 05.90.+m, 87.18.Sn, 89.40.+k 
We live in a world of networks. In fact any complex system in nature can be modeled as a network, where vertices are the elements of the system and edges represent the interactions between them. Coupled biological and chemical systems, neural networks, social interacting species, computer networks or the Internet are only few of such examples [1]. Characterizing the structural properties of the networks is then of fundamental importance to understand the complex dynamics of these systems. A recent paper [2] has shown that the connection topology of some biological and social networks is neither completely regular nor completely random. These networks, there named small-worlds, in analogy with the concept of small-world phenomenon developed 30 years ago in social psychology [3], are in fact highly clustered like regular lattices, yet having small characteristics path lengths like random graphs. The original paper has triggered a large interest in the study of the properties of small-worlds (see ref. [4] for a recent review). Researchers have focused their attention on different aspects: study of the inset mechanism [5,7], dynamics [8] and spreading of diseases on small-worlds [9], applications to social net.works [10,11] and to the Internet [12,13]. In this letter we introduce the concept of efficiency of a network, measuring how efficiently information is exchanged over the net.work. By using efficiency small-world networks results as systems that are both globally and locally efficient. This formalization gives a clear physical meaning to the concept of small-world, and also allows a precise quantitative analysis of unweighted and weighted networks. We study several systems, like brains, communication and transportation networks, and show that the underlying general principle of their construction is in fact a small-world principle, provided attention is taken not to ignore an important observational property (closure). We start by reexamining the original formulation pro.posed in ref. [2]. There, a generic graph G with N vertices and K edges is considered. G is assumed to be unweighted, i.e. edges are all equal, sparse <<(K . N (N . 1)/2)>>, and connected. i.e. there exists at least one path connecting any couple of vertices with a infinite number of steps. G is therefore represented by simply giving the adjacency (or connection) matrix, i.e. the NxN matrix whose entry a_ij is 1 if there is an edge joining vertex i to vertex j, and 0 otherwise. An important quantity of G is the degree of vertex i, i.e. the number ki of edges incident with vertex i (the number of neighbors of i). The average value of ki is <<k =2K/N>>. Once {<<FORMULA>>} is given it can be used to calculate the matrix of the short.est path lengths d_ij between two generic vertices i and j. The fact that G is assumed to be connected implies that dij is positive and infinite .i = j. In order to quantify the structural properties of G, [2] proposes to evaluate two different quantities: the characteristic path length L and the clustering coefficient C. L is the average distance be-
tween two generic vertices <<FORMULA>>, and C is a local property defined as <<FORMULA>>. Here C_i is the number of edges existing in Gi, the subgraph of the neighbors of i, divided by the maximum possible number ki(ki . 1)/2. In [2] a simple method is considered to pro.duce a class of graphs with increasing randomness. The initial graph G is taken to be a one-dimensional lattice with each vertex connected to its k neighbors and with periodic boundary conditions. Rewiring each edge at ran.dom with probability p, G can be tuned in a continuous way from a regular lattice (p = 0) into a random graph (p = 1). For the regular lattice we expect <<FORMULA>> and a high clustering coefficient <<FORMULA>>, while for a random graph <<FORMULA>> and <<FORMULA>> [14,5]. Although in the two limit cases a large C is associated to a large L and vice versa a small C to a small L, the numerical experiment reveals an intermediate regime at small p where the system is highly clustered like regular lattices, yet having small characteristics path lengths like random graphs. This behavior is there called small-world and it is found to be a property of some social and 
biological networks analyzed [2]. 
Now we propose a more general set-up to investigate real networks. We will show that: the definition of small-world behavior can be given in terms of a single variable with a physical meaning, the efficiency E of the network. -1/L and C can be seen as first approximations of E evaluated resp. on a global and on a local scale. -we can drop all the restrictions on the system, like unweightedness, connectedness and sparseness. We represent a real network as a generic weighted (and possibly even non sparse and non connected) graph G. Such a graph needs two matrices to be described: the adjacency matrix {a_ij} defined as for the unweighted graph, and the matrix {<<FORMULA>>} of physical distances. The number <<FORMULA>> can be the space distance between the two vertices or the strength of their possible interaction: we suppose <<FORMULA to be known even if in the graph there is no edge between i and j. To make some examples, <<FORMULA>> can be the geographical distance between stations in transportation systems (in such a case <<FORMULA>> respects the triangle equality, though this is not a necessary assumption), the time taken to ex.change a packet of information between routers in the Internet, or the inverse v<<E_loc>>ity of chemical reactions along a direct connection in a biological system. Of course, in the particular case of an unweighted graph <<FORMULA>>. The shortest path length dij between two generic points i and j is the smallest sum of the physical distances throughout all the possible paths in the graph from i to j. The matrix {<<FORMULA>>} is therefore calculated by using the information contained both in matrix {a_ij} and in matrix {<<FORMULA>>}. We have <<FORMULA>>, the equality being valid when there is an edge between i and j. Let us now suppose that the system is parallel, i.e. every vertex sends information concurrently along the network, through its edges. The efficiency <<FORMULA>> in the communication between vertex i and j can be then defined to be inversely proportional to the shortest distance: <<FORMULA>>. When there is no path in the graph between i and j, <<FORMULA>> and consistently <<FORMULA>>. The average efficiency of G can be defined as: 

<<FORMULA>>

To normalize E we consider the ideal case G_id in which the graph G has all the <<N (N . 1)/2>> possible edges. In such a case the information is propagated in the most efficient way since dij = .ij .i, j, and E assumes its maxi-
<<FORMULA>>. The efficiency <<FORMULA>>
<<E(G)>> considered in the following of the paper is always divided by <<FORMULA>> and therefore <<FORMULA>>. Though the equality E = 1 is valid when there is an edge between each couple of vertices, real networks can reach a high value of E. 
In our formalism, we can define the small-world be.haviour by using the single measure E to analyze both the local and global behavior, rather than two different variables L and C. The quantity in eq. (1) is the global efficiency of G and we therefore name it E_glob. Since E is defined also for a disconnected graph we can characterize the local properties of G by evaluating for each vertex i the efficiency of G_i, the subgraph of the neighbors of i. We define the local efficiency as the average efficiency of the local subgraphs, E loc =1/N E(Gi). 

This quantity plays a role similar to the clustering co.efficient C. Since <<FORMULA>>, the local efficiency <<FORMULA>> tells how much the system is fault tolerant, thus how efficient is the communication between the first neighbors of i when i is removed [15]. The definition of small-world can now be rephrased and generalized in terms of the information <<FORMULA>>: small-world networks have high <<FORMULA>> and <<FORMULA>>, i.e. are very efficient in global and local communication. This definition is valid both for unweighted and weighted graphs, and can also be applied to disconnected and/or non sparse graphs. 
It is interesting to see the correspondence between our measure and the quantities L and C of [2] (or, correspondingly, <<1/L>> and C). The fundamental difference is that E_glob is the efficiency of a parallel systems, where all the nodes in the network concurrently exchange pack.ets of information (such are all the systems in [2], for example), while 1/L measures the efficiency of a sequential system (i.e. only one packet of information goes along the network). <<FORMULA>> is a reasonable approximation of <<E_glob>>when there are not huge differences among the distances in the graph, and this can explain why L works reasonably well in the unweighted examples of [2]. But, in general 1/L can significantly depart from E_glob. For instance, in the Internet, having few computers with an extremely slow connection does not mean that the whole Internet diminishes by far its efficiency: in practice, the presence of such very slow computers goes unnoticed, be.cause the other thousands of computers are exchanging packets among them in a very efficient way. Here 1/L would give a number very close to zero (strictly 0 in the particular case when a computer is disconnected from the others and <<FORMULA>>, while E_glob gives the correct efficiency measure of the Internet. We turn now our attention to the local properties of a network. C is only one among the many possible intuitive measures [10] of how well connected a cluster is. It can be shown that when in a graph most of its local subgraphs Gi are not sparse, then C is a good approximation of E_loc. Summing up there are not two different kinds of analysis to be done for the global and local scales, but just one with a very precise physical meaning: the efficiency in transporting information. We now illustrate the onset of the small-world in an un.weighted graph by means of the same example used in [2]. A regular lattice with <<N = 1000>> and <<k = 20>> is rewired 
2 
with probability p and <<E_glob>> and <<E_loc>> are reported in <<FORMULA>> as functions of p [16]. For <<p = 0>> we expect the system to be inefficient on a global scale (E_glob . k/N log(N/K)) but locally efficient. The situation is inverted for the ran.dom graph. In fact at p =1 E_glob assumes a maximum value of 0.4, meaning 40% the efficiency of the ideal graph with an edge between each couple of vertices. This at the expenses of the fault tolerance (<<FORMULA>>). 

<<FIGURE>>

FIG. 1. FIG.1 Global and local efficiency for the graph example considered in [2]. A regular lattice with <<N = 1000>> and <<k = 20>> is rewired with probability p. The small-world behavior results from the increase of E_glob caused by the introduction of only a few rewired edges (short cuts), which on the other side do not affect <<E_loc>>. At p  0.1, E_glob has almost reached the value of the random graph, though <<E_loc>> has only diminished by very little from the value of 0.82 of the regular lattice. Small worlds have high E_glob and <<E_loc>>. 
The small-world behavior appears for intermediate values of p. It results from the fast increase of E_glob (for small p we find a linear increase of E_glob in the logarithmic horizontal scale) caused by the introduction of only a few rewired edges (short cuts), which on the other side do not affect <<E_loc>>. At p . 0.1, E_glob has almost reached the maximum value of 0.4, though <<E_loc>> has only diminished by very little from the maximum value of 0.82. For an unweighted case the description in terms of network efficiency resembles the approximation given in [2]. In particular we have checked that a good agreement with curves L(p) and C(p) [2] can be obtained by reporting <<FORMULA>> and <<FORMULA>>. Of course in such an example the short cuts connect at almost no cost vertices that would otherwise be much farther apart (because <<FORMULA>>). On the other hand this is not true when we consider a weighted network. As real networks we consider first different examples of natural systems (neural networks), and then we turn our attention to man-made communication and transportation systems. 
1) Neural Networks. Thanks to recent experiments 
neural structures can be studied at several levels of scale. Here we focus first on the analysis of the neuro-anatomical structure of cerebral cortex, and then on a simple nervous system at the level of wiring between neurons. The anatomical connections between cortical areas are of particular importance for their intricate relationship with the functional connectivity of the cerebral cortex [18]. We analyze two databases of cortico-cortical connections in the macaque and in the cat [19]. Tab.1 indicates the two networks are small-worlds [16]: they have high E_glob, respectively 52% and 69% the efficiency of the ideal graph with an edge between each couple of vertices (just slightly smaller than the best possible values of 57% and 70% obtained in random graphs) and high <<E_loc>>, respectively 70% and 83%, i.e. high fault tolerance [22]. These results indicate that in neural cortex each region is intermingled with the others and has grown following a perfect balance between local necessities (fault tolerance) and wide-scope interactions. Next we consider the neural network of C. elegans, the only case of a nervous system completely mapped at the level of neurons and chemical synapses [23]. Tab.1 shows that this is also a small-world network: C. elegans achieves both a 50% of global and local efficiency. Moreover the value of E_glob is similar to <<E_loc>>. This is a difference from cortex databases where fault tolerance is slightly privileged with respect to global communication. 
2) Communication Networks. We have considered two of the most important large-scale communication net.works present nowadays: the World Wide Web and the Internet. Tab.2 shows that they have relatively high val.ues of E_glob (slightly smaller than the best possible val.ues obtained for random graphs) and <<E_loc>>. Despite the WWW is a virtual network and the Internet is a physical network, at a global scale they transport information essentially in the same way (as their E_globs are almost equal). At a local scale, the bigger <<E_loc>> in the WWW case can be explained both by the tendency in the WWW to create Web communities (where pages talking about the same subject tend to link to each other), and by the fact that many pages within the same site are often quickly connected to each other by some root or menu page. 
3) Transport Networks. differently from previous databases the Boston subway transportation system (MBTA) can be better described by a weighted graph, the matrix {.ij } being given by the geographical distances between stations. If we consider the MBTA as an unweighted graph we obtain that it is apparently neither locally nor globally efficient (see Tab.3). On the other hand, when we take into account the geographical distances, we have E_glob =0.63: this shows the MBTA is a very efficient transportation system on a global scale, only 37% less efficient than the ideal subway with a di.rect tunnel from each station to the others. Even in the weighted case <<E_loc>> stays low (0.03), indicating a poor local behavior: differently from a neural network the 


MBTA is not fault tolerant and a damage in a station will dramatically affect the connection between the previous and the next station. The difference with respect to neural networks comes from different needs and priorities in the construction and evolution mechanism: when we build a subway system, the priority is given to the achievement of global efficiency, and not to fault tolerance. In fact a temporary problem in a station can be solved by other means: for example, walking, or taking a bus from the previous to the next station. That is to say, the MBTA is not a closed system: it can be considered, after all, as a subgraph of a wider transportation network. This property is of fundamental importance when we analyze a system: while global efficiency is without doubt the major characteristic, it is closure that somehow leads a system to have high local efficiency (without alternatives, there should be high fault-tolerance). The MBTA is not a closed system, and thus this explains why, unlike in the case of the brain, fault tolerance is not a critical issue. Indeed, if we increase the precision of the analysis and change the MBTA subway network by taking into account, for example, the Boston Bus System, this ex.tended transportation system comes back to be a small-world network (<<FORMULA>>). Qualitatively similar results, confirming the similarity of construction principles, have been obtained for other undergrounds and for a wider transportation system consisting of all the main airplane and highway connections throughout the world [25]. Considering all the transportation alter.natives available at that scale makes again the system closed (there are no other reasonable routing alternatives), and so fault-tolerance comes back as a leading construction principle. 
Summing up, the introduction of the efficiency mea.sure allows to give a definition of small-world with a clear physical meaning, and provides important hints on why the original formulas of [2] work reasonably well in some cases, and where they fail. The efficiency measure al.lows a precise quantitative analysis of the information flow, and works both in the unweighted abstraction, and in the more realistic assumption of weighted networks. Finally, analysis of real data indicates that various existing (neural, communication and transport) networks exhibit the small-world behavior (even, in some cases, when their unweighted abstractions do not), substantiating the idea that the diffusion of small-world networks can be interpreted as the need to create networks that are both globally and locally efficient. 
[1] Y. Bar-Yam, Dynamics of Complex Systems (Addison-Wesley, Reading Mass, 1997). 
[2] D.J. Watts and S.H. Strogatz, Nature 393, 440 (1998). 
[3] S. Milgram, Physicol. Today, 2, 60 (1967). 
[4] M.E.J. Newman, cond-mat/0001118. 
[5] A. Barrat, M. Weigt, Europ. Phys. J. B 13, 547 (2000) 
[6] M. Marchiori and V. Latora, Physica A285, 539 (2000). 
[7] M. Barthelemy, L. Amaral, Phys. Rev. Lett. 82, 3180 (1999). 
[8] L. F. Lago-Fernandez et al, Phys. Rev. Lett. 84, 2758 (2000). 
[9] C. Moore and M.E.J. Newman, Phys. Rev. E61, 5678 (2000). 
[10] M.E.J. Newman, cond-mat/0011144. 
[11] L. A. N. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley, Proc. Natl. Acad. Sci. 97, 11149 (2000). 
[12] R. Albert, H. Jeong, and A.-L. Barabasi, Nature 401, 130 (1999). 
[13] A.-L. Barabasi and R. Albert, Science 286, 509 (1999). 
[14] B. Bollobas, Random Graphs (Academic, London, 1985). 
[15] Our concept of fault tolerance is different from the one adopted in R. Albert, H. Jeong, and A.-L. Barabasi, Na.ture 406, 378 (2000); R. Cohen et al. Phys. Rev. Lett. 85, 2758 (2000), where the authors consider the response of the entire network to the removal of i. 
[16] Here and in the following the matrix {dij }i,j2G has been computed by using two different methods: the Floyd-Warshall (O(N 3 )) [17] and the Dijkstra algorithm (O(N 2 logN )) [10]. 
[17] G. Gallo and S. Pallottino, Ann. Oper. Res. 13, 3 (1988). 
[18] O. Sporns, G. Tononi, G.M. Edelman, Celebral Cortex 10, 127 (2000). [19] J.W.Scannell, Nature 386, 452 (1997). [20] M.P. Young, Phil.Trans.R.Soc B252, 13 (1993). 
[21] J.W. Scannell, M.P. Young and C. Blakemore, J. Neu.rosci. 15, 1463 (1995). 
[22] E. Sivan, H. Parnas and D. Dolev, Biol. Cybern. 81, 11.23 (1999). 
[23] J.G. White et. al., Phil. Trans. R. Soc. London B314,1 (1986). 
[24] T.B. Achacoso and W.S. Yamamoto, AYs Neuroanatomy of C. elegans for Computation (CRC Press, FL, 1992). 
[25] M. Marchiori and V. Latora, in preparation. 

TABLE I. Macaque and cat cortico-cortical connections [19]. The macaque database contains N = 69 cortical areas and K = 413 connections [20]. The cat database has N = 55 cortical areas (including hippocampus, amygdala, entorhinal cortex and subiculum) and K = 564 (revised database and cortical parcellation from [21]). The nervous system of C. elegans consists of N = 282 neurons and K = 2462 links which can be either synaptic connections or gap junctions [24]. 

<<TABLE>>

TABLE II. Communication networks. Data on the World Wide Web from http://www.nd.edu/networks contains N = 325729 documents and K = 1090108 links [12], while the Internet database is taken from http://moat.nlanr.net and has N = 6474 nodes and K = 12572 links. 

<<TABLE>> 

TABLE III. The Boston underground transportation system (MBTA) consists of N = 124 stations and K = 124 tunnels. The matrix {.ij } of the spatial distances between stations, used for the weighted case, has been calculated us.ing databases from http://www.mbta.com/ and the U.S. Na.tional Mapping Division. 
 
<<TABLE>>

<|endoftext|>


<|startoftext|>
            Efﬁcient Processing of Deep Neural Networks: A Tutorial and Survey

             Vivienne Sze,Senior Member, IEEE,Yu-Hsin Chen,Student Member, IEEE,Tien-Ju Yang,Student
                                   Member, IEEE,Joel Emer,Fellow, IEEE

                                                     Abstract
         
        Deep neural networks (DNNs) are currently widely representation of an input space. This is different from earlier
        used for many artiﬁcial intelligence (AI) applications including approaches that use hand-crafted features or rules designed by
        While DNNs experts. deliver state-of-the-art accuracy on many AI tasks, it comes at the   The superior accuracy of DNNs, however, comes at the cost of high computational complexity. Accordingly, techniques
        that enable efﬁcient processing of DNNs to improve energy cost of high computational complexity. While general-purpose
        efﬁciency and throughput without sacriﬁcing application accuracy compute engines, especially graphics processing units (GPUs),
        or increasing hardware cost are critical to the wide deployment have been the mainstay for much DNN processing, increasingly of DNNs in AI systems.                            there is interest in providing more specialized acceleration of This article aims to provide a comprehensive tutorial and the DNN computation. This article aims to provide an overview survey about the recent advances towards the goal of enabling
        efﬁcient processing of DNNs. Speciﬁcally, it will provide an of DNNs, the various tools for understanding their behavior,
        overview of DNNs, discuss various hardware platforms and and the techniques being explored to efﬁciently accelerate their
        architectures that support DNNs, and highlight key trends in computation. reducing the computation cost of DNNs either solely via hardware   This paper is organized as follows: design changes or via joint hardware design and DNN algorithm
        changes. It will also summarize various development resources    Section II provides background on the context of why
        that enable researchers and practitioners to quickly get started     DNNs are important, their history and applications.
        in this ﬁeld, and highlight important benchmarking metrics and    Section III gives an overview of the basic components of design considerations that should be used for evaluating the     DNNs and popular DNN models currently in use. rapidly growing number of DNN hardware designs, optionally
        including algorithmic co-designs, being proposed in academia    Section IV describes the various resources used for DNN
        and industry.                                      research and development.
         The reader will take away the following concepts from this    Section V describes the various hardware platforms used
        article: understand the key design considerations for DNNs; be     to process DNNs and the various optimizations used able to evaluate different DNN hardware implementations with     to improve throughput and energy efﬁciency without benchmarks and comparison metrics; understand the trade-offs     impacting application accuracy (i.e., produce bit-wise between various hardware architectures and platforms; be able to
        evaluate the utility of various DNN design techniques for efﬁcient     identical results).
        processing; and understand recent implementation trends and    Section VI discusses how mixed-signal circuits and new
        opportunities.                                      memory technologies can be used for near-data processing
                                                       to address the expensive data movement that dominates
                                                       throughput and energy consumption of DNNs.
                                                       
                                                       I. INTRODUCTION                   Section VII describes various joint algorithm and hardware
         Deep neural networks (DNNs) are currently the foundation     optimizations that can be performed on DNNs to improve
        for many modern artiﬁcial intelligence (AI) applications [1].     both throughput and energy efﬁciency while trying to
        Since the breakthrough application of DNNs to speech recogni-     minimize impact on accuracy.
        tion [2] and image recognition [3], the number of applications    Section VIII describes the key metrics that should be
        that use DNNs has exploded. These DNNs are employed in a     considered when comparing various DNN designs.
        myriad of applications from self-driving cars [4], to detecting
        cancer [5] to playing complex games [6]. In many of these  II. B ACKGROUND ON DEEP NEURAL NETWORKS (DNN)
        domains, DNNs are now able to exceed human accuracy. The   In this section, we describe the position of DNNs in thesuperior performance of DNNs comes from its ability to extract context of AI in general and some of the concepts that motivatedhigh-level features from raw sensory data after using statistical its development. We will also present a brief chronology oflearning over a large amount of data to obtain an effective the major steps in its history, and some current domains to
                                                   which it is being applied. V. Sze, Y.-H. Chen and T.-J. Yang are with the Department of Electrical
        Engineering and Computer Science, Massachusetts Institute of Technol-
        ogy, Cambridge, MA 02139 USA. (e-mail: sze@mit.edu; yhchen@mit.edu, A. Artiﬁcial Intelligence and DNNs tjy@mit.edu)
         J. S. Emer is with the Department of Electrical Engineering and Computer   DNNs, also referred to as deep learning, are a part of Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, the broad ﬁeld of AI, which is the science and engineering and also with Nvidia Corporation, Westford, MA 01886 USA. (e-mail:
        jsemer@mit.edu)                                  of creating intelligent machines that have the ability to                                                                                             2

                                                                <<FIGURE>>

                                                   Fig. 2. Connections to a neuron in the brain. <<FORMULA>>,<<FORMULA>>,<<FORMULA>>, and b are the
                                                   activations, weights, non-linear function and bias, respectively. (Figure adopted
                                                   from [7].)Fig. 1. Deep Learning in the context of Artiﬁcial Intelligence.


                                                   to be10 14 to10 15 synapses in the average human brain.
        achieve goals like humans do, according to John McCarthy,   A key characteristic of the synapse is that it can scale the
        the computer scientist who coined the term in the 1950s. signal (x_i) crossing it as shown in Fig. 2. That scaling factor
        The relationship of deep learning to the whole of artiﬁcial can be referred to as a weight (<<FORMULA>>), and the way the brain is
        intelligence is illustrated in Fig. 1. believed to learn is through changes to the weights associated
         Within artiﬁcial intelligence is a large sub-ﬁeld called with the synapses. Thus, different weights result in different
        machine learning, which was deﬁned in 1959 by Arthur Samuel responses to an input. Note that learning is the adjustment
        as the ﬁeld of study that gives computers the ability to learn of the weights in response to a learning stimulus, while the
        without being explicitly programmed. That means a single organization (what might be thought of as the program) of the
        program, once created, will be able to learn how to do some brain does not change. This characteristic makes the brain an
        intelligent activities outside the notion of programming. This is excellent inspiration for a machine-learning-style algorithm.
        in contrast to purpose-built programs whose behavior is deﬁned   Within the brain-inspired computing paradigm there is a
        by hand-crafted heuristics that explicitly and statically deﬁne subarea called spiking computing. In this subarea, inspiration
        their behavior.                                  is taken from the fact that the communication on the dendrites
         The advantage of an effective machine learning algorithm and axons are spike-like pulses and that the information being
        is clear. Instead of the laborious and hit-or-miss approach of conveyed is not just based on a spike’s amplitude. Instead,
        creating a distinct, custom program to solve each individual it also depends on the time the pulse arrives and that the
        problem in a domain, the single machine learning algorithm computation that happens in the neuron is a function of not just
        simply needs to learn, via a processes called training, to handle a single value but the width of pulse and the timing relationship
        each new problem.                              between different pulses. An example of a project that was
         Within the machine learning ﬁeld, there is an area that is inspired by the spiking of the brain is the IBM TrueNorth [8].
        often referred to as brain-inspired computation. Since the brain In contrast to spiking computing, another subarea of brain-
        is currently the best ‘machine’ we know for learning and inspired computing is called neural networks, which is the
        solving problems, it is a natural place to look for a machine focus of this article. 1
        learning approach. Therefore, a brain-inspired computation is
        a program or algorithm that takes some aspects of its basic B. Neural Networks and Deep Neural Networks (DNNs)
        form or functionality from the way the brain works. This is in   Neural networks take their inspiration from the notion that
        contrast to attempts to create a brain, but rather the program a neuron’s computation involves a weighted sum of the input
        aims to emulate some aspects of how we understand the brain values. These weighted sums correspond to the value scaling
        to operate.                                    performed by the synapses and the combining of those values
         Although scientists are still exploring the details of how the in the neuron. Furthermore, the neuron doesn’t just output that
        brain works, it is generally believed that the main computational weighted sum, since the computation associated with a cascade
        element of the brain is the neuron. There are approximately of neurons would then be a simple linear algebra operation.
        86 billion neurons in the average human brain. The neurons Instead there is a functional operation within the neuron that
        themselves are connected together with a number of elements is performed on the combined inputs. This operation appears
        entering them called dendrites and an element leaving them to be a non-linear function that causes a neuron to generate
        called an axon as shown in Fig. 2. The neuron accepts the an output only if the inputs cross some threshold. Thus by
        signals entering it via the dendrites, performs a computation on analogy, neural networks apply a non-linear function to the
        those signals, and generates a signal on the axon. These input weighted sum of the input values. We look at what some of
        and output signals are referred to as activations. The axon of those non-linear functions are in Section III-A1.
        one neuron branches out and is connected to the dendrites of
        many other neurons. The connections between a branch of the   1 Note: Recent work using TrueNorth in a stylized fashion allows it to be
                                                   used to compute reduced precision neural networks [9]. These types of neural axon and a dendrite is called asynapse. There are estimated networks are discussed in Section VII-A.                                                                                             3

                                                <<FIGURE>>
               
        Fig. 3. Simple neural network example and terminology (Figure adopted (a) Compute the gradient of the loss (b) Compute the gradient of the lossfrom [7]).                                      relative to the ﬁlter inputs      relative to the weights

                            <<FIGURE>>

           Fig. 4. An example of backpropagation through a neural network.

                                                    <<FIGURE>>

         Fig. 3(a) shows a diagrammatic picture of a computational neural network. The neurons in the input layer receive some
        values and propagate them to the neurons in the middle layer and is referred to as training the network. 
        
        Once trained, the
        of the network, which is also frequently called a ‘hidden program can perform its task by computing the output of
        layer’. The weighted sums from one or more hidden layers are the network using the weights determined during the training
        ultimately propagated to the output layer, which presents the process. Running the program with these weights is referred
        ﬁnal outputs of the network to the user. To align brain-inspired to as inference.
        terminology with neural networks, the outputs of the neurons   In this section, we will use image classiﬁcation, as shown
        are often referred to as activations, and the synapses are often in Fig. 6, as a driving example for training and using a DNN.
        referred to as weights as shown in Fig. 3(a). We will use the When we perform inference using a DNN, we give an input
        activation/weight nomenclature in this article.            image and the output of the DNN is a vector of scores, one for
         Fig. 3(b) shows an example of the computation at each each object class; the class with the highest score indicates the
         most likely class of object in the image. The overarching goal layer: <<For>>, where W_ij ,x_i and y_j are the for training a DNN is to determine the weights that maximize
        weights, input activations and output activations, respectively, i=1                                the score of the correct class and minimize the scores of the
        and <<FORMULA>> is a non-linear function described in SectionIII-A1. incorrect classes. When training the network the correct class
        The bias term b is omitted from Fig. 3(b) for simplicity.     is often known because it is given for the images used for
         Within the domain of neural networks, there is an area called training (i.e., the training set of the network). The gap between
        deep learning, in which the neural networks have more than the ideal correct scores and the scores computed by the DNN
        three layers, i.e., more than one hidden layer. Today, the typical based on its current weights is referred to as theloss(L).
        numbers of network layers used in deep learning range from Thus the goal of training DNNs is to ﬁnd a set of weights to
        ﬁve to more than a thousand. In this article, we will generally minimize the average loss over a large training set.
        use the terminologydeep neural networks (DNNs)to refer to   When training a network, the weights (wij ) are usually
        the neural networks used in deep learning.              updated using a hill-climbing optimization process called
         DNNs are capable of learning high-level features with more gradient descent. A multiple of the gradient of the loss relative
        complexity and abstraction than shallower neural networks. An to each weight, which is the partial derivative of the loss with
        example that demonstrates this point is using DNNs to process respect to the weight, is used to update the weight (i.e., updated
        visual data. In these applications, pixels of an image are fed into <<FORMULA>>, where <<FORMULA>> is called the learning rate). 
        Note <<FORMULA>> the ﬁrst layer of a DNN, and the outputs of that layer can be that this gradient indicates how the weights should change in ij

        interpreted as representing the presence of different low-level order to reduce the loss. The process is repeated iteratively to
        features in the image, such as lines and edges. At subsequent reduce the overall loss.
        layers, these features are then combined into a measure of the   An efﬁcient way to compute the partial derivatives of
        likely presence of higher level features, e.g., lines are combined the gradient is through a process called backpropagation.
        into shapes, which are further combined into sets of shapes. Backpropagation, which is a computation derived from the
        And ﬁnally, given all this information, the network provides a chain rule of calculus, operates by passing values backwards
        probability that these high-level features comprise a particular through the network to compute how the loss is affected by
        object or scene. This deep feature hierarchy enables DNNs to each weight.
        achieve superior performance in many tasks.               This backpropagation computation is, in fact, very similar
        in form to the computation used for inference as shown in Fig. 4 [10]. 2 Thus, techniques for efﬁciently performing

        C. Inference versus Training                        
        
         Since DNNs are an instance of a machine learning algorithm,   2 To backpropagate through each ﬁlter: (1) compute the gradient of the loss
        the basic program does not change as it learns to perform its relative to the weights from the ﬁlter inputs (i.e., the forward activations) and
        given tasks. In the speciﬁc case of DNNs, this learning involves the gradients of the loss relative to the ﬁlter outputs; (2) compute the gradient
                                                   of the loss relative to the ﬁlter inputs from the ﬁlter weights and the gradients determining the value of the weights (and bias) in the network, of the loss relative to the ﬁlter outputs.                                                                                             4


        inference can sometimes be useful for performing training.                 DNN Timeline
        It is, however, important to note a couple of points. First,
        backpropagation requires intermediate outputs of the network    1940s - Neural networks were proposed
        to be preserved for the backwards computation, thus training    1960s - Deep neural networks were proposed
        has increased storage requirements. Second, due to the gradients    1989 - Neural networks for recognizing digits (LeNet)
        use for hill-climbing, the precision requirement for training    1990s - Hardware for shallow neural nets (Intel ETANN)
        is generally higher than inference. Thus many of the reduced    2011 - Breakthrough DNN-based speech recognition
                                                       (Microsoft)precision techniques discussed in Section VII are limited to
        inference only.                                    2012 - DNNs for vision start supplanting hand-crafted
                                                       approaches (AlexNet)A variety of techniques are used to improve the efﬁciency
        and robustness of training. For example, often the loss from    2014+ - Rise of DNN accelerator research (Neuﬂow,
                                                       DianNao...)multiple sets of input data, i.e., abatch, are collected before a
        single pass of weight update is performed; this helps to speed Fig. 5. A concise history of neural networks. ’Deep’ refers to the number of
        up and stabilize the training process.                  layers in the network.
         There are multiple ways to train the weights. The most
        common approach, as described above, is called supervised
        learning, where all the training samples are labeled (e.g., with amount of available information to train the networks. To learn
        the correct class).Unsupervised learning is another approach a powerful representation (rather than using a hand-crafted
        where all the training samples are not labeled and essentially approach) requires a large amount of training data. For example,
        the goal is to ﬁnd the structure or clusters in the data.Semi- Facebook receives over 350 millions images per day, Walmart
        supervised learning falls in between the two approaches where creates 2.5 Petabytes of customer data hourly and YouTube
        only a small subset of the training data is labeled (e.g., use has 300 hours of video uploaded every minute. As a result,
        unlabeled data to deﬁne the cluster boundaries, and use the the cloud providers and many businesses have a huge amount
        small amount of labeled data to label the clusters). Finally, of data to train their algorithms.
        reinforcement learning can be used to the train weights such   The second factor is the amount of compute capacity
        that given the state of the current environment, the DNN can available. Semiconductor device and computer architecture
        output what action the agent should take next to maximize advances have continued to provide increased computing
        expected rewards; however, the rewards might not be available capability, and we appear to have crossed a threshold where the
        immediately after an action, but instead only after a series of large amount of weighted sum computation in DNNs, which
        actions.                                      is required for both inference and training, can be performed
         Another commonly used approach to determine weights is in a reasonable amount of time.
        ﬁne-tuning, where previously-trained weights are available and   The successes of these early DNN applications opened the
        are used as a starting point and then those weights are adjusted ﬂoodgates of algorithmic development. It has also inspired the
        for a new dataset (e.g., transfer learning) or for a new constraint development of several (largely open source) frameworks that
        (e.g., reduced precision). This results in faster training than make it even easier for researchers and practitioners to explore
        starting from a random starting point, and can sometimes result and use DNNs. Combining these efforts contributes to the third
        in better accuracy.                               factor, which is the evolution of the algorithmic techniques that
         This article will focus on the efﬁcient processing of DNN have improved application accuracy signiﬁcantly and broadened
        inference rather than training, since DNN inference is often the domains to which DNNs are being applied.
        performed on embedded devices (rather than the cloud) where   An excellent example of the successes in deep learning can
        resources are limited as discussed in more details later.      be illustrated with the ImageNet Challenge [14]. This challenge
                                                   is a contest involving several different components. One of the
                                                   components is an image classiﬁcation task where algorithmsD. Development History                           are given an image and they must identify what is in the image,Although neural nets were proposed in the 1940s, the ﬁrst as shown in Fig. 6. The training set consists of 1.2 millionpractical application employing multiple digital neurons didn’t images, each of which is labeled with one of 1000 objectappear until the late 1980s with the LeNet network for hand- categories that the image contains. For the evaluation phase,written digit recognition [11]3 . Such systems are widely used the algorithm must accurately identify objects in a test set ofby ATMs for digit recognition on checks. However, the early images, which it hasn’t previously seen.2010s have seen a blossoming of DNN-based applications with   Fig. 7 shows the performance of the best entrants in thehighlights such as Microsoft’s speech recognition system in ImageNet contest over a number of years. One sees that 2011 [2] and the AlexNet system for image recognition in the accuracy of the algorithms initially had an error rate2012 [3]. A brief chronology of deep learning is shown in of 25% or more. In 2012, a group from the University ofFig. 5.                                       Toronto used graphics processing units (GPUs) for their highThe deep learning successes of the early 2010s are believed compute capability and a deep neural network approach, namedto be a conﬂuence of three factors. The ﬁrst factor is the AlexNet, and dropped the error rate by approximately 10% [3].
                                                   Their accomplishment inspired an outpouring of deep learning In the early 1960s, single analog neuron systems were used for adaptive
                                                   style algorithms that have resulted in a steady stream of ﬁltering [12, 13].                                                                                   5

                                                       Speech and LanguageDNNs have signiﬁcantly improved
                                                       the accuracy of speech recognition [21] as well as many
                                                       related tasks such as machine translation [2], natural
                                                        language processing [22], and audio generation [23]. Machines Learning                      
                                                         MedicalDNNs have played an important role in genomic                        
                                                          to gain insight into the genetics of diseases such as autism,
                                                        cancers, and spinal muscular atrophy [24–27]. 
                   <<FIGURE>>                           They have also been used in medical imaging to detect skin cancer [5],
                                                       brain cancer [28] and breast cancer [29]. 
        Fig. 6. Example of an image classiﬁcation task. 
        
        The machine learning platform takes in an image and outputs the conﬁdence scores for a predeﬁned set of classes.        
                                                        Game PlayRecently, many of the grand AI challenges
                                                       involving game play have been overcome using DNNs.
                                                       These successes also required innovations in training
                                                       techniques and many rely on reinforcement learning [30].
                                                             DNNs have surpassed human level accuracy in playing
                                                               Atari [31] as well as Go [6], where an exhaustive search
                                	                         of all possibilities is not feasible due to the unimaginably
                                                             huge number of possible moves.
                                             	             RoboticsDNNs have been successful in the domain of
                            <<FIGURE>>                        robotic tasks such as grasping with a robotic arm [32],
                                                              motion planning for ground robots [33], visual navigation [4,34], control to stabilize a quadcopter [35] and
        Fig. 7. Results from the ImageNet Challenge [14]. driving strategies for autonomous vehicles [36].

                                                     DNNs are already widely used in multimedia applications
                                                   today (e.g., computer vision, speech recognition). Looking
        improvements.                                 forward, we expect that DNNs will likely play an increasingly
         In conjunction with the trend to deep learning approaches important role in the medical and robotics ﬁelds, as discussed
        for the ImageNet Challenge, there has been a corresponding above, as well as ﬁnance (e.g., for trading, energy forecasting,
        increase in the number of entrants using GPUs. From 2012 and risk assessment), infrastructure (e.g., structural safety, and
        when only 4 entrants used GPUs to 2014 when almost all trafﬁc control), weather forecasting and event detection [37].
        the entrants (110) were using them. This reﬂects the almost The myriad application domains pose new challenges to the
        complete switch from traditional computer vision approaches efﬁcient processing of DNNs; the solutions then have to be
        to deep learning-based approaches for the competition.      adaptive and scalable in order to handle the new and varied
         In 2015, the ImageNet winning entry, ResNet [15], exceeded forms of DNNs that these applications may employ.
        human-level accuracy with a top-5 error rate 4 below 5%. Since
        then, the error rate has dropped below 3% and more focus F. Embedded versus Cloud
        is now being placed on more challenging components of the   The various applications and aspects of DNN processing competition, such as object detection and localization. These (i.e., training versus inference) have different computational successes are clearly a contributing factor to the wide range needs. Speciﬁcally, training often requires a large dataset 5 and of applications to which DNNs are being applied.         
        signiﬁcant computational resources for multiple weight-update
         iterations. In many cases, training a DNN model still takes several hours to multiple days and thus is typically performed

         E. Applications of DNN                   

         Many applications can beneﬁt from DNNs ranging from in the cloud. Inference, on the other hand, can happen either
        multimedia to medical space. In this section, we will provide in the cloud or at the edge (e.g., IoT or mobile).
        examples of areas where DNNs are currently making an impact   In many applications, it is desirable to have the DNN
        and highlight emerging areas where DNNs hope to make an inference processing near the sensor. For instance, in computer
        impact in the future.                             vision applications, such as measuring wait times in stores
          Image and VideoVideo is arguably the biggest of the or predicting trafﬁc patterns, it would be desirable to extract
           big data. It accounts for over 70% of today’s Internet meaningful information from the video right at the image
           trafﬁc [16]. For instance, over 800 million hours of video sensor rather than in the cloud to reduce the communication
           is collected daily worldwide for video surveillance [17]. cost. For other applications such as autonomous vehicles,
           Computer vision is necessary to extract meaningful infor- drone navigation and robotics, local processing is desired since
           mation from video. DNNs have signiﬁcantly improved the the latency and security risks of relying on the cloud are
           accuracy of many computer vision tasks such as image too high. However, video involves a large amount of data,
           classiﬁcation [14], object localization and detection [18], which is computationally complex to process; thus, low cost
           image segmentation [19], and action recognition [20].   hardware to analyze video is challenging yet critical to enabling

         4 The top-5 error rate is measured based on whether the correct answer   5 One of the major drawbacks of DNNs is their need for large datasets to
        appears in one of the top 5 categories selected by the algorithm.        prevent over-ﬁtting during training.                                                                                             6


                                                   attention has been given to hardware acceleration speciﬁcally Feed Forward     Recurrent     Fully-Connected   Sparsely-Connected    for RNNs.
                                                     DNNs can be composed solely offully-connected(FC)
                                                   layers (also referred to as multi-layer perceptrons, or MLP)
                                                   as shown in the leftmost layer of Fig. 8(b). In a FC layer,
                                                   all output activations are composed of a weighted sum of
                                                   all input activations (i.e., all outputs are connected to all
                                                   inputs). This requires a signiﬁcant amount of storage and
        Thankfully, in many applications, we can remove current) networks some connections between the activations by setting the weights
        to zero without affecting accuracy. This results in a sparsely connected layer. A sparsely connected layer is illustrated in
                                                   the rightmost layer of Fig. 8(b).these applications. Speech recognition enables us to seamlessly   We can also make the computation more efﬁcient by limitinginteract with electronic devices, such as smartphones. While the number of weights that contribute to an output. This sort ofcurrently most of the processing for applications such as Apple structured sparsity can arise if each output is only a functionSiri and Amazon Alexa voice services is in the cloud, it is of a ﬁxed-size window of inputs. Even further efﬁciency canstill desirable to perform the recognition on the device itself to be gained if the same set of weights are used in the calculationreduce latency and dependency on connectivity, and to improve of every output. This repeated use of the same weight values is privacy and security.                             calledweight sharingand can signiﬁcantly reduce the storageMany of the embedded platforms that perform DNN infer- requirements for weights.ence have stringent energy consumption, compute and memory   An extremely popular windowed and weight-shared DNNcost limitations; efﬁcient processing of DNNs have thus become layer arises by structuring the computation as a convolution,of prime importance under these constraints. Therefore, in this as shown in Fig. 9(a), where the weighted sum for each outputarticle, we will focus on the compute requirements for inference activation is computed using only a small neighborhood of inputrather than training.                              activations (i.e., all weights beyond beyond the neighborhood
                                                   are set to zero), and where the same set of weights are shared for
                                                    every output (i.e., the ﬁlter is space invariant). Such convolution-

                                            III. OVERVIEW OF DNN'S  

         DNNs come in a wide variety of shapes and sizes depending based layers are referred to as convolutional (CONV) layers. 
        on the application. The popular shapes and sizes are also
        evolving rapidly to improve accuracy and efﬁciency. In all A. Convolutional Neural Networks (CNNs)cases, the input to a DNN is a set of values representing the   A common form of DNNs isConvolutional Neural Netsinformation to be analyzed by the network. For instance, these (CNNs), which are composed of multiple CONV layers asvalues can be pixels of an image, sampled amplitudes of an shown in Fig. 10. In such networks, each layer generates aaudio wave or the numerical representation of the state of some successively higher-level abstraction of the input data, calledsystem or game.                                afeature map(fmap), which preserves essential yet uniqueThe networks that process the input come in two major information. Modern CNNs are able to achieve superior per-forms: feed forward and recurrent as shown in Fig. 8(a). In formance by employing a very deep hierarchy of layers. CNNfeed-forward networks all of the computation is performed as a are widely used in a variety of applications including imagesequence of operations on the outputs of a previous layer. The understanding [3], speech recognition [39], game play [6],ﬁnal set of operations generates the output of the network, for robotics [32], etc. This paper will focus on its use in imageexample a probability that an image contains a particular object, processing, speciﬁcally for the task of image classiﬁcation [3].the probability that an audio sequence contains a particular   Each of the CONV layers in CNN is primarily composed ofword, a bounding box in an image around an object or the high-dimensional convolutions as shown in Fig. 9(b). In thisproposed action that should be taken. In such DNNs, the computation, the input activations of a layer are structured asnetwork has no memory and the output for an input is always a set of 2-Dinput feature maps(ifmaps), each of which isthe same irrespective of the sequence of inputs previously given called achannel. Each channel is convolved with a distinctto the network.                                 2-D ﬁlter from the stack of ﬁlters, one for each channel; thisIn contrast, recurrent neural networks (RNNs), of which stack of 2-D ﬁlters is often referred to as a single 3-D ﬁlter.Long Short-Term Memory networks (LSTMs) [38] are a The results of the convolution at each point are summed acrosspopular variant, have internal memory to allow long-term all the channels. In addition, a 1-D bias can be added to thedependencies to affect the output. In these networks, some ﬁltering results, but some recent networks [15] remove itsintermediate operations generate values that are stored internally usage from parts of the layers. The result of this computationin the network and used as inputs to other operations in is the output activations that comprise one channel ofoutputconjunction with the processing of a later input. In this article, feature map(ofmap). Additional 3-D ﬁlters can be used onwe will focus on feed-forward networks since (1) the major
        computation in RNNs is still the weighted sum, which is   6 Note: the structured sparsity in CONV layers is orthogonal to the sparsity covered by the feed-forward networks, and (2) to-date little that occurs from network pruning as described in Section VII-B2.                                                                                             7

                 after the CONV layers for classiﬁcation purposes. A FC layer Fully	
                                     Connected	          also applies ﬁlters on the ifmaps as in the CONV layers, but
                  ×	                           ×	     the ﬁlters are of the same size as the ifmaps. Therefore, it
                                                   does not have the weight sharing property of CONV layers. Optional                      
                                                   Eq. (1) still holds for the computation of FC layers with a
        Fig. 10. Convolutional Neural Networks.       few additional constraints on the shape parameters: <<FORMULA>>,
                                                   <<FORMULA>>,<<FORMULA>>, and <<FORMULA>>.
                                                     In addition to CONV and FC layers, various optional layers
        the same input to create additional output channels. Finally, can be found in a DNN such as the non-linearity, pooling,
        multiple input feature maps may be processed together as a and normalization. The function and computations for each of
        batchto potentially improve reuse of the ﬁlter weights.      these layers are discussed next.
         Given the shape parameters in Table I, the computation of   1) Non-Linearity:A non-linear activation function is typically 
         applied after each CONV or FC layer. Various non-linear
                                                   functions are used to introduce non-linearity into the DNN as
                                                   shown in Fig. 11. These include historically conventional non- <<FORMULA>>
        <<FORMULA>>                             linear functions such as sigmoid or hyperbolic tangent as well
         <<FORMULA>>                             as rectiﬁed linear unit (ReLU) [40], which has become popular
        <<FORMULA>>;                            in recent years due to its simplicity and its ability to enable
        <<FORMULA>>;                           fast training. Variations of ReLU, such as leaky ReLU [41], (1) parametric ReLU [42], 
        and exponential LU [43] have also been O,I,W and B are the matrices of the of_maps, if_maps, ﬁlters explored for improved accuracy. 
        Finally, a non-linearity called and biases, respectively.Uis a given stride size. Fig. 9(b) maxout, which takes the max value of two intersecting linear shows a visualization of this computation (ignoring biases).   
        functions, has shown to be effective in speech recognition To align the terminology of CNNs with the generic DNN,  tasks [44, 45].
          ﬁlters are composed of weights (i.e., synapses)          2) Pooling: A variety of computations that reduce the
          input and output feature maps (if_maps, of_maps) are dimensionality of a feature map are referred to as pooling.
           composed of activations (i.e., input and output neurons)  Pooling, which is applied to each channel separately, enables                                                                                             
                                                 DNN is run only once), which is more consistent with what
                                                   would likely be deployed in real-time and/or energy-constrained 
                                                   LeNet[11] was one of the ﬁrst CNN approaches introduced
                                                    in 1989. It was designed for the task of digit classiﬁcation in
           <<FIGURE>>                            grayscale images of size 28x28. The most well known version,
                                                   LeNet-5, contains two CONV layers and two FC layers [48].
        Fig. 12. Various forms of pooling (Figure adopted from Caffe Tutorial [46]). Each CONV layer uses ﬁlters of size 5x5 (1 channel per ﬁlter)
                                                   with 6 ﬁlters in the ﬁrst layer and 16 ﬁlters in the second layer.
        the network to be robust and invariant to small shifts and Average pooling of 2x2 is used after each convolution and a
        distortions. Pooling combines, or pools, a set of values in sigmoid is used for the non-linearity. In total, LeNet requires
        its receptive ﬁeld into a smaller number of values. It can be 60k weights and 341k multiply-and-accumulates (MACs) per
        conﬁgured based on the size of its receptive ﬁeld (e.g., 2x2) image. LeNet led to CNNs’ ﬁrst commercial success, as it was
        and pooling operation (e.g., max or average), as shown in deployed in ATMs to recognize digits for check deposits.
        Fig. 12. Typically pooling occurs on non-overlapping blocks   AlexNet[3] was the ﬁrst CNN to win the ImageNet Challenge
        (i.e., the stride is equal to the size of the pooling). Usually a in 2012. It consists of ﬁve CONV layers and three FC layers.
        stride of greater than one is used such that there is a reduction Within each CONV layer, there are 96 to 384 ﬁlters and the
        in the dimension of the representation (i.e., feature map).    ﬁlter size ranges from 3x3 to 11x11, with 3 to 256 channels
         3) Normalization:Controlling the input distribution across each. In the ﬁrst layer, the 3 channels of the ﬁlter correspond
        layers can help to signiﬁcantly speed up training and improve to the red, green and blue components of the input image.
        accuracy. Accordingly, the distribution of the layer input A ReLU non-linearity is used in each layer. Max pooling of
        activations <<FORMULA>> are normalized such that it has a zero mean 3x3 is applied to the outputs of layers 1, 2 and 5. To reduce
        and a unit standard deviation. In batch normalization (BN), computation, a stride of 4 is used at the ﬁrst layer of the
        the normalized value is further scaled and shifted, as shown network. AlexNet introduced the use of LRN in layers 1 and
        in Eq. (2), where the parameters <<FORMULA>> are learned from 2 before the max pooling, though LRN is no longer popular
        training [47].X is a small constant to avoid numerical problems. in later CNN models. One important factor that differentiates
        Prior to this, local response normalization (LRN) [3] was AlexNet from LeNet is that the number of weights is much
        used, which was inspired by lateral inhibition in neurobiology larger and the shapes vary from layer to layer. To reduce the
        where excited neurons (i.e., high value activations) should amount of weights and computation in the second CONV layer,
        subdue its neighbors (i.e., cause low value activations); however, the 96 output channels of the ﬁrst layer are split into two groups
        BN is now considered standard practice in the design of of 48 input channels for the second layer, such that the ﬁlters in
        CNNs while LRN is mostly deprecated. Note that while LRN the second layer only have 48 channels. Similarly, the weights
        usually is performed after the non-linear function, BN is mostly in fourth and ﬁfth layer are also split into two groups. In total,
        performed between the CONV or FC layer and the non-linear AlexNet requires 61M weights and 724M MACs to process
        one 227x227 input image.
                                                       Overfeat[49] has a very similar architecture to AlexNet with
                      <<FORMULA>>           (2)      ﬁve CONV layers and three FC layers. The main differences <<FORMULA>>                    
                                                    are that the number of ﬁlters is increased for layers 3 (384
                                                   to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is notB. Popular DNN Models                          
                                                   split into two groups, the ﬁrst fully connected layer only has
         Many DNN models have been developed over the past 3072 channels rather than 4096, and the input size is 231x231
        two decades. Each of these models has a different ‘network rather than 227x227. As a result, the number of weights grows
        architecture’ in terms of number of layers, layer types, layer to 146M and the number of MACs grows to 2.8G per image.
        shapes (i.e., ﬁlter size, number of channels and ﬁlters), and Overfeat has two different models: fast (described here) and
        connections between layers. Understanding these variations accurate. The accurate model used in the ImageNet Challenge
        and trends is important for incorporating the right ﬂexibility gives a 0.65% lower top-5 error rate than the fast model at the
        in any efﬁcient DNN engine.                        cost of 1.9% more MACs
         In this section, we will give an overview of various popular   VGG-16[50] goes deeper to 16 layers consisting of 13
        DNNs such as LeNet [48] as well as those that competed in CONV layers and 3 FC layers. In order to balance out the
        and/or won the ImageNet Challenge [14] as shown in Fig. 7, cost of going deeper, larger ﬁlters (e.g., 5x5) are built from
        most of whose models with pre-trained weights are publicly multiple smaller ﬁlters (e.g., 3x3), which have fewer weights,
        available for download; the DNN models are summarized in to achieve the same receptive ﬁelds as shown in Fig. 13(a).
        Table II. Two results for top-5 error results are reported. In the As a result, all CONV layers have the same ﬁlter size of 3x3.
        ﬁrst row, the accuracy is boosted by using multiple crops from In total, VGG-16 requires 138M weights and 15.5G MACs
        the image and an ensemble of multiple trained models (i.e., to process one 224224 input image. VGG has two different
        the DNN needs to be run several times); these results were models: VGG-16 (described here) and VGG-19. VGG-19 gives
        used to compete in the ImageNet Challenge. The second row a 0.1% lower top-5 error rate than VGG-16 at the cost of
        reports the accuracy if only a single crop was used (i.e., the 1.27more MACs.                                                                                             9


                     <<FIGURE>>                                                                         <<FIGURE>>

        Fig. 13. Decomposing larger ﬁlters into smaller ﬁlters.             Fig. 14. Inception module from GoogleNet [51] with example channel lengths.


         GoogLeNet[51] goes even deeper with 22 layers. It in-
        troduced an inception module, shown in Fig. 14, which is                              
        composed of parallel connections, whereas previously there
        was only a single serial connection. Different sized ﬁlters (i.e.,                  
        1x1, 3x3, 5x5), along with 3x3 max-pooling, are used for
        each parallel connection and their outputs are concatenated                       
        for the module output. Using multiple ﬁlter sizes has the                          
        effect of processing the input at multiple scales. For improved                  	
        training speed, GoogLeNet is designed such that the weights         
        ReLU  and the activations, which are stored for backpropagation during         	<<FORMULA>>
        training, could all ﬁt into the GPU memory. In order to reduce               
        the number of weights, 1x1 ﬁlters are applied as a ‘bottleneck’        
        to reduce the number of channels for each ﬁlter [52]. The 22        
        layers consist of three CONV layers, followed by 9 inceptions                     	
        layers (each of which are two CONV layers deep), and one FC         
        layer. Since its introduction in 2014, GoogleNet (also referred                 <<FIGURE>>
        to as Inception) has multiple versions: v1 (described here), v3 7     
        smaller 1-D ﬁlters as shown in Fig. 13(b) to reduce number           Fig. 15. Shortcut module from ResNet [15]. 
        Note that ReLU following last
        of MACs and weights in order to go deeper to 42 layers. CONV layer in short cut is after the addition.
        In conjunction with batch normalization [47], v3 achieves
        over 3% lower top-5 error than v1 with 2.5% increase in is used. This is similar to the LSTM networks that are used for computation [53]. 
        Inception-v4 uses residual connections [54], sequential data. ResNet also uses the ‘bottleneck’ approach of described in the next section, 
        for a 0.4% reduction in error.   using 1x1 ﬁlters to reduce the number of weight parameters.ResNet[15], also known as Residual Net, uses residual 
        As a result, the two layers in the shortcut module are replace d connections to go even deeper (34 layers or more). It was by three layers (1x1, 3x3, 1x1) where the 1x1 reduces and
        the ﬁrst entry DNN in ImageNet Challenge that exceeded then increases (restores) the number of weights. ResNet-50human-level accuracy with a top-5 error rate below 5%. 
        One consists of one CONV layer, followed by 16 shortcut layers of the challenges with deep networks is the vanishing gradient (each of which are three CONV layers deep), and one FC
        during training: as the error backpropagates through the network layer; it requires 25.5M weights and 3.9G MACs per image.the gradient shrinks, which affects the ability to update the There are various versions of ResNet with multiple depths
        weights in the earlier layers for very deep networks. Residual (e.g.,without bottleneck:18, 34;with bottleneck:50, 101, 152).net introduces a ‘shortcut’ module which contains an identity The ResNet with 152 layers was the winner of the ImageNet
        connection such that the weight layers (i.e., CONV layers) Challenge requiring 11.3G MACs and 60M weights. Compared can be skipped as shown in Fig. 15. Rather than learning the to ResNet-50, it reduces the top-5 error by around 1% at the
        function for the weight layersF(x), the shortcut module learns cost of 2.9% more MACs and 2.5% more weights.the residual mapping <<FORMULA>>. Initially, <<FORMULA>> is
        zero and the identity connection is taken; then gradually during   Several trends can be observed in the popular DNNs shown
        training, the actual forward connection through the weight layer in Table II. Increasing the depth of the network tends to provide
                                                   higher accuracy. Controlling for number of weights, a deeper
         7 v2 is very similar to v3.                            network can support a wider range of non-linear functions                                                                                           


        that are more discriminative and also provides more levels B. Models
        of hierarchy in the learned representation [15,50,51,55].   Pretrained DNN models can be downloaded from variousThe number of ﬁlter shapes continues to vary across layers, websites [56–59] for the various different frameworks. It shouldthus ﬂexibility is still important. Furthermore, most of the be noted that even for the same DNN (e.g., AlexNet) thecomputation has been placed on CONV layers rather than FC accuracy of these models can vary by around 1% to 2%layers. In addition, the number of weights in the FC layers is depending on how the model was trained, and thus the resultsreduced and in most recent networks (since GoogLeNet) the do not always exactly match the original publication.CONV layers also dominate in terms of weights. Thus, the
        focus of hardware implementations should be on addressing
        the efﬁciency of the CONV layers, which in many domains C. Popular Datasets for Classiﬁcation
        are increasingly important.                           It is important to factor in the difﬁculty of the task when
                                                   comparing different DNN models. For instance, the task of
               IV. DNN DEVELOPMENT RESOURCES         classifying handwritten digits from the MNIST dataset [62]
                                                   is much simpler than classifying an object into one of 1000
                                                   One of the key factors that has enabled the rapid development classes as is required for the ImageNet dataset [14](Fig. 16).
                                                   of DNNs is the set of development resources that have been It is expected that the size of the DNNs (i.e., number ofmade available by the research community and industry. 
                                                   These weights) and the number of MACs will be larger for the moreresources are also key to the development of DNN accelerators difﬁcult task than the simpler task and thus 
                                                   require moreby providing characterizations of the workloads and facilitating energy and have lower throughput. For instance, LeNet-5[48]the exploration of trade-offs in 
                                                   model complexity and accuracy. is designed for digit classiﬁcation, while AlexNet[3], VGG-This section will describe these resources such that those who 16[50], GoogLeNet[51], 
                                                   and ResNet[15] are designed for theare interested in this ﬁeld can quickly get started.         
                                                     There are many AI tasks that come with publicly availableA. Frameworks                                
         For ease of DNN development and to enable sharing of Public datasets are important for comparing the accuracy of
        trained networks, several deep learning frameworks have been different approaches. The simplest and most common task
        developed from various sources. These open source libraries is image classiﬁcation, which involves being given an entire
        contain software libraries for DNNs. Caffe was made available image, and selecting 1 of N classes that the image most likely
        in 2014 from UC Berkeley [46]. It supports C, C++, Python belongs to. There is no localization or detection.
        and MATLAB. Tensorﬂow was released by Google in 2015,   MNISTis a widely used dataset for digit classiﬁcation
        and supports C++ and python; it also supports multiple CPUs that was introduced in 1998 [62]. It consists of 2828 pixel
        and GPUs and has more ﬂexibility than Caffe, with the grayscale images of handwritten digits. There are 10 classes
        computation expressed as dataﬂow graphs to manage the (for 10 digits) and 60,000 training images and 10,000 test
        tensors (multidimensional arrays). Another popular framework images. LeNet-5 was able to achieve an accuracy of 99.05%
        is Torch, which was developed by Facebook and NYU and when MNIST was ﬁrst introduced. Since then the accuracy has
        supports C, C++ and Lua. There are several other frameworks increased to 99.79% using regularization of neural networks
        such as Theano, MXNet, CNTK, which are described in [60]. with dropconnect [63]. Thus, MNIST is now considered a fairly
        There are also higher-level libraries that can run on top of easy dataset.
        the aforementioned frameworks to provide a more universal   CIFARis a dataset that consists of 3232 pixel colored
        experience and faster development. One example of such images of of various objects, which was released in 2009 [64].
        libraries is Keras, which is written in Python and supports CIFAR is a subset of the 80 million Tiny Image dataset [65].
        Tensorﬂow, CNTK and Theano.                      CIFAR-10 is composed of 10 mutually exclusive classes. There
         The existence of such frameworks are not only a convenient are 50,000 training images (5000 per class) and 10,000 test
        aid for DNN researchers and application designers, but they images (1000 per class). A two-layer convolutional deep belief
        are also invaluable for engineering high performance or more network was able to achieve 64.84% accuracy on CIFAR-10
        efﬁcient DNN computation engines. In particular, because the when it was ﬁrst introduced [66]. Since then the accuracy has
        frameworks make heavy use of a set primitive operations, increased to 96.53% using fractional max pooling [67].
        such processing of a CONV layer, they can incorporate use of   ImageNetis a large scale image dataset that was ﬁrst
        optimized software or hardware accelerators. This acceleration introduced in 2010; the dataset stabilized in 2012 [14]. It
        is transparent to the user of the framework. Thus, for example, contains images of 256256 pixel in color with 1000 classes.
        most frameworks can use Nvidia’s cuDNN library for rapid The classes are deﬁned using the WordNet as a backbone to
        execution on Nvidia GPUs. Similarly, transparent incorporation handle ambiguous word meanings and to combine together
        of dedicated hardware accelerators can be achieved as was synonyms into the same object category. In otherwords, there
        done with the Eyeriss chip [61].                     is a hierarchy for the ImageNet categories. The 1000 classes
         Finally, these frameworks are a valuable source of workloads were selected such that there is no overlap in the ImageNet
        for hardware researchers. They can be used to drive experi- hierarchy. The ImageNet dataset contains many ﬁne-grained
        mental designs for different workloads, for proﬁling different categories including 120 different breeds of dogs. There are
        workloads and for exploring hardware-software trade-offs.    1.3M training images (732 to 1300 per class), 100,000 testing                                                                                            11

               <<TABLE>>
                                                TABLE II
        SUMMARY OF POPULAR DNN S [3,15,48,50,51]. y ACCURACY IS MEASURED BASED ON TOP -5 ERROR ON IMAGE NET [14]. z THIS VERSION OF LE NET -5
                HAS 431 K WEIGHTS FOR THE FILTERS AND REQUIRES 2.3M MAC S PER IMAGE ,AND USES RE LU RATHER THAN SIGMOID .


                                                   be localized and classiﬁed (out of 1000 classes). The DNN
                                                   outputs the top ﬁve categories and top ﬁve bounding box
                                                   locations. There is no penalty for identifying an object that
                                                   is in the image but not included in the ground truth. For
                                                   object detection, all objects in the image must be localized
                                                   and classiﬁed (out of 200 classes). The bounding box for all
                                                   objects in these categories must be labeled. Objects that are
                                                   not labeled are penalized as are duplicated detections. Fig. 16. 
                                                   MNIST (10 classes, 60k training, 10k testing) [62] vs. ImageNet
        (1000 classes, 1.3M training, 100k testing)[14] dataset.               Beyond ImageNet, there are also other popular image
                                                   datasets for computer vision tasks. For object detection, there
        images (100 per class) and 50,000 validation images (50 per is the PASCAL VOC (2005-2012) dataset that contains 11k
        class).                                       images representing 20 classes (27k object instances, 7k of
         The accuracy of the ImageNet Challenge are reported using which has detailed segmentation) [68]. For object detection,
        two metrics: Top-5 and Top-1 error. Top-5 error means that if segmentation and recognition in context, there is the MS COCO
        any of the top ﬁve scoring categories are the correct category, dataset with 2.5M labeled instances in 328k images (91 object
        it is counted as a correct classiﬁcation. The Top-1 requires categories) [69]; compared to ImageNet, COCO has fewer
        that the top scoring category be correct. In 2012, the winner categories but more instances per category, which is useful for
        of the ImageNet Challenge (AlexNet) was able to achieve an precise 2-D localization. COCO also has more labeled instances
        accuracy of 83.6% for the top-5 (which is substantially better per image to potentially help with contextual information.
        than the 73.8% which was second place that year that did not   Most recently even larger scale datasets have been made
        use DNNs); it achieved 61.9% on the top-1 of the validation available. For instance, Google has an Open Images dataset
        set. In 2017, the highest accuracy was 97.7% for the top-5.   with over 9M images [70], spanning 6000 categories. There is
         In summary of the various image classiﬁcation datasets, it also a YouTube dataset with 8M videos (0.5M hours of video)
        is clear that MNIST is a fairly easy dataset, while ImageNet covering 4800 classes [71]. Google also released an audio
        is a challenging one with a wider coverage of classes. Thus dataset comprised of 632 audio event classes and a collection
        in terms of evaluating the accuracy of a given DNN, it is of 2M human-labeled 10-second sound clips [72]. These large
        important to consider that dataset upon which the accuracy is datasets will be evermore important as DNNs become deeper
        measured.                                    with more weight parameters to train.
                                                     Undoubtedly, both larger datasets and datasets for new
        D. Datasets for Other Tasks                        domains will serve as important resources for proﬁling and
                                                   exploring the efﬁciency of future DNN engines.Since the accuracy of the state-of-the-art DNNs are perform-
        ing better than human-level accuracy on image classiﬁcation
        tasks, the ImageNet Challenge has started to focus on more         V. H ARDWARE FOR DNN P ROCESSING
        difﬁcult tasks such as single-object localization and object   Due to the popularity of DNNs, many recent hardware
        detection. For single-object localization, the target object must platforms have special features that target DNN processing. For                                                                                            12


        instance, the Intel Knights Landing CPU features special vector     
        instructions for deep learning; the Nvidia PASCAL GP100        
        GPU features 16-bit ﬂoating point (FP16) arithmetic support     
        to perform two FP16 operations on a single precision core for   
        faster deep learning computation. Systems have also been built                       
        speciﬁcally for DNN processing such as Nvidia DGX-1 and    
        Facebook’s Big Basin custom DNN server [73]. DNN inference
        has also been demonstrated on various embedded System-on-   
        Chips (SoC) such as Nvidia Tegra and Samsung Exynos as    
        well as FPGAs. Accordingly, it’s important to have a good                          
        understanding of how the processing is being performed on     
        these platforms, and how application-speciﬁc accelerators can                       <<FIGURE>>                   
        be designed for DNNs for further improvement in throughput
        and energy efﬁciency.                                                   Fig. 17. Highly-parallel compute paradigms.
         The fundamental component of both the CONV and FC lay-
        ers are the multiply-and-accumulate (MAC) operations, which
        can be easily parallelized. In order to achieve high performance,       
        highly-parallel compute paradigms are very commonly used,                                   
        including both temporal and spatial architectures as shown in                       <<FORMULA>> 
        Fig. 17. The temporal architectures appear mostly in CPUs                            
        parallelism such as vectors (SIMD) or parallel threads (SIMT).
        Such temporal architecture use a centralized control for a large
        number of ALUs. These ALUs can only fetch data from the
        memory hierarchy and cannot communicate directly with each   
        other. In contrast, spatial architectures use dataﬂow processing,   
        i.e., the ALUs form a processing chain so that they can pass data
        from one to another directly. Sometimes each ALU can have       
        its own control logic and local memory, called a scratchpad or       
        register ﬁle. We refer to the ALU with its own local memory as
        a processing engine (PE). Spatial architectures are commonly              
        used for DNNs in ASIC and FPGA-based designs. In this   
        section, we will discuss the different design strategies for
        efﬁcient processing on these different platforms, without any
        impact on accuracy (i.e., all approaches in this section produce
        bit-wise identical results); speciﬁcally,                                           <<FIGURE>>
         * For temporal architectures such as CPUs and GPUs, we
           will discuss howcomputational transformson the kernel            Fig. 18. Mapping to matrix multiplication for fully connected layers
           can reduce the number of multiplications to increase
           throughput.                                
         * For spatial architectures used in accelerators, we will
           discuss howdataﬂowscan increase data reuse from low andNin Fig. 18(b)); ﬁnally, the height of the output feature
           cost memories in the memory hierarchy toreduce energy map matrix is the number of channels in the output feature
           consumption.                               maps (M), and the width is the number of output feature maps
                                                   (N), where each output feature map of the FC layer has the
                                                   dimension of 1x1 number of output channels (M).
        A. Accelerate Kernel Computation on CPU and GPU Platforms   The CONV layer in a DNN can also be mapped to a matrix
         CPUs and GPUs use parallelizaton techniques such as SIMD multiplication using a relaxed form of the Toeplitz matrix as
        or SIMT to perform the MACs in parallel. All the ALUs share shown in Fig. 19. The downside for using matrix multiplication
        the same control and memory (register ﬁle). On these platforms, for the CONV layers is that there is redundant data in the input
        both the FC and CONV layers are often mapped to a matrix feature map matrix as highlighted in Fig. 19(a). This can lead
        multiplication (i.e., the kernel computation). Fig. 18 shows how to either inefﬁciency in storage, or a complex memory access
        a matrix multiplication is used for the FC layer. The height of pattern.
        the ﬁlter matrix is the number of ﬁlters and the width is the   There are software libraries designed for CPUs (e.g., Open-
        number of weights per ﬁlter (input channels (C) width (W) BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN,
        height (H), sinceR=WandS=Hin the FC layer); etc.) that optimize for matrix multiplications. The matrix
        the height of the input feature maps matrix is the number of multiplication is tiled to the storage hierarchy of these platforms,
        activations per input feature map <<FORMULA>>, and the which are on the order of a few megabytes at the higher levels.                                                                                         
 
                    <<FIGURE>>

            Fig. 21. Read and write access per MAC.
 
                    <<FIGURE>>

        Fig. 19. Mapping to matrix multiplication for convolutional layers.     
        
         for a 3x3 ﬁlter, respectively, at the cost of reduced numerical stability, increased storage requirements, and specialized
         The matrix multiplications on these platforms can be further processing depending on the size of the ﬁlter.
        sped up by applying computational transforms to the data to   In practice, different algorithms might be used for different
        reduce the number of multiplications, while still giving the layer shapes and sizes (e.g., FFT for ﬁlters greater than 5x5,
        same bit-wise result. Often this can come at a cost of increased and Winograd for ﬁlters 3x3 and below). Existing platform
        number of additions and a more irregular data access pattern. libraries, such as MKL and cuDNN, dynamically chose the
                                                   appropriate algorithm for a given shape and size [77, 78].Fast Fourier Transform (FFT) [10,74] is a well known
        approach, shown in Fig. 20 that reduces the number of
        multiplications from <<O(N^2 N^2)>> to <<O(N^2 log N)>>            B. Energy-Efﬁcient Dataﬂow for Accelerators <<FORMULA>>, where the
        output size is <<FORMULA>> and the ﬁlter size is <<FORMULA>>. To   For DNNs, the bottleneck for processing is in the memory perform 
        the convolution, we take the FFT of the ﬁlter and access. Each MAC requires three memory reads (for ﬁlterinput feature map, and then 
        perform the multiplication in weight, fmap activation, and partial sum) and one memorythe frequency domain; we then apply an inverse 
        FFT to the write (for the updated partial sum) as shown in Fig. 21. In theresulting product to recover the output feature map in the 
        worst case, all of the memory accesses have to go through the spatial domain. However, there are several drawbacks to using off-chip 
        DRAM, which will severely impact both throughput FFT: (1) the beneﬁts of FFTs decrease with ﬁlter size; (2) the and energy efﬁciency. 
        For example, in AlexNet, to support itssize of the FFT is dictated by the output feature map size which 724M MACs, nearly 3000M DRAM 
        accesses will be required. is often much larger than the ﬁlter; (3) the coefﬁcients in the Furthermore, DRAM accesses require up to 
        several orders offrequency domain are complex. As a result, while FFT reduces magnitude higher energy than computation [79].computation, 
        it requires larger storage capacity and bandwidth.   Accelerators, such as spatial architectures as shown inFinally, a popular 
        approach for reducing complexity is to make Fig. 17, provide an opportunity to reduce the energy cost ofthe weights sparse, which will 
        be discussed in SectionVII-B2; data movement by introducing several levels of local memoryusing FFTs makes it difﬁcult for this sparsity 
        to be exploited. hierarchy with different energy cost as shown in Fig. 22. This
         Several optimizations can be performed on FFT to make it includes a large global buffer with a size of several hundred
        more effective for DNNs. To reduce the number of operations, kilobytes that connects to DRAM, an inter-PE network that
        the FFT of the ﬁlter can be precomputed and stored. In addition, can pass data directly between the ALUs, and a register ﬁle
        the FFT of the input feature map can be computed once and (RF) within each processing element (PE) with a size of a
        used to generate multiple channels in the output feature map. few kilobytes or less. The multiple levels of memory hierarchy
        Finally, since an image contains only real values, its Fourier help to improve energy efﬁciency by providing low-cost data
        Transform is symmetric and this can be exploited to reduce accesses. For example, fetching the data from the RF or
        storage and computation cost.                       neighbor PEs is going to cost 1 or 2 orders of magnitude
         Other approaches include Strassen [75] and Winograd [76], lower energy than from DRAM.
        which rearrange the computation such that the number of   Accelerators can be designed to support specialized process-
        multiplications reduce from <<O(N^3)>> to <<FORMULA>> and by 2.25% ing dataﬂows that leverage this memory hierarchy. The dataﬂow                                                                                            14

                                                        <<FORMULA>>

                                                   Fig. 23. Data reuse opportunities in DNNs [80].

                        <<FORMULA>>

        Fig. 22. Memory hierarchy and data movement energy [80].

                                  
        decides what data gets read into which level of the memory
        hierarchy and when are they getting processed. Since there is   
        no randomness in the processing of DNNs, it is possible to
        design a ﬁxed dataﬂow that can adapt to the DNN shapes and                 
        sizes and optimize for the best energy efﬁciency. The optimized      
        dataﬂow minimizes access from the more energy consuming                     <<FORMULA>>
        levels of the memory hierarchy. Large memories that can store                 
         a signiﬁcant amount of data consume more energy than smaller                  
        memories. For instance, DRAM can store gigabytes of data, but                            
        consumes two orders of magnitude higher energy per access                            
        than a small on-chip memory of a few kilobytes. Thus, every
        time a piece of data is moved from an expensive level to a Fig. 24. An analogy between the operation of DNN accelerators (texts in
        lower cost level in terms of energy, we want to reuse that piece black) and that of general-purpose processors (texts in red). Figure adopted
        from [81]. of data as much as possible to minimize subsequent accesses
        to the expensive levels. The challenge, however, is that the
        storage capacity of these low cost memories are limited. Thus program into machine-readable binary codes for executionwe need to explore different dataﬂows that maximize reuse given the hardware architecture (e.g., x86 or ARM); in theunder these constraints.                           processing of DNNs, the mapper translates the DNN shapeFor DNNs, we investigate dataﬂows that exploit three forms and size into a hardware-compatible computation mappingof input data reuse (convolutional, feature map and ﬁlter) as for execution given the dataﬂow. While the compiler usuallyshown in Fig. 23. For convolutional reuse, the same input optimizes for performance, the mapper optimizes for energyfeature map activations and ﬁlter weights are used within efﬁciency.a given channel, just in different combinations for different   The following taxonomy (Fig. 25) can be used to classifyweighted sums. For feature map reuse, multiple ﬁlters are the DNN dataﬂows in recent works [82–93] based on their applied to the same feature map, so the input feature map data handling characteristics [80]: activations are used multiple times across ﬁlters. Finally, for   1) Weight stationary (WS):The weight stationary dataﬂow
        ﬁlter reuse, when multiple input feature maps are processed at is designed to minimize the energy consumption of reading
        once (referred to as a batch), the same ﬁlter weights are used weights by maximizing the accesses of weights from the register
        multiple times across input features maps.               
        ﬁle (RF) at the PE (Fig. 25(a)). Each weight is read from
         If we can harness the three types of data reuse by storing DRAM into the RF of each PE and stays stationary for further
        the data in the local memory hierarchy and accessing them accesses. The processing runs as many MACs that use the
        multiple times without going back to the DRAM, it can save same weight as possible while the weight is present in the RF;
        a signiﬁcant amount of DRAM accesses. For example, in it maximizes convolutional and ﬁlter reuse of weights. The
        AlexNet, the number of DRAM reads can be reduced by up to inputs and partial sums must move through the spatial array
        500in the CONV layers. The local memory can also be used and global buffer. The input fmap activations are broadcast to
        for partial sum accumulation, so they do not have to reach all PEs and then the partial sums are spatially accumulated
        DRAM. In the best case, if all data reuse and accumulation across the PE array.
        can be achieved by the local memory hierarchy, the 3000M   One example of previous work that implement weight
        DRAM accesses in AlexNet can be reduced to only 61M.    stationary dataﬂow is nn-X, or neuFlow [85], which uses
         The operation of DNN accelerators is analogous to that of eight 2-D convolution engines for processing a 1010 ﬁlter.
        general-purpose processors as illustrated in Fig. 24 [81]. In There are total 100 MAC units, i.e. PEs, per engine with each
        conventional computer systems, the compiler translates the PE having a weight that stays stationary for processing. The    
        
                                               <<FIGURE>>

                                Fig. 26. Variations of output stationary [80].(b) Output Stationary

                                                            are [89], [88], and [90], respectively.
                                                         No local reuse (NLR): While small register ﬁles are
                                                        efﬁcient in terms of energy (pJ/bit), they are inefﬁcient in terms Psum                             
               <<FORMULA>>                              of area (<<FORMULA>>). In order to maximize the storage capacity,                
                                                        and minimize the off-chip memory bandwidth, no local storage
        Fig. 25. Dataﬂows for DNNs [80].                is allocated to the PE and instead all that area is allocated
                                                   to the global buffer to increase its capacity (Fig. 25(c)). The
                                                   no local reuse dataﬂow differs from the previous dataﬂows in
        input fmap activations are broadcast to all MAC units and the that nothing stays stationary inside the PE array. As a result,
        partial sums are accumulated across the MAC units. In order to there will be increased trafﬁc on the spatial array and to the
        accumulate the partial sums correctly, additional delay storage global buffer for all data types. Speciﬁcally, it has to multicast
        elements are required, which are counted into the required size the activations, single-cast the ﬁlter weights, and then spatially
        of local storage. Other weight stationary examples are found accumulate the partial sums across the PE array.
        in [82–84, 86, 87].                                In an example of the no local reuse dataﬂow from
         2) Output stationary (OS):The output stationary dataﬂow is UCLA [91], the ﬁlter weights and input activations are read
        designed to minimize the energy consumption of reading and from the global buffer, processed by the MAC units with custom
        writing the partial sums (Fig. 25(b)). It keeps the accumulation adder trees that can complete the accumulation in a single cycle,
        of partial sums for the same output activation value local in the and the resulting partial sums or output activations are then put
        RF. In order to keep the accumulation of partial sums stationary back to the global buffer. Another example is DianNao [92],
        in the RF, one common implementation is to stream the input which also reads input activations and ﬁlter weights from
        activations across the PE array and broadcast the weight to all the buffer, and processes them through the MAC units with
        PEs in the array.                                custom adder trees. However, DianNao implements specialized
         One example that implements the output stationary dataﬂow registers to keep the partial sums in the PE array, which helps
        is ShiDianNao [89], where each PE handles the processing for to further reduce the energy consumption of accessing partial
        each output activation value by fetching the corresponding input sums. Another example of no local reuse dataﬂow is found
        activations from neighboring PEs. The PE array implements in [93].
        dedicated networks to pass data horizontally and vertically.   4) Row stationary (RS): A row stationary dataﬂow is
        Each PE also has data delay registers to keep data around for proposed in [80], which aims to maximize the reuse and
        the required amount of cycles. At the system level, the global accumulation at the RF level foralltypes of data (weights,
        buffer streams the input activations and broadcasts the weights pixels, partial sums) for the overall energy efﬁciency. This
        into the PE array. The partial sums are accumulated inside differs from WS or OS dataﬂows, which optimize for only
        each PE and then get streamed out back to the global buffer. weights and partial sums, respectively.
        Other examples of output stationary are found in [88, 90].     The row stationary dataﬂow assigns the processing of a
         There are multiple possible variants of output stationary as 1-D row convolution into each PE for processing as shown
        shown in Fig. 26 since the output activations that get processed in Fig. 27. It keeps the row of ﬁlter weights stationary inside
        at the same time can come from different dimensions. For the RF of the PE and then streams the input activations into
        example, the variantOS A targets the processing of CONV the PE. The PE does the MACs for each sliding window at a
        layers, and therefore focuses on the processing of output time, which uses just one memory space for the accumulation
        activations from the same channel at a time in order to of partial sums. Since there are overlaps of input activations
        maximize data reuse opportunities. The variantOS C targets between different sliding windows, the input activations can
        the processing of FC layers, and focuses on generating output then be kept in the RF and get reused. By going through all the
        activations from all different channels, since each channel only sliding windows in the row, it completes the 1-D convolution
        has one output activation. The variantOS B is something in and maximize the data reuse and local accumulation of data
        betweenOS A andOS C . Example of variantsOS A ,OS B , and in this row.                                                                                            16

                                        <<FIGURE>>

        Fig. 27. 1-D Convolutional reuse within PE for Row Stationary Dataﬂow [80].      
        
                          <<FIGURE>>    
          
          Fig. 29. Multiple rows of different input feature maps, ﬁlters and channels are
          mapped to same PE within array for additional reuse in the Row Stationary

                   <<FIGURE>>
        Fig. 28. 2-D convolutional reuse within spatial array for Row Stationary   

        shown in Fig. 28. For example, to generate the ﬁrst row of
        output activations with a ﬁlter having three rows, three 1-D Fig. 30. Mapping optimization takes in hardware and DNNs shape constraints
        convolutions are required. Therefore, we can use three PEs in to determine optimal energy dataﬂow [80].
        a column, each running one of the three 1-D convolutions. The
        partial sums are further accumulated vertically across the three
        PEs to generate the ﬁrst output row. To generate the second different channels are interleaved, and run through the same PE
        row of output, we use another column of PEs, where three as a 1-D convolution. The partial sums from different channels
        rows of input activations are shifted down by one row, and use then naturally get accumulated inside the PE.
        the same rows of ﬁlters to perform the three 1-D convolutions.   The number of ﬁlters, channels, and fmaps that can be
        Additional columns of PEs are added until all rows of the processed at the same time is programmable, and there exists an
        output are completed (i.e., the number of PE columns equals optimal mapping for the best energy efﬁciency, which depends
        the number of output rows).                        on the shape conﬁguration of the DNN as well as the hardware
         This 2-D array of PEs enables other forms of reuse to reduce resources provided, e.g., the number of PEs and the size of the
        accesses to the more expensive global buffer. For example, each memory in the hierarchy. Since all of the variables are known
        ﬁlter row is reused across multiple PEs horizontally. Each row before runtime, it is possible to build a compiler (i.e., mapper)
        of input activations is reused across multiple PEs diagonally. to perform this optimization off-line to conﬁgure the hardware
        And each row of partial sums are further accumulated across for different mappings of the RS dataﬂow for different DNNs
        the PEs vertically. Therefore, 2-D convolutional data reuse and as shown in Fig. 30.
        accumulation are maximized inside the 2-D PE array.        One example that implements the row stationary dataﬂow
         To address the high-dimensional convolution of the CONV is Eyeriss [94]. It consists of a 14x12 PE array, a 108KB
        layer (i.e., multiple fmaps, ﬁlters, and channels), multiple rows global buffer, ReLU and fmap compression units as shown
        can be mapped onto the same PE as shown in Fig. 29. The in Fig. 31. The chip communicates with the off-chip DRAM
        2-D convolution is mapped to a set of PEs, and the additional using a 64-bit bidirectional data bus to fetch data into the
        dimensions are handled by interleaving or concatenating the global buffer. The global buffer then streams the data into the
        additional data. For ﬁlter reuse within the PE, different rows PE array for processing.
        of fmaps are concatenated and run through the same PE   In order to support the RS dataﬂow, two problems need to be
        as a 1-D convolution. For input fmap reuse within the PE, solved in the hardware design. First, how can the ﬁxed-size PE
        different ﬁlter rows are interleaved and run through the same array accommodate different layer shapes? Second, although
        PE as a 1-D convolution. Finally, to increase local partial sum the data will be passed in a very speciﬁc pattern, it still changes
        accumulation within the PE, ﬁlter rows and fmap rows from with different shape conﬁgurations. How can the ﬁxed design                                                                                           


                                                        needs of each dataﬂow under the same area constraint. For
                                                      example, since the no local reuse dataﬂow does not require any Processing 
                                                            RF in PE, it is allocated with a much larger global buffer. The If map 
                                                       simulation uses the layer conﬁgurations from AlexNet with a Buffer             
                                                      batch size of 16. The simulation also takes into account the bits  
                                                    fact that accessing different levels of the memory hierarchy Enc.                          
                                                       requires different energy cost.                    
                                                    of each dataﬂow for the CONV layers of AlexNet with a
                                                   batch size of 16. The WS and OS dataﬂows have the lowest
                                                 energy consumption for accessing weights and partial sums,
                                                   respectively. However, the RS dataﬂow has the lowest total 13                    
                                                     energy consumption since it optimizes for the overall energy
                                                   efﬁciency instead of only for a certain data type.

                                                     Fig. 33(a) shows the same results with breakdown in terms of
                                                   memory hierarchy. The RS dataﬂow consumes the most energy           
                                                   in the RF, since by design most of the accesses have been 
                                                    moved to the lowest level of the memory hierarchy. This helps
                                                    to achieve the lowest total energy consumption since RF has
                                                   the lowest energy per access. The NLR dataﬂow has the lowest Clock Gated                                      
                                                        energy consumption at the DRAM level, since it has a much
                                                 larger global buffer and thus higher on-chip storage capacity
                                                   compared to others. However, most of the data accesses in
                                                        relatively large energy consumption per access compared to
                                                   accessing data from RF or inside the PE array. As a result, the
                                                   overall energy consumption of the NLR dataﬂow is still fairly
                                                  high. Overall, RS dataﬂow uses 1.4% to 2.5% lower energy
                                                  pass data in different patterns?

                                                    <<FIGURE>>

        Fig. 32. Mapping uses replication and folding to maximized utilization of the NLR dataﬂow is from the global buffer, which still has a
        PE array [94]. 

         Two mapping strategies can be used to solve the ﬁrst problem than other dataﬂows.
        as shown in Fig. 32. First, replication can be used to map shapes   Fig. 34 shows the energy efﬁciency between different 
        that do not use up the entire PE array. For example, in the dataﬂows in the FC layers of AlexNet with a batch size of 16.
        third to ﬁfth layers of AlexNet, each 2-D convolution only uses Since there is not as much data reuse in the FC layers as in a 
        13x3 PE array. This structure is then replicated four times, the CONV layers, all dataﬂows spend a signiﬁcant amount of
        and runs different channels and ﬁlters in each replication. The energy on reading weights. However, RS dataﬂow still has the
        second strategy is called folding. For example, in the second lowest energy consumption because it optimizes for the energy
        layer of AlexNet, it requires a 27x5 PE array to complete the of accessing input activations and partial sums. For the OS2-D 
        convolution. In order to ﬁt it into the 14x12 physical PE dataﬂows,OSarray, it is folded into two parts, 14x5 and 13x5, and each           
        C now consumes lower energy thanOS A since it is designed for the FC layers. Overall, RS still consumesare vertically mapped into 
        the physical PE array. Since not all 1.3% lower energy compared to other dataﬂows at the batchPEs are used by the mapping, the 
        unused PEs can be clock size of 16.gated to save energy consumption.
         A custom multicast network is used to solve the second   Fig. 35 shows the RS dataﬂow design with energy breakdown
        problem about ﬂexible data delivery. The simplest way to pass in terms of different layers of AlexNet. In the CONV layers, the
        data to multiple destinations is to broadcast the data to all PEs energy is mostly consumed by the RF, while in the FC layers,
        and let each PE decide if it has to process the data or not. the energy is mostly consumed by DRAM. However, most
        However, it is not very energy efﬁcient especially when the of the energy is consumed by the CONV layers, which takes
        size of PE array is large. Instead, a multicast network is used around 80% of the energy. As recent DNN models go deeper
        to send data to only the places where it is needed.         with more CONV layers, the ratio between number of CONV
         5) Energy comparison of different dataﬂows:To evaluate and FC layers only gets larger. Therefore, moving forward,
        and compare different dataﬂows, the same total hardware area signiﬁcant effort should be placed on energy optimizations for
        and number of PEs (256) are used in the simulation of a spatial CONV layers.
        architecture for all dataﬂows. The local memory (register ﬁle) at   Finally, up until now, we have been looking at architec-
        each processing element (PE) is on the order of 0.5 – 1.0kB and tures with relatively limited storage on the order of a few
        a shared memory (global buffer) is on the order of 100 – 500kB. hundred kilobytes. With much larger storage on the order of
        The sizes of these memories are selected to be comparable to a few megabytes, additional dataﬂows can be considered. For
        a typical accelerator for multimedia processing, such as video example, Fused-Layer looks at dataﬂow optimizations across
        coding [95]. The memory sizes are further adjusted for the layers [96].                                                                                            18

                                                             <<FORMULA>>

                                                   Fig. 35. Energy breakdown across layers of the AlexNet [80]. RF energy
                                                   dominates in convolutional layers. DRAM energy dominates in the fully
                                                  connected layer. Convolutional layer dominate energy consumption.
                                                    In this section, we will discuss how moving compute and data Normalized 
                                                    closer to reduce data movement (i.e., near-data processing) can pixels   
                                                    be achieved using mixed-signal circuit design and advanced                                 
                                                    memory technologies.
                                                     Many of these works use analog processing which has the 
                                                     drawback of increased sensitivity to circuit and device non- 
                                                    idealities. Consequentially, the computation is often performed
                                                   at reduced precision, which can be accounted for during (b) Energy breakdown across data type            
                                                   the training of the DNNs using the techniques discussed in
                                                   Section VII. Another factor to take into consideration is that Fig. 33. 
                                                   Comparison of energy efﬁciency between different dataﬂows in the DNNs are 
                                                   often trained in the digital domain; thus for analog CONV layers of AlexNet with a batch size of 16 [3]: 
                                                   (a) breakdown in terms of storage levels and ALU, (b) breakdown in terms of data types. OS      
                                                   processing, there is an additional overhead cost for analog- A , OS B and OS C are three variants of the 
                                                   OS dataﬂow that are commonly seen in to-digital conversion (ADC) and digital-to-analog conversion different implementations [80].                          (DAC).

                                                    A. DRAM 

                                                     Advanced memory technology can reduce the access energy
                                                     for high density memories such as DRAMs. For instance, psums
                                                    embedded DRAM (eDRAM)brings high density memory on-                                  
                                                    chip to avoid the high energy cost of switching off-chip pixels    
                                                    capacitance [97]; eDRAM is 2.85higher density than SRAM 0.5                                 
                                                    and 32% more energy efﬁcient than DRAM (DDR3) [93].
                                                    eDRAM also offers higher bandwidth and lower latency                  
                                                    compared to DRAM. In DNN processing, eDRAM can be used DNN Dataflows                
                                                    to store tens of megabytes of weights and activations on-chip
                                                   to avoid off-chip access, as demonstrated in DaDianNao [93]. 
                                                   off-chip DRAM and can increase the cost of the chip.
                                                     Rather than integrating DRAM into the chip itself, the
                                                   DRAM can also be stacked on top of the chip using throughVI. N EAR -D ATA PROCESSING            silicon vias (TSV). This technology is often referred to as3-D
         The previous section highlighted that data movement domi- memory, and has been commercialized in the form of Hybrid
        nates energy consumption. While spatial architectures distribute Memory Cube (HMC) [98] and High Bandwidth Memory
        the on-chip memory such that it is closer to the computation (HBM) [99]. 3-D memory delivers an order of magnitude higher
        (e.g., into the PE), there have also been efforts to bring the bandwidth and reduces access energy by up to 5relative to
        off-chip high density memory closer to the computation or to existing 2-D DRAMs, as TSV have lower capacitance than
        integrate the computation into the memory itself; the latter is typical off-chip interconnects. Recent works have explored the
        often referred to asprocessing-in-memoryorlogic-in-memory. use of HMC for efﬁcient DNN processing in a variety of ways.
        In embedded systems, there have also been efforts to bring the For instance, Neurocube [100] integrates SIMD processors into
        computation into the sensor where the data is ﬁrst collected. the logic die of the HMC to bring the memory and computation                                                                                            19
        voltage as the input, and the current as the output as shown in resistive memory. 

                        <<FIGURE>>

        Fig. 36. Analog computation by (a) SRAM bit-cell and (b) non-volatile   
        
        
                                                    Processing with non-volatile resistive memories has several drawbacks as described in [108]. 
                                                   First, it suffers from the
                                                   reduced precision and ADC/DAC overhead of analog process-
                                                   ing described earlier. Second, the array size is limited by thecloser together. Tetris [101] explores the use of HMC with wires that connect the resistive devices; speciﬁcally, wire energythe Eyeriss spatial architecture and row stationary dataﬂow. dominates for large arrays (e.g., 1k1k), and the IR drop alongIt proposes allocating more area to computation than on-chip wire can degrade the read accuracy. Third, the write energymemory (i.e., larger PE array and smaller global buffer) in to program the resistive devices can be costly, in some casesorder to exploit the low energy and high throughput properties requiring multiple pulses. Finally, the resistive devices can alsoof the HMC. It also adapts the dataﬂow to account for the suffer from device-to-device and cycle-to-cycle variations withHMC memory and smaller on-chip memory. Tetris achieves non-linear conductance across the conductance range.a 1.5reduction in energy consumption and 4.1increase   There have been several recent works that explore the use ofin throughput over a baseline system with conventional 2-D memristors for DNNs. ISAAC [104] replaces the eDRAM inDRAM.                                      DaDianNao with memristors. To address the limited precision
                                                   support, ISAAC computes a 16-bit dot product operation with
        B. SRAM                                     8 memristors each storing 2-bits; a 1-bit2-bit multiplication
         Rather than bringing the memory near the compute, recent is performed at each memristor, where a 16-bit input requires
        work has also investigated bringing the compute into the 16 cycles to complete. In other words, the ISAAC architecture
        memory. For instance, the multiply and accumulate operation trades off area and time for increased precision. Finally, ISAAC
        can be directly integrated into the bit-cells of an SRAM arranges its 25.1M memristors in a hierarchical structure to
        array [102], as shown in Fig. 36(a). In this work, a 5-bit avoid issues with large arrays. PRIME [109] also replaces the
        DAC is used to drive the word line (WL) to an analog voltage DRAM main memory with memristors; speciﬁcally, it uses
        that represents the feature vector, while the bit-cells store the 256256 memristor arrays that can be conﬁgured for 4-bit
        binary weights1. The bit-cell current (I              multi-level cell computation or 1-bit single level cell storage. BC ) is effectively
        a product of the value of the feature vector and the value of It should be noted that results from ISAAC and PRIME are
        the weight stored in the bit-cell; the currents from the bit- obtained from simulations. The task of actually fabricating
        cells within a column add together to discharge the bitline large memristors arrays is still very much a research challenge;
        (V                                          for instance, [110] uses a fabricated 1212 memristor array BL ). This approach gives 12energy savings compared to
        reading the 1-bit weights from the SRAM and performing the to demonstrate a linear classiﬁer.
        computation separately. To counter circuit non-idealities, the
        DAC accounts for the non-linear bit-line discharge with respect D. Sensors
        to the WL voltage, and boosting is used to combine the weak   In certain applications, such as image processing, the dataclassiﬁers that are susceptible to device variations to form a movement from the sensor itself can account for a signiﬁcantstrong classiﬁer [103].                            portion of the system energy consumption. Thus there has
                                                   also been research on performing the computation as close
        C. Non-volatile Resistive Memories                   as possible to the sensor. In particular, much of the work
                                                   focuses on moving the computation into the analog domain toThe multiply and accumulate operation can also be directly avoid using the ADC within the sensor, which accounts for aintegrated into advancednon-volatilehigh density memories signiﬁcant portion of the sensor power. However, as mentionedby using them as programmable resistive elements, commonly
        referred to asmemristors[105]. Speciﬁcally, a multiplication   8 The resistive devices can be inserted between the cross-point of two wires is performed with the resistor’s conductance as the weight, the and in certain cases can avoid the need for an access transistor.                                                                                            20


        earlier, lower precision is required for analog computation due
        to circuit non-idealities.
         In [111], the matrix multiplication is integrated into the
        ADC, where the most signiﬁcant bits of the multiplications
        are performed using switched capacitors in an 8-bit successive
        approximation format. This is extended in [112] to not only
        perform the multiplications, but also the accumulations in the
        analog domain. In this work, it is assumed that 3-bits and
        6-bits are sufﬁcient to represent the weights and activations,     
        respectively. This reduces the number of ADC conversions in
        the sensor by 21. RedEye [113] takes this approach even
        further by performing the entire convolution layer (including
        convolution, max pooling and quantization) in the analog
        domain at the sensor. It should be noted that [111] and [112]
        report measured results from fabricated test chips, while results
        in [113] are from simulations.                                              <<FIGURE>>
         It is also feasible to embed the computation not just before
        the ADC, but into the sensor itself. For instance, in [114] an      Fig. 37. Various methods of quantization (Figures from [117, 118]).
        Angle Sensitive Pixels sensor is used to compute the gradient
        of the input, which along with compression, reduces the data the number of bits. The beneﬁts of reduced precision includemovement from the sensor by 10. In addition, since the reduced storage cost and/or reduced computation requirements.ﬁrst layer of the DNN often outputs a gradient-like feature
        map, it maybe possible to skip the computations in the ﬁrst   There are several ways to map the data to quantization levels.
        layer, which further reduces energy consumption as discussed The simplest method is a linear mapping with uniform distance
        in [115, 116].                                  between each quantization level (Fig. 37(a)). Another approach
                                                   is to use a simple mapping function such as alog function
                                                   (Fig. 37(b)) where the distance between the levels varies; thisVII. C O -DESIGN OF DNN MODELS AND HARDWARE    mapping can often be implemented with simple logic such as aIn earlier work, the DNN models were designed to maximize shift. Alternatively, a more complex mapping function can beaccuracy without much consideration of the implementation used where the quantization levels are determined or learnedcomplexity. However, this can lead to designs that are chal- from the data (Fig. 37(c)), e.g., using k-means clustering; forlenging to implement and deploy. To address this, recent this approach, the mapping is usually implemented with a lookwork has shown that DNN models and hardware can be co- up table.designed to jointly maximize accuracy and throughput, while   Finally, the quantization can be ﬁxed (i.e., the same methodminimizing energy and cost, which increases the likelihood of of quantization is used for all data types and layers, ﬁlters, andadoption. In this section, we will highlight various efforts that channels in the network); or it can be variable (i.e., differenthave been made towards the co-design of DNN models and methods of quantization can be used for weights and activations,hardware. Note that unlike Section V, the techniques discussed and different layers, ﬁlters, and channels in the network).in this section can affect the accuracy; thus, the goal is to   Reduced precision research initially focused on reducingnot only substantially reduce energy consumption and increase the precision of the weights rather than the activations, sincethroughput, but also to minimize any degradation in accuracy. weights directly increase the storage capacity requirement,The co-design approaches can be loosely grouped into the while the impact of activations on storage capacity depends onfollowing categories:                             the network architecture and dataﬂow. However, more recent
          Reduce precision of operations and operands.This in- works have also started to look at the impact of quantizationcludes going from ﬂoating point to ﬁxed point, reducing on activations. Most reduced precision research also focusesthe bitwidth, non-linear quantization and weight sharing. on reducing the precision for inference rather than training
          Reduce number of operations and model size. This (with some exceptions [88,119,120]) due to the sensitivity ofincludes techniques such as compression, pruning and the gradients to quantization.compact network architectures.                     The key techniques used in recent work to reduce precision
                                                   are summarized in Table III; both linear and non-linear
        A. Reduce Precision                             quantization applied to weights and activations are explored.
         Quantization involves mapping data to a smaller set of The impact on accuracy is reported relative to a baseline
        quantization levels. The ultimate goal is to minimize the error precision of 32-bit ﬂoating point, which is the default precision
        between the reconstructed data from the quantization levels and used on platforms such as GPUs and CPUs.
        the original data. The number of quantization levels reﬂects the   1) Linear quantization:The ﬁrst step of reducing precision
        precisionand ultimately the number of bits required to represent is usually to convert values and operations from ﬂoating point
        the data (usuallylog 2 of the number of levels); thus,reduced to ﬁxed point. A 32-bit ﬂoating point number, as shown in
        precisionrefers to reducing the number of levels, and thus Fig. 38(a), is represented by <<FORMULA>>, wheres                                                                                    
        product; that output would need to be accumulated with <<FORMULA>>   
        bit precision, where M is determined based on the largest ﬁlter (b) 8-bit dynamic ﬁxed point examples            
        size <<FORMULA>> (<<FORMULA>> from Fig. 9(b)), which is in the range of 0 to 16 bits for the popular DNNs described in SectionIII-B.

        Fig. 38. Various methods of number representations.              1
        
                                                   After accumulation, the precision of the ﬁnal output activation
                                                   is typically reduced to N-bits [88,121], as shown in Fig. 39.is the sign bit, e is the 
                                                   8-bit exponent, andmis the 23-bit The reduced output precision does not have a signiﬁcant impact 
                                                   mantisa, and covers the range of <<FORMULA>>.          
                                                   on accuracy if the distribution of the weights and activationsAn N-bit ﬁxed point number is 
                                                   represented by <<FORMULA>> are centered near zero such that the accumulation would not 
                                                   2f , wheresis the sign bit,mis the (N-1)-bit mantissa, and move only in one direction; 
                                                   this is particularly true when batchfdetermines the location of the decimal point and acts as a normalization is used.
                                                   scale factor. For instance, for an 8-bit integer, whenf= 0,   
                                                   The reduced precision is not only explored in research,the dynamic range is -128 to 127, 
                                                   whereas whenf= 10, the but has been used in recent commercial platforms for DNN 
                                                   dynamic range is -0.125 to 0.124023438.Dynamicﬁxed point processing. For instance, Google’s 
                                                   Tensor Processing Unitrepresentation allowsfto vary based on the desired dynamic (TPU) 
                                                   which was announced in May 2016, was designed forrange as shown in Fig. 38(b). 
                                                   This is useful for DNNs, since 8-bit integer arithmetic [123]. Similarly, Nvidia’s PASCAL
                                                   the dynamic range of the weights and activations can be quite GPU, which was announced in 
                                                   April 2016, also has 8-bitdifferent. In addition, the dynamic range can also vary across 
                                                   \integer instructions for deep learning inference [124]. In generallayers and layer types 
                                                   (e.g., convolutional vs. fully connected). purpose platforms such as CPUs and GPUs, the main beneﬁt
                                                   Using dynamic ﬁxed point, the bitwidth can be reduced to 8 of using 8-bit computation is an increase 
                                                   in throughput, asbits for the weights and 10 bits for the activations without any four 8-bit 
                                                   operations rather than one 32-bit operation can beﬁne-tuning of the weights [121]; with ﬁne-tuning, 
                                                   both weights performed for a given clock cycle.and activations can reach 8-bits [122].                   
                                                   While general purpose platforms usually support 8-bit,Using 8-bit ﬁxed point has the following 
                                                   impact on energy 16-bit and/or 32-bit operations, it has been shown that theand area [79]:                                  
                                                   minimum bit precision for DNNs can actually vary in a more
          An 8-bit ﬁxed point add consumes 3.3% less energy ﬁne grained manner. For instance, the weight and activation
           (3.8less area) than a 32-bit ﬁxed point add, and 30% precision can vary between 4 and 9 bits for AlexNet across
           less energy (116less area) than a 32-bit ﬂoating point different layers without signiﬁcant impact on accuracy (i.e., a
           add. The energy and area of a ﬁxed-point add scales change of less than 1%) [125,126]. This ﬁne-grained variation
           approximately linearly with the number of bits.        can be exploited for increased throughput or reduced energy
          An 8-bit ﬁxed point multiply consumes 15.5% less energy consumption with specialized hardware. For instance, if bit-
           (12.4% less area) than a 32-bit ﬁxed point multiply, serial processing is used, where the number of clock cycles to
           and 18.5% less energy (27.5% less area) than a 32-bit complete an operation is proportional to the bitwidth, adapting
           ﬂoating point multiply. The energy and area of a ﬁxed- to ﬁne-grain variations in bit precision can result in a 2.24%
           point multiply scales approximately quadratically with the speed up versus 16-bits [125]. Alternatively, a multiplier can
           number of bits.                              be designed such that its critical path reduces based on the bit
         Reducing the precision also reduces the energy and area cost precision as fewer adders are needed to resolve the product;
        for storage, which is important since memory access and data this can be combined with voltage scaling for a 2.56energy
        movement dominate energy consumption as described earlier. savings versus 16-bits [126]. While these bit scaling results
        The energy and area of the memory scale approximately linearly are reported relative to 16-bit, it would be interesting to see
        with number of bits. It should be noted, however, that changing their impact relative to the maximum precision required across
        from ﬂoating point to ﬁxed point, without reducing bit-width, layers (i.e., 9-bits for [125, 126]).
        does not reduce the energy or area cost of the memory.       The precision can be reduced even more aggressively to a
         For completeness, it should be noted that the precision of single bit; this area of research is often referred to asbinary nets.
        the internal values of a ﬁxed-point multiply and accumulate BinaryConnect (BC) [127] introduced the concept of binary
        (MAC) operation are typically higher than the weights and weights (i.e., -1 and 1), where using a binary weight reduced
        activations. To guarantee no precision loss, weights and input the multiplication in the MAC to addition and subtraction
        activations with N-bit ﬁxed-point precision would require an only. This was later extended in Binarized Neural Networks
        N-bitxN-bit multiplication which generates a 2N-bit output (BNN) [128] that uses binary weightsandactivations, which     
        
                <<FIGURE>>

         Fig. 40. Weight sharing hardware.                                                                                       

        w, where w is the average of the absolute values of the
        weights in the ﬁlter) 9 , keeping the ﬁrst and last layers at 32-bit
        ﬂoating point precision, and performing normalization before VGG-16 [117]. Furthermore, when weights are quantized to
        convolution to reduce the dynamic range of the activations. powers of two, the multiplication can be replaced with a bit-
        With these changes, BWN reduced the accuracy loss to 0.8%, shift [122,135]. 10 Incremental Network Quantization (INQ)
        while XNOR-Nets reduced the loss to 11%. The loss of XNOR- can be used to further reduce the loss in accuracy by dividing
        Net can be further reduced by increasing the precision of the the large and small weights into different groups, and then
        activations to be slightly larger than one bit. For instance, iteratively quantizing and re-training the weights [136].
        Quantized Neural Networks (QNN) [119], DoReFa-Net [120],   Weight Sharingforces several weights to share a single value.
        and HWGQ-Net [130] allow the activations to have 2-bits, This reduces the number of unique weights in a ﬁlter or a
        while the weights remain at 1-bit; in HWGQ-Net, this reduces layer. One example is to group the weights by using a hashing
        the accuracy loss to 5.2%.                         function and use one value for each group [137]. Alternatively,
         All the previously described binary nets limit the weights the weights can be grouped by the k-means algorithm [118].
        to two values (-wandw); however, there may be beneﬁts Both the shared weights and the indexes indicating which
        for allowing weights to be zero (i.e., -w, 0,w). Although weight to use at each position of the ﬁlter are stored. This
        this requires an additional bit per weight compared to binary leads to a two step process to fetch the weight: (1) read the
        weights, the sparsity of the weights can be exploited to reduce weight index; (2) using the weight index, read the shared
        computation and storage cost, which can potentially cancel weights. This approach can reduce the cost of reading and
        out the cost of the additional bit. This is explored in Ternary storing the weights if the weight index (log 2 of the number of
        Weight Nets (TWN) [131] and then extended in Trained Ternary unique weights) is less than the bitwidth of the weight itself.
        Quantization (TTQ) where a different scale is trained for each   For instance, in Deep Compression [118], the number of
        weight (i.e., -w                                 unique weights per layer is reduced to 256 for convolutional 1 , 0,w2 ) for an accuracy loss of 0.6% [132],
        assuming 32-bit ﬂoating point for the activations.          layers and 16 for fully-connected layers in AlexNet, requiring
         Hardware implementations for binary/ternary nets have 8-bit and 4-bit weight indexes, respectively. Assuming there
        been explored in recent publications. YodaNN [133] uses areUunique weights and the size of the ﬁlters in the layer
        binary weights, while BRein [134] uses binary weights and is <<FORMULA>> from Fig. 9(b), there will be energy savings
        activations. Binary weights are also used in the compute if reading from a CRSM <<log(2)>> U-bit memory plus aU16-
        in SRAM work [102] described in Section VI. Finally, the bit memory (as shown in Fig. 40) cost less than reading
        nominally spike-inspired TrueNorth chip can implement a from a CRSM 16-bit memory. Note that unlike the previous
        reduced precision neural network with binary activations and quantization methods, the weight sharing approach does not
        ternary weights using TrueNorth’s quantized weight table [9]. reduce the precision of the MAC computation itself and only
        These works tend not to support state-of-the-art DNN models reduces the weight storage requirement.
        (with the exception of YodaNN).
         2) Non-linear quantization:The previous works described B. Reduce Number of Operations and Model Size
        involve linear quantization where the levels are uniformly   In addition to reducing the size of each operation or operandspaced out. It has been shown that the distributions of the (weight/activation), there is also a signiﬁcant amount of researchweights and activations are not uniform [118,135], and thus on methods to reduce the number of operations and modela non-linear quantization can potentially improve accuracy. size. These techniques can be loosely classiﬁed as exploitingSpeciﬁcally, there have been two popular approaches taken activation statistics, network pruning, network architecturein recent works: (1) log domain quantization; (2) learned design and knowledge distillation.quantization or weight sharing.                        1) Exploiting Activation Statistics: As discussed in Sec-Log domain quantizationIf the quantization levels are tionIII-A1, ReLU is a popular form of non-linearity used inassigned based on a logarithmic distribution as shown in DNNs that sets all negative values to zero as shown in Fig. 41(a). Fig 37(b), the weights and activations are more equally As a result, the output activations of the feature maps after the distributed across the different levels and each level is used ReLU are sparse; for instance, the feature maps in AlexNetmore efﬁciently resulting in less quantization error. For instance, have sparsity between 19% to 63% as shown in Fig. 41(b).using 4 bits in linear quantization results in a 27.8% loss in This sparsity gives ReLU an implementation advantage overaccuracy versus a 5% loss for log base-2 quantization for other non-linearities such as sigmoid, etc.

         9 This can also be thought of as a form of weights sharing, where only two   10 Note however that multiplications do not account for a signiﬁcant portion
        weights are used per ﬁlter.                             of the total energy.                                                                                            

                                                     <<FORMULA>>      

                                                TABLE III
        METHODS TO REDUCE NUMERICAL PRECISION FOR ALEX NET . ACCURACY MEASURED FOR TOP-5 ERROR ON IMAGE NET . 


                                                      a cost of reduced accuracy.
                                                      2) Network Pruning:To make network training easier, the
                                                     networks are usually over-parameterized. Therefore, a large
                                                     amount of the weights in a network are redundant and can
                                                   be removed (i.e., set to zero). This process is called network
                                                   pruning. Aggressive network pruning often requires some ﬁne-
                                                   tuning of the weights to maintain the original accuracy. This
                                                    was ﬁrst proposed in 1989 through a technique called Optimal
                                                    Brain Damage [140]. The idea was to compute the impact of
                                                    each weight on the training loss (discussed in SectionII-C),
                                                    referred to as the weight saliency. The low-saliency weights (Normalized)                                
                                                    were removed and the remaining weights were ﬁne-tuned; this
                                                    process was repeated until the desired weight reduction and
                                                    accuracy were reached.
                                                    In 2015, a similar idea was applied to modern DNNs in [141].
                    <<FORMULA>>                     Rather than using the saliency as a metric, which is too difﬁcult
                                                   to compute for the large-scaled DNNs, the pruning was simply
        Fig. 41. Sparsity in activations due to ReLU.                  based on the magnitude of the weights. Small weights were
                                                   pruned and the model was ﬁne-tuned to restore the accuracy.
                                                   Without ﬁne-tuning the weights, about 50% of the weightsThe sparsity can be exploited for energy and area savings could be pruned. With ﬁne-tuning, over 80% of the weightsusing compression, particularly for off-chip DRAM access were pruned. Overall this approach can reduce the numberwhich is expensive. For instance, a simple run length coding of weights in AlexNet by 9and the number of MACsthat involves signaling non-zero values of 16-bits and then runs by 3. Most of the weight reduction comes from the fully-of zeros up to 31 can reduce the external memory bandwidth connected layers (9.9for fully-connected layers versus 2.7of the activations by 2.1and the overall external bandwidth for convolutional layers).(including weights) by 1.5[61]. 11 In addition to compression,
        the hardware can also be modiﬁed such that it skips reading the   However, the number of weights alone is not a good metric
        weights and performing the MAC for zero-valued activations for energy. For instance, in AlexNet, the number of weights
        to reduce energy cost by 45% [94]. Rather than just gating the in the fully-connected layers is much larger than in the
        read and MAC computation, the hardware could also skip the convolutional layers; however, the energy of the convolutional
        cycle to increase the throughput by 1.37%[138].          layers is much higher than the fully-connected layers as shown
         The activations can be made to be even more sparse by prun- in Fig. 35 [80]. Rather than using the number of weights
        ing the low-valued activations. For instance, if all activations and MAC operations as proxies for energy, the pruning of
        with small values are pruned, this can be translated into an the weights can be directly driven by energy itself [142]. An
        additional 11% speed up [138] or 2power reduction [139] energy evaluation method can be used to estimate the DNN
        with little impact on accuracy. Aggressively pruning more energy that accounts for the data movement from different
        activations can provide additional throughput improvement at levels of the memory hierarchy, the number of MACs, and the
                                                   data sparsity as shown in Fig. 42; this energy estimation tool
                                                   is available at [143]. The resulting energy values for popular This simple run length compression is within 5-10% of the theoretical
        entropy limit.                                    DNN models are shown in Fig. 43(a). Energy-aware pruning                                                                                            24


                                <<FIGURE>> 

        Fig. 42. Energy estimation methodology from [142], which estimates the
        energy based on data movement from different levels of the memory hierarchy,           

            <<FIGURE>>
            
        Fig.  43. Energy values estimated with methodology in [142].          a time [144]. The CSC format will provide an overall lower
                                                   memory bandwidth than CSR if the output is smaller than the
                                                   input, or in the case of DNN, if the number of ﬁlters isnot
        can then be used to prune weights based on energy to reduce signiﬁcantly larger than the number of weights in the ﬁlter
        the overall energy across all layers by 3.7% for AlexNet, which (<<FORMULA>> from Fig. 9(b)). Since this is often true, CSC can
        is 1.74more efﬁcient than magnitude-based approaches [141] be an effective format for sparse DNN processing.
        as shown in Fig. 43(b). As mentioned previously, it is well   Custom hardware has been explored to efﬁciently supportknown that AlexNet is over-parameterized. The energy-aware pruned DNN models. Many works aim to perform the process-pruning can also be applied to GoogleNet, which is already a ing without decompressing the weights or activations. EIE [145]small DNN model, for a 1.6energy reduction.          performs the sparse matrix-vector multiplication speciﬁcally for
         Recent works have examine how to efﬁciently support the fully connected layers. It stores the weights in a CSC format
        processing of sparse weights in hardware. One area of interest along with the start location of each column, which needs to be
        is how to best store the sparse weights after pruning. Similar to stored since the compressed weights have variable length. When
        compressing the sparse activations discussed in SectionVII-B1, the input is not zero, the compressed weight column is read and
        the sparse weights can be compressed to reduce memory access the output is updated. To handle the sparsity, additional logic
        bandwidth by 20 to 30% [118].                      is used to keep track of the location of the output that should
         When DNN processing is performed as a matrix-vector be updated. SCNN [146] supports processing of convolutional                                                                                            25


        layers in a compressed format. It uses an input stationary weights [154]. It proposes aﬁremodule that ﬁrst ‘squeezes’
        dataﬂow to deliver the compressed weights and activations to the network with 1x1 convolution ﬁlters and then expands
        a multiplier array followed by a scatter network to add the it with multiple 1x1 and 3x3 convolution ﬁlters. It achieves
        scattered partial sums.                            an overall 50% reduction in number of weights compared to
         Recent works have also explored the use of structured AlexNet, while maintaining the same accuracy. It should be
        pruning to avoid the need for custom hardware [147,148]. noted, however, that reducing the number of weights does not
        Rather than pruning individual weights (also referred to as ﬁne- necessarily reduce energy; for instance, SqueezeNet consumes
        grained pruning), structured pruning involves pruning groups more energy than AlexNet, as shown in Fig. 43(a).
        of weights (also referred to as coarse-grained pruning). The     b) After Training:Tensor decomposition can be used to
        beneﬁts of structured pruning are (1) the resulting weights can decompose ﬁlters in a trained network without impacting the
        better align with the data-parallel architecture (e.g., SIMD) accuracy. It treats weights in a layer as a 4-D tensor and breaks
        found in existing general purpose hardware, which results in it into a combination of smaller tensors (i.e., several layers).
        more efﬁcient processing [149]; (2) it amortizes the overhead Low-rank approximation can then be applied to further increase
        cost required to signal the location of the non-zero weights the compression rate at the cost of accuracy degradation, which
        across a group of weights, which improves compression and can be restored by ﬁne-tuning the weights.
        thus reduces storage cost. These groups of weights can include   This approach is demonstrated using Canonical Polyadic (CP)
        a pair of neighboring weights, an entire row or column of a decomposition, a high-order extension of singular value decom-
        ﬁlter, an entire channel of a ﬁlter or the entire ﬁlter itself; using position that can be solved by various methods, such as a greedy
        larger groups tends to result in higher loss in accuracy [150]. algorithm [155] or a non-linear least-square method [156].
         3) Compact Network Architectures:The number of weights Combining CP-decomposition with low-rank approximation
        and operations can also be reduced by improving the network achieves a 4.5% speed-up on CPUs [156]. However, CP-
        architecture itself. The trend is to replace a large ﬁlter with a decomposition cannot be computed in a numerically stable
        series of smaller ﬁlters, which have fewer weights in total; when way when the dimension of the tensor, which represents the
        the ﬁlters are applied sequentially, they achieve the same overall weights, is larger than two [156]. To alleviate this problem,
        effective receptive ﬁeld (i.e., the region the ﬁlter uses from input Tucker decomposition is adopted instead in [157].
        image to compute an output). This approach can be applied   4) Knowledge Distillation:Using a deep network or av-
        during the network architecture design (before training) or by eraging the predictions of different models (i.e., ensemble)
        decomposing the ﬁlters of a trained network (after training). gives a better accuracy than using a single shallower network.
        The latter one avoids the hassle of training networks from However, the computational complexity is also higher. To get
        scratch. However, it is less ﬂexible than the former one. For the best of both worlds, knowledge distillation transfers the
        example, existing methods can only decompose a ﬁlter in a knowledge learned by the complex model (teacher) to the
        trained network into a series of ﬁlters without non-linearity simpler model (student). The student network can therefore
        between them.                                 achieve an accuracy that would be unachievable if it was
           a) Before Training:In recent DNN models, ﬁlters with directly trained with the same dataset [158,159]. For example,
        a smaller width and height are used more frequently because [160] shows how using knowledge distillation can improve the
        concatenating several of them can emulate a larger ﬁlter as speech recognition accuracy of a student net by 2%, which is
        shown in Fig. 13. For example, one 5x5 convolution can be similar to the accuracy of a teacher net that is composed of
        replaced with two 3x3 convolutions. Alternatively, one NxN an ensemble of 10 networks.
        convolution can be decomposed into two 1-D convolutions, one   Fig. 45 shows the simplest knowledge distillation
        1xN and one Nx1 convolution [53]; this basically imposes method [158]. The softmax layer is commonly used as the
        a restriction that the 2-D ﬁlter must be separable, which is output layer in the image classiﬁcation networks to generate
        a common constraint in image processing [151]. Similarly, a the class probabilities from the class scores 12 ; it squashes the
        3-D convolution can be replaced by a set of 2-D convolutions class scores into values between 0 and 1 that sum up to 1.
        (i.e., applied only on one of the input channels) followed by For this knowledge distillation method, soft targets (values
        1x1 3-D convolutions as demonstrated in Xception [152] and between 0 and 1) such as the class scores of the teacher DNN
        MobileNets [153]. The order of the 2-D convolutions and 1x1 (or an ensemble of teacher DNNs) are used instead of the
        3-D convolutions can be switched.                    hard targets (values of either 0 or 1) such as the labels in the
         1x1 convolutional layers can also be used to reduce the dataset. The objective is to minimize the squared difference
        number of channels in the output feature map for a given between the soft targets and the class scores of the student DNN.
        layer, which reduces the number of ﬁlter channels and thus Class scores are used as the soft targets instead of the class
        computation cost for the ﬁlters in the next layer as demonstrated probabilities because small values in the class scores contain
        in [15,51,52]; this is often referred to as a ‘bottleneck’ as important information that may be eliminated by the softmax.
        discussed in SectionIII-B. For this purpose, the number of 1x1 Alternatively, class probabilities after the softmax layer can be
        ﬁlters has to be less than the number of channels in the 1x1 used as soft targets if the softmax is conﬁgured to generate
        ﬁlter. For example, 32 ﬁlters of 1x164 can transform an input softer class probabilities where the smaller values retain more
        with 64 channels to an output of 32 channels and reduce the information [160]. Finally, the intermediate representations of
        number of ﬁlter channels in the next layer to 32. SqueezeNet
        uses many 1x1 ﬁlters to aggressively reduce the number of   12 Also commonly referred to as logits.                                                                                   


                                                           robotics. For data analytics, high throughput means that more
                                            data can be analyzed in a given amount of time. As the amount

                                                      of visual data is growing exponentially, high-throughput big
                                                     data analytics becomes important, particularly if an action needs 
                                                     to be taken based on the analysis (e.g., security or terrorist
                                                   prevention; medical diagnosis). Try to match                   
                                                   Low latencyis necessary for real-time interactive applications.
                                                   Latency measures the time between when the pixel arrives
                                                      to a system and when the result is generated. Latency is Simple DNN 
                                                      measured in terms of seconds, while throughput is measured
                                                   in operations/second. Often high throughput is obtained by
                                                   batching multiple images/frames together for processing; this Fig. 45. 
                                                   Knowledge distillation matches the class scores of a small DNN to results 
                                                   in multiple frame latency (e.g., at 30 frames per second, an ensemble of large DNNs.                            
                                                   a batch of 100 frames results in a 3 second delay). This delay
                                                   is not acceptable for real-time applications, such as high-speed
                                                   navigation where it would reduce the time available for coursethe teacher DNN can 
                                                   also be incorporated as the extra hints correction. Thus achieving low latency and
                                                    high throughputto train the student DNN [161].                      
                                                     Hardware costis in large part dictated by the amount of 
                                                     on-chip storage and the number of cores. Typical embedded 
                                                    processors have limited on-chip storage on the order of a few
                                                    simultaneously can be a challenge.

                                                   VIII. B ENCHMARKING METRICS FOR DNN EVALUATION  AND COMPARISON                
         As we have seen in this article, there has been a signiﬁcant hundred kilobytes. Since there is a trade-off between the amount
        amount of research on efﬁcient processing of DNNs. We should of on-chip memory and the external memory bandwidth, both
        consider several key metrics to compare the various strengths metrics should be reported. Similarly, there is a correlation
        and weaknesses of different designs and proposed techniques. between the number of cores and the throughput. In addition,
        These metrics should cover important attributes such as accu- while many cores can be built on a chip, the number of cores
        racy/robustness, power/energy consumption, throughput/latency that can actually be used at a given time should be reported. It is
        and cost. Reporting all these metrics is important in order often unrealistic to assume peak utilization and performance due
        to provide a complete picture of the trade-offs made by a to limitations of mapping and memory bandwidth. Accordingly,
        proposed design or technique. We have prepared a website to the power and throughput should be reported for running actual
        collect these metrics from various publications [162].       DNNs as opposed to only reporting theoretical limits.
         In terms ofaccuracyandrobustness, it is important that the
        accuracy be reported on widely-accepted datasets as discussed
        in Section IV. The difﬁculty of the dataset and/or task should A. Metrics for DNN Models
        be considered when measuring the accuracy. For instance, the   To evaluate the properties of a given DNN model, we should
        MNIST dataset for digit recognition is signiﬁcantly easier than consider the following metrics:the ImageNet dataset. 
        As a result, a DNN that performs well
        on MNIST may not necessarily perform well on ImageNet.    Theaccuracy of the model in terms of the top-5 error
        Thus it is important that the same dataset and task is used when     on datasets such as ImageNet. Also, the type of data
        comparing the accuracy of different DNN models; currently     augmentation used (e.g., multiple crops, ensemble models)
        ImageNet is preferred since it presents a challenge for DNNs,     should be reported.
        as opposed to MNIST, which can also be addressed with simple    Thenetwork architectureof the model should be reported,
        non-DNN techniques. To demonstrate primarily hardware     including number of layers, ﬁlter sizes, number of ﬁlters
        innovations, it would be desirable to report results for widely-     and number of channels.
        used DNN models (e.g., AlexNet, GoogLeNet) whose accuracy    Thenumber of weightsimpact the storage requirement of
        and robustness have been well studied and tested.             the model and should be reported. If possible, the number
         Energyandpowerare important when processing DNNs at     of non-zero weights should be reported since this reﬂects
        the edge in embedded devices with limited battery capacity     the theoretical minimum storage requirements.
        (e.g., smart phones, smart sensors, UAVs, and wearables), or in    Thenumber of MACsthat needs to be performed should
        the cloud in data centers with stringent power ceilings due to     be reported as it is somewhat indicative of the number
        cooling costs, respectively. Edge processing is preferred over     of operations and potential throughput of the given DNN.
        the cloud for certain applications due to latency, privacy or     If possible, the number of non-zero MACs should also
        communication bandwidth limitations. When evaluating the     be reported since this reﬂects the theoretical minimum
        power and energy consumption, it is important to account     compute requirements.
        for all aspects of the system including the chip and external   Table IV shows how these metrics are reported for various
        memory accesses.                               well known DNNs. The accuracy is reported for the case where
         High throughputis necessary to deliver real-time perfor- only a single crop for a single model is used for classiﬁcation,
        mance for interactive applications such as navigation and such that the number of weights and MACs in the table are                                                                                           
                                                           reported in terms of the core area in squared millimeters 
                                                           per multiplier along with process technology.
                                                           In terms of cost, different platforms will have different
                                                            implementation-speciﬁc metrics. For instance, for an FPGA, (Number of CONV Layers)
                                                             the speciﬁc device should be reported, along with the utilization
                                                          of resources such as DSP, BRAM, LUT and FF; performance
                                                            density such as GOPs/slice can also be reported. Stride                   
                                                            Each processor should report various speciﬁcations for each NZ Weights      
                                                            metric as shown in Table V, using the Eyeriss chip as an
                                                         example. It is important that all metrics and speciﬁcations are
                                                           accounted for in order fairly evaluate all the design trade-offs. Number of Channels            
                                                         For instance, without the accuracy given for a speciﬁc dataset Number of Filters       
                                                            and task, one could run a simple DNN and easily claim low
                                                            power, high throughput, and low cost – however, the processor
                                                         might not be usable for a meaningful task; alternatively, without Total NZ MACs         
                                                          reporting the off-chip bandwidth, one could build a processor 
                                                          with only multipliers and easily claim low cost, high throughput,
                                                        high accuracy, and lowchippower – however, when evaluating
                                                   systempower, the off-chip memory access would be substantial.
                                                   Finally, the test setup should also be reported, including whether
                                                   the results are measured or obtained from simulation and consistent. 
                                                    (NZ) operations signiﬁcantly reduces the number of MACs   
                                                    In summary, the evaluation process for whether a DNNand weights. 
                                                    Since the number of NZ MACs depends on the system is a viable solution 
                                                    for a given application might go asinput data, we propose using the publicly available 50,000 follows:
                                                   (1) the accuracy determines if it can perform the givenvalidation images from ImageNet for the 
                                                   computation. Finally, task; (2) the latency and throughput determine if it can run fast there are 
                                                   various methods to reduce the weights in a DNN enough and in real-time; (3) the energy and power consumption
                                                   (e.g., network pruning in SectionVII-B2). Table IV shows will primarily dictate the form factor of the device 
                                                   where the another example of these DNN model metrics, by comparing processing can operate; (4) the cost, 
                                                   which is primarily dictatedsparse DNNs pruned using [142] to dense DNNs.         
                                                   by the chip area, determines how much one would pay for this
                                                   solution.
                            <<TABLE>>
                            TABLE IV
        METRICS FOR POPULAR DNN MODELS. SPARSITY IS ACCOUNT FOR BY  
              REPORTING NON-ZERO (NZ) WEIGHTS AND MACS.

        B. Metrics for DNN Hardware
         To measure the efﬁciency of the DNN hardware, we should                 IX. SUMMARY
        consider the following additional metrics:                 The use of deep neural networks (DNNs) has seen explosive
          Thepower and energyconsumption of the design should growth in the past few years. They are currently widely used
           be reported for various DNN models; the DNN model for many artiﬁcial intelligence (AI) applications including
           speciﬁcations should be provided including which layers computer vision, speech recognition and robotics and are often
           and bit precision are supported by the hardware during delivering better than human accuracy. However, while DNNs
           measurement. In addition, the amount of off-chip accesses can deliver this outstanding accuracy, it comes at the cost
           (e.g., DRAM accesses) should be included since it of high computational complexity. Consequently, techniques
           accounts for a signiﬁcant portion of the system power; it that enable efﬁcient processing of deep neural network to
           can be reported in terms of the total amount of data that improveenergy-efﬁciencyandthroughputwithout sacriﬁcing
           is read and written off-chip per inference.           accuracywith cost-effective hardware are critical to expanding
          Thelatency and throughputshould be reported in terms the deployment of DNNs in both existing and new domains.
           of the batch size and the actual run time for various   Creating a system for efﬁcient DNN processing should
           DNN models, which accounts for mapping and memory begin with understanding the current and future applications
           bandwidth effects. This provides a more useful and and the speciﬁc computations required both now and the
           informative metric than peak throughput.            potential evolution of those computations. This article surveys a
          Thecostof the chip depends on the area efﬁciency, which number of the current applications, focusing on computer vision
           accounts for the size and type of memory (e.g., registers applications, the associated algorithms, and the data being used
           or SRAM) and the amount of control logic. It should be to drive the algorithms. These applications, algorithms and
                                                   input data are experiencing rapid change. So extrapolating
         13 Data augmentation is often used to increase accuracy. This includes using these trends to determine the degree of ﬂexibility desired to
        multiple crops of an image to account for misalignment; in addition, an handle next generation computations, becomes an important ensemble of multiple models can be used where each model has different
        weights due to different training settings, such as using different initializations ingredient of any design project.
        or datasets, or even different network architectures. If multiple crops and
        models are used, then the number of MACs and weights required would   
                                               
                                               <<TABLE>>
                                                TABLE V
                                    EXAMPLE BENCHMARK METRICS FOR EYERISS [94].

         During the design-space exploration process, it is critical to article both reviews a variety of these techniques and discusses
        understand and balance the important system metrics. For DNN the frameworks that are available for describing, running and
        computation these include the accuracy, energy, throughput training networks.
        and hardware cost. Evaluating these metrics is, of course,   Finally, DNNs afford the opportunity to use mixed-signal
        key, so this article surveys the important components of circuit design and advanced technologies to improve efﬁciency.
        a DNN workload. In speciﬁc, a DNN workload has two These include using memristors for analog computation and 3-D
        major components. First, the workload is the form of each stacked memory. Advanced technologies can also can facilitate
        DNN network including the ‘shape’ of each layer and the moving computation closer to the source by embedding compu-
        interconnections between layers. These can vary both within tation near or within the sensor and the memories. Of course, all
        and between applications. Second, the workload consists of of these techniques should also be considered in combination,
        the speciﬁc the data input to the DNN. This data will vary while being careful to understand their interactions and looking
        with the input set used for training or the data input during for opportunities for joint hardware/algorithm co-optimization.
        operation for inference.                             In conclusion, although much work has been done, deep
         This article also surveys a number of avenues that prior neural networks remain an important area of research with
        work have taken to optimize DNN processing. Since data many promising applications and opportunities for innovation
        movement dominates energy consumption, a primary focus at various levels of hardware design.
        of some recent research has been to reduce data movement
        while maintaining accuracy, throughput and cost. This means               ACKNOWLEDGMENTS
        selecting architectures with favorable memory hierarchies like   Funding provided by DARPA YFA, MIT CICS, and gifts
        a spatial array, and developing dataﬂows that increase data from Nvidia and Intel. The authors thank the anonymous
        reuse at the low-cost levels of the memory hierarchy. We reviewers as well as James Noraky, Mehul Tikekar and
        have included a taxonomy of dataﬂows and an analysis of Zhengdong Zhang for providing valuable feedback on this
        their characteristics. Other work is presented that aims to save paper.
        space and energy by changing the representation of data values
        in the DNN. Still other work saves energy and sometimes                  REFERENCES
        increases throughput by exploiting the sparsity of weights   [1]Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, and/or activations.                                   vol. 521, no. 7553, pp. 436–444, May 2015.
         The DNN domain also affords an excellent opportunity   [2]L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer,
                                                        G. Zweig, X. He, J. Williamset al., “Recent advances in deep for joint hardware/software co-design. For example, various      learning for speech research at Microsoft,” inICASSP, 2013. efforts have noted that efﬁciency can be improved by increasing   [3]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
        sparsity (increasing the number of zero values) or optimizing      Classiﬁcation with Deep Convolutional Neural Networks,” in
        the representation of data by reducing the precision of values      NIPS, 2012.
        or using more complex mappings of the stored value to the   [4]C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
                                                        Learning affordance for direct perception in autonomous actual value used for computation. However, to avoid losing      driving,” inICCV, 2015. accuracy it is often useful to modify the network or ﬁne-tune the   [5]A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
        network’s weights to accommodate these changes. Thus, this      Blau, and S. Thrun, “Dermatologist-level classiﬁcation of skin                                                                                                      29


             cancer with deep neural networks,”Nature, vol. 542, no. 7639,  [25]J. Zhou and O. G. Troyanskaya, “Predicting effects of noncod-
             pp. 115–118, 2017.                                   ing variants with deep learning-based sequence model,”Nature
          [6]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,      methods, vol. 12, no. 10, pp. 931–934, 2015.
             G. van den Driessche, J. Schrittwieser, I. Antonoglou,  [26]B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey,
             V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,      “Predicting the sequence speciﬁcities of dna-and rna-binding
             J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,      proteins by deep learning,”Nature biotechnology, vol. 33, no. 8,
             K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the      pp. 831–838, 2015.
             game of Go with deep neural networks and tree search,”Nature,  [27]H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolu-
             vol. 529, no. 7587, pp. 484–489, Jan. 2016.                  tional neural network architectures for predicting dna–protein
          [7]F.-F. Li, A. Karpathy, and J. Johnson, “Stanford CS class      binding,”Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016.
             CS231n: Convolutional Neural Networks for Visual Recogni-  [28]M. Jermyn, J. Desroches, J. Mercier, M.-A. Tremblay, K. St-
             tion,” http://cs231n.stanford.edu/.                          Arnaud, M.-C. Guiot, K. Petrecca, and F. Leblond, “Neural net-
          [8]P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy,      works improve brain cancer detection with raman spectroscopy
             J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo,      in the presence of operating room light artifacts,”Journal of
             Y. Nakamuraet al., “A million spiking-neuron integrated circuit      Biomedical Optics, vol. 21, no. 9, pp. 094002–094002, 2016.
             with a scalable communication network and interface,”Science,  [29]D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck,
             vol. 345, no. 6197, pp. 668–673, 2014.                     “Deep learning for identifying metastatic breast cancer,”arXiv
          [9]S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy,      preprint arXiv:1606.05718, 2016.
             R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry,  [30]L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
             T. Melano, D. R. Barchet al., “Convolutional networks for      forcement learning: A survey,”Journal of artiﬁcial intelligence
             fast, energy-efﬁcient neuromorphic computing,”Proceedings      research, vol. 4, pp. 237–285, 1996.
             of the National Academy of Sciences, 2016.               [31]V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
         [10]M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of      D. Wierstra, and M. Riedmiller, “Playing Atari with Deep
             convolutional networks through FFTs,” inICLR, 2014.           Reinforcement Learning,” inNIPS Deep Learning Workshop,
         [11]Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf,      2013.
             I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard,  [32]S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end
             “Handwritten digit recognition: applications of neural network      training of deep visuomotor policies,”Journal of Machine
             chips and automatic learning,”IEEE Commun. Mag., vol. 27,      Learning Research, vol. 17, no. 39, pp. 1–40, 2016.
             no. 11, pp. 41–46, Nov 1989.                        [33]M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena,
         [12]B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in      “From Perception to Decision: A Data-driven Approach to End-
             1960 IRE WESCON Convention Record, 1960.                to-end Motion Planning for Autonomous Ground Robots,” in
         [13]B. Widrow, “Thinking about thinking: the discovery of the      ICRA, 2017.
             LMS algorithm,”IEEE Signal Process. Mag., 2005.         [34]S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik,
         [14]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,      “Cognitive mapping and planning for visual navigation,” in
             Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,      CVPR, 2017.
             and L. Fei-Fei, “ImageNet Large Scale Visual Recognition   [35]T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
             Challenge,”International Journal of Computer Vision (IJCV),      control policies for autonomous aerial vehicles with mpc-guided
             vol. 115, no. 3, pp. 211–252, 2015.                        policy search,” inICRA, 2016.
         [15]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning   [36]S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
             for Image Recognition,” inCVPR, 2016.                    agent, reinforcement learning for autonomous driving,” inNIPS
         [16]“Complete Visual Networking Index (VNI) Forecast,” Cisco,      Workshop on Learning, Inference and Control of Multi-Agent
             June 2016.                                         Systems, 2016.
         [17]J. Woodhouse, “Big, big, big data: higher and higher resolution   [37]N. Hemsoth, “The Next Wave of Deep Learning Applications,”
             video surveillance,” technology.ihs.com, January 2016.           Next Platform, September 2016.
         [18]R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich   [38]S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
             Feature Hierarchies for Accurate Object Detection and Semantic      Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
             Segmentation,” inCVPR, 2014.                       [39]T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab-
         [19]J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional      hadran, “Deep convolutional neural networks for LVCSR,” in
             Networks for Semantic Segmentation,” inCVPR, 2015.          ICASSP, 2013.
         [20]K. Simonyan and A. Zisserman, “Two-stream convolutional   [40]V. Nair and G. E. Hinton, “Rectiﬁed Linear Units Improve
             networks for action recognition in videos,” inNIPS, 2014.        Restricted Boltzmann Machines,” inICML, 2010.
         [21]G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,  [41]A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectiﬁer nonlin-
             A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep      earities improve neural network acoustic models,” inICML,
             neural networks for acoustic modeling in speech recognition:      2013.
             The shared views of four research groups,”IEEE Signal Process.  [42]K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into
             Mag., vol. 29, no. 6, pp. 82–97, 2012.                      rectiﬁers: Surpassing human-level performance on imagenet
         [22]R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,      classiﬁcation,” inICCV, 2015.
             and P. Kuksa, “Natural language processing (almost) from   [43]D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and
             scratch,”Journal of Machine Learning Research, vol. 12, no.      Accurate Deep Network Learning by Exponential Linear Units
             Aug, pp. 2493–2537, 2011.                             (ELUs),”ICLR, 2016.
         [23]A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,  [44]X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving
             O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and      deep neural network acoustic models using generalized maxout
             K. Kavukcuoglu, “Wavenet: A generative model for raw audio,”      networks,” inICASSP, 2014.
             CoRR abs/1609.03499, 2016.                         [45]Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, , C. Laurent,
         [24]H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider,      Y. Bengio, and A. Courville, “Towards End-to-End Speech
             D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi,      Recognition with Deep Convolutional Neural Networks,” in
             T. R. Hugheset al., “The human splicing code reveals new      Interspeech, 2016.
             insights into the genetic determinants of disease,”Science, vol.  [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
             347, no. 6218, p. 1254806, 2015.                         shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional                                                                                                     
             architecture for fast feature embedding,” inACM International   [75]J. Cong and B. Xiao, “Minimizing computation in convolutional
             Conference on Multimedia, 2014.                         neural networks,” inICANN, 2014.
         [47]S. Ioffe and C. Szegedy, “Batch normalization: Accelerating   [76]A. Lavin and S. Gray, “Fast algorithms for convolutional neural
             deep network training by reducing internal covariate shift,” in      networks,” inCVPR, 2016.
             ICML, 2015.                                    [77]“Intel Math Kernel Library,” https://software.intel.com/en-us/
         [48]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-      mkl.
             based learning applied to document recognition,”Proc. IEEE,  [78]S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
             vol. 86, no. 11, pp. 2278–2324, Nov 1998.                   B. Catanzaro, and E. Shelhamer, “cuDNN: Efﬁcient Primitives
         [49]P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and      for Deep Learning,”arXiv preprint arXiv:1410.0759, 2014.
             Y. LeCun, “OverFeat: Integrated Recognition, Localization and   [79]M. Horowitz, “Computing’s energy problem (and what we can
             Detection using Convolutional Networks,” inICLR, 2014.         do about it),” inISSCC, 2014.
         [50]K. Simonyan and A. Zisserman, “Very Deep Convolutional   [80]Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Archi-
             Networks for Large-Scale Image Recognition,” inICLR, 2015.      tecture for Energy-Efﬁcient Dataﬂow for Convolutional Neural
         [51]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,      Networks,” inISCA, 2016.
             D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper   [81]——, “Using Dataﬂow to Optimize Energy Efﬁciency of Deep
             With Convolutions,” inCVPR, 2015.                       Neural Network Accelerators,”IEEE Micro’s Top Picks from the
         [52]M. Lin, Q. Chen, and S. Yan, “Network in Network,” inICLR,      Computer Architecture Conferences, vol. 37, no. 3, May-June
             2014.                                            2017.
         [53]C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,  [82]M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur-
             “Rethinking the inception architecture for computer vision,” in      danovic, E. Cosatto, and H. P. Graf, “A Massively Parallel
             CVPR, 2016.                                       Coprocessor for Convolutional Neural Networks,” inASAP,
         [54]C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-      2009.
             v4, Inception-ResNet and the Impact of Residual Connections   [83]V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an
             on Learning,” inAAAI, 2017.                            embedded biologically-inspired machine vision processor,” in
         [55]G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang,      FPT, 2010.
             R. Caruana, A. Mohamed, M. Philipose, and M. Richardson,  [84]S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi,
             “Do Deep Convolutional Nets Really Need to be Deep and      “A Dynamically Conﬁgurable Coprocessor for Convolutional
             Convolutional?”ICLR, 2017.                            Neural Networks,” inISCA, 2010.
         [56]“Caffe LeNet MNIST,” http://caffe.berkeleyvision.org/gathered/  [85]V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
             examples/mnist.html.                                  “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,”
         [57]“Caffe Model Zoo,” http://caffe.berkeleyvision.org/modelzoo.      inCVPR Workshop, 2014.
             html.                                         [86]S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “A
         [58]“Matconvnet Pretrained Models,” http://www.vlfeat.org/      1.93TOPS/W scalable deep learning/inference processor with
             matconvnet/pretrained/.                                tetra-parallel MIMD architecture for big-data applications,” in
         [59]“TensorFlow-Slim image classiﬁcation library,” https://github.      ISSCC, 2015.
             com/tensorﬂow/models/tree/master/slim.                 [87]L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
         [60]“Deep Learning Frameworks,” https://developer.nvidia.com/      L. Benini, “Origami: A Convolutional Network Accelerator,”
             deep-learning-frameworks.                              inGLVLSI, 2015.
         [61]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An   [88]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,
             Energy-Efﬁcient Reconﬁgurable Accelerator for Deep Convolu-      “Deep Learning with Limited Numerical Precision,” inICML,
             tional Neural Networks,”IEEE J. Solid-State Circuits, vol. 51,      2015.
             no. 1, 2017.                                    [89]Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
         [62]C. J. B. Yann LeCun, Corinna Cortes, “THE MNIST      X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting
             DATABASE of handwritten digits,” http://yann.lecun.com/exdb/      Vision Processing Closer to the Sensor,” inISCA, 2015.
             mnist/.                                        [90]M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal,
         [63]L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus,      “Memory-centric accelerator design for Convolutional Neural
             “Regularization of neural networks using dropconnect,” inICML,      Networks,” inICCD, 2013.
             2013.                                         [91]C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
         [64]A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,”      mizing FPGA-based Accelerator Design for Deep Convolutional
             https://www.cs.toronto.edu/  kriz/cifar.html.                   Neural Networks,” inFPGA, 2015.
         [65]A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny  [92]T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
             images: A large data set for nonparametric object and scene      O. Temam, “DianNao: A Small-footprint High-throughput
             recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,      Accelerator for Ubiquitous Machine-learning,” inASPLOS,
             no. 11, pp. 1958–1970, 2008.                            2014.
         [66]A. Krizhevsky and G. Hinton, “Convolutional deep belief   [93]Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li,
             networks on cifar-10,”Unpublished manuscript, vol. 40, 2010.      T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A
         [67]B. Graham, “Fractional max-pooling,” arXiv preprint      Machine-Learning Supercomputer,” inMICRO, 2014.
             arXiv:1412.6071, 2014.                            [94]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An
         [68]“Pascal VOC data sets,” http://host.robots.ox.ac.uk/pascal/      Energy-Efﬁcient Reconﬁgurable Accelerator for Deep Convo-
             VOC/.                                            lutional Neural Networks,” inISSCC, 2016.
         [69]“Microsoft Common Objects in Context (COCO) dataset,” http:  [95]V. Sze, M. Budagavi, and G. J. Sullivan, “High Efﬁciency Video
             //mscoco.org/.                                       Coding (HEVC): Algorithms and Architectures,” inIntegrated
         [70]“Google Open Images,” https://github.com/openimages/dataset.      Circuit and Systems. Springer, 2014, pp. 1–375.
         [71]“YouTube-8M,” https://research.google.com/youtube8m/.      [96]M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer
         [72]“AudioSet,” https://research.google.com/audioset/index.html.       CNN accelerators,” inMICRO, 2016.
         [73]S. Condon, “Facebook unveils Big Basin, new server geared   [97]D. Keitel-Schulz and N. Wehn, “Embedded DRAM develop-
             for deep learning,” ZDNet, March 2017.                     ment: Technology, physical design, and application issues,”
         [74] C. Dubout and F. Fleuret, “Exact acceleration of linear object      IEEE Des. Test. Comput., vol. 18, no. 3, pp. 7–15, 2001.
             detectors,” inECCV, 2012.                          [98]J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM                                                                                                     
             architecture increases density and performance,” inSymp. on      and modularized RTL compilation of Convolutional Neural
             VLSI, 2012.                                        Networks onto FPGA,” inFPL, 2016.
         [99]J. Standard, “High bandwidth memory (HBM) DRAM,” [122]P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented
             JESD235, 2013.                                     Approximation of Convolutional Neural Networks,” inICLR,
         [100]D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad-      2016.
             hyay, “Neurocube: A programmable digital neuromorphic  [123]S. Higginbotham, “Google Takes Unconventional Route with
             architecture with high-density 3D memory,” inISCA, 2016.        Homegrown Machine Learning Chips,” Next Platform, May
         [101]M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis,      2016.
             “TETRIS: Scalable and Efﬁcient Neural Network Acceleration  [124]T. P. Morgan, “Nvidia Pushes Deep Learning Inference With
             with 3D Memory,” inASPLOS, 2017.                      New Pascal GPUs,” Next Platform, September 2016.
         [102]J. Zhang, Z. Wang, and N. Verma, “A machine-learning  [125]P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
             classiﬁer implemented in a standard 6T SRAM array,” inSymp.      A. Moshovos, “Stripes: Bit-serial deep neural network comput-
             on VLSI, 2016.                                      ing,” inMICRO, 2016.
         [103]Z. Wang, R. Schapire, and N. Verma, “Error-adaptive classiﬁer  [126]B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-
             boosting (EACB): Exploiting data-driven training for highly      scalable processor for real-time large-scale ConvNets,” inSymp.
             fault-tolerant hardware,” inICASSP, 2014.                   on VLSI, 2016.
         [104]A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian,  [127]M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:
             J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC:      Training deep neural networks with binary weights during
             A Convolutional Neural Network Accelerator with In-Situ      propagations,” inNIPS, 2015.
             Analog Arithmetic in Crossbars,” inISCA, 2016.          [128]M. Courbariaux and Y. Bengio, “Binarynet: Training deep
         [105]L. Chua, “Memristor-the missing circuit element,”IEEE Trans.      neural networks with weights and activations constrained to+
             Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971.              1 or-1,”arXiv preprint arXiv:1602.02830, 2016.
         [106]L. Wilson, “International technology roadmap for semiconduc- [129]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-
             tors (ITRS),”Semiconductor Industry Association, 2013.          Net: ImageNet Classiﬁcation Using Binary Convolutional
         [107]Lu, Darsen, “Tutorial on Emerging Memory Devices,” 2016.       Neural Networks,” inECCV, 2016.
         [108]S. B. Eryilmaz, S. Joshi, E. Neftci, W. Wan, G. Cauwenberghs,  [130]Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with
             and H.-S. P. Wong, “Neuromorphic architectures with electronic      low precision by half-wave gaussian quantization,” inCVPR,
             synapses,” inISQED, 2016.                             2017.
         [109]P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu,  [131]F. Li and B. Liu, “Ternary weight networks,” inNIPS Workshop
             Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory      on Efﬁcient Methods for Deep Neural Networks, 2016.
             Architecture for Neural Network Computation in ReRAM-based  [132]C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary
             Main Memory,” inISCA, 2016.                           Quantization,”ICLR, 2017.
         [110]M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. [133]R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An
             Likharev, and D. B. Strukov, “Training and operation of      Ultra-Low Power Convolutional Neural Network Accelerator
             an integrated neuromorphic network based on metal-oxide      Based on Binary Weights,” inISVLSI, 2016.
             memristors,”Nature, vol. 521, no. 7550, pp. 61–64, 2015.    [134]K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato,
         [111]J. Zhang, Z. Wang, and N. Verma, “A matrix-multiplying ADC      H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and
             implementing a machine-learning classiﬁer directly with data      M. Kuroda, T.and Motomura, “BRein Memory: A 13-Layer
             conversion,” inISSCC, 2015.                            4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconﬁgurable
         [112]E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched-      In-Memory Deep Neural Network Accelerator in 65nm CMOS,”
             capacitor matrix multiplier with co-designed local memory in      inSymp. on VLSI, 2017.
             40nm,” inISSCC, 2016.                           [135]D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional
         [113]R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong,      Neural Networks using Logarithmic Data Representation,”
             “RedEye: analog ConvNet image sensor architecture for contin-      arXiv preprint arXiv:1603.01025, 2016.
             uous mobile vision,” inISCA, 2016.                   [136]A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental
         [114]A. Wang, S. Sivaramakrishnan, and A. Molnar, “A 180nm      Network Quantization: Towards Lossless CNNs with Low-
             CMOS image sensor with on-chip optoelectronic image com-      precision Weights,” inICLR, 2017.
             pression,” inCICC, 2012.                          [137]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
         [115]H. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrish-      “Compressing Neural Networks with the Hashing Trick,” in
             nan, A. Veeraraghavan, and A. Molnar, “ASP Vision: Optically      ICML, 2015.
             Computing the First Layer of Convolutional Neural Networks  [138]J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,
             using Angle Sensitive Pixels,” inCVPR, 2016.                and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep
         [116]A. Suleiman and V. Sze, “Energy-efﬁcient HOG-based object      neural network computing,” inISCA, 2016.
             detection at 1080HD 60 fps with multi-scale support,” inSiPS,  [139]B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K.
             2014.                                            Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks,´
         [117]E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong,      “Minerva: Enabling low-power, highly-accurate deep neural
             “Lognet: Energy-Efﬁcient Neural Networks Using Logrithmic      network accelerators,” inISCA, 2016.
             Computations,” inICASSP, 2017.                     [140]Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain
         [118]S. Han, H. Mao, and W. J. Dally, “Deep Compression:      Damage,” inNIPS, 1990.
             Compressing Deep Neural Networks with Pruning, Trained  [141]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights
             Quantization and Huffman Coding,” inICLR, 2016.             and connections for efﬁcient neural networks,” inNIPS, 2015.
         [119] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Ben- [142]T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efﬁcient
             gio, “Quantized neural networks: Training neural networks      Convolutional Neural Networks using Energy-Aware Pruning,”
             with low precision weights and activations,”arXiv preprint      inCVPR, 2017.
             arXiv:1609.07061, 2016.                           [143]“DNN Energy Estimation,” http://eyeriss.mit.edu/energy.html.
         [120]S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa- [144]R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse´
             Net: Training low bitwidth convolutional neural networks with      matrix-vector multiplication kernel for energy-efﬁcient sparse-
             low bitwidth gradients,”arXiv preprint arXiv:1606.06160, 2016.      blas on FPGAs,” inISFPGA, 2014.
         [121]Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable  [145]S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,                                                                                                     
             and W. J. Dally, “EIE: efﬁcient inference engine on compressed
             deep neural network,” inISCA, 2016.
         [146]A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
             B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn:
             An accelerator for compressed-sparse convolutional neural
             networks,” inISCA, 2017.
         [147]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
             structured sparsity in deep neural networks,” inNIPS, 2016.
         [148]S. Anwar, K. Hwang, and W. Sung, “Structured pruning of
             deep convolutional neural networks,”ACM Journal of Emerging
             Technologies in Computing Systems, vol. 13, no. 3, p. 32, 2017.
         [149]J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and
             S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying
             hardware parallelism,” inISCA, 2017.
         [150]H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally,
             “Exploring the regularity of sparse structure in convolutional
             neural networks,” inCVPR Workshop on Tensor Methods In
             Computer Vision, 2017.
         [151]J. S. Lim, “Two-dimensional signal and image processing,”
             Englewood Cliffs, NJ, Prentice Hall, 1990, 710 p., 1990.
         [152]F. Chollet, “Xception: Deep Learning With Depthwise Separa-
             ble Convolutions,”CVPR, 2017.
         [153]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
             T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient
             convolutional neural networks for mobile vision applications,”
             arXiv preprint arXiv:1704.04861, 2017.
         [154]F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
             Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy
             with 50x fewer parameters and<1MB model size,”ICLR,
             2017.
         [155]E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
             “Exploiting Linear Structure Within Convolutional Networks
             for Efﬁcient Evaluation,” inNIPS, 2014.
         [156]V. Lebedev, Y. Ganin, M. Rakhuba1, I. Oseledets, and V. Lem-
             pitsky, “Speeding-Up Convolutional Neural Networks Using
             Fine-tuned CP-Decomposition,”ICLR, 2015.
         [157]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin,
             “Compression of Deep Convolutional Neural Networks for Fast
             and Low Power Mobile Applications,” inICLR, 2016.
         [158]C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model
             Compression,” inSIGKDD, 2006.
         [159]L. Ba and R. Caurana, “Do Deep Nets Really Need to be
             Deep?”NIPS, 2014.
         [160]G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge
             in a Neural Network,” inNIPS Deep Learning Workshop, 2014.
         [161]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
             Y. Bengio, “Fitnets: Hints for Thin Deep Nets,”ICLR, 2015.
         [162]“Benchmarking DNN Processors,” http://eyeriss.mit.edu/benchmarking.html.
<<END> <<END>> <<END>>


<|startoftext|>
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 

Abstract 

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. 
To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https: //github.com/tensorflow/tpu/tree/master/models/official/efficientnet. 

1. Introduction

Scaling up ConvNets is widely used to achieve better accuracy. For example, ResNet (He et al., 2016) can be scaled up from ResNet-18 to ResNet-200 by using more layers; Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima.
geNet top-1 accuracy by scaling up a baseline model four 

<<FIGURE>>

Figure 1. Model Size vs. ImageNet Accuracy. All numbers are for single-crop, single-model. Our EfficientNets significantly out.perform other ConvNets. In particular, EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 accuracy but being 8.4x smaller and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152. Details are in Table 2 and 4. 
time larger. However, the process of scaling up ConvNets has never been well understood and there are currently many ways to do it. The most common way is to scale up Con.vNets by their depth (He et al., 2016) or width (Zagoruyko & Komodakis, 2016). Another less common, but increasingly popular, method is to scale up models by image resolution (Huang et al., 2018). In previous work, it is common to scale only one of the three dimensions  depth, width, and image size. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling requires tedious manual tuning and still often yields sub-optimal accuracy and efficiency. 
In this paper, we want to study and rethink the process of scaling up ConvNets. In particular, we investigate the central question: is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency? Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surpris.ingly such balance can be achieved by simply scaling each of them with constant ratio. Based on this observation, we propose a simple yet effective compound scaling method. Unlike conventional practice that arbitrary scales these fac.tors, our method uniformly scales network width, depth, 

<<FIGURE>>

Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio. 
and resolution with a set of fixed scaling coefficients. For example, if we want to use 2N times more computational resources, then we can simply increase the network depth by <<FORMULA>>, width by <<FORMULA>> , and image size by <<FORMULA>> are constant coefficients determined by a small grid search on the original small model. Figure 2 illustrates the difference between our scaling method and conventional methods. 
Intuitively, the compound scaling method makes sense be.cause if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image. In fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018) and empirical results (Zagoruyko & Komodakis, 2016) both show that there exists certain relationship between network width and depth, but to our best knowledge, we are the first to empirically quantify the relationship among all three dimensions of network width, depth, and resolution. 
We demonstrate that our scaling method work well on exist.ing MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Notably, the effectiveness of model scaling heavily depends on the baseline network; to go even further, we use neural architecture search (Zoph & Le, 2017; Tan et al., 2019) to develop a new baseline network, and scale it up to obtain a family of models, called EfficientNets. Figure 1 summarizes the ImageNet performance, where our EfficientNets significantly outperform other ConvNets. In particular, our EfficientNet-B7 surpasses the best existing GPipe accuracy (Huang et al., 2018), but using 8.4x fewer parameters and running 6.1x faster on inference. Compared to the widely used ResNet-50 (He et al., 2016), our EfficientNet-B4 improves the top-1 accuracy from 76.3% to 83.0% (+6.7%) with similar FLOPS. Besides ImageNet, EfficientNets also transfer well and achieve state-of-the-art accuracy on 5 out of 8 widely used datasets, while reducing parameters by up to 21x than existing ConvNets. 

2. Related Work 

ConvNet Accuracy: Since AlexNet (Krizhevsky et al., 2012) won the 2012 ImageNet competition, ConvNets have become increasingly more accurate by going bigger: while the 2014 ImageNet winner GoogleNet (Szegedy et al., 2015) achieves 74.8% top-1 accuracy with about 6.8M parameters, the 2017 ImageNet winner SENet (Hu et al., 2018) achieves 82.7% top-1 accuracy with 145M parameters. Recently, GPipe (Huang et al., 2018) further pushes the state-of-the-art ImageNet top-1 validation accuracy to 84.3% using 557M parameters: it is so big that it can only be trained with a specialized pipeline parallelism library by partitioning the network and spreading each part to a different accelerator. While these models are mainly designed for ImageNet, recent studies have shown better ImageNet models also per.form better across a variety of transfer learning datasets (Kornblith et al., 2019), and other computer vision tasks such as object detection (He et al., 2016; Tan et al., 2019). Although higher accuracy is critical for many applications, we have already hit the hardware memory limit, and thus further accuracy gain needs better efficiency. 
ConvNet efficiency: Deep ConvNets are often over-parameterized. Model compression (Han et al., 2016; He et al., 2018; Yang et al., 2018) is a common way to re.duce model size by trading accuracy for efficiency. As mo.bile phones become ubiquitous, it is also common to hand.craft efficient mobile-size ConvNets, such as SqueezeNets (Iandola et al., 2016; Gholami et al., 2018), MobileNets (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets (Zhang et al., 2018; Ma et al., 2018). Recently, neural architecture search becomes increasingly popular in designing efficient mobile-size ConvNets (Tan et al., 2019; Cai et al., 2019), and achieves even better efficiency than hand-crafted mobile ConvNets by extensively tuning the network width, depth, convolution kernel types and sizes. However, it is unclear how to apply these techniques for larger models that have much larger design space and much more expensive tuning cost. In this paper, we aim to study model efficiency for super large ConvNets that surpass state-of-the-art accuracy. To achieve this goal, we resort to model scaling. 

Model Scaling: There are many ways to scale a Con.vNet for different resource constraints: ResNet (He et al., 2016) can be scaled down (e.g., ResNet-18) or up (e.g., ResNet-200) by adjusting network depth (#layers), while WideResNet (Zagoruyko & Komodakis, 2016) and Mo.bileNets (Howard et al., 2017) can be scaled by network width (#channels). It is also well-recognized that bigger input image size will help accuracy with the overhead of more FLOPS. Although prior studies (Raghu et al., 2017; Lin & Jegelka, 2018; Sharir & Shashua, 2018; Lu et al., 2018) have shown that network depth and width are both important for ConvNets expressive power, it still remains an open question of how to effectively scale a ConvNet to achieve better efficiency and accuracy. Our work systematically and empirically studies ConvNet scaling for all three dimensions of network width, depth, and resolutions. 

3. Compound Model Scaling

In this section, we will formulate the scaling problem, study different approaches, and propose our new scaling method. 

3.1. Problem Formulation 
A ConvNet Layer i can be defined as a function: <<FORMULA>>, where Fi is the operator, Yi is output tensor, Xi is input tensor, with tensor shape <<FORMULA>>, where H_i and W_i are spatial dimension and C_i is the channel dimension. A ConvNet N can be represented by a list of composed lay-

<<FORMULA>>

practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has ve stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling. Therefore, we can define a ConvNet as: 

<<FORMULA>>

<<FORMULA>> where <<FORMULA>> denotes layer F_i is repeated L_i times in stage i,

<<FORMULA>> denotes the shape of input tensor X of layer 1For the sake of simplicity, we omit batch dimension. 
i. Figure 2(a) illustrate a representative ConvNet, where the spatial dimension is gradually shrunk but the channel dimension is expanded over layers, for example, from initial input shape h224, 224, 3i to final output shape h7, 7, 512i. 
Unlike regular ConvNet designs that mostly focus on find.ing the best layer architecture Fi, model scaling tries to expand the network length (Li), width (Ci), and/or resolution (Hi,Wi) without changing Fi predefined in the baseline network. By xing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different <<FORMULA>> for each layer. In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. Our target is to maximize the model accuracy for any given resource constraints, which can be formulated as an optimization problem: 

<<FORMULA>>         (2) 

where <<FORMULA>> are coefficients for scaling network width, depth, and resolution; <<FORMULA>> are predefined parameters in baseline network (see Table 1 as an example). 
3.2. Scaling Dimensions 
The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints. Due to this difficulty, conventional methods mostly scale ConvNets in one of these dimensions: 
Depth (d): Scaling network depth is the most common way used by many ConvNets (He et al., 2016; Huang et al., 2017; Szegedy et al., 2015; 2016). The intuition is that deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. However, deeper networks are also more difficult to train due to the vanishing gradient problem (Zagoruyko & Komodakis, 2016). Although several techniques, such as skip connections (He et al., 2016) and batch normalization (Ioffe & Szegedy, 2015), alleviate the training problem, the accuracy gain of very deep network diminishes: for example, ResNet-1000 has similar accuracy as ResNet-101 even though it has much more layers. Figure 3 (middle) shows our empirical study on scaling a baseline model with different depth coefficient d, further suggesting the diminishing accuracy return for very deep ConvNets. 
Width (w): Scaling network width is commonly used for small size models (Howard et al., 2017; Sandler et al., 2018; 

<<FIGURE>>

Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) coefficients. Bigger networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1. 
Tan et al., 2019)2. As discussed in (Zagoruyko & Komodakis, 2016), wider networks tend to be able to capture more fine-grained features and are easier to train. However, extremely wide but shallow networks tend to have difcul.ties in capturing higher level features. Our empirical results in Figure 3 (left) show that the accuracy quickly saturates when networks become much wider with larger w. 
Resolution (r): With higher resolution input images, Con.vNets can potentially capture more fine-grained patterns. Starting from 224x224 in early ConvNets, modern Con.vNets tend to use 299x299 (Szegedy et al., 2016) or 331x331 (Zoph et al., 2018) for better accuracy. Recently, GPipe (Huang et al., 2018) achieves state-of-the-art ImageNet ac.curacy with 480x480 resolution. Higher resolutions, such as 600x600, are also widely used in object detection ConvNets (He et al., 2017; Lin et al., 2017). Figure 3 (right) shows the results of scaling network resolutions, where indeed higher resolutions improve accuracy, but the accuracy gain dimin.ishes for very high resolutions (r =1.0 denotes resolution 224x224 and r =2.5 denotes resolution 560x560). 
The above analyses lead us to the first observation: 
Observation 1  Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models. 
3.3. Compound Scaling 
We empirically observe that different scaling dimensions are not independent. Intuitively, for higher resolution images, we should increase network depth, such that the larger receptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network width when resolution is higher, in 
In some literature, scaling number of channels is called depth multiplier, which means the same as our width coefficient w. 

<<FIGURE>>

Figure 4. Scaling Network Width for Different Baseline Net.works. Each dot in a line denotes a model with different width coefficient (w). All baseline networks are from Table 1. The first baseline network <<FORMULA>> has 18 convolutional layers with resolution 224x224, while the last baseline <<FORMULA>> has 36 layers with resolution 299x299. 
order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling. 
To validate our intuitions, we compare width scaling under different network depths and resolutions, as shown in Figure 4. If we only scale network width w without changing depth <<(d=1.0)>> and resolution <<(r=1.0)>>, the accuracy saturates quickly. With deeper (d=2.0) and higher resolution <<(r=2.0)>>, width scaling achieves much better accuracy under the same FLOPS cost. These results lead us to the second observation: 
Observation 2 In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling. 

In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) have already tried to arbitrarily balance network width and depth, but they all require tedious manual tuning. 
In this paper, we propose a new compound scaling method, which use a compound coefficient . to uniformly scales network width, depth, and resolution in a principled way: 

<<FORMULA>>      (3) 

where <<FORMULA>> are constants that can be determined by a small grid search. Intuitively, . is a user-specified coefficient that controls how many more resources are available for model scaling, while <<FORMULA>> specify how to assign these extra resources to network width, depth, and resolution respectively. Notably, the FLOPS of a regular convolution op
is proportional to <<FORMULA>> i.e., doubling network depth will double FLOPS, but doubling network width or resolution will increase FLOPS by four times. Since convolution ops usually dominate the computation cost in ConvNets, scaling a ConvNet with equation 3 will approximately in.
crease total FLOPS by <<FORMULA>> In this paper, we constraint <<FORMULA>> such that for any new <<FORMULA>>, the total FLOPS will approximately3 increase by 2.
4. EfficientNet Architecture 
Since model scaling does not change layer operators F_i in baseline network, having a good baseline network is also critical. We will evaluate our scaling method using existing ConvNets, but in order to better demonstrate the effectiveness of our scaling method, we have also developed a new mobile-size baseline, called EfficientNet. 
Inspired by (Tan et al., 2019), we develop our baseline net.work by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Specifically, we use the same search space as (Tan et al., 2019), and use <<FORMULA>> as the optimization goal, where <<ACC(m)>> and <<FLOPS(m)>> denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. Unlike (Tan et al., 2019; Cai et al., 2019), here we optimize FLOPS rather than latency since we are not targeting any specific hardware de.vice. Our search produces an efficient network, which we name EfficientNet-B0. Since we use the same search space as (Tan et al., 2019), the architecture is similar to <<FORMULA>>.
FLOPS may differ from theoretical value due to rounding. 

Table 1. EfficientNet-B0 baseline network <<FORMULA>> Each row describes a stage i with L_i layers, with input resolution <<FORMULA>> and output channels C_i. Notations are adopted from equation 2. 

<<FORMULA>>

Net, except our EfficientNet-B0 is slightly bigger due to the larger FLOPS target (our FLOPS target is 400M). Ta.ble 1 shows the architecture of EfficientNet-B0. Its main building block is mobile inverted bottleneck MBConv (Sandler et al., 2018; Tan et al., 2019), to which we also add squeeze-and-excitation optimization (Hu et al., 2018). 
Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps: 

STEP 1: we first <<FORMULA>> assuming twice more re.sources available, and do a small grid search of <<FORMULA>> based on Equation 2 and 3. In particular, we find the best values for EfficientNet-B0 are <<FORMULA>>, under constraint of <<FORMULA>>. 

STEP 2: we then <<FORMULA>> as constants and scale up baseline network with different . using Equation 3, to obtain EfficientNet-B1 to B7 (Details in Table 2). 

Notably, it is possible to achieve even better performance by searching for <<FORMULA>> directly around a large model, but the search cost becomes prohibitively more expensive on larger models. Our method solves this issue by only doing search once on the small baseline network (step 1), and then use the same scaling coefficients for all other models (step 2). 

5. Experiments 

In this section, we will first evaluate our scaling method on existing ConvNets and the new proposed EfficientNets. 

5.1. Scaling Up MobileNets and ResNets 
As a proof of concept, we first apply our scaling method to the widely-used MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Table 3 shows the ImageNet results of scaling them in different ways. Compared to other single-dimension scaling methods, our compound scaling method improves the accuracy on all these models, suggesting the effectiveness of our proposed scaling method for general existing ConvNets. 

Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient . in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets. 

<<TABLE>>

We omit ensemble and multi-crop models (Hu et al., 2018), or models pretrained on 3.5B Instagram images (Mahajan et al., 2018). 

Table 3. Scaling Up MobileNets and ResNet. 

<<TABLE>>

Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the.art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average. 
Comparison to best public-available results Comparison to best reported results Model Accuracy. 

<<TABLE>>

Figure 6. Model Parameters vs. Transfer Learning Accuracy weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. 

<<FIGURE>>

We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed Au.to Augment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with survival probability 0.8. As commonly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7. 
Table 2 shows the performance of all EfficientNet models that are scaled from the same baseline EfficientNet-B0. Our EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018). 
All models are pretrained on ImageNet and fine tuned on new datasets. Figure 1 and Figure 5 illustrates the parameters-accuracy and FLOPS-accuracy curve for representative ConvNets, where our scaled EfficientNet models achieve better accuracy with much fewer parameters and FLOPS than other ConvNets. Notably, our EfficientNet models are not only small, but also computational cheaper. For example, our EfficientNet-B3 achieves higher accuracy than ResNeXt.101 (Xie et al., 2017) using 18x fewer FLOPS. 
To validate the computational cost, we have also measured the inference latency for a few representative CovNets on a real CPU as shown in Table 4, where we report average latency over 20 runs. Our EfficientNet-B1 runs 5.7x faster than the widely used ResNet-152 (He et al., 2016), while EfficientNet-B7 runs about 6.1x faster than GPipe (Huang et al., 2018), suggesting our EfficientNets are indeed fast on real hardware. 

Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods-Our compound scaling method allows the scaled model (last column) to focus on more relevant regions with more object details. Model details are in Table 7. 

<<FIGURE>>

Table 6. Transfer Learning Datasets. 
  
<<TABLE>>

5.3. Transfer Learning Results for EfficientNet 
We have also evaluated our EfficientNet on a list of commonly used transfer learning datasets, as shown in Table 6. We borrow the same training settings from (Kornblith et al., 2019) and (Huang et al., 2018), which take ImageNet pretrained checkpoints and fine tune on new datasets. 
Table 5 shows the transfer learning performance: (1) Com.pared to public available models, such as NASNet-A (Zoph et al., 2018) and Inception-v4 (Szegedy et al., 2017), our EfficientNet models achieve better accuracy with 4.7x average (up to 21x) parameter reduction. (2) Compared to state-of-the-art models, including DAT (Ngiam et al., 2018) that dynamically synthesizes training data and GPipe (Huang et al., 2018) that is trained with specialized pipeline parallelism, our EfficientNet models still surpass their accuracy in 5 out of 8 datasets, but using 9.6x fewer parameters 
Figure 6 compares the accuracy-parameters curve for a variety of models. In general, our EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet (He et al., 2016), DenseNet (Huang et al., 2017), Inception (Szegedy et al., 2017), and NASNet (Zoph et al., 2018). 

6. Discussion 

Figure 8. Scaling Up EfficientNet-B0 with Different Methods. Table 7. Scaled Models Used in Figure 7. 

<<FIGURE>>

To disentangle the contribution of our proposed scaling method from the EfficientNet architecture, Figure 8 com.pares the ImageNet performance of different scaling methods for the same EfficientNet-B0 baseline network. In general, all scaling methods improve accuracy with the cost of more FLOPS, but our compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods, suggesting the importance of our proposed compound scaling. 
In order to further understand why our compound scaling method is better than others, Figure 7 compares the class activation map (Zhou et al., 2016) for a few representative models with different scaling methods. All these models are scaled from the same baseline, and their statistics are shown in Table 7. Images are randomly picked from ImageNet validation set. As shown in the figure, the model with com.pound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the images. 

7. Conclusion 

In this paper, we systematically study ConvNet scaling and identify that carefully balancing network width, depth, and resolution is an important but missing piece, preventing us from better accuracy and efficiency. To address this issue, we propose a simple and highly effective compound scaling method, which enables us to easily scale up a baseline Con.vNet to any target resource constraints in a more principled way, while maintaining model efficiency. Powered by this compound scaling method, we demonstrate that a mobile-size EfficientNet model can be scaled up very effectively, surpassing state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPS, on both ImageNet and five commonly used transfer learning datasets. 

Acknowledgements 

We thank Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Hongkun Yu, Xiaodan Song, Samy Bengio, Jeff Dean, and Google Brain team for their help. 

References 

Berg, T., Liu, J., Woo Lee, S., Alexander, M. L., Jacobs, 
D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. CVPR, pp. 20112018, 2014. 
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 mining discriminative components with random forests. ECCV, pp. 446461, 2014. 
Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. ICLR, 2019. 
Chollet, F. Xception: Deep learning with depthwise separa.ble convolutions. CVPR, pp. 161002357, 2017. 
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. CVPR, 2019. 
Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:311, 2018. 
Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware neural network design. ECV Workshop at CVPR18, 2018. 
Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016. 
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CVPR, pp. 770778, 2016. 
He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask r-cnn. ICCV, pp. 29802988, 2017. 
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. Amc: Automl for model compression and acceleration on mobile devices. ECCV, 2018. 
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net.works. CVPR, 2018. 
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, 
K. Q. Deep networks with stochastic depth. ECCV, pp. 646661, 2016. 
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, 
K. Q. Densely connected convolutional networks. CVPR, 2017. 
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, 
Q. V., and Chen, Z. Gpipe: efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1808.07233, 2018. 
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. 
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, pp. 448456, 2015. 
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? CVPR, 2019. 
Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. Second Workshop on Fine-Grained Visual Categorizatio, 2013. 
Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009. 
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classication with deep convolutional neural networks. In NIPS, pp. 10971105, 2012. 
Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp. 6172 6181, 2018. 
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. CVPR, 2017. 
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. ECCV, 2018. 
Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres.sive power of neural networks: A view from the width. NeurIPS, 2018. 
Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufenet v2: Practical guidelines for efficient cnn architecture design. ECCV, 2018. 
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Explor.ing the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018. 
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, 
A. Fine-grained visual classication of aircraft. arXiv preprint arXiv:1306.5151, 2013. 
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with spe.cialist models. arXiv preprint arXiv:1811.07056, 2018. 
Nilsback, M.-E. and Zisserman, A. Automated ower clas.sication over a large number of classes. ICVGIP, pp. 722729, 2008. 
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. CVPR, pp. 34983505, 2012. 
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. ICML, 2017. 
Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2018. 
Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu.larized evolution for image classier architecture search. AAAI, 2019. 
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition chal.lenge. International Journal of Computer Vision, 115(3): 211252, 2015. 
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018. 
Sharir, O. and Shashua, A. On the expressive power of overlapping architectures of deep learning. ICLR, 2018. 
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overtting. The Journal of Machine Learning Research, 15(1):19291958, 2014. 
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, 
A. Going deeper with convolutions. CVPR, pp. 19, 2015. 
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, 
Z. Rethinking the inception architecture for computer vision. CVPR, pp. 28182826, 2016. 
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, 4:12, 2017. 
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. CVPR, 2019. 
Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. Aggre.gated residual transformations for deep neural networks. CVPR, pp. 59875995, 2017. 
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platform-aware neural net.work adaptation for mobile applications. ECCV, 2018. 
Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016. 
Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit of structural diversity in very deep networks. CVPR, pp. 39003908, 2017. 
Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufenet: An ex.tremely efficient convolutional neural network for mobile devices. CVPR, 2018. 
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, 
A. Learning deep features for discriminative localization. CVPR, pp. 29212929, 2016. 
Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ICLR, 2017. 
Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018. 
<|endoftext|>


<|startoftext|>
Energy and Policy Considerations for Deep Learning in NLP 

Emma Strubell Ananya Ganesh Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst 
{strubell, aganesh, 

Abstract 

Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exception.ally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the car.bon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice. 

1 Introduction 

Advances in techniques and hardware for train.ing deep neural networks have recently enabled impressive accuracy improvements across many fundamental NLP tasks (Bahdanau et al., 2015; Luong et al., 2015; Dozat and Manning, 2017; Vaswani et al., 2017), with the most computationally-hungry models obtaining the highest scores (Peters et al., 2018; Devlin et al., 2019; Radford et al., 2019; So et al., 2019). As a result, training a state-of-the-art model now re.quires substantial computational resources which demand considerable energy, along with the associated financial and environmental costs. Research and development of new models multiplies these costs by thousands of times by requiring re.training to experiment with model architectures and hyperparameters. Whereas a decade ago most 

<<TABLE>>

Table 1: Estimated CO2 emissions from training com.mon NLP models, compared to familiar consumption.

NLP models could be trained and developed on a commodity laptop or server, many now require multiple instances of specialized hardware such as GPUs or TPUs, therefore limiting access to these highly accurate models on the basis of finances. 
Even when these expensive computational resources are available, model training also incurs a substantial cost to the environment due to the energy required to power this hardware for weeks or months at a time. Though some of this energy may come from renewable or carbon credit-offset resources, the high energy demands of these models are still a concern since (1) energy is not currently derived from carbon-neural sources in many locations, and (2) when renewable energy is available, it is still limited to the equipment we have to pro.duce and store it, and energy spent training a neural network might better be allocated to heating a family's home. It is estimated that we must cut carbon emissions by half over the next decade to deter escalating rates of natural disaster, and based on the estimated CO2 emissions listed in Table 1, 
1Sources: (1) Air travel and per-capita consumption: https://bit.ly/2Hw0xWc; (2) car lifetime: https://bit.ly/2Qbr0w1. 
model training and development likely make up a substantial portion of the greenhouse gas emissions attributed to many NLP researchers. 
To heighten the awareness of the NLP community to this issue and promote mindful practice and policy, we characterize the dollar cost and carbon emissions that result from training the neural net.works at the core of many state-of-the-art NLP models. We do this by estimating the kilowatts of energy required to train a variety of popular off-the-shelf NLP models, which can be converted to approximate carbon emissions and electricity costs. To estimate the even greater resources re.quired to transfer an existing model to a new task or develop new models, we perform a case study of the full computational resources required for the development and tuning of a recent state-of-the-art NLP pipeline (Strubell et al., 2018). We conclude with recommendations to the community based on our findings, namely: (1) Time to retrain and sensitivity to hyperparameters should be reported for NLP machine learning models; (2) academic Researchers need equitable access to computational resources; and (3) researchers should prioritize developing efficient models and hardware. 
2 Methods 
To quantify the computational and environmental cost of training deep neural network models for NLP, we perform an analysis of the energy required to train a variety of popular off-the-shelf NLP models, as well as a case study of the complete sum of resources required to develop LISA (Strubell et al., 2018), a state-of-the-art NLP model from EMNLP 2018, including all tuning and experimentation. 
We measure energy use as follows. We train the models described in 2.1 using the default settings provided, and sample GPU and CPU power con.sumption during training. Each model was trained for a maximum of 1 day. We train all models on a single NVIDIA Titan X GPU, with the exception of ELMo which was trained on 3 NVIDIA GTX 1080 Ti GPUs. While training, we repeatedly query the NVIDIA System Management Interface to sample the GPU power consumption and report the average over all samples. To sample CPU power consumption, we use Intel's Running Average Power Limit interface.

<<TABLE>>

Table 2: Percent energy sourced from: Renewable (e.g. hydro, solar, wind), natural gas, coal and nuclear for the top 3 cloud compute providers (Cook et al., 2017), compared to the United States,4 China5 and Germany (Burger, 2019). 

We estimate the total time expected for models to train to completion using training times and hardware reported in the original papers. We then calculate the power consumption in kilowatt-hours (kWh) as follows. Let pc be the average power draw (in watts) from all CPU sockets during train.ing, let pr be the average power draw from all DRAM (main memory) sockets, let pg be the aver.age power draw of a GPU during training, and let g be the number of GPUs used to train. We esti.mate total power consumption as combined GPU, CPU and DRAM consumption, then multiply this by Power Usage Effectiveness (PUE), which ac.counts for the additional energy required to sup.port the compute infrastructure (mainly cooling). We use a PUE coefficient of 1.58, the 2018 global average for data centers (Ascierto, 2018). Then the total power pt required at a given instance during training is given by: 

<<FORMULA>> (1) 

The U.S. Environmental Protection Agency (EPA) provides average CO2 produced (in pounds per kilowatt-hour) for power consumed in the U.S. (EPA, 2018), which we use to convert power to estimated CO2 emissions: 

<<FORMULA>> (2) 

This conversion takes into account the relative pro.portions of different energy sources (primarily nat.ural gas, coal, nuclear and renewable) consumed to produce energy in the United States. Table 2 lists the relative energy sources for China, Ger.many and the United States compared to the top 
three cloud service providers. The U.S. break.down of energy is comparable to that of the most popular cloud compute service, Amazon Web Ser.vices, so we believe this conversion to provide a reasonable estimate of CO2 emissions per kilowatt hour of compute energy used. 

2.1 Models 

We analyze four models, the computational requirements of which we describe below. All models have code freely available online, which we used out-of-the-box. For more details on the models themselves, please refer to the original papers. 
Transformer. The Transformer model (Vaswani et al., 2017) is an encoder-decoder architecture primarily recognized for efficient and accurate ma.chine translation. The encoder and decoder each consist of 6 stacked layers of multi-head self-attention. Vaswani et al. (2017) report that the Transformer base model (65M parameters) was trained on 8 NVIDIA P100 GPUs for 12 hours, and the Transformer big model (213M parameters) was trained for 3.5 days (84 hours; 300k steps). This model is also the basis for recent work on neural architecture search (NAS) for ma.chine translation and language modeling (So et al., 2019), and the NLP pipeline that we study in more detail in 4.2 (Strubell et al., 2018). So et al. (2019) report that their full architecture search ran for a total of 979M training steps, and that their base model requires 10 hours to train for 300k steps on one TPUv2 core. This equates to 32,623 hours of TPU or 274,120 hours on 8 P100 GPUs. 
ELMo. The ELMo model (Peters et al., 2018) is based on stacked LSTMs and provides rich word representations in context by pre-training on a large amount of data using a language model.ing objective. Replacing context-independent pre.trained word embeddings with ELMo has been shown to increase performance on downstream tasks such as named entity recognition, semantic role labeling, and coreference. Peters et al. (2018) report that ELMo was trained on 3 NVIDIA GTX 1080 GPUs for 2 weeks (336 hours). 
BERT. The BERT model (Devlin et al., 2019) provides a Transformer-based architecture for build.ing contextual representations similar to ELMo, but trained with a different language modeling objective. BERT substantially improves accuracy on tasks requiring sentence-level representations such as question answering and natural language inference. Devlin et al. (2019) report that the BERT base model (110M parameters) was trained on 16 TPU chips for 4 days (96 hours). NVIDIA reports that they can train a BERT model in 3.3 days (79.2 hours) using 4 DGX-2H servers, totaling 64 Tesla V100 GPUs (Forster et al., 2019). 
GPT-2. This model is the latest edition of OpenAI's GPT general-purpose token encoder, also based on Transformer-style self-attention and trained with a language modeling objective (Rad.ford et al., 2019). By training a very large model on massive data, Radford et al. (2019) show high zero-shot performance on question answering and language modeling benchmarks. The large model described in Radford et al. (2019) has 1542M parameters and is reported to require 1 week (168 hours) of training on 32 TPUv3 chips. 6 

3 Related work 

There is some precedent for work characterizing the computational requirements of training and inference in modern neural network architectures in the computer vision community. Li et al. (2016) present a detailed study of the energy use required for training and inference in popular convolutional models for image classification in computer vision, including fine-grained analysis comparing different neural network layer types. Canziani et al. (2016) assess image classification model accuracy as a function of model size and gigaflops required during inference. They also measure average power draw required during inference on GPUs as a function of batch size. Neither work analyzes the recurrent and self-attention models that have become commonplace in NLP, nor do they extrapolate power to estimates of carbon and dol.lar cost of training. 
Analysis of hyperparameter tuning has been performed in the context of improved algorithms for hyperparameter search (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). To our knowledge there exists to date no analysis of the computation required for R&D and hyperparameter tuning of neural network models in NLP. 
6Via the authors on Reddit. 
7GPU lower bound computed using pre-emptible <<P100/V100>> U.S. resources priced at <<FORMULA>>, upper bound uses on-demand U.S. resources priced at <<FORMULA>>. We similarly use pre-emptible (<<FORMULA>>) and on-demand (<<FORMULA>>) pricing as lower and upper bounds for TPU v2/3; cheaper bulk contracts are available. 

<<TABLE>>

Table 3: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).7 Power and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware. 

4 Experimental results 

4.1 Cost of training 
Table 3 lists CO2 emissions and estimated cost of training the models described in 2.1. Of note is that TPUs are more cost-efficient than GPUs on workloads that make sense for that hardware (e.g. BERT). We also see that models emit substantial carbon emissions; training BERT on GPU is roughly equivalent to a trans-American fight. So et al. (2019) report that NAS achieves a new state-of-the-art BLEU score of 29.7 for English to Ger.man machine translation, an increase of just 0.1 BLEU at the cost of at least $150k in on-demand compute time and non-trivial carbon emissions. 

4.2 Cost of development: Case study 
To quantify the computational requirements of R&D for a new model we study the logs of all training required to develop Linguistically-Informed Self-Attention (Strubell et al., 2018), a multi-task model that performs part-of-speech tagging, labeled dependency parsing, predicate detection and semantic role labeling. This model makes for an interesting case study as a representative NLP pipeline and as a Best Long Paper at EMNLP. 
Model training associated with the project spanned a period of 172 days (approx. 6 months). During that time 123 small hyperparameter grid searches were performed, resulting in 4789 jobs in total. Jobs varied in length ranging from a minimum of 3 minutes, indicating a crash, to a maximum of 9 days, with an average job length of 52 hours. All training was done on a combination of NVIDIA Titan X (72%) and M40 (28%) GPUs.8 
The sum GPU time required for the project totaled 9998 days (27 years). This averages to 

<<TABLE>>

Table 4: Estimated cost in terms of cloud compute and electricity for training: (1) a single model (2) a single tune and (3) all models trained during R&D. 
about 60 GPUs running constantly throughout the 6 month duration of the project. Table 4 lists upper and lower bounds of the estimated cost in terms of Google Cloud compute and raw electricity re.quired to develop and deploy this model.9 We see that while training a single model is relatively inexpensive, the cost of tuning a model for a new dataset, which we estimate here to require 24 jobs, or performing the full R&D required to develop this model, quickly becomes extremely expensive. 

5 Conclusions 

Authors should report training time and sensitivity to hyperparameters. 
Our experiments suggest that it would be beneficial to directly compare different models to per.form a cost-benet (accuracy) analysis. To ad.dress this, when proposing a model that is meant to be re-trained for downstream use, such as re.training on a new domain or fine-tuning on a new task, authors should report training time and computational resources required, as well as model sensitivity to hyperparameters. This will enable direct comparison across models, allowing subsequent consumers of these models to accurately assess whether the required computational resources 

We approximate cloud compute cost using P100 pricing. 9Based on average U.S cost of electricity of $0.12/kWh. 
are compatible with their setting. More explicit characterization of tuning time could also reveal inconsistencies in time spent tuning baseline models compared to proposed contributions. Realizing this will require: (1) a standard, hardware-independent measurement of training time, such as gigaflops required to convergence, and (2) a standard measurement of model sensitivity to data and hyperparameters, such as variance with respect to hyperparameters searched. 
Academic researchers need equitable access to computation resources. 

Recent advances in available compute come at a high price not attainable to all who desire access. Most of the models studied in this paper were developed outside academia; recent improvements in state-of-the-art accuracy are possible thanks to industry access to large-scale compute. 
Limiting this style of research to industry labs hurts the NLP research community in many ways. First, it stifles creativity. Researchers with good ideas but without access to large-scale compute will simply not be able to execute their ideas, instead constrained to focus on different problems. Second, it prohibits certain types of Research on the basis of access to financial resources. This even more deeply promotes the already problematic rich get richer cycle of research funding, where groups that are already successful and thus well-funded tend to receive more funding due to their existing accomplishments. Third, the prohibitive start-up cost of building in-house resources forces resource-poor groups to rely on cloud compute services such as AWS, Google Cloud and Microsoft Azure. 
While these services provide valuable, flexible, and often relatively environmentally friendly compute resources, it is more cost effective for academic researchers, who often work for nonprofit educational institutions and whose research is funded by government entities, to pool resources to build shared compute centers at the level of funding agencies, such as the U.S. National Science Foundation. For example, an off-the-shelf GPU server containing 8 NVIDIA 1080 Ti GPUs and supporting hardware can be purchased for approximately $20,000 USD. At that cost, the hardware required to develop the model in our case study (approximately 58 GPUs for 172 days) would cost $145,000 USD plus electricity, about half the estimated cost to use on-demand cloud GPUs. Unlike money spent on cloud compute, however, that invested in centralized resources would continue to pay off as resources are shared across many projects. A government-funded academic compute cloud would provide equitable access to all researchers. 
Researchers should prioritize computationally efficient hardware and algorithms. 
We recommend a concerted effort by industry and academia to promote research of more computationally efficient algorithms, as well as hardware that requires less energy. An effort can also be made in terms of software. There is already a precedent for NLP software packages prioritizing efficient models. An additional avenue through which NLP and machine learning software developers could aid in reducing the energy associated with model tuning is by providing easy.to-use APIs implementing more efficient alternatives to brute-force grid search for hyperparameter tuning, e.g. random or Bayesian hyperparameter search techniques (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). While software packages implementing these techniques do exist,10 they are rarely employed in practice for tuning NLP models. This is likely because their interoperability with popular deep learning frameworks such as PyTorch and TensorFlow is not optimized, i.e. there are not simple examples of how to tune TensorFlow Estimators using Bayesian search. Integrating these tools into the workows with which NLP researchers and practitioners are already familiar could have notable im.pact on the cost of developing and tuning in NLP. 

Acknowledgements 

We are grateful to Sherief Farouk and the anonymous reviewers for helpful feedback on earlier drafts. This work was supported in part by the Centers for Data Science and Intelligent Information Retrieval, the Chan-Zuckerberg Initiative under the Scientific Knowledge Base Construction project, the IBM Cognitive Horizons Network agreement no. W1668553, and National Science Foundation grant no. IIS-1514053. Any opinions, findings and conclusions or recommendations ex.pressed in this material are those of the authors and do not necessarily reflect those of the sponsor. 
For example, the Hyperopt Python library. 

References 

Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute. 
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben.gio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd Inter.national Conference for Learning Representations (ICLR), San Diego, California, USA. 
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281305. 
James S Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 25462554. 
Bruno Burger. 2019. Net Public Electricity Generation in Germany in 2018. Technical report, Fraunhofer Institute for Solar Energy Systems ISE. 
Alfredo Canziani, Adam Paszke, and Eugenio Culur.ciello. 2016. An analysis of deep neural network models for practical applications. 
Gary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Deans, Brian Johnson, Elizabeth Jardim, and Brian Johnson. 2017. Clicking Clean: Who is winning the race to build a green internet? Technical report, Greenpeace. 
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un.derstanding. In NAACL. 
Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency pars.ing. In ICLR. 
EPA. 2018. Emissions & Generation Resource Inte.grated Database (eGRID). Technical report, U.S. Environmental Protection Agency. 
Christopher Forster, Thor Johnsen, Swetha Man.dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie Bernauer, Allison Gray, Sharan Chetlur, and Raul Puri. 2019. BERT Meets GPUs. Technical report, NVIDIA AI. 
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. 2016. Evaluating the energy efciency of deep con.volutional neural networks on cpus and gpus. 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Comput.ing and Communications (SustainCom) (BDCloud.SocialCom-SustainCom), pages 477484. 
Thang Luong, Hieu Pham, and Christopher D. Man.ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 14121421. Associa.tion for Computational Linguistics. 
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep.resentations. In NAACL. 
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. 
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural informa.tion processing systems, pages 29512959. 
David R. So, Chen Liang, and Quoc V. Le. 2019. The evolved transformer. In Proceedings of the 36th International Conference on Machine Learning (ICML). 
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Se.mantic Role Labeling. In Conference on Empir.ical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. 
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS). 
<|endoftext|>


<|startoftext|>
Finite-Element Neural Networks for Solving Differential Equations 
Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE 

Abstract

The solution of partial differential equations (PDE) arises in a wide variety of engineering problems. Solutions to most practical problems use numerical analysis techniques such as finite-element or finite-difference methods. The drawbacks of these approaches include computational costs associated with the modeling of complex geometries. This paper proposes a finite-element neural network (FENN) obtained by embedding a finite-element model in a neural network architecture that enables fast and ac.curate solution of the forward problem. Results of applying the FENN to several simple electromagnetic forward and inverse problems are presented. Initial results indicate that the FENN performance as a forward model is comparable to that of the conventional finite-element method (FEM). The FENN can also be used in an iterative approach to solve inverse problems associated with the PDE. Results showing the ability of the FENN to solve the in.verse problem given the measured signal are also presented. The parallel nature of the FENN also makes it an attractive solution for parallel implementation in hardware and software. 

I. INTRODUCTION 

Solutions of differential equations arise in a wide variety of engineering applications in electromagnetics, signal processing, computational fluid dynamics, etc. These equations are typically solved using either analytical or numerical methods. Analytical solution methods are however feasible only for simple geometries, which limits their applicability. In most practical problems with complex boundary conditions, numerical analysis methods are required in order to obtain a reasonable solution. An example is the solution of Maxwell's equations in electromagnetics. Solutions to Maxwell's equations are used in a variety of applications for calculating the interaction of electromagnetic (EM) fields with different types of media. 
Very often, the solution to differential equations is necessary for solving the corresponding inverse problems. Inverse problems in general are ill-posed, lacking continuous dependence of the measurements on the input. This has resulted in the development of a variety of solution techniques ranging from simple calibration procedures to other direct (analytical) and iterative approaches [1]. Iterative methods typically employ a forward model that simulates the underlying physical process (Fig. 1) [2]. An initial estimate of the solution of the inverse problem (represented by 
in Fig. 1) is applied to the forward model, 
Manuscript received January 17, 2004; revised April 2, 2005. 

<<FIGURE>>

Fig. 1. Iterative inversion method for solving inverse problems. 
resulting in the corresponding solution to the forward problem 

<<ALGORITHM>>

Although finite-element methods (FEMs) [3], [4] are extremely popular for solving differential equations, their major drawback is computational complexity. This problem becomes more acute when three-dimensional (3-D) finite-element models are used in an iterative algorithm for solving the inverse problem. Recently, several authors have suggested the use of neural networks (MLP or RBF networks [5]) for solving differential equations [6][9]. 
In these techniques, a neural network is trained using a large database containing the input data and the solution of the differential equation. The neural network during generalization learns the mapping corresponding to the PDE. Alternatively, in [10], the solution to a differential equation is written as a constant term, and an adjustable term with parameters that need to be determined. A neural network is used to determine the optimal values of the parameters. This approach is applicable only to problems with regular boundaries. An extension of the approach to problems with irregular boundaries is given in [11]. Other neural network based differential equation solvers use multilayer perceptron networks or variations on the MLP to approximate the unknown function in a PDE [12][14]. A combination of the PDE and boundary conditions is used to construct an objective function that is minimized during the training process. 
A major limitation of these approaches is that the network architecture is selected somewhat arbitrarily. A second drawback is that the performance of the neural networks depends on the data used in training and testing. As long the test data is similar to the training data, the network can interpolate between the training data points to obtain a reasonable prediction. However, when the test signal is no longer similar to the training data, the 
network is forced to extrapolate and the performance degrades. One way around this difficulty is to ensure that the training data.base has a diverse set of signals. However, this is difficult to ensure in practice. Alternatively, we have to design neural net.works that are capable of extrapolation. Extrapolation methods are discussed extensively in literature [15][18], but the design of an extrapolation neural network involves several issues particularly for ensuring that the error in the network prediction stays within reasonable bounds during the extrapolation procedure. 
An ideal solution to this problem would be to combine the power of numerical models with the computational speed of neural networks, i.e., to embed a numerical model in a neural network structure. One such finite-element neural network (FENN) formulation has been reported by Takeuchi and Kosugi [19]. This approach, based on error minimization, derives the neural network using the energy functional resulting from the finite-element formulation. Other reports of FENN combinations are either similar to the Takeuchi method [20], [21] or use Hopfield neural networks to solve the forward problem [22], [23]. Kalkkuhl et al. [24] provide a description of a FEM-based approach to NARX modeling that may be interpreted both as a local model network, as well as a single layer feedforward network. A slightly different approach to merging numerical methods and neural networks is given in [25], where the finite-difference time domain (FDTD) method is cast in a neural network framework for the purpose of solving electromagnetic forward problems. The related problem of mesh generation in finite-element models has also been tackled using neural networks (for instance, [26]). Generally, these networks are designed to solve the forward problem, and must be modified to solve inverse problems. 
This paper proposes a new approach that embeds a finite-element model commonly used in the solution of differential equations in a neural network. The network, called the FENN, can solve the forward problem and can also be used in an iterative algorithm to solve inverse problems. The primary advantage of this approach is that the FEM is represented in a parallel form. Thus, it has the potential to alleviate the computational cost associated with using the FEM in an iterative algorithm for solving inverse problems. More importantly, the FENN does not need any training, and the computation of the weights is a one-time process. The proposed approach is also different in that the neural network architecture developed can be used to solve the forward and inverse problems. The structure of the neural network is also simpler than those reported in the literature, making it easier to implement in parallel in both hardware and software. 
The rest of this paper is organized as follows. Section II briefly describes the FEM, and derives the proposed FENN. In this paper, we focus on the problem of solving typical equations encountered in electromagnetic nondestructive evaluation (NDE). However, the same concepts can be easily applied to solve differential equations encountered in other fields. Sections III, IV and V present the application of the FENN to solving forward and inverse problems, along with initial results. A discussion of the advantages and disadvantages of the proposed FENN architecture is given in Section IV. Finally, Section V draws conclusions from the results and presents ideas for future work. 

II. THE FENN 

This section briefly describes the FEM and proposes its reformulation into a parallel neural network structure. Details about the FEM can be found in [3] and [4]. 

A. The FEM 

Consider a typical boundary value problem with the governing differential equation 

<<FORMULA>>        (1) 

where <<FORMULA>> is a differential operator, <<FORMULA>> is the applied source or forcing function, and 
is the unknown quantity. This differential equation can be solved in conjunction with boundary conditions on the boundary 
enclosing the domain 
The variational formulation used in finite-element analysis determines the unknown 

by minimizing the functional [3], [4] (2) with respect to the trial function 

The minimization procedure starts by dividing into small subdomains called elements (Fig. 2) and representing in each element by means of basis functions  defined over the element (3) where 
is the unknown solution in element 

<<FORMULA>>     (3)

is the basis function associated with node in element , is the value of the unknown quantity at node and is the total number of nodes associated with element <<FORMULA>> In general, the basis functions (also referred to as interpolation functions or shape functions) can be linear, quadratic, or of higher order. Typically, finite-element models use either linear or polynomial spline basis functions. 
The functional within an element is expressed as 

<<FORMULA>>         (4) 

By substituting (3) in (4), we obtain the discrete version of the functional within each element  

<<FORMULA>>         (5)  
where is the transpose of a matrix, mental matrix with elements  is the  ele. 

<<FORMULA>>         (6)  

and is an vector with elements 

<<FORMULA>>             (7) 

Combining the values in (5) for each of the elements (8) where is the global matrix derived from the terms of the elemental matrices for different elements, and 
is the total number of nodes, also called the stiffness matrix, is a sparse, banded matrix. Equation (8) is the discrete version of the functional and can be minimized with respect to the nodal parameters 
by taking the derivative of with respect to <<FORMULA>> and setting it equal to zero, which results in the matrix equation 

<<FORMULA>>         (9) 

Boundary conditions for these problems are usually of two types: natural boundary conditions and essential boundary conditions. Essential boundary conditions (also referred to as Dirichlet boundary conditions) impose constraints on the value of the unknown 
at several nodes. Natural boundary  conditions (of which Neumann boundary conditions are a special case) impose constraints on the change in 
across a boundary. Dirichlet boundary conditions are imposed on the functional minimization (9), by deleting the rows and columns of the matrix corresponding to the nodes on the Dirichlet boundary and modifying 
in (9). 


Natural boundary conditions are applied in the FEM by adding an additional term to the functional. These boundary conditions are then incorporated into the functional and are  satisfied automatically during the solution procedure. As an example, consider the natural boundary condition represented by the following equation [3] on 

<<FORMULA>>             (10) 

where <<FORMULA>> represents the Neumann boundary, is its outward normal unit vector, is some constant, and , <<FORMULA>>, and are known parameters associated with the boundary. Assuming that the boundary 
is made up of segments, we can  define boundary matrices and with elements 

<<FORMULA>>         (11) 

where <<FORMULA>>are basis functions  defined over segment and is the length of the segment. The elements of <<FORMULA>> are added to the elements of that correspond to the nodes on the boundary. Similarly, the elements of <<FORMULA>> are added to the corresponding elements of 
<<FORMULA>> The global matrix (9) is thus modified as follows before solving for 


<<FORMULA>>         (12) 


                        <<FIGURE>>

Fig. 3. FEM domain discretization using two elements and four nodes. 

This process ensures that natural boundary conditions are implicitly and automatically  satisfied during the FEM solution procedure. 

B. The FENN 

This section describes how the finite-element model can be converted into a parallel network form. We focus on solving typical inverse problems arising in electromagnetic NDE, but the basic idea is applicable to other areas as well. NDE inverse problems can be formulated as the problem of  finding the material properties (such as the conductivity or the permeability) within the domain of the problem. Since the domain is discretized in the FEM method by a large number of elements, the problem can be posed as one of  finding the material properties in each of these elements. These properties are usually embedded in the differential operator <<FORMULA>> or equivalently, in the global matrix 
<<FORMULA>> Thus, in order to be able to iteratively estimate these properties from the measurements, the material properties need to be separated out from 
<<FORMULA>> This separation is easier to achieve at the element matrix level. For nodes <<FORMULA>> and in element 

<FORMULA>>          (13) 

where <<FORMULA>> is the parameter representing the material property in element <<FORMULA>> and <<FORMULA>> represents the differential operator at the 

<<FIGURE>>

Fig. 4. FENN. 

element level without embedded in it. Substituting (13) into the functional, we get 

<<FORMULA>>         (14) 
If we  define 

<<FORMULA>>          (15) 

where 

<<FORMULA>>         (16) 

<<FORMULA>>             (17) 


Equation (17) expresses the functional explicitly in terms of <<FORMULA>> The assumption that is constant within each element is implicit in this expression. This assumption is usually  satisfied in problems in NDE where each element in the FEM mesh is defined within the confines of a domain, and at no time does a single element cross domain boundaries. Furthermore, each  element is small enough that minor variations in 
within an  element may be ignored. Equation (17) can be easily converted into a parallel network form. The neural network comprises an input, output and hidden layer. In the general case with 
<<FORMULA>> elements and <<FORMULA>> nodes in the FEM mesh, the input layer with network inputs takes the values in each element as input. The hidden layer has 
neurons arranged in groups of neurons, corresponding to the members of the global <<FORMULA>> matrix 

. The output of each group of hidden layer neurons is the corresponding row vector of 
. The weights from the input to the hidden layer are set to the appropriate values of 
. Each neuron in the hidden layer acts as a summation unit, (equivalent to a summation followed by a linear activation function [5]). The outputs of the hidden layer neurons are the elements of the global matrix 

as given in (15). Each group of hidden neurons is connected to one output neuron (giving a total of output neurons) by a set of weights with each element of 
representing the nodal values. Note that the set of weights 
between the first group of hidden neurons and the first output neuron are the same as the set of weights between the second group of hidden neurons and the second output neuron (as well as between successive groups of hidden neurons and the corresponding output neuron). Each output neuron is also a summation unit followed by a linear activation function, and the output of each neuron is equal to 

<<FORMULA>>             (18) 

where the second part of (18) is obtained by using (15). As an example, the FENN architecture for a two-element, four-node FEM mesh (Fig. 3) is shown in Fig. 4. In this case, the FENN has two input neurons, 16 hidden layer neurons and four output neurons. The gure illustrates the grouping of the hidden layer neurons, as well as the similarity inherent in the weights that connect each group of hidden layer neurons to the corresponding output neuron. To simplify the gure, the weights between the network input and hidden layer neurons are depicted by means of vectors 
(for , 2, 3, 4 and , 2), where the individual weight values <<FORMULA>> are  defined as in (16). 
1) Boundary Conditions in the FENN: Note that the elements of <<FORMULA>> and in (11) do not depend on the material properties <<FORMULA>> and need to be added appropriately to the global matrix 
and the source vector as shown in (12). 

<<FIGURE>>

Fig. 5. Geometry of mesh for 1-D FEM. 

<<FIGURE>>

Fig. 6. Flowchart (with example) for designing the FENN for a general PDE. 

Equation (12) thus implies that natural boundary conditions can be ap-layer neurons. These weights will be referred to as the clamped plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later. neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used for 1-, 2-, 3-, or higher dimensional problems. The number of nodes and elements in the FEM mesh dictates the number of neurons in the different layers. The weights between the input and hidden layer change depending on node-element connectivity information. 
The major drawback of the FENN is the number of neurons and weights necessary. However, the memory requirements can be reduced considerably, since most of the weights between the input and hidden layer are zero. These weights, and the corresponding connections, can be discarded. Similarly, most of the elements of the 
matrix are also zero (is a banded  matrix). The corresponding neurons in the hidden layer can also be discarded, reducing memory and computation requirements considerably. Furthermore, the weights between each group of hidden layer neurons and the output layer are the same 
. Weight-sharing approaches can be used here to further reduce the storage requirements. 

C. A 1-D Example 

Consider the 1-D equation 

<<FORMULA>>         (19) 

on the boundary <<FORMULA>> defined by <<FORMULA>> and 
are constants depending on the material and 
is the applied source. Laplace's equation and Poisson's equation are special cases of this equation. The FENN formulation for this problem starts by discretizing the domain of interest with <<FORMULA>> elements and 
nodes. In one dimension, each element is  defined by two nodes (Fig. 5).  define basis functions <<FORMULA>> and <<FORMULA>> over each element <<FORMULA>> and let 
is the value of <<FORMULA>> on node <<FORMULA>> in element <<FORMULA>> An example of the basis functions is shown in Fig. 5. For these basis functions, i.e., 

<<FORMULA>>         (20) 

the element matrices are given by [3] 

<<FORMULA>>         (21) 

<<FORMULA>>             (22) 

Here, <<FORMULA>> is the length of element <<FORMULA>> The global matrix 
is then constructed by selectively adding the element matrices based on the nodes that form an element.  Specifically, 
is a sparse tridiagonal matrix, and its nonzero elements are given by 

<<FORMULA>>         (23) 

Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b) Problem description using symmetry considerations. 
The network implementation of (23) can be derived as fol.lows. If <<FORMULA>> and <<FORMULA>> values at each element are the inputs to the network, 
<<FORMULA>> and <<FORMULA>> form the weights between the input and hidden layers. The network thus uses input neurons and 
hidden neurons. The values of <<FORMULA>> at each of the nodes are assigned as weights between the hidden and output layers, and the source 
is the desired output of this network (corresponding to the output neurons). Dirichlet boundary conditions on 
are applied as explained earlier. 

D. General Case 

Fig. 6 shows a  flowchart of the general scheme for converting a differential equation into the FENN structure. An example in two dimensions is also provided next to the  flowchart. We start with the differential equation and the boundary conditions and formulate the FEM using the variational method. This in.volves discretizing the domain of interest with 
elements and 

nodes, selecting basis functions, writing the functional for each element and obtaining the element matrices and the source vector. The example presented uses the FEM mesh shown in Fig. 3, with 
elements, and <<FORMULA>> nodes, and linear basis functions. The unknown solution to the differential equation 
is represented by its values at each of the nodes in the finite-element mesh <<FORMULA>> The element matrices 
are then separated into two parts, with one part dependent on the material properties <<FORMULA>> and 
while the other is independent of them. The FENN is then designed to have input neurons, hidden neurons, and output neurons, where <<FORMULA>> is the number of material property parameters. In the example under consideration, <<FORMULA>>, since we have two 
material property parameters ( and ). The  first group of input neurons takes in the values while the second group takes in the 
values in each element. The weights from the input to the hidden layer are set to the appropriate values of 
<<FORMULA>> In the example, since nodes 1, 2, and 3 are part of element 1 (see Fig. 3), the weights from the  first input node 
to the  first group of four neurons in the hidden layer are given by 

<<FORMULA>>         (24) 

The last weight is zero since node 4 is not a part of element 1. Each group of hidden neurons is connected to one output neuron (giving a total of 
output neurons) by a set of weights <<FORMULA>> with each element of representing the nodal values. The output of each neuron in the output layer is equal to 

<<FIGURE>>

Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error between (a) and (b). The x-and y-axes show the nodes in the FEM discretization of the domain, and the z-axis in (c) shows the error at each of these nodes in volts. 

III. FORWARD AND INVERSE PROBLEM FORMULATION USING FENN.

where is the output of the FENN based approach, then, for the gradients of the error with respect to the free hidden layer weights is given by the FENN architecture and algorithm lends itself to solving 

<<FORMULA>>             (27)

both the forward and inverse problems. The forward problem involves determining the weights 
given the material parameters Equation (27) can be used to solve the forward problem. 

Similarly, the applied source to solve the inverse problem,
while the inverse problem the gradients of the error involves determining and (input of the FENN) are necessary, and approach can be used to solve both these problems. Suppose we are given by  define the error at the output of the FENN as 

            <<TABLE>>

TABLE I SUMMARY OF PERFORMANCE OF THE FENN ALGORITHM FOR VARIOUS PDES 

For the forward problem, such an approach is equivalent to the iterative approaches used to solve for the unknown nodal values in the FEM [4]. 

IV. RESULTS 

A. Forward Model Results 
The FENN was tested using both 1-and 2-D versions of Poissons equation 
<<FORMULA>>     (30) 
where represents the material property, and is the applied source. For instance, in electromagnetics may represent the permittivity while represents the charge density. 
As the  first example, consider the following 2-D equation 
<<FORMULA>>         (31)  
with boundary conditions and <<FORMULA>> on <<FORMULA>> (32)  
on <<FORMULA>> (33)  

This is the governing equation for the shielded microstrip trans.mission line problem shown in Fig. 7. The forward problem computes the electric potential due to the shielded microstrip shown in Fig. 7(a). The potentials are zero on the shielding con.ductor. Since the geometry is symmetric, we can solve the equiv.alent problem shown in Fig. 7(b), by applying the homogeneous Neumann condition on the plane of symmetry. The inner con.ductor (microstrip) is held at a constant potential of volts. Finally, we also assume that the material inside the shielding conductor has a permittivity , where K is a constant. The permittivity in this case corresponds to the material property .  Specifically, and . The homogeneous Neu.mann boundary condition is equivalent to setting . The microstrip and the shielding conductor correspond to the Dirichlet boundary, with <<FORMULA>> on the microstrip and 
on the outer boundary [Fig. 7(b)]. Finally, there is no source term in this example (the source term would correspond to a charge distribution in the domain of interest), i.e., <<FORMULA>> In this ex.ample, we assume that volts. Further, we assume that the domain of interest is 

The solution to the forward problem is presented in Fig. 8, with the FEM solution using 11 nodes in each direction shown in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b). These gures show contours of constant potential. The error be.tween the FEM and FENN solutions is presented in Fig. 8(c). As seen from the gure, the FENN is seen to match the FEM solu.tion accurately, with the peak error at any node on the order of 
Several other examples were also used to test the FENN and the results are summarized in Table I. Column 1 shows the PDE used to evaluate the FENN performance, while column 2 shows the boundary conditions used. The analytic solution to the problem is indicated in Column 3. The FENN structure and the number of iterations for convergence using a gradient de.scent approach are indicated in Columns 4 and 5, respectively. The FENN structure, as explained earlier, has an 
are the number of elements and nodes in the FEM mesh, respectively, and 
is the number of hidden neurons, and corresponds to the number of nonzero elements in the FEM global matrix 
Finally, Columns 6 and 7 present the sum-squared error (SSE) and the maximum error in the solution, respectively, where the errors are computed with respect to the analytical solution. These results indicate that the FENN is capable of accurately deter.mining the potential 
One advantage of the FENN approach is that the computation of the input-hidden layer weights is a one-time process, as long as the differential equation does not change. The only changes necessary to solve the different problems are changes in the input 
and the desired output.

B. Inverse Model Results 

The FENN was also used to solve several simple inverse problems based on (30). In all cases, the objective was to determine 

<<FIGURE>>

Fig. 9. FENN inversion results for Poisson's equation with initial solutions (a) 
the value of <<FORMULA>> and <<FORMULA>> for given values of <<FORMULA>> and 
The <<FORMULA>> first example is a 1-D problem that involves determining 
given and <<FORMULA>> 
for the differential equation 
<<FORMULA>>     (34) 

with boundary conditions <<FORMULA>> and <<FORMULA>>. The analytical solution to this inverse problem is 
<<FORMULA>> and 

<<FORMULA>>             (35) 

As seen from (35), the problem has an infinite number of solutions and we expect the solution procedure to converge to one of these solutions depending on the initial value. 

Fig. 9(a) and (b) shows two solutions to this inverse problem for two different initializations (shown using triangles). In both cases, the FENN solution (in stars) is seen to match the analytical solution (squares). The SSE in both cases was on the order of 

<<FORMULA>>

In order to obtain a unique solution, we need to constrain the value of at the boundary as well. Consider the same differen.
tial equation as (34), but with  and  specified as follows:  
and  
(36)  


The analytical solution for this equation is .To solve this problem, we set and clamp the value of at and as follows: , . The results of the constrained inversion obtained using 11 nodes and 10 elements in the corresponding finite-element mesh are shown in Fig. 10. Fig. 10(a) shows the comparison between the analytical solution (solid line with squares) and the FENN result (solid line with stars). The initial value of is shown in the figure as a dashed line. Fig. 10(b) shows the comparison between the actual and desired forcing function at the FENN 


output. This result indicates that the SSE in the forcing function, as well as the SSE in the inversion result, is fairly large (0.0148 and 0.0197, respectively). The reason for this was traced back to the mesh discretization. Fig. 11 shows the SSE in the output of the FENN and the SSE in the inverse problem solution as a function of FEM discretization. It is seen that increasing the discretization significantly improves the solution. Similar results were observed for other problems. 

V. DISCUSSION AND CONCLUSION 

The FENN is closely related to the finite-element model used to solve differential equations. The FENN architecture has a weight structure that allows both the forward and inverse problems to be solved using simple gradient-based algorithms. Initial results indicate that the proposed FENN algorithm is capable of accurately solving both the forward and inverse problems. In addition, the forward problem solution from the FENN is seen to exactly match the FEM solution, indicating that the FENN represents the finite-element model exactly in a parallel configuration. 
The major advantage of the FENN is that it represents the finite-element model in a parallel form, enabling parallel implementation in either hardware or software. Further, computing gradients in the FENN is very simple. This is an advantage in solving both forward and inverse problems using gradient-based methods. The gradients can also be computed in parallel and the lack of nonlinearities in the neuron activation functions makes the computation of gradients simpler. A major advantage of this approach for solving inverse problems is that it avoids inverting the global matrix in each iteration. The FENN also does not require any training, since most of its weights can be computed in advance and stored. The weights depend on the governing differential equation and its associated boundary conditions, and as long as these two factors do not change, the weights do not change. This is especially an advantage in solving inverse problems in electromagnetic NDE. This approach also reduces the computational effort associated with the network. 

Future work will concentrate on applying the FENN to 3-D electromagnetic NDE problems. The robustness of the approach will also be tested, since the ability of these approaches to in.vert practical noisy measurements is important. Furthermore, the use of better optimization algorithms, like conjugate gradient methods, is expected to improve the solution speed. In addition, parallel implementation of the FENN in both hardware and software is under investigation. The approach described in this paper is very general in that it can be applied to a variety of inverse problems in fields other than electromagnetic NDE. Some of these other applications will also be investigated to show the general nature of the proposed method. 

REFERENCES 

[1] L. Udpa and S. S. Udpa, Application of signal processing and pattern recognition techniques to inverse problems in NDE, Int. J. Appl. Elec.tromagn. Mechan., vol. 8, pp. 99117, 1997. 
[2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. Sacks, Iterative algorithms for electromagnetic NDE signal inversion, in ENDE 97, Reggio Calabria, Italy, Sep. 1416, 1997. 
[3] J. Jin, The Finite Element Method in Electromagnetics. New York: Wiley, 1993. 
[4] P. Zhou, Numerical Analysis of Electromagnetic Fields. Berlin, Ger.many: Springer-Verlag, 1993. 
[5] S. Haykin, Neural Networks: A Comprehensive Foundation. Upper Saddle River, NJ: Prentice-Hall, 1994. 
[6] C. A. Jensen et al., Inversion of feedforward neural networks: algo.rithms and applications, Proc. IEEE, vol. 87, no. 9, pp. 15361549, 1999. 
[7] P. Ramuhalli, L. Udpa, and S. Udpa, Neural network algorithm for elec.tromagnetic NDE signal inversion, in ENDE 2000, Budapest, Hungary, Jun. 2000. 
[8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr., and A. P. Ewing, Automation of SQUID nondestructive evaluation of steel plates by neural networks, IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 34753478, 1999. 
[9] W. Qing, S. Xueqin, Y. Qingxin, and Y. Weili, Using wavelet neural net.works for the optimal design of electromagnetic devices, IEEE Trans. Magn., vol. 33, no. 2, pp. 19281930, 1997. 
[10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis, Articial neural networks for solving ordinary and partial differential equations, IEEE Trans. Neural Netw., vol. 9, no. 5, pp. 9871000, 1998. 
[11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries, IEEE Trans. Neural Netw., vol. 11, no. 5, pp. 10411049, 2000. 
[12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez, Neural network differential equation and plasma equilibrium solver, Phys. Rev. Lett., vol. 75, no. 20, pp. 35943597, 1995. 
[13] M. W. M. G. Dissanayake and N. Phan-Thien, Neural-network-based approximations for solving partial differential equations, Commun. Numer. Meth. Eng., vol. 10, pp. 195201, 1994. 
[14] R. Masuoka, Neural networks learning differential data, IEICE Trans. Inform. Syst., vol. E83-D, no. 6, pp. 12911300, 2000. 
[15] D. C. Youla, Generalized image restoration by the method of alternating orthogonal projections, IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, pp. 694702, 1978. 
[16] D. C. Youla and H. Webb, Image restoration by the method of convex projections: part Itheory, IEEE Trans. Med. Imag., vol. MI-1, no. 2, pp. 8194, 1982. 
[17] A. Lent and H. Tuy, An iterative method for the extrapolation of band-limited functions, J. Math. Analysis and Applicat., vol. 83, pp. 554565, 1981. 
[18] W. Chen, A new extrapolation algorithm for band-limited signals using the regularization method, IEEE Trans. Signal Process., vol. 41, no. 3, pp. 10481060, 1993. 
[19] J. Takeuchi and Y. Kosugi, Neural network representation of the finite element method, Neural Netw., vol. 7, no. 2, pp. 389395, 1994. 
[20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady, Articial neural net.work application for material evaluation by electromagnetic methods, in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 40274032. 
[21] G. Xu, G. Littlefair, R. Penson, and R. Callan, Application of FE-based neural networks to dynamic problems, in Proc. Int. Conf. Neural Infor.mation Processing, vol. 3, 1999, pp. 10391044. 
[22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu, Finite element anal.ysis-based Hopeld neural network model for solving nonlinear elec.tromagnetic eld problems, in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 43994403. 
[23] H. Lee and I. S. Kang, Neural algorithm for solving differential equations, J. Computat. Phys., vol. 91, pp. 110131, 1990. 
[24] J. Kalkkuhl, K. J. Hunt, and H. Fritz, FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control, IEEE Trans. Neural Netw., vol. 10, no. 4, pp. 885897, 1999. 
[25] R. K. Mishra and P. S. Hall, NFDTD concept, IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 484490, 2005. 
[26] D. G. Triantafyllidis and D. P. Labridis, A finite-element mesh gener.ator based on growing neural networks, IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 14821496, 2002. 
<|endoftext|>


<|startoftext|>
Floating Point Operations in Matrix-Vector Calculus 
(Version 1.3) 
Raphael Hunger 
Technical Report 2007 

Technische Universitt Mchen Associate Institute for Signal Processing 
Univ.-Prof. Dr.-Ing. Wolfgang Utschick 


History 
Version 1.00: October 2005 -Initial version 
Version 1.01: 2006 -Rewrite of sesquilinear form with a reduced amount of FLOPs -Several Typos fixed concerning the number of FLOPS required for the Cholesky decomposition Version 1.2: November 2006 -Conditions for the existence of the standard <<FORMULA>> Cholesky decomposition specified (positive definiteness) -Outer product version of <<FORMULA>> Cholesky decomposition removed -FLOPs required in Gaxpy version of <<FORMULA>> Cholesky decomposition updated -<<FORMULA>> Cholesky decomposition added -Matrix-matrix product LC added with L triangular -Matrix-matrix product <<FORMULA>>C added with L triangular and <<FORMULA>> not known a priori -Inverse L. 11 of a lower triangular matrix with ones on the main diagonal added 
Version 1.3: September 2007 -First globally accessible document version 
ToDo: (unknown when) -QR-Decomposition -LR-Decomposition 
Please report any bug and suggestion to hunger@tum.de 

Contents 
1. Introduction 4 
2. Flop Counting 5 
2.1 MatrixProducts .................................... 5 
2.1.1 Scalar-Vector Multiplication .a ....................... 5 
2.1.2 Scalar-Matrix Multiplication .A ...................... 5 
2.1.3 Inner Product aHb ofTwo Vectors ...................... 5 
2.1.4 Outer Product ac H ofTwo Vectors ...................... 5 
2.1.5 Matrix-Vector Product Ab .......................... 6 
2.1.6 Matrix-Matrix Product AC ......................... 6 
2.1.7 Matrix Diagonal Matrix Product AD .................... 6 
2.1.8 Matrix-Matrix Product LD ......................... 6 
2.1.9 Matrix-Matrix Product L1D ......................... 6 
2.1.10 Matrix-Matrix Product LC with L Lower Triangular ............ 6 
2.1.11 Gram AHA of A ............................... 6 
2.1.12 Squared Frobenius Norm kAk2F = tr(AHA) ................ 7 
2.1.13 Sesquilinear Form cHAb ........................... 7 
2.1.14 Hermitian Form aHRa ............................ 7 
2.1.15 Gram LHL of a Lower Triangular Matrix L ................. 7 
2.2 Decompositions.................................... 8 
2.2.1 Cholesky Decomposition R = <<FORMULA>> (GaxpyVersion) ........... 8 
2.2.2 Cholesky Decomposition R = L1DL1H ................... 10 
2.3 Inverses ofMatrices .................................. 11 
2.3.1 Inverse <<FORMULA>> of a Lower Triangular Matrix L ................ 11 
2.3.2 Inverse L. 11 of a Lower Triangular Matrix L1 with Ones on the Main Diagonal..................................... 12 
2.3.3 Inverse R.1 of a Positive definite Matrix R ................. 13 
2.4 Solving Systems of Equations ............................ 13 
2.4.1 Product <<FORMULA>>C with <<FORMULA>> not known a priori. ................ 13 
3. Overview 14 
Appendix 15 
Bibliography 16 

1. Introduction 
For the design of efficient und low-complexity algorithms in many signal-processing tasks, a de.tailed analysis of the required number of floating-point operations (FLOPs) is often inevitable. Most frequently, matrix operations are involved, such as matrix-matrix products and inverses of matrices. Structures like Hermiteness or triangularity for example can be exploited to reduce the number of needed FLOPs and will be discussed here. In this technical report, we derive expressions for the number of multiplications and summations that a majority of signal processing algorithms in mobile communications bring with them. 
Acknowledgments: 
The author would like to thank Dipl.-Ing. David A. Schmidt and Dipl.-Ing. Guido Dietl for the fruitful discussions on this topic. 

2. Flop Counting 
In this chapter, we offer expressions for the number of complex multiplications and summations required for several matrix-vector operations. A floating-point operation (FLOP) is assumed to be 

either a complex multiplication or a complex summation here, despite the fact that a complex multiplication requires 4 real multiplications and 2 real summations whereas a complex summations consists of only 2 real summations, making a multiplication more expensive than a summation. However, we count each operation as one FLOP. 
Throughout this report, we assume <<FORMULA>> to be a scalar, the vectors <<FORMULA>>, and <<FORMULA>> to have dimension N, N, and M, respectively. The matrices <<FORMULA>>, and <<FORMULA>> are assumed to have no special structure, whereas <<FORMULA>> is Hermitian and <<FORMULA>> is diagonal. L is a lower triangular <<FORMULA>> matrix, en denotes the unit vector with a 1 in the n-th row and zeros elsewhere. Its dimensionality is chosen such that the respective matrix-vector product exists. Finally, [A]a,b denotes the element in the a-th row and b-th column of A, <<FORMULA>> selects the submatrix of A consisting of rows a to b and columns c to 
d. 0a.b is the a . b zero matrix. Transposition, Hermitian transposition, conjugate, and real-part operator are denoted by <<FORMULA>>, and <<FORMULA>>, respectively, and require no FLOP. 
2.1 Matrix Products 
Frequently arising matrix products and the amount of FLOPs required for their computation will be discussed in this section. 
2.1.1 Scalar-Vector Multiplication <<FORMULA>> 
A simple multiplication .a of a vector a with a scalar <<FORMULA>> requires N multiplications and no summation. 

2.1.2 Scalar-Matrix Multiplication <<FORMULA>>
Extending the result from Subsection 2.1.1 to a scalar matrix multiplication <<FORMULA>> requires NM multiplications and again no summation. 

2.1.3 Inner Product aHb of Two Vectors 
An inner product aHb requires N multiplications and <<FORMULA>> summations, i.e., <<FORMULA>> FLOPs. 

2.1.4 Outer Product <<FORMULA>> of Two Vectors 
An outer product acH requires NM multiplications and no summation. 
2. Flop Counting 

2.1.5 Matrix-Vector Product <<FORMULA>> 
Computing Ab corresponds to applying the inner product rule <<FORMULA>> from Subsection 2.1.3 M times. Obviously, <<FORMULA>> and <<FORMULA>> represents the i-th row of A. Hence, its computation costs MN multiplications and <<FORMULA>> summations, i.e., <<FORMULA>> FLOPs. 

2.1.6 Matrix-Matrix Product <<FORMULA>> 
Repeated application of the matrix-vector rule Aci from Subsection 2.1.5 with ci being the i-th column of C yields the overall matrix-matrix product AC. Since <<FORMULA>>, the matrix-matrix product has the L-fold complexity of the matrix-vector product. Thus, it needs MNL multiplications and <FORMULA> summations, altogether <<FORMULA>> FLOPs. 

2.1.7 Matrix Diagonal Matrix Product AD 
If the right hand side matrix D of the matrix product AD is diagonal, the computational load reduces to M multiplications for each of the N columns of A, since the n-th column of A is scaled by the n-th main diagonal element of D. Thus, MN multiplications in total are required for the computation of AD, no summations are needed. 

2.1.8 Matrix-Matrix Product LD 
When multiplying a lower triangular matrix L by a diagonal matrix D, column n of the matrix product requires <<FORMULA>> multiplications and no summations. With <<n =1,...,N>>, we get 
<<FORMULA>> multiplications.

2.1.9 Matrix-Matrix Product L1D 
When multiplying a lower triangular matrix L1 with ones on the main diagonal by a diagonal matrix D, column n of the matrix product requires <<<<FORMULA>>>> multiplications and no summations. With <<n =1,...,N>>, we get 
<<FORMULA>> multiplications. 

2.1.10 Matrix-Matrix Product LC with L Lower Triangular 
Computing the product of a lower triangular matrix <<FORMULA>> and <<FORMULA>> is done column-wise. The nth element in each column of LC requires n multiplications and <<<<FORMULA>>>> summations,  
so the complete column needs <<FORMULA>> multiplications and <<FORMULA>> summations. The complete matrix-matrix product is obtained from computing L columns. We have 
<<FORMULA>> multiplications and <<FORMULA>> summations, yielding a total amount of <<FORMULA>> FLOPs.

2.1.11 Gram <<FORMULA>> of A 
In contrast to the general matrix product from Subsection 2.1.6, we can make use of the Hermitian structure of the product <<FORMULA>>. Hence, the strictly lower triangular part of <<FORMULA>> need not be computed, since it corresponds to the Hermitian of the strictly upper triangular part. For
this reason, we have to compute only the N main diagonal entries of <<AHA>> and the <<N2^2>> upper <<FORMULA>> off-diagonal elements, so only <<FORMULA>> different entries have to be evaluated. Each element requires an inner product step from Subsection 2.1.3 costing M multiplications and <<FORMULA>> summations. Therefore,
<<FORMULA>> multiplications and <<FORMULA>> summations are needed, making up a total amount of <<FORMULA>> FLOPs. 

2.1 Matrix Products 

2.1.12 Squared Frobenius Norm <<FORMULA>> 
The squared Hilbert-Schmidt norm <<FORMULA>> follows from summing up the MN squared entries from A. We therefore have MN multiplications and <<FORMULA>> summations, yielding a total of <<FORMULA>> FLOPs. 

2.1.13 Sesquilinear Form <<FORMULA>> 
The sesquilinear form cHAb should be evaluated by computing the matrix-vector product Ab in a first step and then multiplying with the row vector cH from the left hand side. The matrix vector product requires MN multiplications and <<FORMULA>> summations, whereas the inner product needs M multiplications and <<FORMULA>> summations. Altogether, <<FORMULA>> multiplications and <<FORMULA>> summations have to be computed for the sesquilinear form <<FORMULA>>, yielding a total number of <<FORMULA>> flops. 

2.1.14 Hermitian Form a <<FORMULA>> 
With the Hermitian matrix <<FORMULA>>, the product <<FORMULA>> can be expressed as 

<<FORMULA>>

with <<FORMULA>>, and <<FORMULA>>. The first sum accumulates the weighted main diagonal entries and requires 2N multiplications and <<FORMULA>> summations. The second part of (2.1) accumulates all weighted off-diagonal entries from A. The last two summations sum up 2 terms2. Consequently, the second part of (2.1) requires <<FORMULA>> summations and <<FORMULA>> products. Finally, the two parts have to be added accounting for an additional summation and yielding an overall amount of <<FORMULA>> products and 
<<FORMULA>> summations, corresponding to <<FORMULA>> FLOPs. 

2.1.15 Gram <<FORMULA>> of a Lower Triangular Matrix L 
During the computation of the inverse of a positive definite matrix, the Gram matrix of a lower triangular matrix occurs when Cholesky decomposition is applied. Again, we make use of the Hermitian structure of the Gram <<FORMULA>>, so only the main diagonal entries and the upper right off-diagonal entries of the product have to be evaluated. The a-th main-diagonal entry can be expressed <FORMULA>>. 
We made use of (A1) in the Appendix for the computation of the last sum accumulating subsequent integers. 
We do not exploit the fact that only real-valued summands are accumulated as we only account for complex flops. 
The scaling with the factor 2 does not require a FLOP, as it can be implemented by a simple bit shift. 
Clearly, if <<FORMULA>>, we have to subtract one summation from the calculation since no off-diagonal entries exist. 

2. Flop Counting 

<<FORMULA>>      (2.2) 

with <<FORMULA>>, requiring <<FORMULA>> multiplications and <<FORMULA>> summations. Hence, all main diagonal elements need <<FORMULA>> multiplications and 
<<FORMULA>> summations. The upper right off-diagonal entry <<FORMULA>> in row a and column b with <<FORMULA>> reads as 

<<FORMULA>>,    (2.3) 
 
again accounting for <<FORMULA>> multiplications and <<FORMULA>> summations. These two expressions have to be summed up over all <<FORMULA>> and <<FORMULA>>, and for the number of multiplications, we find 

<<FORMULA>>     (2.4) 

Again, we made use of (A1) for the sum of subsequent integers and (A2) for the sum of subsequent squared integers. For the number of summations, we evaluate 

<<FORMULA>>

Computing all necessary elements of the Gram LHL thereby requires <<FORMULA>> multiplications and <<FORMULA>> summations. Altogether, <<FORMULA>> FLOPs result. The same result of course holds for the Gram of two upper triangular matrices. 

2.2 Decompositions 

2.2.1 Cholesky Decomposition <<FORMULA>> (Gaxpy Version) 
Instead of computing the inverse of a positive definite matrix R directly, it is more efficient to start with the Cholesky decomposition <<FORMULA>> and then invert the lower triangular matrix L and compute its Gram. In this section, we count the number of FLOPs necessary for the Cholesky decomposition. 

2.2 Decompositions 
The implementation of the Generalized Ax plus y (Gaxpy) version of the Cholesky decomposition, which overwrites the lower triangular part of the positive definite matrix R is listed in Algorithm 2.1, see [1]. Note that R needs to be positive definite for the <<FORMULA>> decomposition! 

Algorithm 2.1 Algorithm for the Gaxpy version of the Cholesky decomposition. 

<<ALGORITHM>>

The computation of the first column of L in Line 1 of Algorithm 2.1 requires <<FORMULA>> multiplications, a single square-root operation, and no summations. Column <<FORMULA>> takes a matrix vector product of dimension <<FORMULA>> which is subtracted from another <<FORMULA>> dimensional vector involving <<FORMULA>> summations, see Line 3. Finally, <<FORMULA> multiplications6 and a single square-root operation are necessary in Line 4. In short, row n with <<FORMULA>> needs <<FORMULA>> multiplications, .<<FORMULA>> summations (see Subsection 2.1.5), and one square root operation, which we classify as an additional FLOP. Summing up the multiplications for rows <<FORMULA>>, we obtain 
<<FORMULA>> The number of summations for rows <<FORMULA>> reads as  

<<FORMULA>>         (2.6)  

<<FORMULA>>              (2.7)  

The first element need not be computed twice, since the result of the division is the square root of the denominator. 
Again, the first element need not be computed twice, since the result of the division is the square root of the denominator. 

2. Flop Counting 

Algorithm 2.2 Algorithm for the Cholesky decomposition <<FORMULA>>

<<ALGORITHM>>

and finally, <<FORMULA>> square-root operations are needed for the <<FORMULA>> rows. Including the <<FORMULA>> multiplications for column <<FORMULA>> and the additional square root operation, <<FORMULA>> multiplications, <<FORMULA>> summations, and N square-root operations occur, 
<<FORMULA>> FLOPs in total. 

2.2.2 Cholesky Decomposition <<FORMULA>> 
The main advantage of the <<FORMULA>> decomposition compared to the standard <<FORMULA>> decomposition is that no square root operations are needed, which may require more than one FLOP depending on the given hardware platform. Another benet of the <<FORMULA>> decomposition is that it does not require a positive definite matrix R, the only two conditions for the unique existence are that R is Hermitian and all but the last principle minor (i.e., the determinant) of R need to be different from zero [2]. Hence, R may also be rank decient to a certain degree. If R is not positive semidefinite, then D may contain negative main diagonal entries. 
The outcome of the decomposition is a lower triangular matrix L1 with ones on the main diagonal and a diagonal matrix D. 
Algorithm 2.2 overwrites the strictly lower left part of the matrix R with the strictly lower part of L1 (i.e., without the ones on the main diagonal) and overwrites the main diagonal of R with the main diagonal of D. It is taken from [1] and slightly modied, such that is also applicable to complex matrices (see the conjugate in Line 4) and no existing scalar should be re-computed (see case distinction in Line 4 for i =1). 
Line 1 needs <<FORMULA>> multiplications. Lines 3 to 5 require <<FORMULA>> multiplications and are executed for <<FORMULA>>, yielding <<FORMULA>> multiplications. Line 6 takes <<FORMULA>> 

multiplications and <<FORMULA>> summations, again with n =2,...,N, yielding n=2(<<FORMULA>>) = 2 multiplications and the same amount of summations. Line 7 does not require any FLOP. In Line 8, the matrix-vector product needs <<FORMULA>> multiplications, and additional <<FORMULA>> multiplications arise when the complete numerator is divided by the denominator. Hence, we have <<FORMULA>> multiplications. For <<FORMULA>> we get <<FORMULA>> multiplications.
The number of summations in Line 8 is <<FORMULA>> for the matrix vector product and <<FORMULA>> for the subtraction in the numerator. Together, we have <<FORMULA>> summations. With 
<<FORMULA>> summations. Summing up, this algorithm requires <<FORMULA>> multiplications, and <<FORMULA>> summations, yielding a total amount of <<FORMULA>> FLOPs. (Note that this formula is also valid for N =1.) 

2.3 Inverses of Matrices 

2.3.1 Inverse <<FORMULA>> of a Lower Triangular Matrix L 
Let <<FORMULA>> denote the inverse of a lower triangular matrix L. Then, X is again lower triangular which means that <<FORMULA>> for <<FORMULA>>. The following equation holds: 

<<FORMULA>>.        (2.8) 

Via forward substitution, above system can easily be solved. Row <<FORMULA>> from (2.8) can be expressed as 

<<FORMULA>>,    (2.9) 

with <<FORMULA>> denoting the Kronecker delta which vanishes for <<FORMULA>>, and <<FORMULA>>. Starting from <<FORMULA>>, the xb,n are computed successively, and we find 

<<FORMULA>>          (2.10)

with all <<FORMULA>> having been computed in previous steps. Hence, if <<FORMULA>> and a single multiplication is required, no summations are needed. For <<FORMULA>> multiplications and <<FORMULA>> summations are required, as the Kronecker-delta vanishes. All main diagonal entries can be computed by means of N multiplications The lower left off-diagonal entries 
Actually, it is a division rather than a multiplication. 

2. Flop Counting 

require 

<<FORMULA>>           (2.11)  
 
multiplications, and  

<<FORMULA>>             (2.12)  

summations. Including the N multiplications for the main-diagonal entries, <<FORMULA>> multiplications and <<FORMULA>> summations have to be implemented, yielding a total amount 
<<FORMULA>> FLOPs. 

2.3.2 Inverse <<FORMULA>> of a Lower Triangular Matrix L1 with Ones on the Main Diagonal 
The inverse of a lower triangular matrix L1 turns out to require N2 FLOPs less than the inverse of L with arbitrary nonzero diagonal elements. Let X denote the inverse of L1. Clearly, X is again a lower triangular matrix with ones on the main diagonal. We can exploit this fact in order to compute only the unknown entries. 
The mth row and nth column of the system of equations <<FORMULA>> with <<FORMULA>> reads as

<<FORMULA>>

or, equivalently, 

<<FORMULA>>

Hence, X is computed via forward substitution. To compute <<FORMULA>>, we need <<FORMULA>> multiplications and <<FORMULA>> summations. Remember that <<FORMULA>>. The total number of multiplications/summations is obtained from 

<<FORMULA>>)         (2.13)

We only have to consider <<FORMULA>>, since the equations resulting from m<n +1 are automatically fulfilled due to the structure of L1 and X. 

2.4 Solving Systems of Equations

Summing up, <<FORMULA>> FLOPs are needed. 

2.3.3 Inverse R.1 of a Positive definite Matrix R 
The inverse of a matrix can for example be computed via Gaussian-elimination [1]. However, this approach is computationally expensive and does not exploit the Hermitian structure of R. Instead, it is more efficient to start with the Cholesky decomposition of <<FORMULA>> (see Subsection 2.2.1), 
invert the lower triangular matrix L (see Subsection 2.3.1), and then build the Gram <<FORMULA>> of <<FORMULA>> (see Subsection 2.1.15). Summing up the respective number of operations, this procedure requires <<FORMULA>> multiplications, <<FORMULA>> summations, and N square-root operations, which yields a total amount of <<FORMULA>> FLOPs. 
 
2.4.1 Product <<FORMULA>> with <<FORMULA>> not known a priori.
A naive way of computing the solution <<FORMULA>> of the equation <<FORMULA>> is to find <<FORMULA>> first and afterwards multiply it by C. This approach needs <<FORMULA>> FLOPs as shown in Sections 2.3.1 and 2.1.10. However, doing so is very expensive since we are not interested in the inverse of L in general. Hence, there must be a computationally cheaper variant. Again, forward substitution plays a key role. 
It is easy to see, that X can be computed column-wise. Let <<FORMULA>> and <<FORMULA>>. Then, from <<FORMULA>>, we get for the element xb,a in row b and column a of X: 

<<FORMULA>>

Its computation requires b multiplications and <<FORMULA>> summations. A complete column of X can therefore the computed with<<FORMULA>> multiplications and <<FORMULA>> summations. The complete matrix X with L columns thus needs <<FORMULA>> FLOPs, so the forward substitution saves <<FORMULA>> FLOPs compared to the direction inversion of L and a subsequent matrix matrix product. Interestingly, computing <<FORMULA>> with <<FORMULA>> unknown is as expensive as computing LC, see Section 2.1.10. 

3. Overview 

<<FORMULA>> and <<FORMULA>> are arbitrary matrices.<<FORMULA>> is a diagonal matrix, <<FORMULA>> is lower triangular, <<FORMULA>> is lower triangular with ones on the main diagonal, <<FORMULA>>, and <<FORMULA>> is positive definite. 

<<TABLE>>

Appendix 

A frequently occurring summation in FLOP counting is the sum of subsequent integers. By complete induction, we find  
  
<<FORMULA>>             (A1)

Above result can easily be verified by recognizing that the sum of the n-th and the <<FORMULA>> summand is equal to <<FORMULA>>, and we have <<FORMULA>> such pairs. 
Another sum of relevance is the sum of subsequent squared integers. Again, via complete induction, we find 

<<FORMULA>>          (A2)

Bibliography 
[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991. 
[2] Kh.D. Ikramov and N.V. Saveleva, Conditionally definite Matrices, Journal of Mathematical Sciences, vol. 98, no. 1, pp. 150, 2000. 
<<END> <<END>> <END>>


<|startoftext|>
                                              Green AI 

                     Roy Schwartz   Jesse Dodge  Noah A. Smith  Oren Etzioni 


                               Allen Institute for AI, Seattle, Washington, USA
                          Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
                            University of Washington, Seattle, Washington, USA


                                                Abstract
                  The computations required for deep learning research have been doubling every few months, resulting in an
                estimated 300,000x increase from 2012 to 2018 [2]. These computations have a surprisingly large carbon footprint
                [40]. Ironically, deep learning was inspired by the human brain, which is remarkably energy efﬁcient. Moreover, the
                ﬁnancial cost of the computations can make it difﬁcult for academics, students, and researchers, in particular those
                from emerging economies, to engage in deep learning research.
                  This position paper advocates a practical solution by making efﬁciency an evaluation criterion for research along-
                side accuracy and related measures. In addition, we propose reporting the ﬁnancial cost or “price tag” of developing,
                training, and running models to provide baselines for the investigation of increasingly efﬁcient methods. Our goal is
                to make AI both greener and more inclusive—enabling any inspired undergraduate with a laptop to write high-quality
                research papers. Green AI is an emerging focus at the Allen Institute for AI.


           1 Introduction and Motivation

           Since 2012, the ﬁeld of artiﬁcial intelligence has reported remarkable progress on a broad range of capabilities in-
           cluding object recognition, game playing, machine translation, and more [36]. This progress has been achieved by
           increasingly large and computationally-intensive deep learning models. 1 Figure 1 reproduced from [2] plots training
           cost increase over time for state-of-the-art deep learning models starting with AlexNet in 2012 [20] to AlphaZero in
           2017 [38]. The chart shows an overall increase of 300,000x, with training cost doubling every few months. An even
           sharper trend can be observed in NLP word embedding approaches by looking at ELMo [29] followed by BERT [8],
           openGPT-2 [30], and XLNet [48]. An important paper [40] has estimated the carbon footprint of several NLP models
           and argued that this trend is both environmentally unfriendly (which we refer to as Red AI ) and expensive, raising
           barriers to participation in NLP research.
              This trend is driven by the strong focus of the AI community on obtaining “state-of-the-art” results, 2 as exempliﬁed
           by the rising popularity of leaderboards [46, 45], which typically report accuracy measures but omit any mention of
           cost or efﬁciency (see, for example,leaderboards.allenai.org). Despite the clear beneﬁts of improving
           model accuracy in AI, the focus on this single metric ignores the economic, environmental, or social cost of reaching
           the reported accuracy.
              We advocate increasing research activity in  Green AI —AI research that is more environmentally friendly and
           inclusive. We emphasize that Red AI  research has been yielding valuable contributions to the ﬁeld of AI, but it’s been
           overly dominant. We want to shift the balance towards the  Green AI  option—to ensure that any inspired undergraduate
           with a laptop has the opportunity to write high-quality papers that could be accepted at premier research conferences.

             1 For brevity, we refer to AI throughout this paper, but our focus is on AI research that relies on deep learning methods.
             2 Meaning, in practice, that a system’s accuracy on some benchmark is greater than any previously reported system’s accuracy.

                                                         <<FIGURE>>

           Figure 1: The amount of compute used to train deep learning models has increased 300,000x in 6 years. Figure taken
           from [2].


           Speciﬁcally, we propose making efﬁciency a more common evaluation criterion for AI papers alongside accuracy and
           related measures.
              AI research can be computationally expensive in a number of ways, but each provides opportunities for efﬁcient
           improvements; for example, papers could be required to plot accuracy as a function of computational cost and of
           training set size, providing a baseline for more data-efﬁcient research in the future. Reporting the computational price
           tag of ﬁnding, training, and running models is a key Green AI  practice (see Equation 1). In addition to providing
           transparency, price tags are baselines that other researchers could improve on.
              Our empirical analysis in Figure 2 suggests that the AI research community has paid relatively little attention to
           computational efﬁciency. In fact, as Figure 1 illustrates, the computational cost of research is increasing exponentially,
           at a pace that far exceeds Moore’s Law [28]. Red AI is on the rise despite the well-known diminishing returns of
           increased cost (e.g., Figure 3). This paper identiﬁes key factors that contribute to  Red AI  and advocates the introduction
           of a simple, easy-to-compute efﬁciency metric that could help make some AI research greener, more inclusive, and
           perhaps more cognitively plausible. Green AI is part of a broader, long-standing interest in environmentally-friendly
           scientiﬁc research (e.g., see the journalGreen Chemistry). Computer science, in particular, has a long history of
           investigating sustainable and energy-efﬁcient computing (e.g., see the journalSustainable Computing: Informatics
           and Systems).
              The remainder of this paper is organized as follows. Section 2 analyzes practices that move deep-learning research
           into the realm of  Red AI . Section 3 discusses our proposals for  Green AI. Section 4 considers related work, and we
           conclude with a discussion of directions for future research.


           2 Red AI 

            Red AI refers to AI research that seeks to obtain state-of-the-art results in accuracy (or related measures) through
           the use of massive computational power—essentially “buying” stronger results. Yet the relationship between model
           performance and model complexity (measured as number of parameters or inference time) has long been understood
           to be at best logarithmic; for a linear gain in performance, an exponentially larger model is required [18]. Similar
           trends exist with increasing the quantity of training data [41, 13] and the number of experiments [9]. In each of these
           cases, diminishing returns come at increased computational cost.
              This section analyzes the factors contributing to Red AI and shows how it is resulting in diminishing returns over
           time (see Figure 3). We note again that Red AI work is valuable, and in fact, much of it contributes to what we know

                                                         <<FIGURE>>

           Figure 2: AI papers tend to target accuracy rather than efﬁciency. The ﬁgure shows the proportion of papers that
           target accuracy, efﬁciency, both or other from a sample of 60 papers from top AI conferences.

           by pushing the boundaries of AI. Our exposition here is meant to highlight areas where computational expense is high,
           and to present each as an opportunity for developing more efﬁcient techniques.
              To demonstrate the prevalence of Red AI , we sampled 60 papers from top AI conferences (ACL, 3 NeurIPS, 4 and
           CVPR 5 ). For each paper we noted whether the authors claim their main contribution to be (a) an improvement to
           accuracy or some related measure, (b) an improvement to efﬁciency, (c) both, or (d) other. As shown in Figure 2, in all
           conferences we considered, a large majority of the papers target accuracy (90% of ACL papers, 80% of NeurIPS papers
           and 75% of CVPR papers). Moreover, for both empirical AI conferences (ACL and CVPR) only a small portion (10%
           and 20% respectively) argue for a new efﬁciency result. 6 This highlights the focus of the AI community on measures
            of performance such as accuracy, at the expense of measures of efﬁciency such as speed or model size. In this paper
            we argue that a larger weight should be given to the latter.
              To better understand the different ways in which AI research can be red, consider an AI result reported in a scientiﬁc
            paper. This result typically includes a model trained on a training dataset and evaluated on a test dataset. The process
            of developing that model often involves multiple experiments to tune its hyperparameters. When considering the
            different factors that increase the computational and environmental cost of producing such a result, three factors come
            to mind: the cost of executing the model on a single (E)xample (either during training or at inference time); the size
           of the training (D)ataset, which controls the number of times the model is executed during training, and the number of
           (H)yperparameter experiments, which controls how many times the model is trained during model development. The
           total cost of producing a (R)esult in machine learning increases linearly with each of these quantities. This cost can
           be estimated as follows:

                                                <<FORMULA>>

           Equation 1: The equation of Red AI : The cost of an AI (R)esult grows linearly with the cost of processing a single
           (E)xample, the size of the training (D)ataset and the number of (H)yperparameter experiments.

              Equation 1 is a simpliﬁcation (e.g., different hyperparameter assignments can lead to different costs for processing
           a single example). It also ignores other factors such as the number of training epochs. Nonetheless, it illustrates three
           quantities that are each an important factor in the total cost of generating a result. Below, we consider each quantity
           separately. Interestingly, many NeurIPS papers included convergence rates or regret bounds which describe performance as a function of examples or
           iterations, thus targeting efﬁciency (55%). This indicates an increased awareness of the importance of this concept, at least in theoretical analyses.


            .

           Expensive processing of one example Our focus is on neural models, where it is common for each training step
           to require inference, so we discuss training and inference cost together as “processing” an example. Some works
           have used increasingly expensive models which require great amounts of resources, and as a result, in these models,
           performing inference can require a lot of computation, and training even more so. For instance, Google’s BERT-large
           [8] contains roughly 350 million parameters. openAI’s openGPT2-XL model [30] contains 1.5 billion parameters.
           AI2, our home organization, recently released Grover [49], also containing 1.5 billion parameters. In the computer
           vision community, a similar trend is observed (Figure 1).
              Such large models have high costs for processing each example, which leads to large training costs. BERT-large
           was trained on 64 TPU chips for 4 days. Grover was trained on 256 TPU chips for two weeks, at an estimated cost of
           $25,000. XLNet had a similar architecture to BERT-large, but used a more expensive objective function (in addition
           to an order of magnitude more data), and was trained on 512 TPU chips for 2.5 days. 7 It is impossible to reproduce
           the best BERT-large results 8 or XLNet results 9 using a single GPU. Specialized models can have even more extreme
           costs, such as AlphaGo, the best version of which required 1,920 CPUs and 280 GPUs to play a single game of Go
           [37] at a cost of over $1,000 per hour. 10
              When examining variants of a single model (e.g., BERT-small and BERT-large) we see that larger models can have
           stronger performance, which is a valuable scientiﬁc contribution. However, this implies the ﬁnancial and environmental
           cost of increasingly large AI models will not decrease soon, as the pace of model growth far exceeds the resulting
           increase in model performance [16]. As a result, more and more resources are going to be required to keep improving
           AI models by simply making them larger.

           Processing many examples Another way state-of-the-art performance has recently been progressing in AI is by
           successively increasing the amount of training data models are trained on. BERT-large had top performance in 2018
           across many NLP tasks after training on 3 billion word-pieces. XLNet recently outperformed BERT after training
           on 32 billion word-pieces, including part of Common Crawl; openGPT-2-XL trained on 40 billion words; FAIR’s
           RoBERTa [23] was trained on 160GB of text, roughly 40 billion word-pieces, requiring around 25,000 GPU hours
           to train. In computer vision, researchers from Facebook [25] pretrained an image classiﬁcation model on 3.5 billion
           images from Instagram, three orders of magnitude larger than existing labelled image datasets such as Open Images. 11
              The use of massive data creates barriers for many researchers for reproducing the results of these models, or
           training their own models on the same setup (especially as training for multiple epochs is standard). For example, the
           June 2019 Common Crawl contains 242 TB of uncompressed data, 12 so even storing the data is expensive. Finally,
           as in the case of model size, relying on more data to improve performance is notoriously expensive because of the
           diminishing return of adding more data [41]. For instance, Figure 3, taken from [25], shows a logarithmic relation
           between the object recognition top-1 accuracy and the number of training examples.

           Massive number of experiments Some projects have poured large amounts of computation into tuning hyperparameters 
           or searching over neural architectures, well beyond the reach of most researchers. For instance, researchers
           from Google [51] trained over 12,800 neural networks in their neural architecture search to improve performance on
           object detection and language modeling. With a ﬁxed architecture, researchers from DeepMind [26] evaluated 1,500
           hyperparameter assignments to demonstrate that an LSTM language model [15] can reach state-of-the-art perplexity
           results. Despite the value of this result in showing that the performance of an LSTM does not plateau after only a few
           hyperparameter trials, fully exploring the potential of other competitive models for a fair comparison is prohibitively
           expensive.
             7 Some estimates for the cost of this process reach $250,000 (twitter.com/eturner303/status/1143174828804857856).
             8 Seehttps://github.com/google-research/bert
             9 Seehttps://github.com/zihangdai/xlnet
             10 Recent versions of AlphaGo are far more efﬁcient [39].
             11 https://opensource.google.com/projects/open-images-dataset
             12 http://commoncrawl.org/2019/07/

                                                      <<FIGURE>>

           Figure 3: Diminishing returns of training on more data: object detection accuracy increases linearly as the number of
           training examples increases exponentially [25].

              The topic of massive number of experiments is not as well studied as the ﬁrst two discussed above. In fact, the
           number of experiments performed during model construction is often under reported. Nonetheless, evidence for a
           logarithmic relation exists here as well, between the number of experiments and performance gains [9].

           Discussion The beneﬁts of pouring more resources into models are certainly of interest to the AI community. Indeed,
           there is value in pushing the limits of model size, dataset size, and the hyperparameter search space. Currently, despite
           the massive amount of resources put into recent AI models, such investment still pays off in terms of downstream
           performance (albeit at an increasingly lower rate). Finding the point of saturation (if such exists) is an important
           question for the future of AI.
              Our goal in this paper is to raise awareness of the cost of Red AI , as well as encourage the AI community to
           recognize the value of work by researchers that take a different path, optimizing efﬁciency rather than accuracy. Next
           we turn to discuss concrete measures for making AI more green.


           3 Green AI 

           The term Green AI refers to AI research that yields novel results without increasing computational cost, and ideally
           reducing it. Whereas Red AI has resulted in rapidly escalating computational (and thus carbon) costs, Green AI has the
           opposite effect. If measures of efﬁciency are widely accepted as important evaluation metrics for research alongside
           accuracy, then researchers will have the option of focusing on the efﬁciency of their models with positive impact on
           both the environment and inclusiveness. This section reviews several measures of efﬁciency that could be reported
           and optimized, and advocates one particular measure—FPO—which we argue should be reported when AI research
           ﬁndings are published.

           3.1 Measures of Efﬁciency
           To measure efﬁciency, we suggest reporting the amount of work required to generate a result in AI, that is, the amount
           of work required to train a model, and if applicable, the sum of works for all hyperparameter tuning experiments. As

           the cost of an experiment decomposes into the cost of a processing a single example, the size of the dataset, and the
           number of experiments (Equation 1), reducing the amount of work in each of these steps will result in AI that is more
           green.
              When reporting the amount of work done by a model, we want to measure a quantity that allows for a fair comparison
           between different models. As a result, this measure should ideally be stable across different labs, at different
           times, and using different hardware.

           Carbon emission Carbon emission is appealing as it is a quantity we want to directly minimize. Nonetheless it
           is impractical to measure the exact amount of carbon released by training or executing a model, and accordingly—
           generating an AI result, as this amount depends highly on the local electricity infrastructure. As a result, it is not
           comparable between researchers in different locations or even the same location at different times.

           Electricity usage Electricity usage is correlated with carbon emission while being time- and location-agnostic.
           Moreover, GPUs often report the amount of electricity each of their cores consume at each time point, which facilitates
           the estimation of the total amount of electricity consumed by generating an AI result. Nonetheless, this measure is
           hardware dependent, and as a result does not allow for a fair comparison between different models.

           Elapsed real time The total running time for generating an AI result is a natural measure for efﬁciency, as all other
           things being equal, a faster model is doing less computational work. Nonetheless, this measure is highly inﬂuenced
           by factors such as the underlying hardware, other jobs running on the same machine, and the number of cores used.
           These factors hinder the comparison between different models, as well as the decoupling of modeling contributions
           from hardware improvements.

           Number of parameters Another common measure of efﬁciency is the number of parameters (learnable or total)
            used by the model. As with run time, this measure is correlated with the amount of work. Unlike the other measures
            described above, it does not depend on the underlying hardware. Moreover, this measure also highly correlates with the
            amount of memory consumed by the model. Nonetheless, different algorithms make different use of their parameters,
            for instance by making the model deeper vs. wider. As a result, different models with a similar number of parameters
            often perform different amounts of work.

            FPO As a concrete measure, we suggest reporting the total number of ﬂoating point operations (FPO) required to
            generate a result. 13 FPO provides an estimate to the amount of work performed by a computational process. It is
           computed analytically by deﬁning a cost to two base operations, ADD and MUL . Based on these operations, the FPO
           cost of any machine learning abstract operation (e.g., a tanh operation, a matrix multiplication, a convolution operation,
           or the BERT model) can be computed as a recursive function of these two operations. FPO has been used in the past
           to quantify the energy footprint of a model [27, 43, 12, 42], but is not widely adopted in AI.
              FPO has several appealing properties. First, it directly computes the amount of work done by the running machine
           when executing a speciﬁc instance of a model, and is thus tied to the amount of energy consumed. Second, FPO is
           agnostic to the hardware on which the model is run. This facilitates fair comparisons between different approaches,
           unlike the measures described above. Third, FPO is strongly correlated with the running time of the model [4]. Unlike
           asymptotic runtime, FPO also considers the amount of work done at each time step.
              Several packages exist for computing FPO in various neural network libraries, 14 though none of them contains all
           the building blocks required to construct all modern AI models. We encourage the builders of neural network libraries
           to implement such functionality directly.

             13 Floating point operations are often referred to as FLOP(s), though this term is not uniquely deﬁned [12]. To avoid confusion, we use the term FPO.
             14 E.g.,https://github.com/Swall0w/torchstat;https://github.com/Lyken17/pytorch-OpCounter

                                                   <<FIGURE>>

           Figure 4: Increase in FPO results in diminishing return for object detection top-1 accuracy. Plots (bottom to top):
           model parameters (in million), FPO (in billions), top-1 accuracy on ImageNet. (4a): Different models: AlexNet
           [20], ResNet [14], ResNext [47], DPN107 [5], SENet154 [17]. (4b): Comparison of different sizes (measured by the
           number of layers) of the ResNet model [14].


           Discussion Efﬁcient machine learning approaches have received attention in the research community, but are generally
           not motivated by being green. For example, a signiﬁcant amount of work in the computer vision community has
           addressed efﬁcient inference, which is necessary for real-time processing of images for applications like self-driving
           cars [24, 31, 22], or for placing models on devices such as mobile phones [16, 34]. Most of these approaches target efficient
           model inference [32, 50, 12], 15 and thus only minimize the cost of processing a single example, while ignoring
           the other two red practices discussed in Section 2. 16
              The above examples indicate that the path to making AI green depends on how it is used. When developing a new
           model, much of the research process involves training many model variants on a training set and performing inference
           on a small development set. In such a setting, more efﬁcient training procedures can lead to greater savings, while in
           a production setting more efﬁcient inference can be more important. We advocate for a holistic view of computational
           savings which doesn’t sacriﬁce in some areas to make advances in others.
              FPO has some limitations. First, it targets the electricity consumption of a model, while ignoring other potential
           limiting factors for researchers such as the memory consumption by the model, which can often lead to additional
           energy and monetary costs [24]. Second, the amount of work done by a model largely depends on the model implementation,
           as two different implementations of the same model could result in very different amounts of processing
           work. Due to the focus on the modeling contribution, the AI community has traditionally ignored the quality or efficiency
           of models’ implementation. We argue that the time to reverse this norm has come, and that exceptionally
           good implementations that lead to efﬁcient models should be credited by the AI community.

           3.2 FPO Cost of Existing Models
           To demonstrate the importance of reporting the amount of work, we present FPO costs for several existing models.
           A few trends are observable. First, as discussed in Section 2, models get more expensive with time, but the increase 
           in FPO does not lead to similar performance gains. For instance, an increase of almost 35% in FPO between ResNet and 
           ResNext (second and third points in graph) resulted in a 0.5% top-1 accuracy improvement. Similar patterns are observed 
           when considering the effect of other increases in model work. Second, the number of model parameters does not tell 
           the whole story: AlexNet (ﬁrst point in the graph) actually has more parameters than ResNet (second point), but 
           dramatically less FPO, and also much lower accuracy.
            Figure 4b shows the same analysis for a single object recognition model, ResNet [14], while comparing different
           versions of the model with different number of layers. This creates a controlled comparison between the different
           models, as they are identical in architecture, except for their size (and accordingly, their FPO cost). Once again, we
           notice the same trend: the large increase in FPO cost does not translate to a large increase in performance.

             14 Figure 4a shows the number of parameters and FPO of several leading object recognition models, as well as their performance on the ImageNet dataset [6].
             15 Some very recent work also targeted efﬁcient training [7].
             16 In fact, creating smaller models often results in longer running time, so mitigating the different trends might be at odds [44].
             17 We consider this exclusive focus on the ﬁnal prediction another symptom of Red AI .
             18 These numbers represent FPO per inference, i.e., the work required to process a single example.

           3.3 Additional Ways to Promote Green AI 
           In addition to reporting the FPO cost of the ﬁnal reported number, we encourage researchers to report the bud-
           get/accuracy curve observed during training. In a recent paper [9], we observed that selecting the better performing
           model on a given task depends highly on the amount of compute available during model development. We introduced
           a method for computing the expected best validation performance of a model as a function of the given budget. We
           argue that reporting this curve will allow users to make wiser decisions about their selection of models and highlight
           the stability of different approaches.
              We further advocate for making efﬁciency an ofﬁcial contribution in major AI conferences, by advising reviewers
           to recognize and value contributions that do not strictly improve state of the art, but have other beneﬁts such as
           efﬁciency. Finally, we note that the trend of releasing pretrained models publicly is a green success, and we would like
           to encourage organizations to continue to release their models in order to save others the costs of retraining them.


           4 Related Work

           Recent work has analyzed the carbon emissions of training deep NLP models [40] and concluded that computationally
           expensive experiments can have a large environmental and economic impact. With modern experiments using such
           large budgets, many researchers (especially those in academia) lack the resources to work in many high-proﬁle areas;
           increased value placed on computationally efﬁcient approaches will allow research contributions from more diverse
           groups. We emphasize that the conclusions of [40] are the result of long-term trends, and are not isolated within NLP,
           but hold true across machine learning.
              While some companies offset electricity usage by purchasing carbon credits, it is not clear that buying credits is
           as effective as using less energy. In addition, purchasing carbon credits is voluntary; Google cloud 20 and Microsoft
           Azure 21 purchase carbon credits to offset their spent energy, but Amazon’s AWS 22 (the largest cloud computing plat-
           form 23 ) only covered ﬁfty percent of its power usage with renewable energy.
              The push to improve state-of-the-art performance has focused the research community’s attention on reporting the
           single best result after running many experiments for model development and hyperparameter tuning. Failure to fully
           report these experiments prevents future researchers from understanding how much effort is required to reproduce a
           result or extend it [9].
              Our focus is on improving efﬁciency in the machine learning community, but machine learning can also be used
           as a tool for work in areas like climate change. For example, machine learning has been used for reducing emissions
           of cement plants [1] and tracking animal conservation outcomes [11], and is predicted to be useful for forest ﬁre
           management [33]. Undoubtedly these are important applications of machine learning; we recognize that they are
           orthogonal to the content of this paper.

             19 Numbers taken fromhttps://github.com/sovrasov/flops-counter.pytorch
             20 https://cloud.google.com/sustainability/
             21 https://www.microsoft.com/en-us/environment/carbon
             22 https://aws.amazon.com/about-aws/sustainability/
             23 https://tinyurl.com/y2kob969


                                                  8           5 Conclusion

           The vision of Green AI raises many exciting research directions that help to overcome the inclusiveness challenges of
            Red AI . Progress will reduce the computational expense with a minimal reduction in performance, or even improve
           performance as more efﬁcient methods are discovered. Also, it would seem that Green AI could be moving us in a
           more cognitively plausible direction as the brain is highly efﬁcient.
              It’s important to reiterate that we see Green AI as a valuable option not an exclusive mandate—of course, both
            Green AI and Red AI have contributions to make. We want to increase the prevalence of Green AI by highlighting its
           beneﬁts, advocating a standard measure of efﬁciency. Below, we point to a few important green research directions,
           and highlight a few open questions.
              Research on building space or time efﬁcient models is often motivated by ﬁtting a model on a small device (such
           as a phone) or fast enough to process examples in real time, such as image captioning for the blind (see Section 3.1).
           Some modern models don’t even ﬁt on a single GPU (see Section 2). Here we argue for a far broader approach.
              Data efﬁciency has received signiﬁcant attention over the years [35, 19]. Modern research in vision and NLP often
           involves ﬁrst pretraining a model on large “raw” (unannotated) data then ﬁne-tuning it to a task of interest through
           supervised learning. A strong result in this area often involves achieving similar performance to a baseline with
           fewer training examples or fewer gradient steps. Most recent work has addressed ﬁne-tuning data [29], but pretraining
           efﬁciency is also important. In either case, one simple technique to improve in this area is to simply report performance
           with different amounts of training data. For example, reporting performance of contextual embedding models trained
           on 10 million, 100 million, 1 billion, and 10 billion tokens would facilitate faster development of new models, as they
           can ﬁrst be compared at the smallest data sizes. Research here is of value not just to make training less expensive, but
           because in areas such as low resource languages or historical domains it is extremely hard to generate more data, so to
           progress we must make more efﬁcient use of what is available.
              Finally, the total number of experiments run to get a ﬁnal result is often underreported and underdiscussed [9]. The
           few instances researchers have of full reporting of the hyperparameter search, architecture evaluations, and ablations
           that went into a reported experimental result have surprised the community [40]. While many hyperparameter optimization
           algorithms exist which can reduce the computational expense required to reach a given level of performance
           [3, 10], simple improvements here can have a large impact. For example, stopping training early for models which are
           clearly underperforming can lead to great savings [21].


           References

            [1]Prabal Acharyya, Sean D Rosario, Roey Flor, Ritvik Joshi, Dian Li, Roberto Linares, and Hongbao Zhang.
               Autopilot of cement plants for reduction of fuel consumption and emissions, 2019. ICML Workshop on Climate
               Change.
            [2]Dario Amodei and Danny Hernandez. AI and compute, 2018. Blog post.
            [3]James S. Bergstra, Remi Bardenet, Yoshua Bengio, and Bal´                         azs K´   egl. Algorithms for hyper-parameter optimiza-´
               tion. InProc. of NeurIPS, 2011.
            [4]Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for
               practical applications. InProc. of ISCAS, 2017.
            [5]Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In
               Proc. of NeurIPS, 2017.
            [6]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical
               image database. InProc. of CVPR, 2009.
            [7]Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance,
               2019. arXiv:1907.04840.
            [8]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional
               transformers for language understanding. InProc. of NAACL, 2019.
            [9]Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved
               reporting of experimental results. InProc. of EMNLP, 2019.
            [10]Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open loop hyperparameter optimization and determinantal
               point processes. InProc. of AutoML, 2017.
           [11]Clement Duhart, Gershon Dublon, Brian Mayton, Glorianna Davenport, and Joseph A. Paradiso. Deep learning
               for wildlife conservation and restoration efforts, 2019. ICML Workshop on Climate Change.
           [12]Ariel Gordon, Elad Eban, Oﬁr Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. MorphNet: Fast &
               simple resource-constrained structure learning of deep networks. InProc. of CVPR, 2018.
            [13]Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent
               Systems, 24:8–12, 2009.
            [14]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
               Proc. of CVPR, 2016.
            [15]Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.¨ Neural computation, 9(8):1735–1780,
               1997.
            [16]Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
               dreetto, and Hartwig Adam. MobileNets: Efﬁcient convolutional neural networks for mobile vision applications,
               2017. arXiv:1704.04861.
            [17]Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProc. of CVPR, 2018.
            [18]Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbig-
               niew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convo-
               lutional object detectors. InProc. of CVPR, 2017.
            [19]Sanket Kamthe and Marc Peter Deisenroth. Data-efﬁcient reinforcement learning with probabilistic model pre-
               dictive control. InProc. of AISTATS, 2018.
            [20]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural
               networks. InProc. of NeurIPS, 2012.
            [21]Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-
               based conﬁguration evaluation for hyperparameter optimization. InProc. of ICLR, 2017.
            [22]Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C.
               Berg. Ssd: Single shot multibox detector. InProc. of ECCV, 2016.
           [23]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
               Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach, 2019.
               arXiv:1907.11692.
           [24]Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShufﬂeNet V2: Practical guidelines for efﬁcient
               cnn architecture design. InProc. of ECCV, 2018.
           [25]Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin
               Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. InProc. ECCV,
               2018.
           [26]Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. In´
               Proc. of EMNLP, 2018.
           [27]Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks
               for resource efﬁcient inference. InProc. of ICLR, 2017.
           [28]Gordon E. Moore. Cramming more components onto integrated circuits, 1965.
           [29]Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-
               moyer. Deep contextualized word representations. InProc. of NAACL, 2018.
           [30]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
               unsupervised multitask learners, 2019. OpenAI Blog.
           [31]Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classiﬁcation
               using binary convolutional neural networks. InProc. of ECCV, 2016.
           [32]Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object
               detection. InProc. of CVPR, 2016.
           [33]David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, An-
               drew Slavin Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan
               Maharaj, Evan D. Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Has-¨
               sabis, John C. Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio. Tackling climate change with machine
               learning, 2019. arXiv:1905.12616.
            [34]Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2:
               Inverted residuals and linear bottlenecks. InProc. of CVPR, 2018.
           [35]Roy Schwartz, Sam Thomson, and Noah A. Smith. SoPa: Bridging CNNs, RNNs, and weighted ﬁnite-state
               machines. InProc. of ACL, 2018.
           [36]Yoav Shoham, Raymond Perrault, Erik Brynjolfsson, Jack Clark, James Manyika, Juan Carlos Niebles, Terah
               Lyons, John Etchemendy, and Z Bauer. The AI index 2018 annual report. AI Index Steering Committee,
               Human-Cente Red AI  Initiative, Stanford University. Available athttp://cdn.aiindex.org/2018/AI%
               20Index%202018%20Annual%20Report.pdf, 202018, 2018.
            [37]David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
               Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,
               John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore
               Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search.Nature,
               529(7587):484, 2016.
           [38]David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
               Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis
               Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.
               arXiv:1712.01815.
           [39]David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas
               Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre,
               George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human
               knowledge.Nature, 550(7676):354, 2017.
           [40]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in
               NLP. InProc. of ACL, 2019.
           [41]Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of
               data in deep learning era. InProc. of ICCV, 2017.
           [42]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
               and Illia Polosukhin. Attention is all you need. InProc. of NeurIPS, 2017.
           [43]Tom Veniat and Ludovic Denoyer. Learning time/memory-efﬁcient deep architectures with budgeted super net-
               works. InProc. of CVPR, 2018.
           [44]Aaron Walsman, Yonatan Bisk, Saadia Gabriel, Dipendra Misra, Yoav Artzi, Yejin Choi, and Dieter Fox. Early
               fusion for goal directed robotic vision. InProc. of IROS, 2019.
           [45]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
               Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems,
               2019. arXiv:1905.00537.
           [46]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A
               multi-task benchmark and analysis platform for natural language understanding. InProc. of ICLR, 2019.
           [47]Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations
               for deep neural networks. InProc. of CVPR, 2017.
           [48]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet:
               Generalized autoregressive pretraining for language understanding, 2019. arXiv:1906.08237.
           [49]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi.
               Defending against neural fake news, 2019. arXiv:1905.12616.
           [50]Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShufﬂeNet: An extremely efﬁcient convolutional
               neural network for mobile devices. InProc. of CVPR, 2018.
           [51]Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. InProc. of ICLR, 2017.
<|endoftext|>


<|startoftext|>
Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication 

Herbert Jaeger* and Harald Haas

We present a method for learning nonlinear systems, echo state networks (ESNs). ESNs employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains. The learning method is computationally efficient and easy to use. On a benchmark task of predicting a chaotic time series, accuracy is improved by a factor of 2400 over previous techniques. The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude. 
Nonlinear dynamical systems abound in the sciences and in engineering. If one wishes to simulate, predict, filter, classify, or control such a system, one needs an executable system model. However, it is often infeasible to obtain analytical models. In such cases, one has to resort to black-box models, which ignore the internal physical mechanisms and instead reproduce  only the outwardly observable input-output behavior of the target system. 
If the target system is linear, efficient methods for black-box modeling are available. Most technical systems, however, become nonlinear if operated at higher operational points (that is, closer to saturation). Although this might lead to cheaper and more energy-efficient designs, it is not done be.cause the resulting nonlinearities cannot be harnessed. Many biomechanical systems use their full dynamic range (up to saturation) and thereby become lightweight, energy efficient, and thoroughly nonlinear. 
Here, we present an approach to learn.ing black-box models of nonlinear systems, echo state networks (ESNs). An ESN is an artificial recurrent neural network (RNN). RNNs are characterized by feedback (recurrent) loops in their synaptic connection pathways. They can maintain an ongoing activation even in the absence of input and thus exhibit dynamic memory. Biological neural networks are typically recurrent. Like biological neural networks, an artificial RNN can learn to mimic a target system in principle, with arbitrary accuracy (1). Several learning algorithms are known (24) that incrementally adapt the synaptic weights of an RNN in order to tune it toward the target system. These algorithms have not been widely employed in technical applications because of slow 
International University Bremen, Bremen D-28759, Germany. 

convergence and suboptimal solutions (5, 6). The ESN approach differs from these methods in that a large RNN is used (on the order of 50 to 1000 neurons; previous techniques  typically use 5 to 30 neurons) and in that only the synaptic connections from the RNN to the output readout neurons are modified by learning; previous techniques tune all synaptic connections (Fig. 1). Be.cause there are no cyclic dependencies be.tween the trained readout connections, training an ESN becomes a simple linear regression task. 
We illustrate the ESN approach on a task of chaotic time series prediction (Fig. 
2) (7). The Mackey-Glass system (MGS) 
(8) is a standard benchmark system for time series prediction studies. It generates a sub.tly irregular time series (Fig. 2A). The prediction task has two steps: (i) using an initial teacher sequence generated by the original MGS to learn a black-box model M of the generating system, and (ii) using M to predict the value of the sequence some steps ahead. 
First, we created a random RNN with 1000 neurons (called the reservoir) and one output neuron. The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B). A 3000-step teacher sequence <<FORMULA>> was generated from the MGS equation and fed into the output neuron. This excited the internal neurons through the output feedback connections. After an initial transient period, they started to exhibit systematic individual variations of the teacher sequence (Fig. 2B). 
The fact that the internal neurons display systematic variants of the exciting external signal is constitutional for ESNs: The internal neurons must work as echo functions for the driving signal. Not every randomly generated RNN has this property, but it can effectively be built into a reservoir (support.ing online text). 
It is important that the echo signals be richly varied. This was ensured by a sparse interconnectivity of 1% within the reservoir. This condition lets the reservoir decompose into many loosely coupled subsystems, establishing a richly structured reservoir of excitable dynamics. 
After time <<n=3000>>, output connection weights wi (i  1, . . . , 1000) were computed (dashed arrows in Fig. 2B) from the last 2000 steps n=1001, . . . , 3000 of the training run such that the training error 

<<FORMULA>> 

was minimized [<<xi(n)>>, activation of the ith internal neuron at time n]. This is a simple linear regression. 
With the new wi in place, the ESN was disconnected from the teacher after step 3000 and left running freely. A bidirectional dynamical interplay of the network-generated output signal with the internal signals <<FORMULA>> unfolded. The output signal <<FORMULA>> was created from the internal neuron activation signals <<FORMULA>> through the trained connections wi,by <<FORMULA>>. Conversely, the internal signals were echoed from that output signal through the fixed output feedback connections (supporting online text). 
For testing, an 84-step continuation <<d(3001), ... , d(3084)>> of the original signal was computed for reference. The network output y(3084) was compared with the cor.rect continuation d(3084). Averaged over 100 independent trials, a normalized root mean square error 

<<FORMULA>>

was obtained <<FORMULA>> and <<FORMULA>> teacher and network 

                                                        <<FIGURE>>

Fig. 1. (A) Schema of previous approaches to RNN learning. (B) Schema of ESN approach. Solid synaptic connections; dotted arrows, adjustable connections. Both approaches aim at minimizing the error <<FORMULA>>, where <<FORMULA>> is the network output and d(n) is the teacher time series observed from the target system. 

output in trial j, 2 variance of MGS signal), improving the best previous techniques (9 15), which used training sequences of length 500 to 10,000, by a factor of 700. If the prediction run was continued, deviations typically became visible after about 1300 steps (Fig. 2A). With a refined variant of the learn.ing method (7), the improvement factor rises to 2400. Models of similar accuracy were also obtained for other chaotic systems (supporting online text). 
The main reason for the jump in modeling accuracy is that ESNs capitalize on a massive short-term memory. We showed analytically 
(16) that under certain conditions an ESN of size N may be able to "remember" a number of previous inputs that is of the same order of magnitude as N. This information is more massive than the information used in other techniques (supporting online text). 
We now illustrate the approach in a task of practical relevance, namely, the equalization of a wireless communication channel (7). The essentials of equalization are as fol.lows: A sender wants to communicate a sym.bol sequence s(n). This sequence is first transformed into an analog envelope signal d(n), then modulated on a high-frequency carrier signal and transmitted, then received and demodulated into an analog signal u(n), which is a corrupted version of d(n). Major sources of corruption are noise (thermal or due to interfering signals), multipath propagation, which leads to a superposition of adjacent symbols (intersymbol interference), and nonlinear distortion induced by operating the senders power amplifier in the high-gain region. To avoid the latter, the actual power amplification is run well below the maximum amplification possible, thereby incurring a substantial loss in energy efficiency, which is clearly undesirable in cell-phone and satellite 

Fig. 2. (A) Prediction output of the trained ESN (dotted) overlaid with the correct continuation (solid). (B) Learning the MG attractor. Three sample activation traces of internal neurons are shown. They echo the teacher signal d(n). After training, the desired output is recreated from the echo signals through output connections (dotted arrows) whose weights wi are the result of the training procedure. 
communications. The corrupted signal u(n)is then passed through an equalizing filter whose output y(n) should restore u(n)as closely as possible to d(n). Finally, the equalized signal y(n) is converted back into a symbol sequence. The quality measure for the entire process is the fraction of incorrect symbols finally obtained (symbol error rate). 
To compare the performance of an ESN equalizer with standard techniques, we took a channel model for a nonlinear wireless transmission system from a study (17) that compared three customary nonlinear equalization methods: a linear decision feedback equalizer (DFE), which is actually a non.linear method; a Volterra DFE; and a bilinear DFE. The model equation featured inter symbol interference across 10 consecutive symbols, a second-order and a third-order nonlinear distortion, and additive white Gaussian noise. All methods investigated in that study had 47 adjustable parameters and used sequences of 5000 symbols for training. To make the ESN equalizer comparable with the equalizers studied in (17), we took ESNs with a reservoir of 46 neurons (which is small for the ESN approach), which yielded 47 adjust.able parameters. (The 47th comes from a direct connection from the input to the output neuron.) 
We carried out numerous learning trials (7) to obtain ESN equalizers, using an online learning method (a version of the recursive least square algorithm known from linear adaptive filters) to train the output weights on 5000-step training sequences. We chose an online adaptation scheme here because the methods in (17) were online adaptive, too, and because wireless communication channels mostly are time-varying, such that an equalizer must adapt to changing system characteristics. The entire learning-testing procedure was repeated for signal-to-noise 

<<FIGURE>>

Fig. 3. Results of using an ESN for nonlinear channel equalization. Plot shows signal error rate (SER) versus signal-to-noise ratio (SNR). 
(a) Linear DFE. (b) Volterra DFE. (c) Bilinear DFE. [(a) to (c) taken from (20)]. (d) Blue line represents average ESN performance with randomly generated reservoirs. Error bars, variation across networks. (e) Green line indicates performance of best network chosen from the networks averaged in (d). Error bars, variation across learning trials. 
REPORTS 
ratios ranging from 12 to 32 db. Figure 3 compares the average symbol error rates obtained with the results reported in (17), show.ing an improvement of two magnitudes for high signal-to-noise ratios. 
For tasks with multichannel input and/or output, the ESN approach can be accommodated simply by adding more input or output neurons (16, 18). 
ESNs can be applied to all basic tasks of signal processing and control, including time series prediction, inverse modeling, pattern generation, event detection and classification, modeling distributions of stochastic process.es, filtering, and nonlinear control (16, 18, 19, 20). Because a single learning run takes only a few seconds (or minutes, for very large data sets and networks), engineers can test out variants at a high turnover rate, a crucial factor for practical usability. 
ESNs have been developed from a mathematical and engineering perspective, but exhibit typical features of biological RNNs: a large number of neurons, recurrent pathways, sparse random connectivity, and local modification of synaptic weights. The idea of using randomly connected RNNs to represent and memorize dynamic input in network states has frequently been explored in specific contexts, for instance, in artificial intelligence models of associative memory (21), models of prefrontal cortex function in sensory-motor sequencing tasks (22), models of birdsong (23), models of the cerebellum (24), and general computational models of neural oscillators (25). Many different learning mechanisms were considered, mostly within the RNN itself. The contribution of the ESN is to elucidate the mathematical properties of large RNNs such that they can be used with a linear, trainable readout mechanism for general black-box modeling. An approach essentially equivalent to ESNs, liquid state networks (26, 27), has been developed independently to model  computations in cortical microcircuits. Recent findings in neurophysiology suggest that the basic ESN/liquid state network principle seems not uncommon in biological networks (28,30) and could eventually be exploited to control prosthetic devices by signals collected from a collective of neurons (31). 

References and Notes 
1. K.-I. Funahashi, Y. Nakamura, Neural Netw. 6, 801 (1993). 
2. D. Zipser, R. J. Williams, Neural Comput. 1, 270 (1989). 
3. P. J. Werbos, Proc. IEEE 78, 1550 (1990). 
4. L. A. Feldkamp, D. V. Prokhorov, C. F. Eagen, F. Yuan, in Nonlinear Modeling: Advanced Black-Box techniques , J. A. K. Suykens, J. Vandewalle, Eds. (Kluwer, Dordrecht, Netherlands, 1998), pp. 2954. 
5. K. Doya, in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. (MIT Press, Cambridge, MA, 1995), pp. 796800. 
6. H. Jaeger, Tutorial on training recurrent neural networks (GMD-Report 159, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ CompleteTutorialTechrep.pdf. 

REPORTS 

7. Materials andmethods are available as supporting material on Science Online. 
8. M. C. Mackey, L. Glass, Science 197, 287 (1977). 
9. J. Vesanto, in Proc. WSOM 97 (1997); www.cis.hut./ projects/monitor/publications/papers/wsom97.ps. 
10. L. Chudy, I. Farkas, Neural Network World 8, 481 (1998). 
11. H. Bersini, M. Birattari, G. Bontempi, in Proc. IEEE World Congr. on Computational Intelligence (IJCNN 98) (1997), pp. 21022106; ftp://iridia.ulb.ac.be/ pub/lazy/papers/IridiaTr1997-13_2.ps.gz. 
12. T. M. Martinetz, S. G. Berkovich, K. J. Schulten, IEEE Trans. Neural Netw. 4, 558 (1993). 
13. X. Yao, Y. Liu, IEEE Trans. Neural Netw. 8, 694 (1997). 
14. F. Gers, D. Eck, J. F. Schmidhuber, Applying LSTM to time series predictable through time-window ap.proaches (IDSIA-IDSIA-22-00, 2000); www.idsia.ch/ felix/Publications.html. 
15. J. McNames, J. A. K. Suykens, J. Vandewalle, Int. J. Bifurcat. Chaos 9, 1485 (1999). 
16. H. Jaeger, Short term memory in echo state net.works (GMD-Report 152, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ STMEchoStatesTechRep.pdf. 
17. V. J. Mathews, J. Lee, in Advanced Signal Processing: Algorithms, Architectures, and Implementations V (Proc. SPIE Vol. 2296), (SPIE, San Diego, CA, 1994), pp. 317327. 
18. J. Hertzberg, H. Jaeger, F. Schonherr, in Proc. 15th Europ. Conf. on Art. Int. (ECAI 02), F. van Harmelen, Ed. (IOS Press, Amsterdam, 2002), pp. 708712; www. ais.fhg.de/schoenhe/papers/ECAI02.pdf. 
19. H. Jaeger, The echo state approach to analysing and training recurrent neural networks (GMD-Report 148, German National Research Institute for Com.puter Science, 2001); ftp://borneo.gmd.de/pub/indy/ publications_herbert/EchoStatesTechRep.pdf. 
20. H. Jaeger, in Advances in Neural Information Process.ing Systems 15, S. Becker, S. Thrun, K. Obermayer, Eds. (MIT Press, Cambridge, MA, 2003) pp. 593600. 
21. G. E. Hinton, in Parallel Models of Associative Mem.ory, G. E. Hinton, J. A. Anderson, Eds. (Erlbaum, Hills.dale, NJ, 1981), pp. 161187. 
22. D. G. Beiser, J. C. Houk, J. Neurophysiol. 79, 3168 (1998). 
23. S. Dehaene, J.-P. Changeux, J.-P. Nadal, Proc. Natl. Acad. Sci. U.S.A. 84, 2727 (1987). 
24. M. Kawato, in The Handbook of Brain Theory and Neural Networks, M. Arbib, Ed. (MIT Press, Cam.bridge, MA, 1995), pp. 172178. 
25. K. Doya, S. Yoshizawa, Neural Netw. 2, 375 (1989). 

Ultrafast Electron Crystallography of Interfacial Water 
Chong-Yu Ruan, Vladimir A. Lobastov, Franco Vigliotti, Songye Chen, Ahmed H. Zewail* 
We report direct determination of the structures and dynamics of interfacial water on a hydrophilic surface with atomic-scale resolution using ultrafast electron crystallography. On the nanometer scale, we observed the coexistence of ordered surface water and crystallite-like ice structures, evident in the superposition of Bragg spots and Debye-Scherrer rings. The structures were determined to be dominantly cubic, but each undergoes different dynamics after the ultrafast sub.strate temperature jump. From changes in local bond distances (OHOand OO) with time, we elucidated the structural changes in the far-from-equilibrium regime at short times and near-equilibration at long times. 

The nature of interfacial molecular assemblies of nanometer scale is of fundamental impor.tance to chemical and biological phenomena (14). For water, the directional molecular fea.tures of hydrogen bonding (5, 6) and the dif.ferent structures possible, from amorphous (7) to crystalline (8), make the interfacial (9) col.lective assembly on the mesoscopic (10) scale much less understood. Structurally, the nature of water on a substrate is determined by forces of orientation at the interface and by the net charge density, which establishes the hydro.philic or hydrophobic character of the substrate. However, the transformation from ordered to dis.ordered structure and their coexistence critically depends on the time scales for the movements of atoms locally and at long range. Therefore, it is essential to elucidate the nature of these structures and the time scales for their equilibration. 
Laboratory for Molecular Sciences, Arthur Amos Noyes Laboratory of Chemical Physics, California Institute of Technology, Pasadena, CA 91125, USA. 
*To whom correspondence should be addressed. E.mail: zewail@caltech.edu 
Here, we report direct determination of the structures of interfacial water with atomic-scale resolution, using diffraction and the dynamics following ultrafast infrared (IR) laser-initiated 
26. W. Maass, T. Natschlager, H. Markram, Neural Com-put. 14, 2531 (2002). 
27. W. Maass, T. Natschlager, H. Markram, in Compu.tational Neuroscience: A Comprehensive Approach, J. Feng, Ed. (Chapman & Hall/CRC, 2003), pp. 575 605. 
28. G. B. Stanley, F. F. Li, Y. Dan, J. Neurosci. 19, 8036 (1999). 
29. G. B. Stanley, Neurocomputing 3840, 1703 (2001). 
30. W. M. Kistler, Ch. I. de Zeeuw, Neural Comput. 14, 2597 (2002). 31. S. Mussa-Ivaldi, Nature 408, 361 (2000). 
32. The rst author thanks T. Christaller for unfaltering support andW. Maass for friendly cooperation. Inter.national patents are claimedby Fraunhofer AIS (PCT/ EP01/11490). 

Supporting Online Material 
www.sciencemag.org/cgi/content/full/304/5667/78/DC1 Materials andMethods SOM Text Figs. S1 to S4 References 

temperature jump. Interfacial water is formed on a hydrophilic surface (silicon, chlorine-terminated) under controlled ultrahigh vacuum (UHV) conditions (Fig. 1). With these atomic-scale spatial, temporal, and energy resolutions, the evolution of nonequilibrium structures was monitored, their ordered or disordered nature was established, and the time scale for the breakage of long-range bonding and formation of new structures was determined. We identi.fied the structured and ordered interfacial water from the Bragg diffraction and the layered crys.tallite structure from the Debye-Scherrer rings. The temporal evolution of interfacial water and layered ice after the temperature jump was studied with submonolayer sensitivity. We compared these results with those obtained on hydrophobic surfaces, such as hydrogen-terminated silicon or silver substrate. 
Spectroscopic techniques, such as internal reflection (11) and nonlinear [second-harmonic generation (12) and sum-frequency generation 

            <<FIGURE>>

Fig. 1. Structured water at the hydrophilic interface. The chlo.rine termination on a <<FORMULA>> substrate forms a hydrophilic layer that orients the water bilayer. The closest packing dis.tance (4.43) be.tween oxygen atoms in the bottom layer of water is similar to the distance (4.50) be.tween the on-top and interstitial sites of the chlorine layer, result.ing in specific bilayer orientations (30) with respect to the silicon substrate. This ordered stacking persists for three to four bilayers (1 nm) before disorientation takes place andresults in crystallite islands, forming the layered structure. The size of atoms is not to scale for the van der Waals radii. 
<|endoftext|>


<|startoftext|>
       Identity Mappings in Deep Residual Networks

       Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

                            Microsoft Research

                           Abstract
          
          Deep residual networks [1] have emerged as a family of ex-
          tremely deep architectures showing compelling accuracy and nice con-
          vergence behaviors. In this paper, we analyze the propagation formu-
          lations behind the residual building blocks, which suggest that the for-
          ward and backward signals can be directly propagated from one block
          to any other block, when using identity mappings as the skip connec-
          tions and after-addition activation. A series of ablation experiments sup-
          port the importance of these identity mappings. This motivates us to
          propose a new residual unit, which makes training easier and improves
          generalization. We report improved results using a 1001-layer ResNet
          on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
          on ImageNet. Code is available at:https://github.com/KaimingHe/
          resnet-1k-layers.


     1 Introduction

     Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
     Each unit (Fig.1(a)) can be expressed in a general form:

                         <<FORMULA>>

     where xl and <<FORMULA>> are input and output of the l-th unit, andFis a residual
     function. In [1],<<FORMULA>> is an identity mapping and is a ReLU [2] function.
        ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
     several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
     titions. The central idea of ResNets is to learn the additive residual functionF
     with respect to  <<FORMULA>>, with a key choice of using an identity mapping <<FORMULA>> .
     This is realized by attaching an identity skip connection shortcut.
        In this paper, we analyze deep residual networks by focusing on creating a
     direct path for propagating information not only within a residual unit,
     but through the entire network. Our derivations reveal that if both <<FORMULA>> and
     <<FORMULA>> are identity mappings, the signal could be directly propagated from one
     unit to any other units, in both forward and backward passes. Our experiments
     empirically show that training in general becomes easier when the architecture
     is closer to the above two conditions.
        To understand the role of skip connections, we analyze and compare various
     types of <<FORMULA>>. We find that the identity mapping <<FORMULA>> chosen in [1]  

                                          <<FIGURE>>

     Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey
     arrows indicate the easiest paths for the information to propagate, corresponding to
     the additive term \xl " in Eqn.(4) (forward propagation) and the additive term \1" in
     Eqn.(5) (backward propagation).Right: training curves on CIFAR-10 of1001-layer
     ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote
     training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.


     achieves the fastest error reduction and lowest training loss among all variants
     we investigated, whereas skip connections of scaling, gating [5,6,7], and 1x1
     convolutions all lead to higher training loss and error. These experiments suggest
     that keeping a clean information path (indicated by the grey arrows in Fig.1,2,
     and4) is helpful for easing optimization.
        To construct an identity mapping <<FORMULA>>, we view the activation func-
     tions (ReLU and BN [8]) as pre-activation of the weight layers, in contrast
     to conventional wisdom of post-activation. This point of view leads to a new
     residual unit design, shown in (Fig.1(b)). Based on this unit, we present com-
     petitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier
     to train and generalizes better than the original ResNet in [1]. We further report
     improved results on ImageNet using a 200-layer ResNet, for which the counter-
     part of [1] starts to overfit. These results suggest that there is much room to
     exploit the dimension ofnetwork depth, a key to the success of modern deep
     learning.


     2 Analysis of Deep Residual Networks


     The ResNets developed in [1] are modularized architectures that stack building
     blocks of the same connecting shape. In this paper we call these blocks \Residual                                                              3

     Units". The original Residual Unit in [1] performs the following computation:

                         <<FORMULA>>;                  (1)
                          <<FORMULA>>.                     (2)

     Here xl is the input feature to the l-th Residual Unit. <<FORMULA>> is a
     set of weights (and biases) associated with the l-th Residual Unit, andKis the
     number of layers in a Residual Unit (Kis 2 or 3 in [1]). F denotes the residual
     function,e.g., a stack of two 3x3 convolutional layers in [1]. The function f is
     the operation after element-wise addition, and in [1] f is ReLU. The function h
     is set as an identity mapping:<<FORMULA>> If f is also an identity mapping: <<FORMULA>>, 
     we can put Eqn.(2) into Eqn.(1)
     and obtain:

                          <<FORMULA>>.                  (3)

     Recursively <<FORMULA>>, etc. we will have:

                         <<FORMULA>>;                 (4)
                               
     for any deeper unit L and any shallower unit l. Eqn.(4) exhibits some nice
     properties. 
     
     (i) The feature xL of any deeper unit L can be represented as the
     P feature xl of any shallower unit l plus a residual function in a form of  <<FORMULA>> 
     indicating that the model is in a residual fashion between any units L and l.
     (ii)The feature <<FORMULA>>, of any deep unit L, is the summation
     of the outputs of all preceding residual functions (<<FORMULA>>). This is in contrast to
     Qa plain network here a feature xL is a series of matrix-vector products, say, <<FORMULA>> 
     (ignoring BN and ReLU).
        
        Eqn.(4) also leads to nice backward propagation properties. Denoting the
     loss function as E, from the chain rule of backpropagation [9] we have:
     
                 <<FORMULA>>       (5)
                      
     Eqn.(5) indicates that the gradient @E can be decomposed into two additive <<FORMULA>>
     terms: a term of  <<FORMULA>> that propagates information directly without concerning                   
     any weight layers, and another term of  <<FORMULA>> that propagates <<FORMULA>>
     through the weight layers. The additive term of @E ensures that information is directly propagated back to
     any shallower unIt l. Eqn.(5) also suggests that it is unlikely for the gradient @E to be canceled out for 
     a mini-batch, because in general the term  <<FORMULA>> cannot be always -1 for all samples in a mini-batch. 
     This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.

           1 It is noteworthy that there are Residual Units for increasing dimensions and reducing
       feature map sizes [1] in which h is not identity. In this case the following derivations
       do not hold strictly. But as there are only a very few such units (two on CIFAR and
       three on ImageNet, depending on image sizes [1]), we expect that they do not have
       the exponential impact as we present in Sec.3. One may also think of our derivations
       as applied to all Residual Units within the same feature map size. 

     Discussions

     Eqn.(4) and Eqn.(5) suggest that the signal can be directly propagated from
     any unit to another, both forward and backward. The foundation of Eqn.(4) is
     two identity mappings: (i) the identity skip connection <<FORMULA>> , and (ii) the
     condition that f is an identity mapping.

        These directly propagated information flows are represented by the grey ar-
     rows in Fig.1,2, and4. And the above two conditions are true when these grey
     arrows cover no operations (expect addition) and thus are clean. In the fol-
     lowing two sections we separately investigate the impacts of the two conditions.

     3 On the Importance of Identity Skip Connections

     Let’s consider a simple modification, <<FORMULA>>, to break the identity shortcut:

                         <<FORMULA>>,                  (6)

     where l is a modulating scalar (for simplicity we still assume f is identity).
     Recursively applying this formulation we obtain an equation similar to Eqn. (4):
     <<FORMULA>>, or simply:

                      <<FORMULA>>;               (7)
                          
     where the notationF^absorbs the scalars into the residual functions. Similar to
     Eqn.(5), we have backpropagation of the following form:

                              <<FORMULA>>          (8)
                   
     Unlike Eqn.(5), in Eqn.(8) the first additive term is modulated by a factor <<FORMULA>>                                        
     the factor can be exponentially large; if <<FORMULA>> for all i, this factor can be 
     exponentially small and vanish, which blocks the backpropagated signal from the
     shortcut and forces it to flow through the weight layers. This results in optimization 
     difficulties as we show by experiments.
        In the above analysis, the original identity skip connection in Eqn.(3) is re-
     placed with a simple scaling <<FORMULA>>. If the skip connection <<FORMULA>> represents
     more complicated transforms (such as gating and 1x1 convolutions), in Eqn.(8) Q the first 
     term becomes <<FORMULA>> where h0 is the derivative of h. This product <<FORMULA>> may 
     also impede information propagation and hamper the training procedure
     as witnessed in the following experiments.                                         


                                             <<FIGURE>>

     Figure 2.Various types of shortcut connections used in Table1. The grey arrows
     indicate the easiest paths for the information to propagate. The shortcut connections
     in (b-f) are impeded by different components. For simplifying illustrations we do not
     display the BN layers, which are adopted right after the weight layers for all units here.


     3.1 Experiments on Skip Connections

     We experiment with the 110-layer ResNet as presented in [1] on CIFAR-10 [10].
     This extremely deep ResNet-110 has 54 two-layer Residual Units (consisting of
     3x3 convolutional layers) and is challenging for optimization. Our implementation
     details (see appendix) are the same as [1]. Throughout this paper we report
     the median accuracy of 5 runs for each architecture on CIFAR, reducing the
     impacts of random variations.
        Though our above analysis is driven by identity f, the experiments in this
     section are all based onf= ReLU as in [1]; we address identity f in the next 
     section. Our baseline ResNet-110 has 6.61% error on the test set. The comparisons
     of other variants (Fig.2 and Table1) are summarized as follows:
        Constant scaling. We set <<FORMULA>> for all shortcuts (Fig.2(b)). We further
     study two cases of scalingF: (i)Fis not scaled; or (ii)Fis scaled by a constant
     scalar of <<FORMULA>>, which is similar to the highway gating [6,7] but with frozen
     gates. The former case does not converge well; the latter is able to converge,
     but the test error (Table1, 12.35%) is substantially higher than the original
     ResNet-110. Fig3(a) shows that the training error is higher than that of the
     original ResNet-110, suggesting that the optimization has difficulties when the
     shortcut signal is scaled down.     6

     Table 1.Classification error on the CIFAR-10 test set using ResNet-110 [1], with
     different types of shortcut connections applied to all Residual Units. We report \fail"
     when the test error is higher than 20%.

                     <<TABLE>>

        Exclusive gating. Following the Highway Networks [6,7] that adopt a gating
     mechanism [5], we consider a gating function <<FORMULA>> where a
     transform is represented by weights W g and biases <<bg>> followed by the sigmoid
     function <<FORMULA>>. In a convolutional network <<g(x)>> is realized by a <<FORMULA>>
     convolutional layer. The gating function modulates the signal by element-wise
     multiplication.
        We investigate the exclusive gates as used in [6,7] the F path is scaled
     byg(x) and the shortcut path is scaled by <<FORMULA>>. See Fig2(c). We find that the
     initialization of the biases <<bg>> is critical for training gated models, and following
     the guidelines 2 in [6,7], we conduct hyper-parameter search on the initial value of
     <<bg>> in the range of 0 to -10 with a decrement step of -1 on the training set by cross-
     validation. The best value (6 here) is then used for training on the training
     set, leading to a test result of 8.70% (Table1), which still lags far behind the
     ResNet-110 baseline. Fig 3(b) shows the training curves. Table1also reports the
     results of using other initialized values, noting that the exclusive gating network
     does not converge to a good solution when <<bg>> is not appropriately initialized.
        The impact of the exclusive gating mechanism is two-fold. When <<FORMULA>>
     approaches 1, the gated shortcut connections are closer to identity which helps
     information propagation; but in this case <<g(x)>> approaches 0 and suppresses the
     functionF. To isolate the effects of the gating functions on the shortcut path
     alone, we investigate a non-exclusive gating mechanism in the next.
        Shortcut-only gating. In this case the functionFis not scaled; only the
     shortcut path is gated by <<FORMULA>>. See Fig2(d). The initialized value of<<bg>> is still
     essential in this case. When the initialized<<bg>> is 0 (so initially the expectation
     of <<FORMULA>> is 0.5), the network converges to a poor result of 12.86% (Table1).
     This is also caused by higher training error (Fig 3(c)).

                                             <<FIGURE>>

     Figure 3.Training curves on CIFAR-10 of various shortcuts. Solid lines denote test
     error (y-axis on the right), and dashed lines denote training loss (y-axis on the left).


        When the initialized <<bg>> is very negatively biased (e.g.,6), the value of
     <<FORMULA>> is closer to 1 and the shortcut connection is nearly an identity mapping.
     Therefore, the result (6.91%, Table1) is much closer to the ResNet-110 baseline.
        1x1 convolutional shortcut. Next we experiment with 1x1 convolutional
     shortcut connections that replace the identity. This option has been investigated
     in [1] (known as option C) on a 34-layer ResNet (16 Residual Units) and shows
     good results, suggesting that 1x1 shortcut connections could be useful. But we
     find that this is not the case when there are many Residual Units. The 110-layer
     ResNet has a poorer result (12.22%, Table1) when using 1x1 convolutional
     shortcuts. Again, the training error becomes higher (Fig3(d)). When stacking
     so many Residual Units (54 for ResNet-110), even the shortest path may still
     impede signal propagation. We witnessed similar phenomena on ImageNet with
     ResNet-101 when using 1x1 convolutional shortcuts.
        Dropout shortcut. Last we experiment with dropout [11] (at a ratio of 0.5)
     which we adopt on the output of the identity shortcut (Fig.2(f)). The network
     fails to converge to a good solution. Dropout statistically imposes a scale of
     with an expectation of 0.5 on the shortcut, and similar to constant scaling by
     0.5, it impedes signal propagation. 

     Table 2.Classification error (%) on the CIFAR-10 test set using different activation
     functions.

                                  <<TABLE>>

                                 <<FIGURE>>

     Figure 4.Various usages of activation in Table2. All these units consist of the same
     components | only the orders are different.


     3.2 Discussions
     As indicated by the grey arrows in Fig.2, the shortcut connections are the
     most direct paths for the information to propagate.Multiplicative manipulations
     (scaling, gating, 1x1 convolutions, and dropout) on the shortcuts can hamper
     information propagation and lead to optimization problems.
        It is noteworthy that the gating and 1x1 convolutional shortcuts introduce
     more parameters, and should have stronger representational abilities than
     identity shortcuts. In fact, the shortcut-only gating and 1x1 convolution cover the
     solution space of identity shortcuts (i.e., they could be optimized as identity
     shortcuts). However, their training error is higher than that of identity short-
     cuts, indicating that the degradation of these models is caused by optimization
     issues, instead of representational abilities.


     4 On the Usage of Activation Functions

     Experiments in the above section support the analysis in Eqn.(5) and Eqn.(8),
     both being derived under the assumption that the after-addition activation f                                                              9

     is the identity mapping. But in the above experiments f is ReLU as designed
     in [1], so Eqn.(5) and (8) are approximate in the above experiments. Next we
     investigate the impact off.
        We want to make f an identity mapping, which is done by re-arranging
     the activation functions (ReLU and/or BN). The original Residual Unit in [1]
     has a shape in Fig.4(a) | BN is used after each weight layer, and ReLU is
     adopted after BN except that the last ReLU in a Residual Unit is after element-
     wise addition (f= ReLU). Fig.4(b-e) show the alternatives we investigated,
     explained as following.

     4.1 Experiments on Activation
     In this section we experiment with ResNet-110 and a 164-layerBottleneck[1]
     architecture (denoted as ResNet-164). A bottleneck Residual Unit consist of a
     1x1 layer for reducing dimension, a 3x3 layer, and a 1x1 layer for restoring
     dimension. As designed in [1], its computational complexity is similar to the
     two-3x3 Residual Unit. More details are in the appendix. The baseline ResNet-
     164 has a competitive result of 5.93% on CIFAR-10 (Table2).
        BN after addition. Before turning f into an identity mapping, we go the
     opposite way by adopting BN after addition (Fig.4(b)). In this case f involves
     BN and ReLU. The results become considerably worse than the baseline (Ta-
     ble2). Unlike the original design, now the BN layer alters the signal that passes
     through the shortcut and impedes information propagation, as reflected by the
     difficulties on reducing training loss at the beginning of training (Fib.6left).
        ReLU before addition. A naive choice of making f into an identity map-
     ping is to move the ReLU before addition (Fig.4(c)). However, this leads to a
     non-negative output from the transformF, while intuitively a residual function
     should take values in (-1,+1). As a result, the forward propagated signal
     is monotonically increasing. This may impact the representational ability,
     and the result is worse (7.84%, Table2) than the baseline. We expect to have
     a residual function taking values in (-1,+1). This condition is satisfied by
     other Residual Units including the following ones.
        Post-activation or pre-activation?In the original design (Eqn.(1) and
     Eqn.(2)), the activation<<FORMULA>> affects both paths in the next Residual
     Unit: <<FORMULA>>. Next we develop an asymmetric form
     where an activation f only affects the F path: <<FORMULA>>, for
     any l(Fig.5(a) to (b)). By renaming the notations, we have the following form:

                        <<FORMULA>>,                (9)

     It is easy to see that Eqn.(9) is similar to Eqn.(4), and can enable a backward
     formulation similar to Eqn.(5). For this new Residual Unit as in Eqn.(9), the new
     after-addition activation becomes an identity mapping. This design means that
     if a new after-addition activation f is asymmetrically adopted, it is equivalent
     to recasting f as the pre-activation of the next Residual Unit. This is illustrated
     in Fig.5. 

               <<FIGURE>>

     Figure 5.Using asymmetric after-addition activation is equivalent to constructing a
     pre-activationResidual Unit.

     Table 3.Classification error (%) on the CIFAR-10/100 test set using the original
     Residual Units and our pre-activation Residual Units.

            <<TABLE>>

        The distinction between post-activation/pre-activation is caused by the presence
     of the element-wise addition. For a plain network that has N layers, there
     are N-1 activations (BN/ReLU), and it does not matter whether we think of
     them as post- or pre-activations. But for branched layers merged by addition,
     the position of activation matters.
        We experiment with two such designs: (i) ReLU-only pre-activation (Fig.4(d)),
     and (ii) full pre-activation (Fig.4(e)) where BN and ReLU are both adopted be-
     fore weight layers. Table2 shows that the ReLU-only pre-activation performs
     very similar to the baseline on ResNet-110/164. This ReLU layer is not used in
     conjunction with a BN layer, and may not enjoy the benefits of BN [8].
        Somehow surprisingly, when BN and ReLU are both used as pre-activation,
     the results are improved by healthy margins (Table2and Table3). In Table3we
     report results using various architectures: (i) ResNet-110, (ii) ResNet-164, (iii)
     a 110-layer ResNet architecture in which each shortcut skips only 1 layer (i.e.,                                                             11

       <<FIGURE>>

     Figure 6.Training curves on CIFAR-10.Left: BN after addition (Fig.4(b)) using
     ResNet-110.Right: pre-activation unit (Fig.4(e)) on ResNet-164. Solid lines denote
     test error, and dashed lines denote training loss.


     a Residual Unit has only 1 layer, denoted as ResNet-110 (1layer)), and (iv)
     a 1001-layer bottleneck architecture that has 333 Residual Units (111 on each
     feature map size), denoted as \ResNet-1001". We also experiment on CIFAR-
     100. Table3shows that our pre-activation models are consistently better than
     the baseline counterparts. We analyze these results in the following.


     4.2 Analysis

     We find the impact of pre-activation is twofold. First, the optimization is further
     eased (comparing with the baseline ResNet) because f is an identity mapping.
     Second, using BN as pre-activation improves regularization of the models.
        Ease of optimization. This effect is particularly obvious when training
     the1001-layerResNet. Fig.1shows the curves. Using the original design in
     [1], the training error is reduced very slowly at the beginning of training. For
     f= ReLU, the signal is impacted if it is negative, and when there are many
     Residual Units, this effect becomes prominent and Eqn.(3) (so Eqn.(5)) is not
     a good approximation. On the other hand, when f is an identity mapping, the
     signal can be propagated directly between any two units. Our 1001-layer network
     reduces the training loss very quickly (Fig.1). It also achieves the lowest loss
     among all models we investigated, suggesting the success of optimization.
        We also find that the impact off= ReLU is not severe when the ResNet
     has fewer layers (e.g., 164 in Fig.6(right)). The training curve seems to suffer
     a little bit at the beginning of training, but goes into a healthy status soon. By
     monitoring the responses we observe that this is because after some training,
     the weights are adjusted into a status such that yl in Eqn.(1) is more frequently
     above zero and f does not truncate it (xl is always non-negative due to the previous
     ReLU, so yl is below zero only when the magnitude ofFis very negative).
     The truncation, however, is more frequent when there are 1000 layers.

     Table 4.Comparisons with state-of-the-art methods on CIFAR-10 and CIFAR-100
     using \moderate data augmentation" (ip/translation), except for ELU [12] with no
     augmentation. Better results of [13,14] have been reported using stronger data augmen-
     tation and ensembling. For the ResNets we also report the number of parameters. Our
     results are the median of 5 runs with meanstd in the brackets. All ResNets results
     are obtained with a mini-batch size of 128 except y with a mini-batch size of 64 (code
     available athttps://github.com/KaimingHe/resnet-1k-layers).

         <<TABLE>>

        Reducing overfitting. Another impact of using the proposed pre-activation
     unit is on regularization, as shown in Fig.6(right). The pre-activation ver-
     sion reaches slightly higher training loss at convergence, but produces lower test
     error. This phenomenon is observed on ResNet-110, ResNet-110(1-layer), and
     ResNet-164 on both CIFAR-10 and 100. This is presumably caused by BN’s
     reularization effect [8]. In the original Residual Unit (Fig.4(a)), although the BN
     normalizes the signal, this is soon added to the shortcut and thus the merged
     signal is not normalized. This unnormalized signal is then used as the input of
     the next weight layer. On the contrary, in our pre-activation version, the inputs
     to all weight layers have been normalized.


     5 Results

     Comparisons on CIFAR-10/100.Table4compares the state-of-the-art meth-
     ods on CIFAR-10/100, where we achieve competitive results. We note that we
     do not specially tailor the network width or filter sizes, nor use regularization
     techniques (such as dropout) which are very effective for these small datasets.
     We obtain these results via a simple but essential concept | going deeper. These
     results demonstrate the potential of pushing the limits of depth.

     Comparisons on ImageNet.Next we report experimental results on the 1000-
     class ImageNet dataset [3]. We have done preliminary experiments using the skip
     connections studied in Fig.2&3on ImageNet with ResNet-101 [1], and observed
     similar optimization difficulties. The training error of these non-identity shortcut
     networks is obviously higher than the original ResNet at the first learning rate                                                             13

     Table 5.Comparisons of single-crop error on the ILSVRC 2012 validation set. All
     ResNets are trained using the same hyper-parameters and implementations as [1]).
     Our Residual Units are the full pre-activation version (Fig.4(e)). y : code/model avail-
     able athttps://github.com/facebook/fb.resnet.torch/tree/master/pretrained,
     using scale and aspect ratio augmentation in [20].

       <<TABLE>>

     (similar to Fig.3), and we decided to halt training due to limited resources.
     But we did finish a BN after addition version (Fig.4(b)) of ResNet-101 on
     ImageNet and observed higher training loss and validation error. This model’s
     single-crop (224x224) validation error is 24.6%/7.5%,vs.the original ResNet-
     101’s 23.6%/7.1%. This is in line with the results on CIFAR in Fig.6(left).
        Table5shows the results of ResNet-152 [1] and ResNet-200 3 , all trained from
     scratch. We notice that the original ResNet paper [1] trained the models using
     scale jittering with shorter sides [256;480], and so the test of a 224x224 crop
     ons= 256 (as did in [1]) is negatively biased. Instead, we test a single 320x320
     crop from s=320, for all original and our ResNets. Even though the ResNets
     are trained on smaller crops, they can be easily tested on larger crops because
     the ResNets are fully convolutional by design. This size is also close to 299x299
     used by Inception v3 [19], allowing a fairer comparison.
        The original ResNet-152 [1] has top-1 error of 21.3% on a 320x320 crop, and
     our pre-activation counterpart has 21.1%. The gain is not big on ResNet-152
     because this model has not shown severe generalization difficulties. However,
     the original ResNet-200 has an error rate of 21.8%, higher than the baseline
     ResNet-152. But we find that the original ResNet-200 has lower training error
     than ResNet-152, suggesting that it suffers from overfitting.
        Our pre-activation ResNet-200 has an error rate of 20.7%, which is1.1%
     lower than the baseline ResNet-200 and also lower than the two versions of
     ResNet-152. When using the scale and aspect ratio augmentation of [20,19], our
     ResNet-200 has a result better than Inception v3 [19] (Table5). Concurrent
     with our work, an Inception-ResNet-v2 model [21] achieves a single-crop result
     of 19.9%/4.9%. We expect our observations and the proposed Residual Unit will
     help this type and generally other types of ResNets.

     Computational Cost.Our models’ computational complexity is linear on

      3 The ResNet-200 has 16 more 3-layer bottleneck Residual Units than ResNet-152,
       which are added on the feature map of 28x28.    

     depth (so a 1001-layer net is complex of a 100-layer net). On CIFAR,
     ResNet-1001 takes about 27 hours to train on 2 GPUs; on ImageNet, ResNet-
     200 takes about 3 weeks to train on 8 GPUs (on par with VGG nets [22]).


     6 Conclusions


     This paper investigates the propagation formulations behind the connection
     mechanisms of deep residual networks. Our derivations imply that identity short-
     cut connections and identity after-addition activation are essential for making
     information propagation smooth. Ablation experiments demonstrate phenom-
     ena that are consistent with our derivations. We also present 1000-layer deep
     networks that can be easily trained and achieve improved accuracy.


     Appendix: Implementation DetailsThe implementation details and hyper-
     parameters are the same as those in [1]. On CIFAR we use only the translation
     and skipping augmentation in [1] for training. The learning rate starts from 0.1,
     and is divided by 10 at 32k and 48k iterations. Following [1], for all CIFAR
     experiments we warm up the training by using a smaller learning rate of 0.01 at
     the beginning 400 iterations and go back to 0.1 after that, although we remark
     that this is not necessary for our proposed Residual Unit. The mini-batch size
     is 128 on 2 GPUs (64 each), the weight decay is 0.0001, the momentum is 0.9,
     and the weights are initialized as in [23].
        On ImageNet, we train the models using the same data augmentation as in
     [1]. The learning rate starts from 0.1 (no warming up), and is divided by 10 at
     30 and 60 epochs. The mini-batch size is 256 on 8 GPUs (32 each). The weight
     decay, momentum, and weight initialization are the same as above.
        When using the pre-activation Residual Units (Fig.4(d)(e) and Fig.5), we
     pay special attention to the first and the last Residual Units of the entire net-
     work. For the first Residual Unit (that follows a stand-alone convolutional layer,
     conv 1 ), we adopt the first activation right after conv 1 and before splitting into
     two paths; for the last Residual Unit (followed by average pooling and a fully-
     connected classifier), we adopt an extra activation right after its element-wise
     addition. These two special cases are the natural outcome when we obtain the
     pre-activation network via the modification procedure as shown in Fig.5.
        The bottleneck Residual Units (for ResNet-164/1001 on CIFAR) are
     constructed following [1]. For example, a 3x3, 16 unit in ResNet-110 is replaced 3x3, 162
     with a 1x1, 166    7 unit in ResNet-164, both of which have roughly the same 3x3, 165
      1x1, 64
     number of parameters. For the bottleneck ResNets, when reducing the feature map
     size we use projection shortcuts [1] for increasing dimensions, and when pre-
     activation is used, these projection shortcuts are also with pre-activation.                                                                    15

      References

       1.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
         In: CVPR. (2016)
       2.Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann ma-
         chines. In: ICML. (2010)
       3.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
         Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
         Scale Visual Recognition Challenge. IJCV (2015)
       4.Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,
         Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
       5.Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
         (1997)
       6.Srivastava, R.K., Gre, K., Schmidhuber, J.: Highway networks. In: ICML work-
         shop. (2015)
       7.Srivastava, R.K., Gre, K., Schmidhuber, J.: Training very deep networks. In:
         NIPS. (2015)
       8.Ioe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
         reducing internal covariate shift. In: ICML. (2015)
       9.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
         Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
         computation (1989)
      10.Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report
         (2009)
      11.Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
         Improving neural networks by preventing co-adaptation of feature detectors.
         arXiv:1207.0580 (2012)
      12.Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
         learning by exponential linear units (ELUs). In: ICLR. (2016)
      13.Graham, B.: Fractional max-pooling. arXiv:1412.6071 (2014)
      14.Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic-
         ity: The all convolutional net. arXiv:1412.6806 (2014)
      15.Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR. (2014)
      16.Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In:
         AISTATS. (2015)
      17.Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
         Hints for thin deep nets. In: ICLR. (2015)
      18.Mishkin, D., Matas, J.: All you need is a good init. In: ICLR. (2016)
      19.Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
         tion architecture for computer vision. In: CVPR. (2016)
      20.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
         Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
      21.Szegedy, C., Ioe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact
         of residual connections on learning. arXiv:1602.07261 (2016)
      22.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
         image recognition. In: ICLR. (2015)
      23.He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiers: Surpassing human-
         level performance on imagenet Classification. In: ICCV. (2015)
<|endoftext|>


<|startoftext|>
                         Language Models are Few-Shot Learners

                  Tom B. Brown       Benjamin Mann       Nick Ryder       Melanie Subbiah 


              Jared Kaplan y   Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry

              Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan

                Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter

                Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray

                      Benjamin Chess Jack Clark Christopher Berner

                  Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei


                                               OpenAI


                                               Abstract

                 Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training
                 on a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic
                 in architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of
                 thousands of examples. By contrast, humans can generally perform a new language task from only
                 a few examples or from simple instructions – something which current NLP systems still largely
                 struggle to do. Here we show that scaling up language models greatly improves task-agnostic,
                 few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art ﬁne-
                 tuning approaches. Speciﬁcally, we train GPT-3, an autoregressive language model with 175 billion
                 parameters, 10x more than any previous non-sparse language model, and test its performance in
                 the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or ﬁne-tuning,
                 with tasks and few-shot demonstrations speciﬁed purely via text interaction with the model. GPT-3
                 achieves strong performance on many NLP datasets, including translation, question-answering, and
                 close tasks, as well as several tasks that require on-the-ﬂy reasoning or domain adaptation, such as
                 unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same
                 time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some
                 datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally,
                 we ﬁnd that GPT-3 can generate samples of news articles which human evaluators have difﬁculty
                 distinguishing from articles written by humans. We discuss broader societal impacts of this ﬁnding
                 and of GPT-3 in general.


              Equal contribution
              y Johns Hopkins University, OpenAI

                      Contents

           1 Introduction                                                                 3
            2 Approach                                                                   6
              2.1 Model and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
              2.2 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
              2.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
              2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
            3 Results                                                                    10
              3.1 Language Modeling, Cloze, and Completion Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .11
              3.2 Closed Book Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
              3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
              3.4 Winograd-Style Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
              3.5 Common Sense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
              3.6 Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
              3.7 SuperGLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
              3.8 NLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
              3.9 Synthetic and Qualitative Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
           4 Measuring and Preventing Memorization Of Benchmarks29
           5 Limitations                                                                 33
           6 Broader Impacts                                                             34
              6.1 Misuse of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
              6.2 Fairness, Bias, and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
              6.3 Energy Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
           7 Related Work                                                               39
           8 Conclusion                                                                 40
           A Details of Common Crawl Filtering43
           B Details of Model Training                                                       43
           C Details of Test Set Contamination Studies43
           D Total Compute Used to Train Language Models46
           E Human Quality Assessment of Synthetic News Articles46
           F Additional Samples from GPT-348
           G Details of Task Phrasing and Speciﬁcations50
           H Results on All Tasks for All Model Sizes63


                                                            1 Introduction

           Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly
           ﬂexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word
           vectors [MCCD13,PSM14] and fed to task-speciﬁc architectures, then RNNs with multiple layers of representations
           and contextual state were used to form stronger representations [DL15,MBXS17,PNZtY18] (though still applied to
           task-speciﬁc architectures), and more recently pre-trained recurrent or transformer language models [VSP + 17] have
           been directly ﬁne-tuned, entirely removing the need for task-speciﬁc architectures [RNSS18,DCLT18,HR18].
           This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension,
           question answering, textual entailment, and many others, and has continued to advance based on new architectures
           and algorithms [RSR + 19,LOG + 19,YDY + 19,LCG + 19]. However, a major limitation to this approach is that while
           the architecture is task-agnostic, there is still a need for task-speciﬁc datasets and task-speciﬁc ﬁne-tuning: to achieve
           strong performance on a desired task typically requires ﬁne-tuning on a dataset of thousands to hundreds of thousands
           of examples speciﬁc to that task. Removing this limitation would be desirable, for several reasons.
           First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the
           applicability of language models. There exists a very wide range of possible useful language tasks, encompassing
           anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many
           of these tasks it is difﬁcult to collect a large supervised training dataset, especially when the process must be repeated
           for every new task.
           Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness
           of the model and the narrowness of the training distribution. This can create problems for the pre-training plus
           ﬁne-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then
           ﬁne-tuned on very narrow task distributions. For instance [HLW + 20] observe that larger models do not necessarily
           generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm
           can be poor because the model is overly speciﬁc to the training distribution and does not generalize well outside it
           [YdC + 19,MPL19]. Thus, the performance of ﬁne-tuned models on speciﬁc benchmarks, even when it is nominally at
           human-level, may exaggerate actual performance on the underlying task [GSL + 18,NK19].
           Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural
           language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number
           of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often

                                                        <<FIGURE>>

           Figure 1.1: Language model meta-learning.During unsupervised pre-training, a language model develops a broad
           set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize
           the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within
           the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a
           model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded
           within a single sequence.

                                                <<FIGURE>>

           Figure 1.2: Larger models make increasingly efﬁcient use of in-context information. We show in-context learning
           performance on a simple task requiring the model to remove random symbols from a word, both with and without a
           natural language task description (see Sec.3.9.2). The steeper “in-context learning curves” for large models demonstrate
           improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range
           of tasks.


           sufﬁcient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing
           to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans
           to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy
           dialogue. To be broadly useful, we would someday like our NLP systems to have this same ﬂuidity and generality.
           One potential route towards addressing these issues is meta-learning 1 – which in the context of language models means
           the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities
           at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure1.1). Recent work [RWC + 19]
           attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form
           of task speciﬁcation: the model is conditioned on a natural language instruction and/or a few demonstrations of the task
           and is then expected to complete further instances of the task simply by predicting what comes next.
           While it has shown some initial promise, this approach still achieves results far inferior to ﬁne-tuning – for example
           [RWC + 19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind
           the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of
           solving language tasks.
           Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer
           language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters
           [DCLT18], to 1.5 billion parameters [RWC + 19], to 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19],
           and ﬁnally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream
           NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a
           smooth trend of improvement with scale [KMH + 20]. Since in-context learning involves absorbing many skills and
           tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong
           gains with scale.

              1 In the context of language models this has sometimes been called “zero-shot transfer”, but this term is potentially ambiguous:
           the method is “zero-shot” in the sense that no gradient updates are performed, but it often involves providing inference-time
           demonstrations to the model, so is not truly learning from zero examples. To avoid this confusion, we use the term “meta-learning”
           to capture the inner-loop / outer-loop structure of the general method, and the term “in context-learning” to refer to the inner
           loop of meta-learning. We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many
           demonstrations are provided at inference time. These terms are intended to remain agnostic on the question of whether the model
           learns new tasks from scratch at inference time or simply recognizes patterns seen during training – this is an important issue which
           we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer
           loop structure.

                                                        <<FIGURE>>

           Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance
           improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are
           more proﬁcient at in-context learning. See Figure3.8for a more detailed analysis on SuperGLUE, a standard NLP
           benchmark suite.


           In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call
           GPT-3, and measuring its in-context learning abilities. Speciﬁcally, we evaluate GPT-3 on over two dozen NLP datasets,
           as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training
           set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we
           allow as many demonstrations as will ﬁt into the model’s context window (typically 10 to 100), (b) “one-shot learning”,
           where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only
           an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional
           ﬁne-tuning setting, but we leave this to future work.
           Figure1.2illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to
           remove extraneous symbols from a word. Model performance improves with the addition of a natural language task
           description, and with the number of examples in the model’s context,K. Few-shot learning also improves dramatically
           with model size. Though the results in this case are particularly striking, the general trends with both model size and
           number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no
           gradient updates or ﬁne-tuning, just increasing numbers of demonstrations given as conditioning.
           Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot
           setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held
           by ﬁne-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in
           the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the
           zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art
           relative to ﬁne-tuned models operating in the same closed-book setting.
           GPT-3 also displays one-shot and few-shot proﬁciency at tasks designed to test rapid adaption or on-the-ﬂy reasoning,
           which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them
           deﬁned only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human
           evaluators have difﬁculty distinguishing from human-generated articles.
           At the same time, we also ﬁnd some tasks on which few-shot performance struggles, even at the scale of GPT-3. This
           includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE
           or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we
           hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.
           A heuristic sense of the overall results can be seen in Figure1.3, which aggregates the various tasks (though it should
           not be seen as a rigorous or meaningful benchmark in itself).

           We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models
           on datasets such as Common Crawl, which can potentially include content from test datasets simply because such
           content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify
           its distorting effects. Although we ﬁnd that data contamination has a minimal effect on GPT-3’s performance on most
           datasets, we do identify a few datasets where it could be inﬂating results, and we either do not report results on these
           datasets or we note them with an asterisk, depending on the severity.
           In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion
           parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most
           tasks we ﬁnd relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap
           between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models
           are more proﬁcient meta-learners.
           Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and
           broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.
           The remainder of this paper is organized as follows. In Section2, we describe our approach and methods for training
           GPT-3 and evaluating it. Section3presents results on the full range of tasks in the zero-, one- and few-shot settings.
           Section4addresses questions of data contamination (train-test overlap). Section5discusses limitations of GPT-3.
           Section6discusses broader impacts. Section7reviews related work and Section8concludes.


           2 Approach

           Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC + 19],
           with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use
           of in-context learning is also similar to [RWC + 19], but in this work we systematically explore different settings for
           learning within the context. Therefore, we start this section by explicitly deﬁning and contrasting the different settings
           that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a
           spectrum of how much task-speciﬁc data they tend to rely on. Speciﬁcally, we can identify at least four points on this
           spectrum (see Figure2.1for an illustration):

                •Fine-Tuning (FT)has been the most common approach in recent years, and involves updating the weights of
                 a pre-trained model by training on a supervised dataset speciﬁc to the desired task. Typically thousands to
                 hundreds of thousands of labeled examples are used. The main advantage of ﬁne-tuning is strong performance
                 on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential
                 for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the
                 training data [GSL + 18,NK19], potentially resulting in an unfair comparison with human performance. In
                 this work we do not ﬁne-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be
                 ﬁne-tuned in principle and this is a promising direction for future work.
                •Few-Shot (FS)is the term we will use in this work to refer to the setting where the model is given a few
                 demonstrations of the task at inference time as conditioning [RWC + 19], but no weight updates are allowed.
                 As shown in Figure2.1, for a typical dataset an example has a context and a desired completion (for example
                 an English sentence and the French translation), and few-shot works by giving K examples of context and
                 completion, and then one ﬁnal example of context, with the model expected to provide the completion. We
                 typically setKin the range of 10 to 100 as this is how many examples can ﬁt in the model’s context window
                 (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-speciﬁc data and
                 reduced potential to learn an overly narrow distribution from a large but narrow ﬁne-tuning dataset. The main
                 disadvantage is that results from this method have so far been much worse than state-of-the-art ﬁne-tuned
                 models. Also, a small amount of task speciﬁc data is still required. As indicated by the name, few-shot
                 learning as described here for language models is related to few-shot learning as used in other contexts in
                 ML [HYC01,VBL + 16] – both involve learning based on a broad distribution of tasks (in this case implicit in
                 the pre-training data) and then rapidly adapting to a new task.
                •One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural
                 language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and
                 zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans.
                 For example, when asking humans to generate a dataset on a human worker service (for example Mechanical
                 Turk), it is common to give one demonstration of the task. By contrast it is sometimes difﬁcult to communicate
                 the content or format of a task if no examples are given.

                 <<FIGURE>>

            Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional ﬁne-tuning. The panels above show
           four methods for performing a task with a language model – ﬁne-tuning is the traditional method, whereas zero-, one-,
           and few-shot, which we study in this work, require the model to perform the task with only forward passes at test
           time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task
           descriptions, examples and prompts can be found in AppendixG.


                •Zero-Shot (0S)is the same as one-shot except that no demonstrations are allowed, and the model is only given
                 a natural language instruction describing the task. This method provides maximum convenience, potential for
                 robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of
                 pre-training data), but is also the most challenging setting. In some cases it may even be difﬁcult for humans
                 to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”.
                 For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be
                 ambiguous, as it may not be clear exactly what format the table should have or what should be included (and
                 even with careful clariﬁcation, understanding precisely what is desired can be difﬁcult). Nevertheless, for at
                 least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example
                 in Figure2.1, a human would likely know what to do from just the text instruction.

           Figure2.1shows the four methods using the example of translating English to French. In this paper we focus on
           zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different
           problem settings which offer a varying trade-off between performance on speciﬁc benchmarks and sample efﬁciency.
           We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art ﬁne-tuned models.
           Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance,
           and are important targets for future work.
           Sections2.1-2.3below give details on our models, training data, and training process respectively. Section2.4discusses
           the details of how we do few-shot, one-shot, and zero-shot evaluations.

                                                  <<TABLE>>

           Table 2.1:Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models
           which we trained. All models were trained for a total of 300 billion tokens.


           2.1 Model and Architectures

           We use the same model and architecture as GPT-2 [RWC + 19], including the modiﬁed initialization, pre-normalization,
           and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse
           attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence
           of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125
           million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH + 20]
           suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a
           function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for
           downstream language tasks.
           Table2.1shows the sizes and architectures of our 8 models. Here n params is the total number of trainable parameters,
           n layers is the total number of layers,d model is the number of units in each bottleneck layer (we always have the
           feedforward layer four times the size of the bottleneck layer,<<FORMULA>> model ), and d head is the dimension of each
           attention head. All models use a context window of <<FORMULA>> tokens. We partition the model across GPUs along
           both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural
           parameters for each model are chosen based on computational efﬁciency and load-balancing in the layout of models
           across GPU’s. Previous work [KMH + 20] suggests that validation loss is not strongly sensitive to these parameters
           within a reasonably broad range.

           2.2 Training Dataset

           Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset 2 [RSR + 19] constituting
           nearly a trillion words. This size of dataset is sufﬁcient to train our largest models without ever updating on the same
           sequence twice. However, we have found that unﬁltered or lightly ﬁltered versions of Common Crawl tend to have
           lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets:
           (1) we downloaded and ﬁltered a version of CommonCrawl based on similarity to a range of high-quality reference
           corpora, (2) we performed fuzzy de-duplication at the document level, within and across datasets, to prevent redundancy
           and preserve the integrity of our held-out validation set as an accurate measure of overﬁtting, and (3) we also added
           known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.
           Details of the ﬁrst two points (processing of Common Crawl) are described in AppendixA. For the third, we added
           several curated high-quality datasets, including an expanded version of the WebText dataset [RWC + 19], collected
           by scraping links over a longer period of time, and ﬁrst described in [KMH + 20], two internet-based books corpora
           (Books1 and Books2) and English-language Wikipedia.
           Table2.2shows the ﬁnal mixture of datasets that we used in training. The CommonCrawl data was downloaded from
           41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before ﬁltering
           and 570GB after ﬁltering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets
           are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently,
           such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are
           sampled 2-3 times. This essentially accepts a small amount of overﬁtting in exchange for higher quality training data.

                                                <<FIGURE>>
              
            Figure 2.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models
           [KMH + 20] we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B
           is almost 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaﬂop/s-days of compute
           during pre-training. Methodology for these calculations can be found in AppendixD.

                                           <<TABLE>>

           Table 2.2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training
           that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a
           result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets
           are seen less than once.


           A major methodological concern with language models pretrained on a broad swath of internet data, particularly large
           models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by
           having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched
           for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper.
           Unfortunately, a bug in the ﬁltering caused us to ignore some overlaps, and due to the cost of training it was not feasible
           to retrain the model. In Section4we characterize the impact of the remaining overlaps, and in future work we will
           more aggressively remove data contamination.

           2.3 Training Process

           As found in [KMH + 20,MKAT18], larger models can typically use a larger batch size, but require a smaller learning
           rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table
           2.1shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture
           of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models
           were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process
           and hyperparameter settings are described in AppendixB.

                                                            2.4 Evaluation

           For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that
           task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Story cloze
           there is no supervised training set available so we draw conditioning examples from the development set and evaluate
           on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning
           examples directly from it.
           K can be any value from 0 to the maximum amount allowed by the model’s context window, which is <<FORMULA>>
           for all models and typically ﬁts10to100examples. Larger values of K are usually but not always better, so when a
           separate development and test set are available, we experiment with a few values ofKon the development set and then
           run the best value on the test set. For some tasks (see AppendixG) we also use a natural language prompt in addition to
           (or forK= 0, instead of) demonstrations.
           On tasks that involve choosing one correct completion from several options (multiple choice), we provideKexamples
           of context plus correct completion, followed by one example of context only, and compare the LM likelihood of
           each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small
           number of datasets (ARC, OpenBookQA, and RACE) we gain additional beneﬁt as measured on the development set
           by normalizing by the unconditional probability of each completion, by computing  <<FORMULA>>, where <<FORMULA>> answer context
           is the string "Answer: "or" A: " and is used to prompt that the completion should be an answer
           but is otherwise generic.
           On tasks that involve binary classiﬁcation, we give the options more semantically meaningful names (e.g. “True” or
           “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what
           is done by [RSR + 19] (see AppendixG) for details.
           On tasks with free-form completion, we use beam search with the same parameters as [RSR + 19]: a beam width of 4
           and a length penalty of= 0:6. We score the model using F1 similarity score, BLEU, or exact match, depending on
           what is standard for the dataset at hand.
           Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-,
           and few-shot). When the test set is private, our model is often too large to ﬁt on the test server, so we report results on
           the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa)
           where we were able to make submission work, and we submit only the 200B few-shot results, and report development
           set results for everything else.


           3 Results


           In Figure3.1we display training curves for the 8 models described in Section2. For this graph we also include 6
           additional extra-small models with as few as 100,000 parameters. As observed in [KMH + 20], language modeling
           performance follows a power-law when making efﬁcient use of training compute. After extending this trend by two
           more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these
           improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will
           see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a
           broad spectrum of natural language tasks.
           Below, we evaluate the 8 models described in Section2(the 175 billion parameter parameter GPT-3 and 7 smaller
           models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.
           In Section3.1we evaluate on traditional language modeling tasks and tasks that are similar to language modeling,
           such as Cloze tasks and sentence/paragraph completion tasks. In Section3.2we evaluate on “closed book” question
           answering tasks: tasks which require using the information stored in the model’s parameters to answer general
           knowledge questions. In Section3.3we evaluate the model’s ability to translate between languages (especially one-shot
           and few-shot). In Section3.4we evaluate the model’s performance on Winograd Schema-like tasks. In Section3.5we
           evaluate on datasets that involve commonsense reasoning or question answering. In Section3.6we evaluate on reading
           comprehension tasks, in Section3.7we evaluate on the SuperGLUE benchmark suite, and in3.8we brieﬂy explore
           NLI. Finally, in Section3.9, we invent some additional tasks designed especially to probe in-context learning abilities –
           these tasks focus on on-the-ﬂy reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the
           few-shot, one-shot, and zero-shot settings.

                                        <<FIGURE>>

           Figure 3.1: Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy
           validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior
           observed in [KMH + 20] continues for an additional two orders of magnitude with only small deviations from the
           predicted curve. For this ﬁgure, we exclude embedding parameters from compute and parameter counts.

                                          <<TABLE>>

           Table 3.1: Zero-shot results on PTB language modeling dataset.Many other common language modeling datasets
           are omitted because they are derived from Wikipedia or other sources which are included in GPT-3’s training data.
           a [RWC + 19]


           3.1 Language Modeling, Cloze, and Completion Tasks

           In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks
           that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible
           completions of a piece of text.

           3.1.1 Language Modeling
           We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM + 94] dataset measured in [RWC + 19]. We omit
           the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the
           one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these
           issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15
           points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have
           a clear separation of examples to deﬁne one-shot or few-shot evaluation around, so we measure only zero-shot.

           3.1.2 LAMBADA
           The LAMBADA dataset [PKL + 16] tests the modeling of long-range dependencies in text – the model is asked to
           predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the
           continued scaling of language models is yielding diminishing returns on this difﬁcult benchmark. [BHT + 20] reﬂect on
           the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results [SPP + 19]

                                                  <<TABLE>>

           Table 3.2: Performance on cloze and completion tasks.GPT-3 signiﬁcantly improves SOTA on LAMBADA while
           achieving respectable performance on two difﬁcult completion prediction datasets.

                                <<FIGURE>> 

           Figure 3.2:On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3
           2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of
           the art by 18%. Note zero-shot uses a different format from one-shot and few-shot as described in the text.


           and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path
           forward”. We ﬁnd that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of
           8% over the previous state of the art.
           LAMBADA is also a demonstration of the ﬂexibility of few-shot learning as it provides a way to address a problem that
           classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a
           standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but
           also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word
           ﬁlters [RWC + 19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a
           cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We
           use the following ﬁll-in-the-blank format:
                          Alice was friends with Bob. Alice went to visit her friend     .!Bob
                          George bought some baseball equipment, a ball, a glove, and a     .!
           When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase
           of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model
           size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy
           by 10%. Finally, the ﬁll-in-blank method is not effective one-shot, where it always performs worse than the zero-shot
           setting. Perhaps this is because all models still require several examples to recognize the pattern.

                                                  <<TABLE>>

           Table 3.3: Results on three Open-Domain QA tasks.GPT-3 is shown in the few-, one-, and zero-shot settings, as
           compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the
           wiki split test server.

           One note of caution is that an analysis of test set contamination identiﬁed that a signiﬁcant minority of the LAMBADA
           dataset appears to be present in our training data – however analysis performed in Section4suggests negligible impact
           on performance.

           3.1.3 HellaSwag
           The HellaSwag dataset [ZHB + 19] involves picking the best ending to a story or set of instructions. The examples were
           adversarially mined to be difﬁcult for language models while remaining easy for humans (who achieve 95.6% accuracy).
           GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the
           75.4% accuracy of a ﬁne-tuned 1.5B parameter language model [ZHR + 19] but still a fair amount lower than the overall
           SOTA of 85.6% achieved by the ﬁne-tuned multi-task model ALUM.

           3.1.4 StoryCloze
           We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH + 16], which involves selecting the correct ending
           sentence for ﬁve-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot
           setting (withK= 70). This is still 4.1% lower than the ﬁne-tuned SOTA using a BERT based model [LDL19] but
           improves over previous zero-shot results by roughly 10%.

           3.2 Closed Book Question Answering

           In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense
           amount of possible queries, this task has normally been approached by using an information retrieval system to ﬁnd
           relevant text in combination with a model which learns to generate an answer given the question and the retrieved
           text. Since this setting allows a system to search for and condition on text which potentially contains the answer it
           is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well
           directly answering the questions without conditioning on auxiliary information. They denote this more restrictive
           evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better
           and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR + 19],
           WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in
           the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than
           previous closed-book QA work: in addition to external content not being allowed, ﬁne-tuning on the Q&A dataset itself
           is also not permitted.
           The results for GPT-3 are shown in Table3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the
           one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the ﬁne-tuned T5-11B by
           14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot
           result improves by 3.7% and matches the SOTA for an open-domain QA system which not only ﬁne-tunes but also
           makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP + 20].
           GPT-3’s few-shot result further improves performance another 3.2% beyond this.
           On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5%
           in the few-shot setting. This compares to 37.4% for ﬁne-tuned T5-11B, and 44.7% for ﬁne-tuned T5-11B+SSM,
           which uses a Q&A-speciﬁc pre-training procedure. GPT-3 in the few-shot setting approaches the performance of
           state-of-the-art ﬁne-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to
           few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions
           
           <<FIGURE>>
           
           Figure 3.3:On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models
           continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make signiﬁcant gains
           over zero-shot behavior, matching and exceeding the performance of the SOTA ﬁne-tuned open-domain model, RAG
           [LPP + 20]


           and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this
           distribution, recovering strong performance in the few-shot setting.
           On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in
           the few-shot setting, compared to 36.6% for ﬁne-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot
           to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to
           TriviaQA and WebQS. In particular, the questions in NQs tend towards very ﬁne-grained knowledge on Wikipedia
           speciﬁcally which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.
           Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain ﬁne-tuning SOTA. On the other two
           datasets it approaches the performance of the closed-book SOTA despite not using ﬁne-tuning. On all 3 datasets, we
           ﬁnd that performance scales very smoothly with model size (Figure3.3and AppendixHFigureH.7), possibly reﬂecting
           the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.

           3.3 Translation

           For GPT-2 a ﬁlter was used on a multilingual collection of documents to produce an English only dataset due to capacity
           concerns. Even with this ﬁltering GPT-2 showed some evidence of multilingual capability and performed non-trivially
           when translating between French and English despite only training on 10 megabytes of remaining French text. Since we
           increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training
           dataset to include more representation of other languages, though this remains an area for further improvement. As
           discussed in2.2the majority of our data is derived from raw Common Crawl with only quality-based ﬁltering. Although
           GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.
           These languages are documented in the supplemental material. In order to better understand translation capability, we
           also expand our analysis to include two additional commonly studied languages, German and Romanian.
           Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets
           with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a
           blend of training data that mixes many languages together in a natural way, combining them on a word, sentence,
           and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in
           particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make
           use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.
           Results are shown in Table3.4. Zero-shot GPT-3, which only receives on a natural language description of the task,
           still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for

                                                  <<TABLE>>

           Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating
           into English reﬂecting its strength as an English LM.We report BLEU scores on the WMT’14 Fr$En,
           WMT’16 De$En, and WMT’16 Ro$En datasets as measured by multi-bleu.perl with XLM’s tokenization
           in order to compare most closely with prior unsupervised NMT work. SacreBLEU f [Pos18] results re-
           ported in AppendixH. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA
           with relative conﬁdence. a [EOAG18]b [DHKH14]c [WXH + 18]d [oR16]e [LGG + 20]f [SacreBLEU signature:
           BLEU+case.mixed+numrefs.1+smooth.exp+tok.intl+version.1.2.20]

                                                <<FIGURE>>

           Figure 3.4:Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent
           trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be
           stronger than translation from English.

                                <<TABLE>>

           Table 3.5:Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section
           4for details on potential contamination of the Winograd test set. a [SBBC19]b [LYN + 20]

                                        <<FIGURE>>

           Figure 3.5:Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales.
           Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B
           is competitive with a ﬁne-tuned RoBERTA-large.


           each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.
           GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior
           unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the
           three input languages studied, GPT-3 signiﬁcantly outperforms prior unsupervised NMT work when translating into
           English but under-performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at
           over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE
           tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En,
           few shot GPT-3 outperforms the best supervised result we could ﬁnd but due to our unfamiliarity with the literature and
           the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art.
           For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of
           unsupervised pretraining, supervised ﬁnetuning on 608K labeled examples, and backtranslation [LHCG19b].
           Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of
           improvement with model capacity. This is shown in Figure3.4in the case of few-shot results, and scaling for all three
           settings is shown in AppendixH.

           3.4 Winograd-Style Tasks

           The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun
           refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently ﬁne-tuned
           language models have achieved near-human performance on the original Winograd dataset, but more difﬁcult versions

                                                  <<TABLE>>

           Table 3.6:GPT-3 results on three commonsense reasoning tasks, PIQA, ARC, and OpenBookQA. GPT-3 Few-Shot
           PIQA result is evaluated on the test server. See Section4for details on potential contamination issues on the PIQA test
           set.
                                                        <<FIGURE>>

           Figure 3.6:GPT-3 results on PIQA in the zero-shot, one-shot, and few-shot settings. The largest model achieves a
           score on the development set in all three conditions that exceeds the best recorded score on the task.


           such as the adversarially-mined Winogrande dataset [SBBC19] still signiﬁcantly lag human performance. We test
           GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.
           On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method
           described in [RWC + 19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which
           is presented as binary classiﬁcation and requires entity extraction to convert to the form described in this section. On
           Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear
           in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human
           performance. We note that contamination analysis found some Winograd schemas in the training data but this appears
           to have only a small effect on results (see Section4).
           On the more difﬁcult Winogrande dataset, we do ﬁnd gains to in-context learning: GPT-3 achieves 70.2% in the
           zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a ﬁne-tuned
           RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a ﬁne-tuned high capacity model (T5), and
           human performance on the task as reported by [SBBC19] is 94.0%.

           3.5 Common Sense Reasoning

           Next we consider three datasets which attempt to capture physical or scientiﬁc reasoning, as distinct from sentence
           completion, reading comprehension, or broad knowledge question answering. The ﬁrst, PhysicalQA (PIQA) [BZB + 19],
           asks common sense questions about how the physical world works and is intended as a probe of grounded understanding
           of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot
           (the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a

                                                  <<TABLE>>

           Table 3.7:Results on reading comprehension tasks. All scores are F1 except results for RACE which report accuracy.
           a [JZC + 19]b [JN20]c [AI19]d [QIA20]e [SPP + 19]

           ﬁne-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human
           performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis
           ﬂagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark
           the result with an asterisk. See Section4for details.
           ARC [CCE + 18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the
           “Challenge” version of the dataset which has been ﬁltered to questions which simple statistical or information retrieval
           methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot
           setting, and 51.5% in the few-shot setting. This is approaching the performance of a ﬁne-tuned RoBERTa baseline
           (55.9%) from UniﬁedQA [KKS + 20]. On the “Easy” version of the dataset (questions which either of the mentioned
           baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a ﬁne-tuned
           RoBERTa baseline from [KKS + 20]. However, both of these results are still much worse than the overall SOTAs
           achieved by the UniﬁedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy
           set.
           On OpenBookQA [MCKS18], GPT-3 improves signiﬁcantly from zero to few shot settings but is still over 20 points
           short of the overall SOTA. GPT-3’s few-shot performance is similar to a ﬁne-tuned BERT Large baseline on the
           leaderboard.
           Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and
           inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a signiﬁcant
           improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.

           3.6 Reading Comprehension

           Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive,
           multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread
           in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general
           we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each
           respective dataset.
           GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset
           and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI + 18] a dataset which requires modeling structured
           dialog acts and answer span selections of teacher-student interactions. On DROP [DWD + 19], a dataset testing discrete
           reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the ﬁne-tuned
           BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches
           which augment neural networks with symbolic systems [RLL + 19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its
           few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to
           slightly outperform the best ﬁne-tuned result in the original paper. On RACE [LXL + 17], a multiple choice dataset of
           middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with
           the earliest work utilizing contextual representations and is still 45% behind SOTA.

           3.7 SuperGLUE

           In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a
           more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark
           [WPN + 19]. GPT-3’s test-set performance on the SuperGLUE dataset [WPN + 19] is shown in Table3.8. In the few-shot
           setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and
           
                                                <<FIGURE>>
           
           Figure 3.7:GPT-3 results on CoQA reading comprehension task. GPT-3 175B achieves 85 F1 in the few-shot setting,
           only a few points behind measured human performance and state-of-the-art ﬁne-tuned models. Zero-shot and one-shot
           performance is a few points behind, with the gains to few-shot being largest for bigger models.

                                <<TABLE>>

           Table 3.8:Performance of GPT-3 on SuperGLUE compared to ﬁne-tuned baselines and SOTA. All results are reported
           on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient
           updates.

                                <<FIGURE>>

           Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context.A value
           of K= 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in
           SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference
           lines (our test set results are in Table3.8). The BERT-Large reference model was ﬁne-tuned on the SuperGLUE training
           set (125K examples), whereas BERT++ was ﬁrst ﬁne-tuned on MultiNLI (392K examples) and SWAG (113K examples)
           before further ﬁne-tuning on the SuperGLUE training set (for a total of 630K ﬁne-tuning examples). We ﬁnd the
           difference in performance between the BERT-Large and BERT++ to be roughly equivalent to the difference between
           GPT-3 with one example per context versus eight examples per context.

           MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used
           the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.
           We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA
           performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving
           second place on the leaderboard, where ﬁrst place is held by a ﬁne-tuned 11 billion parameter model (T5). On WSC,
           performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the
           original Winograd dataset as described in Section3.4). On BoolQ, MultiRC, and RTE, performance is reasonable,
           roughly matching that of a ﬁne-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.
           WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different
           phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two
           sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer
           in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot
           setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same
           way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another.
           This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these
           weaknesses, GPT-3 still outperforms a ﬁne-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to
           the state-of-the-art held by a ﬁne-tuned 11 billion parameter model.
           Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of
           examples in the context showing increasing beneﬁts from in-context learning (Figure3.8). We scale K up to 32
           examples per task, after which point additional examples will not reliably ﬁt into our context. When sweeping over
           values ofK, we ﬁnd that GPT-3 requires less than eight total examples per task to outperform a ﬁne-tuned BERT-Large
           on overall SuperGLUE score.

           3.8 NLI

           Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences.
           In practice, this task is usually structured as a two or three class classiﬁcation problem where the model classiﬁes

                        <<FIGURE>>

           Figure 3.9: Performance of GPT-3 on ANLI Round 3.Results are on the dev-set, which has only 1500 examples
           and therefore has high variance (we estimate a standard deviation of 1.2%). We ﬁnd that smaller models hover around
           random chance, while few-shot GPT-3 175B closes almost half the gap from random chance to SOTA. Results for
           ANLI rounds 1 and 2 are shown in the appendix.


           whether the second sentence logically follows from the ﬁrst, contradicts the ﬁrst sentence, or is possibly true (neutral).
           SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest
           version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting
           GPT-3 performs similarly to a single-task ﬁne-tuned BERT Large. We also evaluate on the recently introduced
           Adversarial Natural Language Inference (ANLI) dataset [NWD + 19]. ANLI is a difﬁcult dataset employing a series of
           adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our
           models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (33%),
           whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure3.9and full results
           for all rounds can be found in AppendixH. These results on both RTE and ANLI suggest that NLI is still a very difﬁcult
           task for language models and they are only just beginning to show signs of progress.

           3.9 Synthetic and Qualitative Tasks

           One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which
           require it to perform simple on-the-ﬂy computational reasoning, recognize a novel pattern that is unlikely to have
           occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we
           test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the
           letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to
           solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new
           words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets
           with the hope of stimulating further study of test-time behavior of language models.

           3.9.1 Arithmetic
           To test GPT-3’s ability to perform simple arithmetic operations without task-speciﬁc training, we developed a small
           battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

                •2 digit addition (2D+)– The model is asked to add two integers sampled uniformly from[0;100), phrased in
                 the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”
                •2 digit subtraction (2D-)– The model is asked to subtract two integers sampled uniformly from[0;100); the
                 answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.
                •3 digit addition (3D+)– Same as 2 digit addition, except numbers are uniformly sampled from[0;1000).

                                <<FIGURE>>

           Figure 3.10:Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a
           signiﬁcant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being
           able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a signiﬁcant fraction
           of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot
           are shown in the appendix.


                •3 digit subtraction (3D-)– Same as 2 digit subtraction, except numbers are uniformly sampled from[0;1000).
                •4 digit addition (4D+)– Same as 3 digit addition, except uniformly sampled from[0;10000).
                •4 digit subtraction (4D-)– Same as 3 digit subtraction, except uniformly sampled from[0;10000).
                •5 digit addition (5D+)– Same as 3 digit addition, except uniformly sampled from[0;100000).
                •5 digit subtraction (5D-)– Same as 3 digit subtraction, except uniformly sampled from[0;100000).
                •2 digit multiplication (2Dx)– The model is asked to multiply two integers sampled uniformly from[0;100),
                 e.g. “Q: What is 24 times 42? A: 1008”.
                •One-digit composite (1DC)– The model is asked to perform a composite operation on three 1 digit numbers,
                 with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers
                 are selected uniformly on[0;10)and the operations are selected uniformly from f+,-,*g.

           In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random
           instances of the task and evaluate all models on those instances.
           First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure3.10. On addition and subtraction,
           GPT-3 displays strong proﬁciency when the number of digits is small, achieving 100% accuracy on 2 digit addition,
           98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the
           number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on
           ﬁve digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves
           29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves
           21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness
           beyond just single operations.
           As Figure3.10makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the
           second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all
           other operations less than 10% of the time.
           One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation
           to the task (or at the very least recognition of the task) is important to performing these computations correctly.
           Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 signiﬁcantly

                                                  <<TABLE>>

           Table 3.9:Results on basic arithmetic tasks for GPT-3 175B.f2,3,4,5gDf+,-gis 2, 3, 4, and 5 digit addition or
           subtraction, 2Dx is 2 digit multiplication. 1DC is 1 digit composite operations. Results become progressively stronger
           moving from the zero-shot to one-shot to few-shot setting, but even the zero-shot shows signiﬁcant arithmetic abilities.


                                 <<TABLE>>

           Table 3.10:GPT-3 175B performance on various word unscrambling and word manipulation tasks, in zero-, one-, and
           few-shot settings. CL is “cycle letters in word”, A1 is anagrams of but the ﬁrst and last letters, A2 is anagrams of all but
           the ﬁrst and last two letters, RI is “Random insertion in word”, RW is “reversed words”.


           outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table3.9, and
           model capacity scaling for all three settings is shown in AppendixH.
           To spot-check whether the model is simply memorizing speciﬁc arithmetic problems, we took the 3-digit arithmetic
           problems in our test set and searched for them in our training data in both the forms"<NUM1> + <NUM2> ="and
           "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000
           subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers
           could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes
           such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than
           memorizing a table.
           Overall, GPT-3 displays reasonable proﬁciency at moderately complex arithmetic in few-shot, one-shot, and even
           zero-shot settings.

           3.9.2 Word Scrambling and Manipulation Tasks
           To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of
           5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of
           scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

                •Cycle letters in word (CL)– The model is given a word with its letters cycled, then the “=” symbol, and
                 is expected to generate the original word. For example, it might be given “lyinevitab” and should output
                 “inevitably”.
                •Anagrams of all but ﬁrst and last characters (A1)– The model is given a word where every letter except
                 the ﬁrst and last have been scrambled randomly, and must output the original word. Example: criroptuon =
                 corruption.
                •Anagrams of all but ﬁrst and last 2 characters (A2)– The model is given a word where every letter except
                 the ﬁrst 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt
                 !opponent.
                •Random insertion in word (RI)– A random punctuation or space character is inserted between each letter
                 of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
                •Reversed words (RW)– The model is given a word spelled backwards, and must output the original word.
                 Example: stcejbo!objects.

           For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by
           [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure3.11.
           Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing

                                <<FIGURE>>

           Figure 3.11:Few-shot performance on the ﬁve word scrambling tasks for different sizes of model. There is generally
           smooth improvement with model size although the random insertion task shows an upward slope of improvement with
           the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in
           the appendix. All tasks are done with K=100.


           random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difﬁcult anagram
           task (where only the ﬁrst and last letters are held ﬁxed). None of the models can reverse the letters in a word.
           In the one-shot setting, performance is signiﬁcantly weaker (dropping by half or more), and in the zero-shot setting the
           model can rarely perform any of the tasks (Table3.10). This suggests that the model really does appear to learn these
           tasks at test time, as the model cannot perform them zero-shot and their artiﬁcial nature makes them unlikely to appear
           in the pre-training data (although we cannot conﬁrm this with certainty).
           We can further quantify performance by plotting “in-context learning curves”, which show task performance as a
           function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task
           in Figure1.2. We can see that larger models are able to make increasingly effective use of in-context information,
           including both task examples and natural language task descriptions.
           Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding
           operates on signiﬁcant fractions of a word (on average 0.7 words per token), so from the LM’s perspective succeeding
           at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also,
           CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word),
           requiring the model to perform some search to ﬁnd the correct unscrambling. Thus, the skills involved appear to require
           non-trivial pattern-matching and computation.


           3.9.3 SAT Analogies

           To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of
           374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of
           the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to
           hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to
           temptation”. The student is expected to choose which of the ﬁve word pairs has the same relationship as the original
           word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the
           few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among
           college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure3.12, the results improve with
           scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.

                                <<FIGURE>>

           Figure 3.12:Zero-, one-,and few-shot performance on SAT analogy tasks, for different sizes of model. The largest
           model achieves 65% accuracy in the few-shot setting, and also demonstrates signiﬁcant gains to in-context learning
           which are not present in smaller models.


           3.9.4 News Article Generation
           Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by
           conditional sampling from the model given a human-written prompt consisting of a plausible ﬁrst sentence for a news
           story [RWC + 19]. Relative to [RWC + 19], the dataset used to train GPT-3 is much less weighted towards news articles,
           so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets
           the proposed ﬁrst sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To
           solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the
           model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably
           generate short articles in the “news” genre.
           To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional
           sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles
           from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR + 19]. Generative
           language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to
           distinguish the two is a potentially important measure of quality. 

           In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles
           from the websitenewser.com(mean length: 215 words). We then generated completions of these titles and subtitles
           from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each
           model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed
           by either the human written article or the article generated by the model 4 . Participants were asked to select whether the
           article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by
           a machine”, or “very likely written by a machine”.
           The articles we selected were not in the models’ training data and the model outputs were formatted and selected
           programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were
           pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model.
           However, we also ran an experiment to control for participant effort and attention that followed the same format but
           involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a
           160M parameter model with no context and increased output randomness.

              3 This task is also relevant to the potential misuse of language models discussed in Section6.1.
              4 We wanted to identify how good an average person on the internet is at detecting language model outputs, so we focused on
           participants drawn from the general US population. See AppendixEfor details.

                                                  <<TABLE>>

           Table 3.11: Human accuracy in identifying whether short (200 word) news articles are model generated. We
           ﬁnd that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from 86%
           on the control model to 52% on GPT-3 175B. This table compares mean accuracy between ﬁve different models, and
           shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model
           (an unconditional GPT-3 Small model with increased output randomness).


           Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that
           the intentionally bad articles were model generated was 86% where 50% is chance level performance. By contrast,
           mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance
           at 52% (see Table3.11). 5 Human abilities to detect model generated text appear to decrease as model size increases:
           there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance. 6
           This is true despite the fact that participants spend more time on each output as model size increases (see AppendixE).
           Examples of synthetic articles from GPT-3 are given in Figures3.14and3.15.7 Much of the text is—as indicated by the
           evaluations—difﬁcult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator
           that an article is model generated since, unlike human authors, the models have no access to the speciﬁc facts that the
           article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual
           phrasings, though these are often subtle enough that they are not noticed.
           Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like
            GROVER [ZHR + 19] and GLTR [GSR19] may have greater success at detecting model generated text than human
           evaluators. Automatic detection of these models may be a promising area of future research.
           Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe
           more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated
           by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated
           completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial
           experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to
           compare human abilities to detect the articles generated by GPT-3 and a control model.
           We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was
           88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely
           above chance at 52%(see Table3.12). This indicates that, for news articles that are around 500 words long, GPT-3
           continues to produce articles that humans ﬁnd difﬁcult to distinguish from human written news articles.

           3.9.5 Learning and Using Novel Words
           A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a
           word in a sentence after seeing it deﬁned only once, or conversely inferring a word’s meaning from only one usage. Here
           we qualitatively test GPT-3’s ability to do the former. Speciﬁcally, we give GPT-3 the deﬁnition of a nonexistent word,
           such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to ﬁve previous examples of a (separate)

              5 We use a two-sample Student’s T-Test to test for signiﬁcant difference between the means of the participant accuracies of each
           model and the control model and report the normalized difference in the means (as the t-statistic) and the p-value.
              6 If a model consistently produces texts that are more impressive than human articles, it is possible that human performance on
           this task would drop below 50%. Indeed, many individual participants scored below 50% on this task.
              7 Additional non-news samples can be found in AppendixF.

                                                                        <<FIGURE>>

           Figure 3.13:People’s ability to identify whether news articles are model-generated (measured by the ratio of correct
           assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberately-
           bad control model (an unconditioned GPT-3 Small model with higher output randomness) is indicated with the dashed
           line at the top, and the random chance (50%) is indicated with the dashed line at the bottom. Line of best ﬁt is a power
           law with 95% conﬁdence intervals.

                                        <<TABLE>>

           Table 3.12:People’s ability to identify whether 500 word articles are model generated (as measured by the ratio of
           correct assignments to non-neutral assignments) was 88% on the control model and 52% on GPT-3 175B. This table
           shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control
           model (an unconditional GPT-3 Small model with increased output randomness).

                <<FIGURE>>

           Figure 3.14:The GPT-3 generated news article that humans had the greatest difﬁculty distinguishing from a human
           written article (accuracy: 12%).

                                                        <<FIGURE>>

           Figure 3.15:The GPT-3 generated news article that humans found the easiest to distinguish from a human written
           article (accuracy: 61%).

                                <<FIGURE>>

           Figure 3.16:Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is
           GPT-3’s completions, plain text is human prompts. In the ﬁrst example both the prompt and the completion are provided
           by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional
           prompts and provides the completions. Nothing task-speciﬁc is provided to GPT-3 other than the conditioning shown
           here.

           nonexistent word being deﬁned and used in a sentence, so the task is few-shot in terms of previous examples of the
           broad task and one-shot in terms of the speciﬁc word. Table3.16shows the 6 examples we generated; all deﬁnitions
           were human-generated, and the ﬁrst answer was human-generated as conditioning while the subsequent answers were
           generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try
           any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the ﬁnal
           sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of
           the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy
           sword ﬁght. Overall, GPT-3 appears to be at least proﬁcient at the task of using novel words in a sentence.

           3.9.6 Correcting English Grammar
           Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the few-
           shot setting by giving prompts of the form"Poor English Input: <sentence>nn Good English Output:
           <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any
           omissions or repeats). Results are shown in Figure3.17.

           4 Measuring and Preventing Memorization Of Benchmarks

           Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our
           benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research
           without established best practices. While it is common practice to train large models without investigating contamination,
           given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.
           This concern is not just hypothetical. One of the ﬁrst papers to train a language model on Common Crawl data [TL18]
           detected and removed a training document which overlapped with one of their evaluation datasets. Other work such
           as GPT-2 [RWC + 19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, ﬁnding that

                                                  <<FIGURE>>.

             Figure 3.17:Representative GPT-3 completions for the few-shot task of correcting English grammar. Boldface
             is GPT-3’s completions, plain text is human prompts. In the ﬁrst few examples example both the prompt and the
             completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives
             successive additional prompts and provides the completions. Nothing task-speciﬁc is provided to GPT-3 aside from
             the ﬁrst few examples as conditioning and the “Poor English input/Good English output” framing. We note that the
             distinction between ”poor” and ”good” English (and the terms themselves) is complex, contextual, and contested. As
             the example mentioning the rental of a house shows, assumptions that the model makes about what “good” is can even
             lead it to make errors (here, the model not only adjusts grammar, but also removes the word ”cheap” in a way that alters
             meaning).

                                                                                                <<FIGURE>>

           Figure 4.1: GPT-3 Training Curves We measure model performance during training on a deduplicated validation
           split of our training distribution. Though there is some gap between training and validation performance, the gap grows
           only minimally with model size and training time, suggesting that most of the gap comes from a difference in difﬁculty
           rather than overﬁtting.


           although models did perform moderately better on data that overlapped between training and testing, this did not
           signiﬁcantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
           GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of
           magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential
           for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B
           does not overﬁt its training set by a signiﬁcant amount, measured relative to a held-out validation set with which it was
           deduplicated (Figure4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as
           large as feared.
           We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap
           between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a
           bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t
           feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts
           results.
           For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, deﬁned roughly as
           examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when
           it is shorter than 13-grams). The goal is to very conservatively ﬂag anything that could potentially be contamination,
           so as to produce a clean subset that is free of contamination with high conﬁdence. The exact procedure is detailed in
           AppendixC.
           We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean
           subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a
           signiﬁcant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be
           inﬂating the results. The results are summarized in Figure4.2. Although potential contamination is often high (with a
           quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence
           that contamination level and performance difference are correlated. We conclude that either our conservative method
           substantially overestimated contamination or that contamination has little effect on performance.
           Below, we review in more detail the few speciﬁc cases where either (1) the model performs signiﬁcantly worse on
           the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference
           difﬁcult.
           Our analysis ﬂagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension
           (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English

                                <<FIGURE>>

           Figure 4.2: Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to
           check for potential contamination in our training set. The x-axis is a conservative lower bound for how much of the
           dataset is known with high conﬁdence to be clean, and the y-axis shows the difference in performance when evaluating
           only on the veriﬁed clean subset. Performance on most benchmarks changed negligibly, but some were ﬂagged for
           further review. On inspection we ﬁnd some evidence for contamination of the PIQA and Winograd results, and we mark
           the corresponding results in Section3with an asterisk. We ﬁnd no evidence that other benchmarks are affected.


           translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false
           positives. We summarize the results for each group of tasks below:

                •Reading Comprehension:Our initial analysis ﬂagged>90% of task examples from QuAC, SQuAD2, and
                 DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difﬁcult.
                 Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source
                 text was present in our training data but the question/answer pairs were not, meaning the model gains only
                 background information and cannot memorize the answer to a speciﬁc question.
                •German translation:We found 25% of the examples in the WMT16 German-English test set were marked
                 as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the
                 ﬂagged examples contain paired sentences resembling NMT training data and collisions were monolingual
                 matches mostly of snippets of events discussed in the news.
                •Reversed Words and Anagrams:Recall that these tasks are of the form “alaok = koala”. Due to the
                 short length of these tasks, we used 2-grams for ﬁltering (ignoring punctuation). After inspecting the ﬂagged
                 overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set,
                 but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small,
                 but removing the trivial tasks lead to an increase in difﬁculty and thus a spurious signal. Related to this, the
                 symbol insertion task shows high overlap but no effect on performance – this is because that task involves
                 removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to
                 many spurious matches.
                •PIQA:The overlap analysis ﬂagged 29% of examples as contaminated, and observed a 3 percentage point
                 absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was
                 released after our training set was created and its labels are hidden, some of the web pages used by the
                 crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller
                 model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias
                 rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot
                 rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential
                 contamination.
                •Winograd:The overlap analysis ﬂagged 45% of examples, and found a 2.6% decrease in performance on the
                 clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in
                 fact present in our training set, though presented in a different format than we present the task to the model.
                 Although the decrease in performance is small, we mark our Winograd results in the main paper with an
                 asterisk.

                •Language modeling:We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the
                 Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably
                 extract a clean subset here, we do not report results on these datasets, even though we intended to when starting
                 this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language
                 modeling benchmark.

           We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply
           to verify how much actual contamination existed. These appeared to often contain false positives. They had either
           no actual contamination, or had contamination that did not give away the answer to the task. One notable exception
           was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very
           small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our ﬁll-in-the-blank format
           precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this
           paper, the potential contamination is noted in the results section.
           An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the
           same distribution as the original dataset. It remains possible that memorization inﬂates results but at the same time
           is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number
           of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small
           models, which are unlikely to be memorizing.
           Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright
           remove problematic results, depending on the severity. Much work remains to be done to address this important and
           subtle issue for the ﬁeld in general, both when designing benchmarks and when training models. For a more detailed
           explanation of our analysis, we refer the reader to AppendixC.


           5 Limitations

           GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for
           future work.
           First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct
           predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although
           the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to
           lose coherence over sufﬁciently long passages, contradict themselves, and occasionally contain non-sequitur sentences
           or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of
           GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed
           informally that GPT-3 seems to have special difﬁculty with “common sense physics”, despite doing well on some
           datasets (such as PIQA [BZB + 19]) that test this domain. Speciﬁcally GPT-3 has difﬁculty with questions of the type
           “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable
           gaps on our suite of benchmarks, as described in Section3, and in particular it does little better than chance when
           evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same
           way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading
           comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.
           GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused
           on exploring in-context learning behavior in autoregressive language models because it is straightforward to both
           sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional
           architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent
           literature, which has documented improved ﬁne-tuning performance when using these approaches over standard
           language models [RSR + 19]. Thus our design decision comes at the cost of potentially worse performance on tasks
           which empirically beneﬁt from bidirectionality. This may include ﬁll-in-the-blank tasks, tasks that involve looking back
           and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then
           generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a
           few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves
           comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and
           RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at ﬁne-tuning
           than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with
           few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.
           A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether
           autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the

           pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to
           predict and what is less important. [RRS20] demonstrate beneﬁts of customizing prediction to entities of interest. Also,
           with self-supervised objectives, task speciﬁcation relies on forcing the desired task into a prediction problem, whereas
           ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed
           actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains
           of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world
           [BHT + 20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a
           different approach is likely to be necessary. Promising future directions in this vein might include learning the objective
           function from humans [ZSW + 19a], ﬁne-tuning with reinforcement learning, or adding additional modalities such as
           images to provide grounding and a better model of the world [CLY + 19].
           Another limitation broadly shared by language models is poor sample efﬁciency during pre-training. While GPT-3
           takes a step towards test-time sample efﬁciency closer to that of humans (one-shot or zero-shot), it still sees much more
           text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efﬁciency is
           an important direction for future work, and might come from grounding in the physical world to provide additional
           information, or from algorithmic improvements.
           A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot
           learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identiﬁes tasks that it
           has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that
           are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format,
           to adapting to a speciﬁc style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on
           this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or deﬁning nonsense words
           seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although
           possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what
           humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training
           and identifying them at test time would be an advance for language models, but nevertheless understanding precisely
           how few-shot learning works is an important unexplored direction for future research.
           A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are
           both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of
           models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large
           models down to a manageable size for speciﬁc tasks. Large models such as GPT-3 contain a very wide range of skills,
           most of which are not needed for a speciﬁc task, suggesting that in principle aggressive distillation may be possible.
           Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters;
           new challenges and opportunities may be associated with applying it to models of this size.
           Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable,
           it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in
           performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This
           last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special
           concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts
           (Section6).

           6 Broader Impacts

           Language models have a wide range of beneﬁcial applications for society, including code and writing auto-completion,
           grammar assistance, game narrative generation, improving search engine responses, and answering questions. But
           they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over
           smaller models and increases the difﬁculty of distinguishing synthetic text from human-written text. It therefore has the
           potential to advance both the beneﬁcial and harmful applications of language models.
           Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily
           greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this
           are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in
           Section6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section6.2. We also brieﬂy
           discuss issues of energy efﬁciency (Section6.3).

           6.1 Misuse of Language Models

           Malicious uses of language models can be somewhat difﬁcult to anticipate because they often involve repurposing
           language models in a very different environment or for a different purpose than researchers intended. To help with this,
           we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying
           threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact
           [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

           6.1.1 Potential Misuse Applications

           Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples
           include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing
           and social engineering pretexting. Many of these applications bottleneck on human beings to write sufﬁciently high
           quality text. Language models that produce high quality text generation could lower existing barriers to carrying out
           these activities and increase their efﬁcacy.
           The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to
           generate several paragraphs of synthetic content that people ﬁnd difﬁcult to distinguish from human-written text in
           3.9.4 represents a concerning milestone in this regard.

           6.1.2 Threat Actor Analysis

           Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors
           who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced
           (e.g. state-sponsored) groups with long-term agendas [SBC + 19].
           To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat
           groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did
           ﬁnd signiﬁcant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances
           of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated
           with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is
           not immediate, but signiﬁcant improvements in reliability could change this.
           Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about
           possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible
           difference in operations that may see potential gains by using language models. The assessment was that language
           models may not be worth investing signiﬁcant resources in because there has been no convincing demonstration that
           current language models are signiﬁcantly better than current methods for generating text, and because methods for
           “targeting” or “controlling” the content of language models are still at a very early stage.

           6.1.3 External Incentive Structures

           Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their
           agenda. TTPs are inﬂuenced by economic factors like scalability and ease of deployment; phishing is extremely popular
           among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login
           credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.
           Ease of use is another signiﬁcant incentive. Having stable infrastructure has a large impact on the adoption of TTPs.
           The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k
           truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot
           produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the
           amount of human labor required in operating this bot. But a human is still needed to ﬁlter the outputs, which restricts
           how scalable the operation can be.
           Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will
           eventually develop language models that are sufﬁciently consistent and steerable that they will be of greater interest to
           malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on
           this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

           6.2 Fairness, Bias, and Representation

           Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning,
           since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and
           producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in
           the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8

           Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and
           behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely
           present and could be studied in follow-up work. This is a preliminary analysis and does not reﬂect all of the model’s
           biases even within the studied categories.
           Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reﬂect stereotypes
           present in their training data. Below we discuss our preliminary ﬁndings of bias along the dimensions of gender, race,
           and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how
           they are different in this dimension.

           6.2.1 Gender
           In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found
           that occupations in general have a higher probability of being followed by a male gender identiﬁer than a female one
           (in other words, they are male leaning) when given a context such as"Thefoccupationgwas a"(Neutral Variant).
           83% of the 388 occupations we tested were more likely to be followed by a male identiﬁer by GPT-3. We measured
           this by feeding the model a context such as"The detective was a"and then looking at the probability of the
           model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.).
           In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus
           were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and
           sheriff. Occupations that were more likely to be followed by female identiﬁers include midwife, nurse, receptionist,
           housekeeper etc.
           We also tested how these probabilities changed when we shifted the context to be the"The competentfoccupationg
           was a"(Competent Variant), and when we shifted the context to be"The incompetentfoccupationgwas a"
           (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with"The competent
           foccupationgwas a,"the majority of occupations had an even higher probability of being followed by a
           male identiﬁer than a female one than was the case with our original neutral prompt,"Thefoccupationgwas
           a". With the prompt"The incompetentfoccupationgwas a"the majority of occupations still leaned male
           with a similar probability than for our original neutral prompt. The average occupation bias - measured as
           <<FORMULA>> was <<FORMULA>> for the Neutral Variant,<<FORMULA>> for the Competent Variant and <<FORMULA>> jobs
            for the Incompetent Variant.

           We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further
           corroborated the model’s tendency to associate most occupations with males. One method measured the mod-
           els ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model
           a context such as"The advisor met with the advisee because she wanted to get advice about job
           applications. ‘She’ refers to the" and found the option with the lowest probability between the two possi-
           ble options (Choices between Occupation Option: advisor; Participant Option: advisee).
           Occupation and participant words often have societal biases associated with them such as the assumption that most
           occupants are by default male. We found that the language models learnt some of these biases such as a tendency to
           associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of
           all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences
           where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All
           other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns
           with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers
           some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger
           models are more robust than smaller models.
           We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other pre-
           selected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature

              8 Evaluating fairness, bias, and representation in language models is a rapidly-developing area with a large body of prior work.
           See, for example, [HZJ + 19,NBR20,SCNP19].

           Table 6.1:Most Biased Descriptive Words in 175B Model

            <<TABLE>>

           of 1 and topp of 0.9 for every prompt in our dataset. For gender, we had prompts such as"He was very","She
           was very","He would be described as","She would be described as" 9 . We looked at the adjectives and
           adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more
           often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were
           more often described using adjectives that span a greater spectrum.
           Table6.1shows the top 10 most favored descriptive words for the model along with the raw number of times each
           word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a
           category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective,
           we have also included the average for the number of co-occurrences across all qualifying words for each gender.


           6.2.2 Race

           To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The fracegman was very",
           "The fracegwoman was very"and"People would describe thefracegperson as"and generated 800
           samples for each of the above prompts, withfracegreplaced with a term indicating a racial category such as White
           or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that
           language models produce text of differing sentiment when varying features such as occupation [HZJ + 19], we explored
           how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred
           disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive
           words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid:
           -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).
           It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that
           focused on racial features; these results are not from the models talking about race in the wild but talking about race in
           an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply
           looking at word co-occurrences, the resulting sentiment can reﬂect socio-historical factors - for instance, text relating to
           a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated
           with a negative sentiment under this testing methodology.
           Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the
           other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences
           narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and
           highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.


              9 We only used male and female pronouns. This simplifying assumption makes it easier to study co-occurrence since it does not
           require the isolation of instances in which ‘they’ refers to a singular noun from those where it didn’t, but other forms of gender bias
           are likely present and could be studied using different approaches.

                                <<FIGURE>>

            Figure 6.1:Racial Sentiment Across Models

                                                                     <<TABLE>>

                    Table 6.2:Shows the ten most favored words about each religion in the GPT-3 175B model.


           6.2.3 Religion

           We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam,
           and Judaism, by generating 800 model outputs of length 50 with a temperature of 1 and a to p of 0.9 for every
           prompt. Our prompts were of the nature"fReligion practitionersgare"(Eg."Christians are") for each
           of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a
           corpus of such completions for studying co-occurrence of words.
           The following is an example output from the model:
            "Buddhists aredivided into two main branches - Theravada and Mahayana. Theravada
            is the more conservative branch, centering on monastic life and the earliest sutras
            and refusing to recognize the later Mahayana sutras as authentic."
           Similar to race, we found that the models make associations with religious terms that indicate some propensity to reﬂect
           how these terms are sometimes presented in the world. For example, with the religionIslam, we found that words such
           asramadan,prophetandmosqueco-occurred at a higher rate than for other religions. We also found that words such
           asviolent,terrorismandterroristco-occurred at a greater rate with Islam than with other religions and were in
           the top 40 most favored words for Islam in GPT-3.

           6.2.4 Future Bias and Fairness Challenges
           We have presented this preliminary analysis to share some of the biases we found in order to motivate further research,
           and to highlight the inherent difﬁculties in characterizing biases in large-scale generative models; we expect this to be an
           area of continuous research for us and are excited to discuss different methodological approaches with the community.
           We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but
           we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model
           attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ + 18].
           Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this
           is also extensive [QMZH19,HZJ + 19], so we offer only a few brief comments on future directions speciﬁc to large
           language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for
           building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for
           these models. There is room for more research that engages with the literature outside NLP, better articulates normative
           statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20].
           Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been
           shown to have blind spots [GG19,NvNvdG19] but in a holistic manner.

           6.3 Energy Usage

           Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3
           175B consumed several thousand petaﬂop/s-days of compute during pre-training, compared to tens of petaﬂop/s-days
           for a 1.5B parameter GPT-2 model (Figure2.2). This means we should be cognizant of the cost and efﬁciency of such
           models, as advocated by [SDSE19].
           The use of large-scale pre-training also gives another lens through which to view the efﬁciency of large models - we
           should consider not only the resources that go into training them, but how these resources are amortized over the
           lifetime of a model, which will subsequently be used for a variety of purposes and ﬁne-tuned for speciﬁc tasks. Though
           models like GPT-3 consume signiﬁcant resources during training, they can be surprisingly efﬁcient once trained: even
           with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or
           only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down
           the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efﬁcient
           versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efﬁciency
           of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].

           7 Related Work

           Several lines of work have focused on increasing parameter count and/or computation in language models as a
           means to improve generative or task performance. An early work scaled LSTM based language models to over a
           billion parameters [JVS + 16]. One line of work straightforwardly increases the size of transformer models, scaling
           up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size:
           213 million parameters [VSP + 17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters
           [RWC + 19], 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19], and most recently 17 billion parameters
           [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of
           increasing models’ capacity to store information without increased computational cost. These approaches rely on the
           conditional computation framework [BLC13] and speciﬁcally, the mixture-of-experts method [SMM + 17] has been
           used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19],
           though only a small fraction of the parameters are actually used on each forward pass. A third approach increases
           computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and
           the universal transformer [DGV + 18]. Our work focuses on the ﬁrst approach (scaling compute and parameters together,
           by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ
           this strategy.
           Several efforts have also systematically studied the effect of scale on language model performance. [KMH + 20,
           RRBS19,LWS + 20,HNA + 17], ﬁnd a smooth power-law trend in loss as autoregressive language models are scaled up.
           This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the
           curve can perhaps be detected in Figure3.1), and we also ﬁnd relatively smooth increases in many (though not all)
           downstream tasks across 3 orders of magnitude of scaling.
           Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language
           models that are as small as possible. This approach includes ALBERT [LCG + 19] as well as general [HVD15] and
           task-speciﬁc [SDCW19,JYS + 19,KR16] approaches to distillation of language models. These architectures and
           techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint
           of giant models.
           As ﬁne-tuned language models have neared human performance on many standard benchmark tasks, considerable
           effort has been devoted to constructing more difﬁcult or open-ended tasks, including question answering [KPR + 19,
           IBGC + 14,CCE + 18,MCKS18], reading comprehension [CHI + 18,RCM19], and adversarially constructed datasets
           designed to be difﬁcult for existing language models [SBBC19,NWD + 19]. In this work we test our models on many
           of these datasets.
           Many previous efforts have focused speciﬁcally on question-answering, which constitutes a signiﬁcant fraction of the
           tasks we tested on. Recent efforts include [RSR + 19,RRS20], which ﬁne-tuned an 11 billion parameter language model,
           and [GLT + 20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on
           in-context learning but could be combined in the future with those of [GLT + 20,LPP + 20].
           Metalearning in language models has been utilized in [RWC + 19], though with much more limited results and no
           systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it
           structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including
           matching networks [VBL + 16], RL2 [DSC + 16], learning to optimize [RL16,ADG + 16,LM17] and MAML [FAL17].
           Our approach of stufﬁng the model’s context with previous examples is most structurally similar to RL2 and also
           resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations
           across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training)
           updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks deﬁned at inference-time.
           Few-shot auto-regressive density estimation was explored in [RCP + 17] and [GWC + 18] studied low-resource NMT as
           a few-shot learning problem.
           While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained
           language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-ﬁeld with
           similar goals is semi-supervised learning where approaches such as UDA [XDH + 19] also explore methods of ﬁne-tuning
           when very little labeled data is available.
           Giving multi-task models instructions in natural language was ﬁrst formalized in a supervised setting with [MKXS18]
           and utilized for some tasks (such as summarizing) in a language model with [RWC + 19]. The notion of presenting
           tasks in natural language was also explored in the text-to-text transformer [RSR + 19], although there it was applied for
           multi-task ﬁne-tuning rather than for in-context learning without weight updates.
           Another approach to increasing generality and transfer-learning capability in language models is multi-task learning
           [Car97], which ﬁne-tunes on a mixture of downstream tasks together, rather than separately updating the weights for
           each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the
           weights (similar to our in-context learning approach), or alternatively could improve sample efﬁciency when updating
           the weights for a new task. Multi-task learning has shown some promising initial results [LGH + 15,LSP + 18] and
           multi-stage ﬁne-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed
           the boundaries on certain tasks [KKS + 20], but is still limited by the need to manually curate collections of datasets and
           set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of
           tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate
           a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR + 17], human
           interaction [ZSW + 19b], or active learning [Mac92].
           Algorithmic innovation in language models over the last two years has been enormous, including denoising-based
           bidirectionality [DCLT18], preﬁxLM [DL15] and encoder-decoder architectures [LLG + 19,RSR + 19], random permu-
           tations during training [YDY + 19], architectures that improve the efﬁciency of sampling [DYY + 19], improvements in
           data and training procedures [LOG + 19], and efﬁciency increases in the embedding parameters [LCG + 19]. Many of
           these techniques provide signiﬁcant gains on downstream tasks. In this work we continue to focus on pure autoregressive
           language models, both in order to focus on in-context learning performance and to reduce the complexity of our large
           model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s
           performance on downstream tasks, especially in the ﬁne-tuning setting, and combining GPT-3’s scale with these
           algorithmic techniques is a promising direction for future work.


           8 Conclusion

           We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and
           benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of
           state-of-the-art ﬁne-tuned systems, as well as generating high-quality samples and strong qualitative performance at
           tasks deﬁned on-the-ﬂy. We documented roughly predictable trends of scaling in performance without using ﬁne-tuning.
           We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results
           suggest that very large language models may be an important ingredient in the development of adaptable, general
           language systems.

           Acknowledgements

           The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub
           Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea
           Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up
           this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura
           Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early
           discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments,
           Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of
           people who created content that was used in the training of the model, and to those who were involved in indexing or
           upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure
           and supercomputing teams for making it possible to train models at this scale.

                                                  Contributions

           Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and Jeffrey Wu
           implemented the large-scale models, training infrastructure, and model-parallel strategies.
           Tom Brown, Dario Amodei, Ben Mann, and Nick Ryderconducted pre-training experiments.
           Ben Mann and Alec Radfordcollected, ﬁltered, deduplicated, and conducted overlap analysis on the training data.
           Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Tom Henighan, and
           Girish Sastryimplemented the downstream tasks and the software framework for supporting them, including creation
           of synthetic tasks.
           Jared Kaplan and Sam McCandlishinitially predicted that a giant language model should show continued gains, and
           applied scaling laws to help predict and guide model and data scaling decisions for the research.
           Ben Mannimplemented sampling without replacement during training.
           Alec Radfordoriginally demonstrated few-shot learning occurs in language models.
           Jared Kaplan and Sam McCandlishshowed that larger models learn more quickly in-context, and systematically
           studied in-context learning curves, task prompting, and evaluation methods.
           Prafulla Dhariwalimplemented an early version of the codebase, and developed the memory optimizations for fully
           half-precision training.
           Rewon Child and Mark Chendeveloped an early version of our model-parallel strategy.
           Rewon Child and Scott Graycontributed the sparse transformer.
           Aditya Rameshexperimented with loss scaling strategies for pretraining.
           Melanie Subbiah and Arvind Neelakantanimplemented, experimented with, and tested beam search.
           Pranav Shyamworked on SuperGLUE and assisted with connections to few-shot learning and meta-learning literature.
           Sandhini Agarwalconducted the fairness and representation analysis.
           Girish Sastry and Amanda Askellconducted the human evaluations of the model.
           Ariel Herbert-Vossconducted the threat analysis of malicious use.
           Gretchen Kruegeredited and red-teamed the policy sections of the paper.
           Benjamin Chess, Clemens Winter, Eric Sigler, Christopher Hesse, Mateusz Litwin, and Christopher Berner
           optimized OpenAI’s clusters to run the largest models efﬁciently.
           Scott Graydeveloped fast GPU kernels used during training.
           Jack Clarkled the analysis of ethical impacts — fairness and representation, human assessments of the model, and
           broader impacts analysis, and advised Gretchen, Amanda, Girish, Sandhini, and Ariel on their work.
           Dario Amodei, Alec Radford, Tom Brown, Sam McCandlish, Nick Ryder, Jared Kaplan, Sandhini Agarwal,
           Amanda Askell, Girish Sastry, and Jack Clarkwrote the paper.
           Sam McCandlishled the analysis of model scaling, and advised Tom Henighan and Jared Kaplan on their work.
           Alec Radfordadvised the project from an NLP perspective, suggested tasks, put the results in context, and demonstrated
           the beneﬁt of weight decay for training.
           Ilya Sutskeverwas an early advocate for scaling large generative likelihood models, and advised Pranav, Prafulla,
           Rewon, Alec, and Aditya on their work.
           Dario Amodeidesigned and led the research.

                                                A Details of Common Crawl Filtering

           As mentioned in Section2.2, we employed two techniques to improve the quality of the Common Crawl dataset: (1)
           ﬁltering Common Crawl and (2) fuzzy deduplication:

               1.In order to improve the quality of Common Crawl, we developed an automatic ﬁltering method to remove low
                 quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classiﬁer
                 to distinguish these from raw Common Crawl. We then used this classiﬁer to re-sample Common Crawl by
                 prioritizing documents which were predicted by the classiﬁer to be higher quality. The classiﬁer is trained
                 using logistic regression classiﬁer with features from Spark’s standard tokenizer and HashingTF 10 . For the
                 positive examples, we used a collection of curated datasets such as WebText, Wikiedia, and our web books
                 corpus as the positive examples, and for the negative examples, we used unﬁltered Common Crawl. We used
                 this classiﬁer to score Common Crawl documents. We kept each document in our dataset iff

                                     <<FORMULA>>

                 We chose <<FORMULA>> in order to take mostly documents the classiﬁer scored highly, but still include some documents
                 that were out of distribution <<FORMULA>> was chosen to match the distribution of scores from our classiﬁer on WebText.
                 We found this re-weighting increased quality as measured by loss on a range of out-of-distribution generative
                 text samples.
               2.To further improve model quality and prevent overﬁtting (which becomes increasingly important as model
                 capacity increases), we fuzzily deduplicated documents (i.e. removed documents with high overlap with
                 other documents) within each dataset using Spark’s MinHashLSH implementation with 10 hashes, using the
                 same features as were used for classiﬁcation above. We also fuzzily removed WebText from Common Crawl.
                 Overall this decreased dataset size by an average of 10%.

           After ﬁltering for duplicates and quality, we also partially removed text occurring in benchmark datasets, described in
           Appendix C.

           B Details of Model Training

           To train all versions of GPT-3, we use Adam with <<FORMULA>>, we clip the global norm of the
           gradient at 1.0, and we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260
           billion tokens, training continues at 10% of the original learning rate). There is a linear LR warmup over the ﬁrst 375
           million tokens. We also gradually increase the batch size linearly from a small value (32k tokens) to the full value over
           the ﬁrst 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during
           training (until an epoch boundary is reached) to minimize overﬁtting. All models use weight decay of 0.1 to provide a
           small amount of regularization [LH17].
           During training we always train on sequences of the fullnctx = 2048token context window, packing multiple
           documents into a single sequence when documents are shorter than 2048, in order to increase computational efﬁciency.
           Sequences with multiple documents are not masked in any special way but instead documents within a sequence
           are delimited with a special end of text token, giving the language model the information necessary to infer that
           context separated by the end of text token is unrelated. This allows for efﬁcient training without need for any special
           sequence-speciﬁc masking.

           C Details of Test Set Contamination Studies

           In section4we gave a high level overview of test set contamination studies. In this section we provide details on
           methodology and results.

           Initial training set ﬁltering We attempted to remove text occurring in benchmarks from training data by searching
           for 13-gram overlaps between all test/development sets used in this work and our training data, and we removed
           the colliding 13-gram as well as a 200 character window around it, splitting the original document into pieces. For
           ﬁltering purposes we deﬁne a gram as a lowercase, whitespace delimited word with no punctuation. Pieces less than
           200characters long were discarded. Documents split into more than 10 pieces were considered contaminated and
           removed entirely. Originally we removed entire documents given a single collision, but that overly penalized long
           documents such as books for false positives. An example of a false positive might be a test set based on Wikipedia, in
           which the Wikipedia article quotes a single line from a book. We ignored13grams that matched more than 10 training
           documents, as inspection showed the majority of these to contain common cultural phrases, legal boilerplate, or similar
           content that we likely do want the model to learn, rather than undesired speciﬁc overlaps with test sets. Examples for
           various frequencies can be found in the GPT-3 release repository.

            Overlap methodology For our benchmark overlap analysis in Section4, we used a variable number of wordsNto
           check for overlap for each dataset, whereNis the 5th percentile example length in words, ignoring all punctuation,
           whitespace, and casing. Due to spurious collisions at lower values ofNwe use a minimum value of 8 on non-synthetic
           tasks. For performance reasons, we set a maximum value of 13 for all tasks. Values forNand the amount of data
           marked as dirty are shown in TableC.1. Unlike GPT-2’s use of bloom ﬁlters to compute probabilistic bounds for test
           contamination, we used Apache Spark to compute exact collisions across all training and test sets. We compute overlaps
           between test sets and our full training corpus, even though we only trained on 40% of our ﬁltered Common Crawl
           documents per Section2.2.
           We deﬁne a ‘dirty’ example as one with anyN-gram overlap with any training document, and a ‘clean’ example as one
           with no collision.
           Test and validation splits had similar contamination levels despite some test splits being unlabeled. Due to a bug revealed
           by this analysis, ﬁltering described above failed on long documents such as books. Because of cost considerations it
           was infeasible to retrain the model on a corrected version of the training dataset. As such, several language modeling
           benchmarks plus the Children’s Book Test showed almost complete overlap, and therefore were not included in this
           paper. Overlaps are shown in TableC.1

           Overlap results To understand how much having seen some of the data helps the model perform on downstream
           tasks, we ﬁlter every validation and test set by dirtiness. Then we run evaluation on the clean-only examples and report
           the relative percent change between the clean score and the original score. If the clean score is more than 1% or 2%
           worse than the overall score, it suggests the model may have overﬁt to the examples it has seen. If the clean score is
           signiﬁcantlybetter, our ﬁltering scheme may have preferentially marked easier examples as dirty.
           This overlap metric tends to show a high rate of false positives for datasets that contain background information (but
           not answers) drawn from the web (such as SQuAD, which draws from Wikipedia) or examples less than 8 words
           long, which we ignored in our ﬁltering process (except for wordscrambling tasks). One instance where this technique
           seems to fail to give good signal is DROP, a reading comprehension task in which 94% of the examples are dirty. The
           information required to answer the question is in a passage provided to the model, so having seen the passage during
           training but not the questions and answers does not meaningfully constitute cheating. We conﬁrmed that every matching
           training document contained only the source passage, and none of the questions and answers in the dataset. The more
           likely explanation for the decrease in performance is that the 6% of examples that remain after ﬁltering come from a
           slightly different distribution than the dirty examples.
           Figure4.2shows that as the dataset becomes more contaminated, the variance of the clean/all fraction increases, but
           there is no apparent bias towards improved or degraded performance. This suggests that GPT-3 is relatively insensitive
           to contamination. See Section4for details on the datasets we ﬂagged for further review.

                                                <<TABLE>>

             Table C.1:Overlap statistics for all datasets sorted from dirtiest to cleanest. We consider a dataset example dirty if it
             has a singleN-gram collision with any document in our training corpus. “Relative Difference Clean vs All” shows the
             percent change in performance between only the clean examples vs all the examples in the benchmark. “Count” shows
             the number of examples. “Clean percentage” is the percent of examples that are clean vs total. For “Acc/F1/BLEU” we
             use the metric speciﬁed in “Metric”. These scores come from evaluations with a different seed for the random examples
             used for in-context learning, and will therefore differ slightly from the scores elsewhere in the paper.

                                                       D Total Compute Used to Train Language Models

           This appendix contains the calculations that were used to derive the approximate compute used to train the language
           models in Figure2.2. As a simplifying assumption, we ignore the attention operation, as it typically uses less than 10%
           of the total compute for the models we are analyzing.
           Calculations can be seen in TableD.1and are explained within the table caption.

                                                  <<TABLE>>

           Table D.1:Starting from the right hand side and moving left, we begin with the number of training tokens that each
           model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are
           active for each token during a forward or backwards pass. We then note that each token is involved in a single addition
           and a single multiply for each active parameter in the forward pass (ignoring attention). Then we add a multiplier of
           3x to account for the backwards pass (as computing both @params and @acts use a similar amount of compute as the
           forwards pass. Combining the previous two numbers, we get the total ﬂops per parameter per token. We multiply this @loss    @loss
           value by the total training tokens and the total parameters to yield the number of total ﬂops used during training. We
           report both ﬂops and petaﬂop/s-day (each of which are 2.88e+7 ﬂops).

           E Human Quality Assessment of Synthetic News Articles

           This appendix contains details on the experiments measuring human ability to distinguish GPT-3-generated synthetic
           news articles from real news articles. We ﬁrst describe the experiments on the200word news articles, and then
           describe the preliminary investigation of500word news articles generated by GPT-3.
           Participants:We recruited 718 unique participants to take part in 6 experiments. 97 participants were excluded for
           failing an internet check question, leaving a total of 621 participants: 343 male, 271 female, and 7 other. Mean
           participant age was38years old. All participants were recruited through Positly, which maintains a whitelist of
           high-performing workers from Mechanical Turk. All participants were US-based but there were no other demographic
           restrictions. Participants were paid$12 for their participation, based on a task time estimate of 60 minutes determined
           by pilot runs. In order to ensure that the sample of participants for each experiment quiz was unique, participants were
           not allowed to take part in an experiment more than once.
           Procedure and design:We arbitrarily selected 25 news articles that appeared innewser.comin early 2020. We used
           the article titles and subtitles to produce outputs from the 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13.0B, and 200B
           (GPT-3) parameter language models. Five outputs per question were generated by each model and the generation with a
           word count closest to that of the human written article was selected automatically. This was to minimize the effect
           that completion length might have on participants’ judgments. The same output procedure for each model with the
           exception of the removal of the intentionally bad control model, as described in the main text.

                                                  <<TABLE>>

           Table E.1:Participant details and article lengths for each experiment to evaluate human detection of200word model
           generated news articles. Participants were excluded due to internet check fails.

                                        <<TABLE>>

           Figure E.1:Participants spend more time trying to identify whether each news article is machine generated as model
           size increases. Duration on the control model is indicated with the dashed line. Line of best ﬁt is a linear model on a log
           scale with 95% conﬁdence intervals.


           In each experiment, half of the participants were randomly assigned to quiz A and half were randomly assigned to quiz
           B. Each quiz consisted of 25 articles: half (12-13) were human written and half (12-13) were model generated: the
           articles with human written completions in quiz A had model generated completions in quiz B and vice versa. The
           order of quiz question was shufﬂed for each participant. Participants could leave comments and were asked to indicate
           if they had seen the articles before. Participants were instructed not to look up the articles or their content during the
           quiz and at the end of the quiz were asked if they had looked anything up during the quiz.
           Statistical Tests:To compare means on the different runs, we performed a two-sample t-test for independent groups for
           each model against the control. This was implemented in Python using thescipy.stats.ttest_indfunction. When
           plotting a regression line in the graph of average participant accuracy vs model size, we ﬁt a power law of the form
           ax b . The 95% conﬁdence intervals were estimated from the t-distribution of the sample mean.
           Duration statistics: In the main text, we discussed the ﬁnding that the ability of human participants to distinguish
           model and human generated news articles decreases as our models become larger. We have also found that the
           average time spent for a given set of questions increases as the model size increases, as shown in FigureE.1. Lower

                                                  <<TABLE>>

           Table E.2:Participant details and article lengths for the experiments investigating human detection of500word
           model generated news articles. Participants were excluded due to internet check fails.


           accuracy scores despite increased time investment from participants supports the ﬁnding that larger models generate
           harder-to-distinguish news articles.
           Preliminary investigation of 500 word articles: We recruited 160 unique US-based participants to take part in 2
           experiments through Positly (details are given in TableE.2). We randomly selected 12 Reuters world news articles from
           late 2019 and created a context for GPT-3 175B that consisted of a single Reuters article not in this set of 12. We then
           used the article titles and Reuters locations to generate completions from GPT-3 175B and the 160M control model
           from the previous experiments. These were used to create two 12-question quizzes per model, each consisting of half
           human written and half model generated articles. Comprehension questions were added and articles were shown to
           participants in 3 stages at 30 second intervals to encourage closer reading. Participants were paid$12 for this task.
           Model generation selection methods, exclusion criteria, and statistical tests mirror those of the previous experiments.

           F Additional Samples from GPT-3

           GPT-3 adapts well to many tasks other than the ones explored in the main body of the paper. As an example, in Figure
           F.1, we show four uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the
           style of Wallace Stevens. We ﬁrst experimented with a few prompts, then generated four samples with no additional
           editing or selection (sampling at temperature1using nucleus sampling [HBFC19] withP= 0:9). Completions were
           truncated when the model began to write a new title and author heading, or broke into prose commentary.

                                        <<FIGURE>>

                 Figure F.1:Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace
                 Stevens with the title ‘Shadows on the Way’.


                                                     G Details of Task Phrasing and Speciﬁcations

             The following ﬁgures illustrate the formatting and phrasing of all the tasks included in the paper. All data comes from
             the ground truth datasets in this section, and no samples from GPT-3 are included here.

                                               <<FIGURE>>

             Figure G.1:Formatted dataset example for RACE-h. When predicting, we normalize by the unconditional probability
             of each answer as described in2.

                                                                <<FIGURE>>

                                      Figure G.4:Formatted dataset example for PIQA

                                        <<FIGURE>>

                                      Figure G.5:Formatted dataset example for COPA

                                <<FIGURE>>

             Figure G.6:Formatted dataset example for ReCoRD. We consider the context above to be a single ”problem” because
             this is how the task is presented in the ReCoRD dataset and scored in the ReCoRD evaluation script.

                      <<FIGURE>>

             Figure G.8:Formatted dataset example for OpenBookQA. When predicting, we normalize by the unconditional
             probability of each answer as described in2.

                      Context!  Making a cake: Several cake pops are shown on a display. A woman and girl
                                 are shown making the cake pops in a kitchen. They
                Correct Answer!  bake them, then frost and decorate.
              Incorrect Answer!  taste them as they place them on plates.
              Incorrect Answer!  put the frosting on the cake as they pan it.
              Incorrect Answer!  come out and begin decorating the cake as well.

                                    Figure G.9:Formatted dataset example for HellaSwag

                      <<FIGURE>>

                                    Figure G.10:Formatted dataset example for ANLI R3

                      <<FIGURE>>

             Figure G.11:Formatted dataset example for ARC (Challenge). When predicting, we normalize by the unconditional
             probability of each answer as described in2.

                      <<FIGURE>>

                                  Figure G.12:Formatted dataset example for SAT Analogies

                <<FIGURE>>

             Figure G.14:Formatted dataset example for Winogrande. The ‘partial’ evaluation method we use compares the
             probability of the completion given a correct and incorrect context.


                      <<FIGURE>>

             Figure G.15:Formatted dataset example for MultiRC. There are three levels within MultiRC: (1) the passage, (2) the
             questions, and (3) the answers. During evaluation, accuracy is determined at the per-question level, with a question
             being considered correct if and only if all the answers within the question are labeled correctly. For this reason, we use
             K to refer to the number ofquestionsshown within the context.

                      <<FIGURE>>

             Figure G.16:Formatted dataset example for ARC (Easy). When predicting, we normalize by the unconditional
             probability of each answer as described in 2.

                                                       <<FIGURE>>

                                   Figure G.17:Formatted dataset example for StoryCloze

                                        <<FIGURE>>

                                     Figure G.18:Formatted dataset example for CoQA

                                                                <<FIGURE>>

                                 Figure G.24:Formatted dataset example for Natural Questions

                                        <<FIGURE>>

                                 Figure G.26:Formatted dataset example for Symbol Insertion

                                                                <<FIGURE>>

                                      Figure G.30:Formatted dataset example for CB

                                                <<FIGURE>>

                                      Figure G.32:Formatted dataset example for WiC

                                                                <<FIGURE>>

             Figure G.36:Formatted dataset example for De!En. This is the format for one- and few-shot learning, for this and
             other langauge tasks, the format for zero-shot learning is “Q: What is theflanguagegtranslation offsentencegA:
             ftranslationg.”

                                                        <<FIGURE>>

                                  Figure G.49:Formatted dataset example for Arithmetic 4D+

                                                <<FIGURE>>

                                 Figure G.50:Formatted dataset example for Arithmetic 5D

                                                                <<FIGURE>>

                                  Figure G.51:Formatted dataset example for Arithmetic 5D+


                                                       H Results on All Tasks for All Model Sizes

                                                         <<TABLE>>

                                      Table H.1:Scores for every task, setting and model that we investigate in this paper.

                                                        <<FIGURE>>

                                                   Figure H.1:All results for all SuperGLUE tasks.

                <<FIGURE>>                                              <<FIGURE>>

             Figure H.2:Results for SAT task.              Figure H.3:All results for all Winograd tasks.

                                                                <<FIGURE>>

                                                Figure H.4:All results for all Arithmetic tasks.

                                        <<FIGURE>>

                               Figure H.5:All results for all Cloze and Completion tasks.

                                                                        <<FIGURE>>

                                        Figure H.6:All results for all Common Sense Reasoning tasks.

                                                <<FIGURE>>

                                     Figure H.7:All results for all QA tasks.

                                                <<FIGURE>>

                              Figure H.8:All results for all Reading Comprehension tasks.

                                                        <<FIGURE>>

                                    Figure H.9:All results for all ANLI rounds.

                                                        <<FIGURE>>

                                                Figure H.10:All results for all Scramble tasks.

                                        <<FIGURE>>

                                  Figure H.11:All results for all Translation tasks.


                                                  References

             [ADG + 16]Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
                     Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.
                     InAdvances in neural information processing systems, pages 3981–3989, 2016.
                [AI19]WeChat AI. Tr-mt (ensemble), December 2019.
               [AJF19]Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In
                     Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
                     Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
            [BBDIW20]Su Lin Blodgett, Solon Barocas, Hal Daume III, and Hanna Wallach. Language (technology) is power:´
                     A critical survey of “bias” in nlp.arXiv preprint arXiv:2005.14050, 2020.
             [BCFL13]Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from
                     question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language
                     processing, pages 1533–1544, 2013.
               [BES10]Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: an enhanced lexical
                     resource for sentiment analysis and opinion mining. InLrec, volume 10, pages 2200–2204, 2010.
             [BHT + 20]Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella
                     Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language.
                     arXiv preprint arXiv:2004.10151, 2020.
              [BLC13]Yoshua Bengio, Nicholas Leonard, and Aaron C. Courville. Estimating or propagating gradients through´
                     stochastic neurons for conditional computation.Arxiv, 2013.
             [BZB + 19]Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about
                     physical commonsense in natural language.arXiv preprint arXiv:1911.11641, 2019.
               [Car97]Rich Caruana. Multitask learning.Machine learning, 28(1), 1997.
               [CB78]Susan Carey and Elsa Bartlett. Acquiring a single new word.Proceedings of the Stanford Child Language
                     Conference, 1978.
             [CCE + 18]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
                     Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv,
                     abs/1803.05457, 2018.
             [CGRS19]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
                     transformers, 2019.
              [CHI + 18]Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke
                     Zettlemoyer. Quac : Question answering in context.Arxiv, 2018.
             [CLY + 19]Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
                     Jingjing Liu. Uniter: Learning universal image-text representations.arXiv preprint arXiv:1909.11740,
                     2019.
               [Cra17]Kate Crawford. The trouble with bias.NIPS 2017 Keynote, 2017.
             [DCLT18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
                     bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
             [DGV + 18]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
                     transformers.Arxiv, 2018.
             [DHKH14] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heaﬁeld. Edinburgh’s phrase-based machine
                     translation systems for wmt-14. InProceedings of the Ninth Workshop on Statistical Machine Translation,
                     pages 97–104, 2014.
               [DL15]Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. InAdvances in neural information
                     processing systems, 2015.
                [DSC + 16]Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2 : Fast
                     reinforcement learning via slow reinforcement learning.ArXiv, abs/1611.02779, 2016.
             [DWD + 19]Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner.
                     Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint
                     arXiv:1903.00161, 2019.
             [DYY + 19]Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
                     Transformer-xl: Attentive language models beyond a ﬁxed-length context.Arxiv, 2019.
             [EOAG18]Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale.
                     arXiv preprint arXiv:1808.09381, 2018.
               [FAL17]Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of
                     deep networks.ArXiv, abs/1703.03400, 2017.
               [Fyo00]Yaroslav Fyodorov. A natural logic inference system, 2000.
               [GG19]Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases
                     in word embeddings but do not remove them.arXiv preprint arXiv:1903.03862, 2019.
             [GLT + 20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-
                     augmented language model pre-training.arXiv preprint arXiv:2002.08909, 2020.
               [Gra16]Alex Graves. Adaptive computation time for recurrent neural networks.Arxiv, 2016.
             [GSL + 18]Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A
                     Smith. Annotation artifacts in natural language inference data.arXiv preprint arXiv:1803.02324, 2018.
              [GSR19]Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualiza-
                     tion of generated text.arXiv preprint arXiv: 1906.04043, 2019.
             [GWC + 18]Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource
                     neural machine translation.arXiv preprint arXiv:1808.08437, 2018.
               [HB20]Daniel Hernandez and Tom Brown. Ai and efﬁciency, May 2020.
             [HBFC19]Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.
                     CoRR, abs/1904.09751, 2019.
             [HLW + 20]Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song.
                     Pretrained transformers improve out of distribution robustness.arXiv preprint arXiv:2004.06100, 2020.
             [HNA + 17]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md.
                     Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.
                     arXiv preprint arXiv:1712.00409, 2017.
               [HR18] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.arXiv
                     preprint arXiv:1801.06146, 2018.
              [HVD15]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv
                     preprint arXiv:1503.02531, 2015.
              [HYC01]Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to Learn Using Gradient Descent.
                     InInternational Conference on Artiﬁcial Neural Networks, pages 87–94. Springer, 2001.
              [HZJ + 19]Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini,
                     Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual
                     evaluation.arXiv preprint arXiv:1911.03064, 2019.
             [IBGC + 14]Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daume III. A neural ´
                     network for factoid question answering over paragraphs. InEmpirical Methods in Natural Language
                     Processing, 2014.
             [IDCBE19]Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of
                     generated text is easiest when humans are fooled.arXiv preprint arXiv:1911.00650, 2019.
              [JCWZ17]Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly
                     supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017.
                [JN20]Zheng Junyuan and Gamma Lab NYC. Numeric transformer - albert, March 2020.
              [JVS + 16]Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits
                     of language modeling.arXiv preprint arXiv:1602.02410, 2016.
              [JYS + 19]Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
                     TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351, 2019.
              [JZC + 19]Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on
                     conversational question answering.arXiv preprint arXiv:1909.10772, 2019.
             [KKS + 20]Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi.
                     Uniﬁedqa: Crossing format boundaries with a single qa system.arXiv preprint arXiv:2005.00700, 2020.
              [KMB20]Sarah E. Kreps, Miles McCain, and Miles Brundage. All the news that’s ﬁt to fabricate: Ai-generated
                     text as a tool of media misinformation, 2020.
             [KMH + 20]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott
                     Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
             [KPR + 19]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti,
                     Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
                     Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural ques-
                     tions: a benchmark for question answering research.Transactions of the Association of Computational
                     Linguistics, 2019.
               [KR16]Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.Arxiv, 2016.
               [LB02]Edward Loper and Steven Bird. Nltk: The natural language toolkit, 2002.
               [LC19]Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint
                     arXiv:1901.07291, 2019.
             [LCG + 19]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori-
                     cut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint
                     arXiv:1909.11942, 2019.
             [LCH + 20]Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao.
                     Adversarial training for large neural language models.arXiv preprint arXiv:2004.08994, 2020.
              [LDL19]Zhongyang Li, Xiao Ding, and Ting Liu. Story ending prediction by transferable bert.arXiv preprint
                     arXiv:1905.07504, 2019.
              [LDM12]Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. InThirteenth
                     International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
             [LGG + 20]Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and
                     Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation.arXiv preprint
                     arXiv:2001.08210, 2020.
             [LGH + 15]Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation
                     learning using multi-task deep neural networks for semantic classiﬁcation and information retrieval. In
                     Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational
                     Linguistics: Human Language Technologies, 2015.
               [LH17]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.  arXiv preprint
                     arXiv:1711.05101, 2017.
            [LHCG19a]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural
                     networks via knowledge distillation for natural language understanding.arXiv preprint arXiv:1904.09482,
                     2019.
               [LHCG19b]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for
                     natural language understanding.arXiv preprint arXiv:1901.11504, 2019.
               [Lin20]Tal Linzen. How can we accelerate progress towards human-like linguistic generalization?arXiv preprint
                     arXiv:2005.00955, 2020.
             [LLG + 19]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
                     Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural
                     language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
               [LM17]Ke Li and Jitendra Malik. Learning to optimize neural nets.arXiv preprint arXiv:1703.00441, 2017.
             [LOG + 19]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
                     Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.
                     arXiv preprint arXiv:1907.11692, 2019.
              [LPP + 20]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
                     Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Kiela Douwe.¨
                     Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,
                     2020.
              [LSP + 18]Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
                     Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198, 2018.
             [LWS + 20]Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez.
                     Train large, then compress: Rethinking model size for efﬁcient training and inference of transformers,
                     2020.
             [LXL + 17]Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading
                     comprehension dataset from examinations.arXiv preprint arXiv:1704.04683, 2017.
             [LYN + 20]Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy
                     Lin. Tttttackling winogrande schemas.arXiv preprint arXiv:2003.08380, 2020.
               [Mac92]David. MacKay. Information-based objective functions for active data selection.Neural Computation,
                     1992.
             [MBXS17]Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Con-
                     textualized word vectors. InAdvances in Neural Information Processing Systems, pages 6294–6305,
                     2017.
             [MCCD13]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations
                     in vector space.arXiv preprint arXiv:1301.3781, 2013.
             [MCH + 16]Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
                     Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of
                     commonsense stories.arXiv preprint arXiv:1604.01696, 2016.
             [MCKS18]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity?
                     a new dataset for open book question answering.ArXiv, abs/1809.02789, 2018.
             [MKAT18]Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of
                     large-batch training, 2018.
            [MKM + 94]Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson,
                     Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure.
                     InProceedings of the workshop on Human Language Technology, pages 114–119. Association for
                     Computational Linguistics, 1994.
             [MKXS18]Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language
                     decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730, 2018.
              [MPL19]R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic
                     heuristics in natural language inference.arXiv preprint arXiv:1902.01007, 2019.
                [MWZ + 18]Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson,
                     Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting, 2018.
              [NBR20]Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained
                     language models.arXiv preprint arXiv:2004.09456, 2020.
               [NK19]Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments.
                     arXiv preprint arXiv:1907.07355, 2019.
               [Nor09]Peter Norvig. Natural language corpus data, 2009.
            [NvNvdG19]Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: Man is to doctor
                     as woman is to doctor.arXiv preprint arXiv:1905.09866, 2019.
             [NWD + 19]Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial
                     nli: A new benchmark for natural language understanding.arXiv preprint arXiv:1910.14599, 2019.
                [oR16]University of Regensburg. Fascha, 2016.
               [PFB18]Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence encoders on STILTs: Supplementary´
                     training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088, 2018.
             [PKL + 16]Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro´
                     Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The lambada dataset: Word prediction´
                     requiring a broad discourse context.arXiv preprint arXiv:1606.06031, 2016.
             [PNZtY18]Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen tau Yih. Dissecting contextual word
                     embeddings: Architecture and representation, 2018.
               [Pos18]Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771, 2018.
              [PSM14]Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word
                     representation. InProceedings of the 2014 conference on empirical methods in natural language
                     processing (EMNLP), 2014.
               [QIA20]QIANXIN. Sa-net on albert (ensemble), April 2020.
             [QMZH19]Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Reducing gender bias in word-level language
                     models with a gender-equalizing loss function.arXiv preprint arXiv:1905.12801, 2019.
              [RCM19]Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering
                     challenge.Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
             [RCP + 17]Scott Reed, Yutian Chen, Thomas Paine, Aaron van den Oord, SM Eslami, Danilo Rezende, Oriol¨
                     Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn
                     distributions.arXiv preprint arXiv:1710.10304, 2017.
               [RJL18]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for
                     squad.arXiv preprint arXiv:1806.03822, 2018.
               [RL16]Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.ICLR 2017 (oral),
                     2016.
             [RLL + 19]Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehension
                     with numerical reasoning. InProceedings of EMNLP, 2019.
            [RNLVD18]Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in
                     coreference resolution.arXiv preprint arXiv:1804.09301, 2018.
             [RNSS18]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding
                     by generative pre-training, 2018.
               [Ros12]R.S. Ross. Guide for conducting risk assessments.NIST Special Publication, 2012.
             [RRBS19]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of
                     the generalization error across scales, 2019.
              [RRS20]Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters
                     of a language model?arXiv preprint arXiv:2002.08910, 2020.
             [RSR + 19]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
                     Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
                     transformer, 2019.
             [RWC + 19]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                     models are unsupervised multitask learners, 2019.
             [SBBC19]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial
                     winograd schema challenge at scale, 2019.
             [SBC + 19]Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,
                     Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris
                     McGufﬁe, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
             [SCNP19]Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a
                     babysitter: On biases in language generation.arXiv preprint arXiv:1909.01326, 2019.
             [SDCW19]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of
                     BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019.
              [SDSE19]Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.CoRR, abs/1907.10597, 2019.
              [SHB15]Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with
                     monolingual data.arXiv preprint arXiv:1511.06709, 2015.
             [SMM + 17]Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
                     Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint
                     arXiv:1701.06538, 2017.
              [SPP + 19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
                     Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
                [SS20]Timo Schick and Hinrich Schutze. Exploiting cloze questions for few-shot text classiﬁcation and natural¨
                     language inference.arXiv preprint arXiv:2001.07676, 2020.
             [STQ + 19]Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence
                     pre-training for language generation.arXiv preprint arXiv:1905.02450, 2019.
             [TFR + 17]Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain
                     randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ
                     international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
                [TL05]Peter D. Turney and Michael L. Littman. Corpus-based learning of analogies and semantic relations.
                     CoRR, abs/cs/0508103, 2005.
                [TL18]Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning. arXiv preprint
                     arXiv:1806.02847, 2018.
              [TLBS03]Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. Combining independent
                     modules to solve multiple-choice synonym and analogy problems.CoRR, cs.CL/0309035, 2003.
               [Tur20]Project Turing. Microsoft research blog, Feb 2020.
             [VBL + 16]Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching Networks for One
                     Shot Learning. InAdvances in neural information processing systems, pages 3630–3638, 2016.
             [VSP + 17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Łukasz
                     Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing
                     systems, 2017.
             [WPN + 19]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer
                     Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understand-
                     ing systems. InAdvances in Neural Information Processing Systems, pages 3261–3275, 2019.
               [WXH + 18]Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multi-agent
                     dual learning.ICLR 2019, 2018.
             [XDH + 19]Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data
                     augmentation for consistency training, 2019.
             [YdC + 19]Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski,
                     Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. Learning and evaluating
                     general linguistic intelligence.arXiv preprint arXiv:1901.11373, 2019.
             [YDY + 19]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet:
                     Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,
                     2019.
             [ZHB + 19]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine
                     really ﬁnish your sentence?arXiv preprint arXiv:1905.07830, 2019.
             [ZHR + 19]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin
                     Choi. Defending against neural fake news.arXiv preprint arXiv:1905.12616, 2019.
            [ZSW + 19a] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul
                     Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019.
            [ZSW + 19b]Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Chris-
                     tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.ArXiv, abs/1909.08593,
                     2019.
<|endoftext|>


<|startoftext|>
                  Learning both Weights and Connections for Efﬁcient Neural Networks

                                  Song Han                     Jeff Pool
                               Stanford University                   NVIDIA
                           songhan@stanford.edu          jpool@nvidia.com

                                 John Tran                   William J. Dally
                                 NVIDIA                   Stanford University
                           johntran@nvidia.com                NVIDIA
                                                        dally@stanford.edu


                                               Abstract

                       Neural networks are both computationally intensive and memory intensive, making
                       them difﬁcult to deploy on embedded systems. Also, conventional networks ﬁx
                       the architecture before training starts; as a result, training cannot improve the
                       architecture. To address these limitations, we describe a method to reduce the
                       storage and computation required by neural networks by an order of magnitude
                       without affecting their accuracy by learning only the important connections. Our
                       method prunes redundant connections using a three-step method. First, we train
                       the network to learn which connections are important. Next, we prune the
                       unimportant connections. Finally, we retrain the network to ﬁne tune the weights of the
                       remaining connections. On the ImageNet dataset, our method reduced the number
                       of parameters of AlexNet by a factor of9, from 61 million to 6.7 million, without
                       incurring accuracy loss. Similar experiments with VGG-16 found that the total
                       number of parameters can be reduced by13, from 138 million to 10.3 million,
                       again with no loss of accuracy.


                 1 Introduction

                 Neural networks have become ubiquitous in applications ranging from computer vision [1] to speech
                 recognition [2] and natural language processing [3]. We consider convolutional neural networks used
                 for computer vision tasks which have grown over time. In 1998 LeCun et al.designed a CNN model
                 LeNet-5 with less than 1M parameters to classify handwritten digits [4], while in 2012, Krizhevsky
                 et al.[1] won the ImageNet competition with 60M parameters. Deepface classiﬁed human faces with
                 120M parameters [5], and Coateset al.[6] scaled up a network to 10B parameters.
                 While these large neural networks are very powerful, their size consumes considerable storage,
                 memory bandwidth, and computational resources. For embedded mobile applications, these resource
                 demands become prohibitive. Figure 1 shows the energy cost of basic arithmetic and memory
                 operations in a 45nm CMOS process. From this data we see the energy per connection is dominated
                 by memory access and ranges from 5pJ for 32 bit coefﬁcients in on-chip SRAM to 640pJ for 32bit
                 coefﬁcients in off-chip DRAM [7]. Large networks do not ﬁt in on-chip storage and hence require
                 the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at
                 20Hz would require(20Hz)(1G)(640pJ) = 12:8Wjust for DRAM access - well beyond the power
                 envelope of a typical mobile device. Our goal in pruning networks is to reduce the energy required to
                 run such large networks so they can run in real time on mobile devices. The model size reduction
                 from pruning also facilitates storage and transmission of mobile applications incorporating DNNs.

                                                  <<FIGURE>>

                 Figure 1: Energy table for 45nm CMOS process [7]. Memory access is 3 orders of magnitude more
                 energy expensive than simple arithmetic.

                 To achieve this goal, we present a method to prune network connections in a manner that preserves the
                 original accuracy. After an initial training phase, we remove all connections whose weight is lower
                 than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This ﬁrst
                 phase learns the topology of the networks — learning which connections are important and removing
                 the unimportant connections. We then retrain the sparse network so the remaining connections can
                 compensate for the connections that have been removed. The phases of pruning and retraining may
                 be repeated iteratively to further reduce network complexity. In effect, this training process learns
                 the network connectivity in addition to the weights - much as in the mammalian brain [8][9], where
                 synapses are created in the ﬁrst few months of a child’s development, followed by gradual pruning of
                 little-used connections, falling to typical adult values.


                 2 Related Work


                 Neural networks are typically over-parameterized, and there is signiﬁcant redundancy for deep learn-
                 ing models [10]. This results in a waste of both computation and memory. There have been various
                 proposals to remove the redundancy: Vanhouckeet al.[11] explored a ﬁxed-point implementation
                 with 8-bit integer (vs 32-bit ﬂoating point) activations. Dentonet al. [12] exploited the linear
                 structure of the neural network by ﬁnding an appropriate low-rank approximation of the parameters
                 and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gonget al.
                 [13] compressed deep convnets using vector quantization. These approximation and quantization
                 techniques are orthogonal to network pruning, and they can be used together to obtain further gains
                 [14].
                 There have been other attempts to reduce the number of parameters of neural networks by replacing
                 the fully connected layer with global average pooling. The Network in Network architecture [15]
                 and GoogLenet [16] achieves state-of-the-art results on several benchmarks by adopting this idea.
                 However, transfer learning, i.e. reusing features learned on the ImageNet dataset and applying them
                 to new tasks by only ﬁne-tuning the fully connected layers, is more difﬁcult with this approach. This
                 problem is noted by Szegedyet al.[16] and motivates them to add a linear layer on the top of their
                 networks to enable transfer learning.
                 Network pruning has been used both to reduce network complexity and to reduce over-ﬁtting. An
                 early approach to pruning was biased weight decay [17]. Optimal Brain Damage [18] and Optimal
                 Brain Surgeon [19] prune networks to reduce the number of connections based on the Hessian of the
                 loss function and suggest that such pruning is more accurate than magnitude-based pruning such as
                 weight decay. However, second order derivative needs additional computation.
                 HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly
                 group connection weights into hash buckets, so that all connections within the same hash bucket
                 share a single parameter value. This technique may beneﬁt from pruning. As pointed out in Shiet al.
                 [21] and Weinbergeret al.[22], sparsity will minimize hash collision making feature hashing even
                 more effective. HashedNets may be used together with pruning to give even better parameter savings.

                                                  <<FIGURE>>

                 Figure 3: Synapses and neurons before and after
                 
                        <<FIGURE>>

                 Figure 2: Three-Step Training Pipeline.   pruning.


                 3 Learning Connections in Addition to Weights

                 Our pruning method employs a three-step process, as illustrated in Figure 2, which begins by learning
                 the connectivity via normal network training. Unlike conventional training, however, we are not
                 learning the ﬁnal values of the weights, but rather we are learning which connections are important.
                 The second step is to prune the low-weight connections. All connections with weights below a
                 threshold are removed from the network — converting a dense network into a sparse network, as
                 shown in Figure 3. The ﬁnal step retrains the network to learn the ﬁnal weights for the remaining
                 sparse connections. This step is critical. If the pruned network is used without retraining, accuracy is
                 signiﬁcantly impacted.

                 3.1 Regularization

                 Choosing the correct regularization impacts the performance of pruning and retraining. L1 regularization
                 penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracy
                 after pruning, but before retraining. However, the remaining connections are not as good as with L2
                 regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the best
                 pruning results. This is further discussed in experiment section.

                 3.2 Dropout Ratio Adjustment

                 Dropout [23] is widely used to prevent over-ﬁtting, and this also applies to retraining. During
                 retraining, however, the dropout ratio must be adjusted to account for the change in model capacity.
                 In dropout, each parameter is probabilistically dropped during training, but will come back during
                 inference. In pruning, parameters are dropped forever after pruning and have no chance to come back
                 during both training and inference. As the parameters get sparse, the classiﬁer will select the most
                 informative predictors and thus have much less prediction variance, which reduces over-ﬁtting. As
                 pruning already reduced model capacity, the retraining dropout ratio should be smaller.
                 Quantitatively, letCi be the number of connections in layeri,Cio for the original network,Cir for
                 the network after retraining,Ni be the number of neurons in layer i. Since dropout works on neurons,
                 andCi varies quadratically withNi , according to Equation 1 thus the dropout ratio after pruning the
                 parameters should follow Equation 2, whereDo represent the original dropout rate,Dr represent the
                 dropout rate during retraining.
                                                                    <<FORMULA>>         (1)           
                                                                    <<FORMULA>>             (2)

                 3.3 Local Pruning and Parameter Co-adaptation

                 During retraining, it is better to retain the weights from the initial training phase for the connections
                 that survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adapted
                 features [24]: gradient descent is able to ﬁnd a good solution when the network is initially trained,
                 but not after re-initializing some layers and retraining them. So when we retrain the pruned layers,
                 we should keep the surviving parameters instead of re-initializing them.

                Table 1: Network pruning can save 9% to 13% parameters with no drop in predictive performance.

                                                                     <<TABLE>>


                 Retraining the pruned layers starting with retained weights requires less computation because we
                 don’t have to back propagate through the entire network. Also, neural networks are prone to suffer
                 the vanishing gradient problem [25] as the networks get deeper, which makes pruning errors harder to
                 recover for deep networks. To prevent this, we ﬁx the parameters for CONV layers and only retrain
                 the FC layers after pruning the FC layers, and vice versa.

                 3.4 Iterative Pruning

                 Learning the right connections is an iterative process. Pruning followed by a retraining is one iteration,
                 after many such iterations the minimum number connections could be found. Without loss of accuracy,
                 this method can boost pruning rate from 5% to 9% on AlexNet compared with single-step aggressive
                 pruning. Each iteration is a greedy search in that we ﬁnd the best connections. We also experimented
                 with probabilistically pruning parameters based on their absolute value, but this gave worse results.

                 3.5 Pruning Neurons

                 After pruning connections, neurons with zero input connections or zero output connections may be
                 safely pruned. This pruning is furthered by removing all connections to or from a pruned neuron.
                 The retraining phase automatically arrives at the result where dead neurons will have both zero input
                 connections and zero output connections. This occurs due to gradient descent and regularization.
                 A neuron that has zero input connections (or zero output connections) will have no contribution
                 to the ﬁnal loss, leading the gradient to be zero for its output connection (or input connection),
                 respectively. Only the regularization term will push the weights to zero. Thus, the dead neurons will
                 be automatically removed during retraining.

                 4 Experiments

                 We implemented network pruning in Caffe [26]. Caffe was modiﬁed to add a mask which disregards
                 pruned parameters during network operation for each weight tensor. The pruning threshold is chosen
                 as a quality parameter multiplied by the standard deviation of a layer’s weights. We carried out the
                 experiments on Nvidia TitanX and GTX980 GPUs.
                 We pruned four representative networks: Lenet-300-100 and Lenet-5 on MNIST, together with
                 AlexNet and VGG-16 on ImageNet. The network parameters and accuracy 1 before and after pruning
                 are shown in Table 1.

                 4.1 LeNet on MNIST

                 We ﬁrst experimented on MNIST dataset with the LeNet-300-100 and LeNet-5 networks [4]. LeNet-
                 300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each, which
                 achieves 1.6% error rate on MNIST. LeNet-5 is a convolutional network that has two convolutional
                 layers and two fully connected layers, which achieves 0.8% error rate on MNIST. After pruning,
                 the network is retrained with1=10of the original network’s original learning rate. Table 1 shows
                    1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation

                  Table 2: For Lenet-300-100, pruning reduces the number of weights by 12% and computation by 12%.

                               <<TABLE>>

                   Table 3: For Lenet-5, pruning reduces the number of weights by 12% and computation by 6%.

                              <<TABLE>>

                                                                    <<FIGURE>>

                 Figure 4: Visualization of the ﬁrst FC layer’s sparsity pattern of Lenet-300-100. It has a banded
                 structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images,
                 since the digits are written in the center.


                 pruning saves 12% parameters on these networks. For each layer of the network the table shows (left
                 to right) the original number of weights, the number of ﬂoating point operations to compute that
                 layer’s activations, the average percentage of activations that are non-zero, the percentage of non-zero
                 weights after pruning, and the percentage of actually required ﬂoating point operations.
                 An interesting byproduct is that network pruning detects visual attention regions. Figure 4 shows the
                 sparsity pattern of the ﬁrst fully connected layer of LeNet-300-100, the matrix size is 784x300. It
                 has 28 bands, each band’s width 28, corresponding to the 28x28 input pixels. The colored regions
                 of the ﬁgure, indicating non-zero parameters, correspond to the center of the image. Because digits
                 are written in the center of the image, these are the important parameters. The graph is sparse on the
                 left and right, corresponding to the less important regions on the top and bottom of the image. After
                 pruning, the neural network ﬁnds the center of the image more important, and the connections to the
                 peripheral regions are more heavily pruned.


                 4.2 AlexNet on ImageNet

                 We further examine the performance of pruning on the ImageNet ILSVRC-2012 dataset, which
                 has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as the
                 reference model, which has 61 million parameters across 5 convolutional layers and 3 fully connected
                 layers. The AlexNet Caffe model achieved a top-1 accuracy of 57.2% and a top-5 accuracy of 80.3%.
                 The original AlexNet took 75 hours to train on NVIDIA Titan X GPU. After pruning, the whole
                 network is retrained with1=100of the original network’s initial learning rate. It took 173 hours to
                 retrain the pruned AlexNet. Pruning is not used when iteratively prototyping the model, but rather
                 used for model reduction when the model is ready for deployment. Thus, the retraining time is less
                 a concern. Table 1 shows that AlexNet can be pruned to 1-9% of its original size without impacting
                 accuracy, and the amount of computation can be reduced by 3%.

                 Table 4: For AlexNet, pruning reduces the number of weights by 9% and computation by 3%.

                  <<TABLE>>

                   Table 5: For VGG-16, pruning reduces the number of weights by 12% and computation by 5%.

                          <<TABLE>>

                 4.3 VGG-16 on ImageNet

                 With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 [27],
                 on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional layers but still only three
                 fully-connected layers. Following a similar methodology, we aggressively pruned both convolutional
                 and fully-connected layers to realize a signiﬁcant reduction in the number of weights, shown in
                 Table 5. We used ﬁve iterations of pruning an retraining.
                 The VGG-16 results are, like those for AlexNet, very promising. The network as a whole has
                 been reduced to 7.5% of its original size (13% smaller). In particular, note that the two largest
                 fully-connected layers can each be pruned to less than 4% of their original size. This reduction is
                 critical for real time image processing, where there is little reuse of fully connected layers across
                 images (unlike batch processing during training).


                 5 Discussion

                 The trade-off curve between accuracy and number of parameters is shown in Figure 5. The more
                 parameters pruned away, the less the accuracy. We experimented with L1 and L2 regularization, with
                 and without retraining, together with iterative pruning to give ﬁve trade off lines. Comparing solid and
                 dashed lines, the importance of retraining is clear: without retraining, accuracy begins dropping much
                 sooner with 1-3% of the original connections, rather than with1=10of the original connections.
                 It’s interesting to see that we have the “free lunch” of reducing 2% the connections without losing
                 accuracy even without retraining; while with retraining we are ably to reduce connections by 9%.

                                                  <<FIGURE>>

                 Figure 5: Trade-off curve for parameter reduction and loss in top-5 accuracy. L1 regularization
                 performs better than L2 at learning the connections without retraining, while L2 regularization
                 performs better than L1 at retraining. Iterative pruning gives the best result.


                        <<FIGURE>>

                       Figure 6: Pruning sensitivity for CONV layer (left) and FC layer (right) of AlexNet.


                 L1 regularization gives better accuracy than L2 directly after pruning (dotted blue and purple lines)
                 since it pushes more parameters closer to zero. However, comparing the yellow and green lines shows
                 that L2 outperforms L1 after retraining, since there is no beneﬁt to further pushing values towards
                 zero. One extension is to use L1 regularization for pruning and then L2 for retraining, but this did not
                 beat simply using L2 for both phases. Parameters from one mode do not adapt well to the other.
                 The biggest gain comes from iterative pruning (solid red line with solid circles). Here we take the
                 pruned and retrained network (solid green line with circles) and prune and retrain it again. The
                 leftmost dot on this curve corresponds to the point on the green line at 80% (5% pruning) pruned to
                 8%. There’s no accuracy loss at 9%. Not until 10% does the accuracy begin to drop sharply.
                 Two green points achieve slightly better accuracy than the original model. We believe this accuracy
                 improvement is due to pruning ﬁnding the right capacity of the network and hence reducing overﬁtting.
                 Both CONV and FC layers can be pruned, but with different sensitivity. Figure 6 shows the sensitivity
                 of each layer to network pruning. The ﬁgure shows how accuracy drops as parameters are pruned on
                 a layer-by-layer basis. The CONV layers (on the left) are more sensitive to pruning than the fully
                 connected layers (on the right). The ﬁrst convolutional layer, which interacts with the input image
                 directly, is most sensitive to pruning. We suspect this sensitivity is due to the input layer having only
                 3 channels and thus less redundancy than the other convolutional layers. We used the sensitivity
                 results to ﬁnd each layer’s threshold: for example, the smallest threshold was applied to the most
                 sensitive layer, which is the ﬁrst convolutional layer.
                 Storing the pruned layers as sparse matrices has a storage overhead of only 15.6%. Storing relative
                 rather than absolute indices reduces the space taken by the FC layer indices to 5 bits. Similarly,
                 CONV layer indices can be represented with only 8 bits.

                 Table 6: Comparison with other model reduction methods on AlexNet. Data-free pruning [28]
                 saved only 1-5% parameters with much loss of accuracy. Deep Fried Convnets [29] worked on fully
                 connected layers only and reduced the parameters by less than 4%. [30] reduced the parameters by
                 4% with inferior accuracy. Naively cutting the layer size saves parameters but suffers from 4% loss
                 of accuracy. [12] exploited the linear structure of convnets and compressed each layer individually,
                 where model compression on a single layer incurred 0.9% accuracy penalty with biclustering + SVD.

                                             <<FIGURE>> 

                 Figure 7: Weight distribution before and after parameter pruning. The right ﬁgure has 10% smaller
                 scale.

                 After pruning, the storage requirements of AlexNet and VGGNet are are small enough that all weights
                 can be stored on chip, instead of off-chip DRAM which takes orders of magnitude more energy to
                 access (Table 1). We are targeting our pruning method for ﬁxed-function hardware specialized for
                 sparse DNN, given the limitation of general purpose hardware on sparse computation.
                 Figure 7 shows histograms of weight distribution before (left) and after (right) pruning. The weight
                 is from the ﬁrst fully connected layer of AlexNet. The two panels have different y-axis scales.
                 The original distribution of weights is centered on zero with tails dropping off quickly. Almost all
                 parameters are between <<FORMULA>>. After pruning the large center region is removed. The
                 network parameters adjust themselves during the retraining phase. The result is that the parameters
                 form a bimodal distribution and become more spread across the x-axis, between <<FORMULA>>.

                 6 Conclusion

                 We have presented a method to improve the energy efﬁciency and storage of neural networks without
                 affecting accuracy by ﬁnding the right connections. Our method, motivated in part by how learning
                 works in the mammalian brain, operates by learning which connections are important, pruning
                 the unimportant connections, and then retraining the remaining sparse network. We highlight our
                 experiments on AlexNet and VGGNet on ImageNet, showing that both fully connected layer and
                 convolutional layer can be pruned, reducing the number of connections by 9% to 13% without loss of
                 accuracy. This leads to smaller memory capacity and bandwidth requirements for real-time image
                 processing, making it easier to be deployed on mobile systems.

                                            References  

                  [1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional
                     neural networks. InAdvances in neural information processing systems, pages 1097–1105, 2012.
                    [2]Alex Graves and Jurgen Schmidhuber. Framewise phoneme classiﬁcation with bidirectional lstm and other¨
                       neural network architectures.Neural Networks, 18(5):602–610, 2005.
                    [3]Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. ´
                       Natural language processing (almost) from scratch.JMLR, 12:2493–2537, 2011.
                    [4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
                       document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                    [5]Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to
                       human-level performance in face veriﬁcation. InCVPR, pages 1701–1708. IEEE, 2014.
                    [6]Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with
                       cots hpc systems. In30th ICML, pages 1337–1345, 2013.
                    [7]Mark Horowitz. Energy table for 45nm process, Stanford VLSI wiki.
                    [8] JP Rauschecker. Neuronal mechanisms of developmental plasticity in the cat’s visual system.Human
                       neurobiology, 3(2):109–114, 1983.
                    [9]Christopher A Walsh. Peter huttenlocher (1931-2013).Nature, 502(7470):172–172, 2013.
                   [10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.
                       InAdvances in Neural Information Processing Systems, pages 2148–2156, 2013.
                   [11]Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus.
                       InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.
                   [12]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure
                       within convolutional networks for efﬁcient evaluation. InNIPS, pages 1269–1277, 2014.
                   [13]Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks
                       using vector quantization.arXiv preprint arXiv:1412.6115, 2014.
                   [14]Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with
                       pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015.
                   [15]Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.arXiv preprint arXiv:1312.4400, 2013.
                   [16]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
                       Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint
                       arXiv:1409.4842, 2014.
                   [17]Stephen Jose Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with´
                       back-propagation. InAdvances in neural information processing systems, pages 177–185, 1989.
                   [18]Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information
                       Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
                   [19]Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon.
                       Advances in neural information processing systems, pages 164–164, 1993.
                   [20]Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural
                       networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015.
                   [21]Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash
                       kernels for structured data.The Journal of Machine Learning Research, 10:2615–2637, 2009.
                   [22]Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing
                       for large scale multitask learning. InICML, pages 1113–1120. ACM, 2009.
                   [23]Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:
                       A simple way to prevent neural networks from overﬁtting.JMLR, 15:1929–1958, 2014.
                   [24]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural
                       networks? InAdvances in Neural Information Processing Systems, pages 3320–3328, 2014.
                   [25]Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient
                       descent is difﬁcult.Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
                    [26]Yangqing Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint
                       arXiv:1408.5093, 2014.
                   [27]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
                       tion.CoRR, abs/1409.1556, 2014.
                   [28] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv
                       preprint arXiv:1507.06149, 2015.
                   [29]Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.
                       Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014.
                   [30]Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks.arXiv preprint
                       arXiv:1412.1442, 2014.
<|endoftext|>


<|startoftext|>
Learning Efficient Convolutional Networks through Network Slimming 

Abstract 

The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by en.forcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20. reduction in model size and a 5. reduction in computing operations. 

1. Introduction 

In recent years, convolutional neural networks (CNNs) have become the dominant approach for a variety of computer vision tasks, e.g., image classification [22], object detection [8], semantic segmentation [26]. Large-scale datasets, high-end modern GPUs and new network architectures allow the development of unprecedented large CNN models. For instance, from AlexNet [22], VGGNet [31] and GoogleNet [34] to ResNets [14], the ImageNet Classification Challenge winner models have evolved from 8 layers to more than 100 layers. 
This work was done when Zhuang Liu and Zhiqiang Shen were interns at Intel Labs China. Jianguo Li is the corresponding author. 
However, larger CNNs, although with stronger representation power, are more resource-hungry. For instance, a 152-layer ResNet [14] has more than 60 million parameters and requires more than 20 Giga float-point-operations (FLOPs) when inferencing an image with resolution 224. 
224. This is unlikely to be affordable on resource con.strained platforms such as mobile devices, wearables or Internet of Things (IoT) devices. 
The deployment of CNNs in real world applications are mostly constrained by 1) Model size: CNNs strong representation power comes from their millions of trainable parameters. Those parameters, along with network structure information, need to be stored on disk and loaded into mem.ory during inference time. As an example, storing a typical CNN trained on ImageNet consumes more than 300MB space, which is a big resource burden to embedded devices. 
2) Run-time memory: During inference time, the intermediate activations/responses of CNNs could even take more memory space than storing the model parameters, even with batch size 1. This is not a problem for high-end GPUs, but unaffordable for many applications with low computational power. 3) Number of computing operations: The convolution operations are computationally intensive on high resolution images. A large CNN may take several minutes to process one single image on a mobile device, making it un.realistic to be adopted for real applications. 
Many works have been proposed to compress large CNNs or directly learn more Efficient CNN models for fast inference. These include low-rank approximation [7], network quantization [3, 12] and binarization [28, 6], weight pruning [12], dynamic inference [16], etc. However, most of these methods can only address one or two challenges mentioned above. Moreover, some of the techniques require specially designed software/hardware accelerators for execution speedup [28, 6, 12]. 
Another direction to reduce the resource consumption of large CNNs is to sparsify the network. Sparsity can be im.posed on different level of structures [2, 37, 35, 29, 25], which yields considerable model-size compression and inference speedup. However, these approaches generally re.

<<FIGURE>>
 
Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network. 
quire special software/hardware accelerators to harvest the gain in memory or time savings, though it is easier than non-structured sparse weight matrix as in [12]. 
In this paper, we propose network slimming, a simple yet effective network training scheme, which addresses all the aforementioned challenges when deploying large CNNs under limited resources. Our approach imposes L1 regularization on the scaling factors in batch normalization (BN) layers, thus it is easy to implement without introducing any change to existing CNN architectures. Pushing the val.ues of BN scaling factors towards zero with L1 regularization enables us to identify insignificant channels (or neurons), as each scaling factor corresponds to a specific convolutional channel (or a neuron in a fully-connected layer). This facilitates the channel-level pruning at the followed step. The additional regularization term rarely hurt the performance. In fact, in some cases it leads to higher generalization accuracy. Pruning unimportant channels may sometimes temporarily degrade the performance, but this effect can be compensated by the followed fine-tuning of the pruned network. After pruning, the resulting narrower network is much more compact in terms of model size, run.time memory, and computing operations compared to the initial wide network. The above process can be repeated for several times, yielding a multi-pass network slimming scheme which leads to even more compact network. 
Experiments on several benchmark datasets and different network architectures show that we can obtain CNN models with up to 20x mode-size compression and 5x reduction in computing operations of the original ones, while achieving the same or even higher accuracy. Moreover, our method achieves model compression and inference speedup with conventional hardware and deep learning software packages, since the resulting narrower model is free of any sparse storing format or computing operations. 

2. Related Work 

In this section, we discuss related work from five aspects. 
Low-rank Decomposition approximates weight matrix in neural networks with low-rank matrix using techniques like Singular Value Decomposition (SVD) [7]. This method works especially well on fully-connected layers, yield.ing 3x model-size compression however without notable speed acceleration, since computing operations in CNN mainly come from convolutional layers. 
Weight Quantization. HashNet [3] proposes to quantize the network weights. Before training, network weights are hashed to different groups and within each group weight the value is shared. In this way only the shared weights and hash indices need to be stored, thus a large amount of stor.age space could be saved. [12] uses a improved quantization technique in a deep compression pipeline and achieves 35x to 49x compression rates on AlexNet and VGGNet. How.ever, these techniques can neither save run-time memory nor inference time, since during inference shared weights need to be restored to their original positions. 
[28, 6] quantize real-valued weights into binary/ternary weights (weight values restricted to {-1, 1} or {-1, 0, 1}). This yields a large amount of model-size saving, and significant speedup could also be obtained given bitwise operation libraries. However, this aggressive low-bit approximation method usually comes with a moderate accuracy loss. 
Weight Pruning / Sparsifying. [12] proposes to prune the unimportant connections with small weights in trained neu.ral networks. The resulting network's weights are mostly zeros thus the storage space can be reduced by storing the model in a sparse format. However, these methods can only achieve speedup with dedicated sparse matrix operation libraries and/or hardware. The run-time memory saving is also very limited since most memory space is consumed by the activation maps (still dense) instead of the weights. 
In [12], there is no guidance for sparsity during training. 
[32] overcomes this limitation by explicitly imposing sparse constraint over each weight with additional gate variables, and achieve high compression rates by pruning connections with zero gate values. This method achieves better compression rate than [12], but suffers from the same drawback. 

Structured Pruning / Sparsifying. Recently, [23] pro.poses to prune channels with small incoming weights in trained CNNs, and then fine-tune the network to regain accuracy. [2] introduces sparsity by random deactivating input-output channel-wise connections in convolutional layers before training, which also yields smaller networks with moderate accuracy loss. Compared with these works, we explicitly impose channel-wise sparsity in the optimization objective during training, leading to smoother channel pruning process and little accuracy loss. 
[37] imposes neuron-level sparsity during training thus some neurons could be pruned to obtain compact networks. 
[35] proposes a Structured Sparsity Learning (SSL) method to sparsify different level of structures (e.g. filters, channels or layers) in CNNs. Both methods utilize group sparsity regularization during training to obtain structured sparsity. Instead of resorting to group sparsity on convolutional weights, our approach imposes simple L1 sparsity on channel-wise scaling factors, thus the optimization objective is much simpler. 
Since these methods prune or sparsify part of the network structures (e.g., neurons, channels) instead of individual weights, they usually require less specialized libraries 
(e.g. for sparse computing operation) to achieve inference speedup and run-time memory saving. Our network slimming also falls into this category, with absolutely no special libraries needed to obtain the benefits. 
Neural Architecture Learning. While state-of-the-art CNNs are typically designed by experts [22, 31, 14], there are also some explorations on automatically learning network architectures. [20] introduces sub-modular/super.modular optimization for network architecture search with a given resource budget. Some recent works [38, 1] propose to learn neural architecture automatically with reinforcement learning. The searching space of these methods are extremely large, thus one needs to train hundreds of models to distinguish good from bad ones. Network slimming can also be treated as an approach for architecture learning, despite the choices are limited to the width of each layer. However, in contrast to the aforementioned methods, network slimming learns network architecture through only a single training process, which is in line with our goal of efficiency. 
3. Network slimming 
We aim to provide a simple scheme to achieve channel-level sparsity in deep CNNs. In this section, we first discuss the advantages and challenges of channel-level sparsity, and introduce how we leverage the scaling layers in batch normalization to effectively identify and prune unimportant channels in the network. 
Advantages of Channel-level Sparsity. As discussed in prior works [35, 23, 11], sparsity can be realized at differ.ent levels, e.g., weight-level, kernel-level, channel-level or layer-level. Fine-grained level (e.g., weight-level) sparsity gives the highest flexibility and generality leads to higher compression rate, but it usually requires special software or hardware accelerators to do fast inference on the sparsified model [11]. On the contrary, the coarsest layer-level sparsity does not require special packages to harvest the inference speedup, while it is less flexible as some whole layers need to be pruned. In fact, removing layers is only effective when the depth is sufficiently large, e.g., more than 50 layers [35, 18]. In comparison, channel-level sparsity provides a nice tradeoff between flexibility and ease of implementation. It can be applied to any typical CNNs or fully-connected networks (treat each neuron as a channel), and the resulting network is essentially a "thinned" version of the unpruned network, which can be Efficiently inferenced on conventional CNN platforms. 
Challenges. Achieving channel-level sparsity requires pruning all the incoming and outgoing connections associated with a channel. This renders the method of directly pruning weights on a pre-trained model ineffective, as it is unlikely that all the weights at the input or output end of a channel happen to have near zero values. As reported in [23], pruning channels on pre-trained ResNets can only lead to a reduction of 10% in the number of parameters without suffering from accuracy loss. [35] addresses this problem by enforcing sparsity regularization into the training objective. specifically, they adopt group LASSO to push all the filter weights corresponds to the same channel towards zero simultaneously during training. However, this approach re.quires computing the gradients of the additional regularization term with respect to all the filter weights, which is non.trivial. We introduce a simple idea to address the above challenges, and the details are presented below. 
Scaling Factors and Sparsity-induced Penalty. Our idea is introducing a scaling factor . for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. specifically, the training objective of our approach is given by 

<<FORMULA>> (1) 
 
where <<FORMULA>> denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, <<FORMULA>> is a sparsity-induced penalty on the scaling factors, and <<FORMULA>> balances the two terms. In our experiment, we choose <<FORMULA>>, which is known as 

<<FIGURE>>

Figure 2: Flow-chart of network slimming procedure. The dotted-line is for the multi-pass/iterative scheme. 
L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the non-smooth L1 penalty term. An alternative option is to replace the L1 penalty with the smooth-L1 penalty [30] to avoid using sub-gradient at non-smooth point. 
As pruning a channel essentially corresponds to removing all the incoming and outgoing connections of that chan.nel, we can directly obtain a narrow network (see Figure 1) without resorting to any special sparse computation packages. The scaling factors act as the agents for channel se.lection. As they are jointly optimized with the network weights, the network can automatically identity insignificant channels, which can be safely removed without greatly affecting the generalization performance. 
Leveraging the Scaling Factors in BN Layers. Batch normalization [19] has been adopted by most modern CNNs as a standard approach to achieve fast convergence and bet.ter generalization performance. The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling fac.tors. Particularly, BN layer normalizes the internal activations using mini-batch statistics. Let z_in and z_out be the input and output of a BN layer, B denotes the current mini-batch, BN layer performs the following transformation: 

                <<FORMULA>>

where <<FORMULA>> and <<FORMULA>> are the mean and standard deviation val.ues of input activations over <<FORMULA>> and <<FORMULA>> are trainable affine transformation parameters (scale and shift) which provides the possibility of linearly transforming normalized activations back to any scales. 
It is common practice to insert a BN layer after a convolutional layer, with channel-wise scaling/shifting parameters. Therefore, we can directly leverage the . parameters in BN layers as the scaling factors we need for network slimming. It has the great advantage of introducing no overhead to the network. In fact, this is perhaps also the most effective way we can learn meaningful scaling factors for chan.nel pruning. 1), if we add scaling layers to a CNN without BN layer, the value of the scaling factors are not meaning.ful for evaluating the importance of a channel, because both convolution layers and scaling layers are linear transformations. One can obtain the same results by decreasing the scaling factor values while amplifying the weights in the convolution layers. 2), if we insert a scaling layer before a BN layer, the scaling effect of the scaling layer will be completely canceled by the normalization process in BN. 3), if we insert scaling layer after BN layer, there are two consecutive scaling factors for each channel. 
Channel Pruning and Fine-tuning. After training under channel-level sparsity-induced regularization, we obtain a model in which many scaling factors are near zero (see Figure 1). Then we can prune channels with near-zero scaling factors, by removing all their incoming and outgoing connections and corresponding weights. We prune channels with a global threshold across all layers, which is defined as a certain percentile of all the scaling factor values. For instance, we prune 70% channels with lower scaling factors by choosing the percentile threshold as 70%. By doing so, we obtain a more compact network with less parameters and run-time memory, as well as less computing operations. 
Pruning may temporarily lead to some accuracy loss, when the pruning ratio is high. But this can be largely compensated by the followed fine-tuning process on the pruned network. In our experiments, the fine-tuned narrow network can even achieve higher accuracy than the original unpruned network in many cases. 
Multi-pass Scheme. We can also extend the proposed method from single-pass learning scheme (training with sparsity regularization, pruning, and fine-tuning) to a multi-pass scheme. specifically, a network slimming procedure results in a narrow network, on which we could again apply the whole training procedure to learn an even more compact model. This is illustrated by the dotted-line in Figure 2. Experimental results show that this multi-pass scheme can lead to even better results in terms of compression rate. 
Handling Cross Layer Connections and Pre-activation Structure. The network slimming process introduced above can be directly applied to most plain CNN architectures such as AlexNet [22] and VGGNet [31]. While some adaptations are required when it is applied to modern networks with cross layer connections and the pre-activation design such as ResNet [15] and DenseNet [17]. For these networks, the output of a layer may be treated as the input of multiple subsequent layers, in which a BN layer is placed before the convolutional layer. In this case, the sparsity is achieved at the incoming end of a layer, i.e., the layer selectively uses a subset of channels it received. To harvest the parameter and computation savings at test time, we need to place a channel selection layer to mask out insignificant channels we have identified. 

4. Experiments 

We empirically demonstrate the effectiveness of network slimming on several benchmark datasets. We implement 

<<TABLE>>

Table 1: Results on CIFAR and SVHN datasets. "Baseline" denotes normal training without sparsity regularization. In column-1, 60% pruned denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy could typically be maintained with  60% channels pruned. 
our method based on the publicly available Torch [5] implementation for ResNets by [10]. The code is available at https://github.com/liuzhuang13/slimming. 

4.1. Datasets 
CIFAR. The two CIFAR datasets [21] consist of natural im.
ages with resolution 32.32. CIFAR-10 is drawn from 10 and CIFAR-100 from 100 classes. The train and test sets contain 50,000 and 10,000 images respectively. On CIFAR.10, a validation set of 5,000 images is split from the training set for the search of . (in Equation 1) on each model. We report the final test errors after training or fine-tuning on all training images. A standard data augmentation scheme (shifting/mirroring) [14, 18, 24] is adopted. The input data is normalized using channel means and standard deviations. We also compare our method with [23] on CIFAR datasets. 
SVHN. The Street View House Number (SVHN) dataset 
[27] consists of 32x32 colored digit images. Following common practice [9, 18, 24] we use all the 604,388 training images, from which we split a validation set of 6,000 im.ages for model selection during training. The test set con.tains 26,032 images. During training, we select the model with the lowest validation error as the model to be pruned (or the baseline model). We also report the test errors of the models with lowest validation errors during fine-tuning. 
ImageNet. The ImageNet dataset contains 1.2 million training images and 50,000 validation images of 1000 classes. We adopt the data augmentation scheme as in [10]. We report the single-center-crop validation error of the final model. 
MNIST. MNIST is a handwritten digit dataset containing 60,000 training images and 10,000 test images. To test the effectiveness of our method on a fully-connected network (treating each neuron as a channel with 1.1 spatial size), we compare our method with [35] on this dataset. 

4.2. Network Models 
On CIFAR and SVHN dataset, we evaluate our method on three popular network architectures: VGGNet[31], ResNet [14] and DenseNet [17]. The VGGNet is originally designed for ImageNet classification. For our experiment a variation of the original VGGNet for CIFAR dataset is taken from [36]. For ResNet, a 164-layer pre-activation ResNet with bottleneck structure (ResNet-164) [15] is used. For DenseNet, we use a 40-layer DenseNet with growth rate 12 (DenseNet-40). 
On ImageNet dataset, we adopt the 11-layer (8-conv + 3 FC) VGG-A network [31] model with batch normalization from [4]. We remove the dropout layers since we use relatively heavy data augmentation. To prune the neurons in fully-connected layers, we treat them as convolutional channels with 1.1 spatial size. 
On MNIST dataset, we evaluate our method on the same 3-layer fully-connected network as in [35]. 

4.3. Training, Pruning and Fine-tuning 
Normal Training. We train all the networks normally from scratch as baselines. All the networks are trained using SGD. On CIFAR and SVHN datasets we train using mini-batch size 64 for 160 and 20 epochs, respectively. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On Im.ageNet and MNIST datasets, we train our models for 60 and 30 epochs respectively, with a batch size of 256, and an initial learning rate of 0.1 which is divided by 10 after 1/3 and 2/3 fraction of training epochs. We use a weight de.cay of 10.4 and a Nesterov momentum [33] of 0.9 without dampening. The weight initialization introduced by [13] is adopted. Our optimization settings closely follow the orig.inal implementation at [10]. In all our experiments, we initialize all channel scaling factors to be 0.5, since this gives higher accuracy for the baseline models compared with de.fault setting (all initialized to be 1) from [10]. 
Training with Sparsity. For CIFAR and SVHN datasets, when training with channel sparse regularization, the hyper.parameteer ., which controls the tradeoff between empirical loss and sparsity, is determined by a grid search over 10.3, 10.4, 10.5 on CIFAR-10 validation set. For VG-GNet we choose 10.4 and for ResNet and DenseNet 10.5. For VGG-A on ImageNet, we set 10.5 . All other settings are kept the same as in normal training. 
Pruning. When we prune the channels of models trained with sparsity, a pruning threshold on the scaling factors needs to be determined. Unlike in [23] where different lay.ers are pruned by different ratios, we use a global pruning threshold for simplicity. The pruning threshold is deter.mined by a percentile among all scaling factors , e.g., 40% or 60% channels are pruned. The pruning process is implemented by building a new 
narrower model and copying the corresponding weights from the model trained with sparsity. 
Fine-tuning. After the pruning we obtain a narrower and more compact model, which is then fine-tuned. On CIFAR, SVHN and MNIST datasets, the fine-tuning uses the same optimization setting as in training. For ImageNet dataset, due to time constraint, we fine-tune the pruned VGG-A with a learning rate of 10.3 for only 5 epochs. 

<<FIGURE>>  

Figure 3: Comparison of pruned models with lower test errors on CIFAR-10 than the original models. The blue and green bars are parameter and FLOP ratios between pruned and original models. 

4.4. Results 
CIFAR and SVHN The results on CIFAR and SVHN are shown in Table 1. We mark all lowest test errors of a model in boldface. 
Parameter and FLOP reductions. The purpose of network slimming is to reduce the amount of computing re.sources needed. The last row of each model has  60% channels pruned while still maintaining similar accuracy to the baseline. The parameter saving can be up to 10.. The FLOP reductions are typically around 50%. To highlight network slimming's efficiency, we plot the resource savings in Figure 3. It can be observed that VGGNet has a large amount of redundant parameters that can be pruned. On ResNet-164 the parameter and FLOP savings are relatively insignificant, we conjecture this is due to its "bottleneck" structure has already functioned as selecting channels. Also, on CIFAR-100 the reduction rate is typically slightly lower than CIFAR-10 and SVHN, which is possibly due to the fact that CIFAR-100 contains more classes. 
Regularization Effect. From Table 1, we can observe that, on ResNet and DenseNet, typically when 40% channels are pruned, the fine-tuned network can achieve a lower test er.ror than the original models. For example, DenseNet-40 with 40% channels pruned achieve a test error of 5.19% on CIFAR-10, which is almost 1% lower than the original model. We hypothesize this is due to the regularization effect of L1 sparsity on channels, which naturally provides feature selection in intermediate layers of a network. We will analyze this effect in the next section. 

<<TABLE>> 

Table 3: Results on MNIST. 
ImageNet. The results for ImageNet dataset are summarized in Table 2. When 50% channels are pruned, the parameter saving is more than 5%, while the FLOP saving is only 30.4%. This is due to the fact that only 378 (out of 2752) channels from all the computation-intensive convolutional layers are pruned, while 5094 neurons (out of 8192) from the parameter-intensive fully-connected layers are pruned. It is worth noting that our method can achieve the savings with no accuracy loss on the 1000-class Im.ageNet dataset, where other methods for Efficient CNNs [2, 23, 35, 28] mostly report accuracy loss. 
MNIST. On MNIST dataset, we compare our method with the Structured Sparsity Learning (SSL) method [35] in Ta.
ble 3. Despite our method is mainly designed to prune channels in convolutional layers, it also works well in pruning neurons in fully-connected layers. In this experiment, we observe that pruning with a global threshold sometimes completely removes a layer, thus we prune 80% of the neurons in each of the two intermediate layers. Our method slightly outperforms [35], in that a slightly lower test error is achieved while pruning more parameters. 
We provide some additional experimental results in the supplementary materials, including (1) detailed structure of a compact VGGNet on CIFAR-10; (2) wall-clock time and run-time memory savings in practice. (3) comparison with a previous channel pruning method [23]; 

4.5. Results for Multi-pass Scheme 
We employ the multi-pass scheme on CIFAR datasets using VGGNet. Since there are no skip-connections, pruning away a whole layer will completely destroy the models. Thus, besides setting the percentile threshold as 50%, we also put a constraint that at each layer, at most 50% of channels can be pruned. 
The test errors of models in each iteration are shown in Table 4. As the pruning process goes, we obtain more and 

<<TABLE>>

Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR.100 datasets, using VGGNet. The baseline model has test errors of 6.34% and 26.74%. Trained and Fine-tuned columns denote the test errors (%) of the model trained with sparsity, and the fine-tuned model after channel pruning, respectively. The parameter and FLOP pruned ratios correspond to the fine-tuned model in that row and the trained model in the next row. 
more compact models. On CIFAR-10, the trained model achieves the lowest test error in iteration 5. This model achieves 20. parameter reduction and 5. FLOP reduction, while still achieving lower test error. On CIFAR-100, after iteration 3, the test error begins to increase. This is pos.sibly due to that it contains more classes than CIFAR-10, so pruning channels too aggressively will inevitably hurt the performance. However, we can still prune near 90% parameters and near 70% FLOPs without notable accuracy loss. 

5. Analysis 

There are two crucial hyper-parameters in network slimming, the pruned percentage t and the coEfficient of the sparsity regularization term . (see Equation 1). In this section, we analyze their effects in more detail. 
Effect of Pruned Percentage. Once we obtain a model trained with sparsity regularization, we need to decide what percentage of channels to prune from the model. If we prune too few channels, the resource saving can be very limited. However, it could be destructive to the model if we prune too many channels, and it may not be possible to recover the accuracy by fine-tuning. We train a DenseNet.40 model with 10.5 on CIFAR-10 to show the effect of pruning a varying percentage of channels. The results are summarized in Figure 5. 
From Figure 5, it can be concluded that the classification performance of the pruned or fine-tuned models degrade only when the pruning ratio surpasses a threshold. The fine.

<<FIGURE>>

Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter). With the increase of , scaling factors become sparser. 

<<FIGURE>>

Figure 5: The effect of pruning varying percentages of channels, from DenseNet-40 trained on CIFAR-10 with =10.5 . 
tuning process can typically compensate the possible accuracy loss caused by pruning. Only when the threshold goes beyond 80%, the test error of fine-tuned model falls behind the baseline model. Notably, when trained with sparsity, even without fine-tuning, the model performs better than the original model. This is possibly due the the regularization effect of L1 sparsity on channel scaling factors. 
Channel Sparsity Regularization. The purpose of the L1 sparsity term is to force many of the scaling factors to be near zero. The parameter <<FORMULA>> in Equation 1 controls its significance compared with the normal training loss. In Figure 4 we plot the distributions of scaling factors in the whole network with different . values. For this experiment we use a VGGNet trained on CIFAR-10 dataset. 
It can be observed that with the increase of ., the scaling factors are more and more concentrated near zero. When 0, i.e., there's no sparsity regularization, the distribution is relatively flat. When 10.4 , almost all scaling factors fall into a small region near zero. This process can be seen as a feature selection happening in intermediate layers of deep networks, where only channels with non-negligible scaling factors are chosen. We further visualize this process by a heatmap. Figure 6 shows the magnitude of scaling factors from one layer in VGGNet, along the training process. Each channel starts with equal weights; as the training 

<<FIGURE>>

Figure 6: Visulization of channel scaling factorsfi change in scale along the training process, taken from the 11th conv-layer in VG-GNet trained on CIFAR-10. Brighter color corresponds to larger value. The bright lines indicate the selected channels, the dark lines indicate channels that can be pruned. 
progresses, some channels scaling factors become larger (brighter) while others become smaller (darker). 

6. Conclusion 

We proposed the network slimming technique to learn more compact CNNs. It directly imposes sparsity-induced regularization on the scaling factors in batch normalization layers, and unimportant channels can thus be automatically identified during training and then pruned. On multiple datasets, we have shown that the proposed method is able to significantly decrease the computational cost (up to 20.) of state-of-the-art networks, with no accuracy loss. More importantly, the proposed method simultaneously reduces the model size, run-time memory, computing operations while introducing minimum overhead to the training process, and the resulting models require no special libraries/hardware for Efficient inference. 

Acknowledgements. 
Gao Huang is supported by the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No.20150015). Changshui Zhang is supported by NSFC and DFG joint project NSFC 61621136008/DFG TRR-169. 

References 
[1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu.ral network architectures using reinforcement learning. In ICLR, 2017. 
[2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power of sparsity in convolutional neural networks. arXiv preprint arXiv:1702.06257, 2017. 
[3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and 
Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015. 
[4] S. Chintala. Training an object classifier in torch-7 on multiple gpus over imagenet. https://github.com/soumith/imagenet-multiGPU.torch. 
[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011. 
[6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. 
[7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional networks for Efficient evaluation. In NIPS, 2014. 
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea.ture hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580fi587, 2014. 
[9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013. 
[10] S. Gross and M. Wilber. Training and investigating residual nets. https://github.com/szagoruyko/cifar-torch. 
[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quanti.zation and huffman coding. In ICLR, 2016. 
[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for Efficient neural network. In NIPS, pages 1135fi1143, 2015. 
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. 
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 
[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630fi645. Springer, 2016. 
[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional networks for Efficient prediction. arXiv preprint arXiv:1703.09844, 2017. 
[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017. 
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 
[20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network architecture optimization through submodularity and super-modularity. arXiv preprint arXiv:1609.00074, 2016. 
[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. In Tech Report, 2009. 
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097fi1105, 2012. 
[23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for Efficient convnets. arXiv preprint arXiv:1608.08710, 2016. 
[24] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. 
[25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. 
[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431fi 3440, 2015. 
[27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised fea.ture learning, 2011. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. 
[28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In ECCV, 2016. 
[29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural networks. arXiv preprint arXiv:1607.00485, 2016. 
[30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, pages 286fi297, 2007. 
[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 
[32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse neural networks. CoRR, abs/1611.06694, 2016. 
[33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. 
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, et al. Going deeper with convolutions. In CVPR, pages 1fi9, 2015. 
[35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016. 
[36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://github.com/szagoruyko/cifar.torch. 
[37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2016. 
[38] B. Zoph and Q. V. Le. Neural architecture search with rein.forcement learning. In ICLR, 2017. 
<|endoftext|>


<|startoftext|>
                       Learning Structured Sparsity in Deep Neural Networks

                   Wei Wen            Chunpeng Wu          Yandan Wang
                      University of Pittsburgh     University of Pittsburgh     University of Pittsburgh
                       wew57@pitt.edu        chw127@pitt.edu        yaw46@pitt.edu

                           Yiran Chen                     Hai Li
                             University of Pittsburgh            University of Pittsburgh
                              yic52@pitt.edu               hal66@pitt.edu

                                               Abstract

                       High demand for computation resources severely hinders deployment of large-scale
                       Deep Neural Networks (DNN) in resource constrained devices. In this work, we
                       propose aStructured Sparsity Learning(SSL) method to regularize the structures
                       (i.e., ﬁlters, channels, ﬁlter shapes, and layer depth) of DNNs. SSL can: (1)
                       learn a compact structure from a bigger DNN to reduce computation cost; (2)
                       obtain a hardware-friendly structured sparsity of DNN to efﬁciently accelerate
                       the DNN’s evaluation. Experimental results show that SSL achieves on average
                       5.1%and 3.1%speedups of convolutional layer computation of AlexNet against
                       CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about
                       twice speedups of non-structured sparsity; (3) regularize the DNN structure to
                       improve classiﬁcation accuracy. The results show that for CIFAR-10, regularization
                       on layer depth can reduce 20 layers of a Deep Residual Network ( ResNet ) to
                       18 layers while improve the accuracy from 91.25% to 92.60%, which is still
                       slightly higher than that of original ResNet with 32 layers. For AlexNet , structure
                       regularization by SSL also reduces the error by%1%. Our source code can be
                       found athttps://github.com/wenwei202/caffe/tree/scnn


                 1 Introduction

                 Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made 
                 remarkable success in visual tasks[1][2][3][4][5] by leveraging large-scale networks learning from
                 a huge volume of data. Deployment of such big models, however, is computation-intensive and
                 memory-intensive. To reduce computation cost, many studies are performed to compress the scale of
                 DNN, including sparsity regularization[6], connection pruning[7][8] and low rank approximation
                 [9][10][11][12][13]. Sparsity regularization and connection pruning approaches, however, often pro-
                 duce non-structured random connectivity in DNN and thus, irregular memory access that adversely
                 impacts practical acceleration in hardware platforms. Figure 1 depicts practical speedup of each
                 layer of a AlexNet , which is non-structurally sparsiﬁed by l1-norm. Compared to original model,
                 the accuracy loss of the sparsiﬁed model is controlled within 2%. Because of the poor data locality
                 associated with the scattered weight distribution, the achieved speedups are either very limited or
                 negative even the actual sparsity is high, say, >95%. We deﬁne sparsity as the ratio of zeros in this
                 paper. In recently proposed low rank approximation approaches, the DNN is trained ﬁrst and then
                 each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally,
                 ﬁne-tuning is performed to restore the model accuracy. Low rank approximation is able to achieve
                 practical speedups because it coordinates model parameters in dense matrixes and avoids the locality
                 problem of non-structured sparsity regularization. However, low rank approximation can only obtain                         
                 
                 <<FIGURE>>

                 Figure 1: Evaluation speedups of  AlexNet  on GPU platforms and the sparsity.   conv1   refers to
                 convolutional layer 1, and so forth. Baseline is proﬁled by GEMM of cuBLAS. The sparse matrixes
                 are stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE.


                 the compact structure within each layer, and the structures of the layers are ﬁxed during ﬁne-tuning
                 such that costly reiterations of decomposing and ﬁne-tuning are required to ﬁnd an optimal weight
                 approximation for performance speedup and accuracy retaining.
                 Inspired by the facts that (1) there is redundancy across ﬁlters and channels [11]; (2) shapes of
                 ﬁlters are usually ﬁxed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary
                 computation imposed by this ﬁxation; and (3) depth of the network is critical for classiﬁcation
                 but deeper layers cannot always guarantee a lower error because of the exploding gradients and
                 degradation problem [5], we propose Structured Sparsity Learning (SSL) method to directly learn
                 a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a
                 generic regularization to adaptively adjust multiple structures in DNN, including structures of ﬁlters,
                 channels, and ﬁlter shapes within each layer, and structure of depth beyond the layers. SSL combines
                 structure regularization (on DNN for classiﬁcation accuracy) with locality optimization (on memory
                 access for computation efﬁciency), offering not only well-regularized big models with improved
                 accuracy but greatly accelerated computation (e.g. 5.1% on CPU and 3.1% on GPU for  AlexNet ).

                 2 Related works

                 Connection pruning and weight sparsifying. Hanet al.[7][8] reduced number of parameters of
                  AlexNet  by 9% andVGG-16by 13% using connection pruning. Since most reduction is achieved
                 on fully-connected layers, the authors obtained 3% to 4% layer-wise speedup for fully-connected
                 layers. However, no practical speedups of convolutional layers are observed because of the issue
                 shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer
                 fully-connected layers,e.g., only 3.99% parameters of ResNet -152in [5] are from fully-connected
                 layers, compression and acceleration on convolutional layers become essential. Liuet al.[6] achieved
                 >90% sparsity of convolutional layers in  AlexNet  with 2% accuracy loss, and bypassed the issue
                 shown in Figure 1 by hardcoding the sparse weights into program, achieving layer-wise 4.59%
                 speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above
                 techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve
                 higher speedups with the same accuracy. Note that hardware and program optimizations can further
                 boost the system performance on top of the level of SSL but are not covered in this work.
                 Low rank approximation. Denilet al.[9] predicted 95% parameters in a DNN by exploiting the
                 redundancy across ﬁlters and channels. Inspired by it, Jaderberget al.[11] achieved 4.5% speedup
                 on CPUs for scene text character recognition and Dentonet al.[10] achieved 2% speedups on both
                 CPUs and GPUs for the ﬁrst two layers. Both of the works usedLow Rank Approximation(LRA)
                 with%1% accuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the
                 network structure compressed by LRA is ﬁxed; reiterations of decomposing, training/ﬁne-tuning,
                 and cross-validating are still needed to ﬁnd an optimal structure for accuracy and speed trade-off.
                 As number of hyper-parameters in LRA method increases linearly with layer depth [10][13], the
                 search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our
                 contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only
                 one hyper-parameter and no reiterations; (2) besides the redundancy within the layers, SSL also
                 exploits the necessity of deep layers and reduce them; (3) DNN ﬁlters regularized by SSL have lower
                 rank approximation, so it can work together with LRA for more efﬁcient model compression.
                 Model structure learning.Group Lasso [14] is an efﬁcient regularization to learn sparse structures.
                 Kimet al.[15] used group Lasso to regularize the structure of correlation tree for multi-task regression
                 problem and reduced prediction errors. Liuet al.[6] utilized group Lasso to constrain the scale

                 <<FORMULA>>        

                 <<FORMULA>>          

                Figure 2: The proposed structured sparsity learning (SSL) for DNNs. Weights in ﬁlters are split W(l)            
                into multiple groups. Through group Lasso regularization, a more compact DNN is obtained by :,c l ,:,:
                 removing some groups. The ﬁgure illustrates the ﬁlter-wise, channel-wise, shape-wise, and depth-wise 
                 structured sparsity that were explored in the work. 

                                 <<FORMULA>>           

                 of the structure of LRA. To adapt DNN structure to different databases, Fenget al.[16] learned
                 the appropriate number of ﬁlters in DNN. Different from these prior arts, we apply group Lasso to
                 regularize multiple DNN structures (ﬁlters, channels, ﬁlter shapes, and layer depth). Our source code
                 can be found at https://github.com/wenwei202/caffe/tree/scnn.


                 3 Structured Sparsity Learning Method for DNNs

                 We focus mainly on theStructured Sparsity Learning(SSL) on convolutional layers to regularize the
                 structure of DNNs. We ﬁrst propose a generic method to regularize structures of DNN in Section 3.1, 1
                 and then specify the method to structures of ﬁlters, channels, ﬁlter shapes and depth in section 3.2.
                 Variants of formulations are also discussed from computational efﬁciency viewpoint in Section 3.3.

                 3.1 Proposed structured sparsity learning for generic structures           
                 Suppose weights of convolutional layers in a DNN form a sequence of 4-D tensors 

                 <<FORMULA>>, where <<FORMULA>> and <<FORMULA>> are the dimensions of the l-th
                 weight tensor along the axes of ﬁlter, channel, spatial height and spatial width, respectively.
                 L denotes the number of convolutional layers. Then the proposed generic optimization target of a DNN with
                 structured sparsity regularization can be formulated as: 1

                                 <<FORMULA>>             (1)
                                                         
                 Here W represents the collection of all weights in the <<FORMULA>> is the loss on data <<FORMULA>> is
                 non-structured regularization applying on every weight,e.g., l2-norm; and <<FORMULA>> is the structured
                 sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in
                 some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights
                 Pw can be represented as <<FORMULA>>, where <<FORMULA>> is a group of partial weights in w
                 and G is the total number of groups. Different groups may overlap. Here <<FORMULA>>, where
                 <<FORMULA>> the number of weights in <<FORMULA>>.

                 3.2 Structured sparsity learning for structures of ﬁlters, channels, ﬁlter shapes and depth

                 In SSL, the learned “structure” is decided by the way of splitting groups ofw(g) . We investigate and
                 formulate theﬁler-wise,channel-wise,shape-wise, and depth-wise structured sparsity in Figure 2.
                 For simplicity, the <<FORMULA>> term of Eq. (1) is omitted in the following formulation expressions.
                 Penalizing unimportant ﬁlers and channels. Suppose <<FORMULA>> is then l-th ﬁlter and <<FORMULA>> is the
                 cl-th channel of all ﬁlters in the l-th layer. The optimization target of learning the ﬁlter-wise and
                 channel-wise structured sparsity can be deﬁned as
                                           
                                      <<FORMULA>>                        (2) 
                                         
                 As indicated in Eq. (2), our approach tends to remove less important ﬁlters and channels. Note
                 that zeroing out a ﬁlter in the l-th layer results in a dummy zero output feature map, which in turn
                 makes a corresponding channel in the (l+ 1)-th layer useless. Hence, we combine the ﬁlter-wise and
                 channel-wise structured sparsity in the learning simultaneously.
                 Learning arbitrary shapes of ﬁlers. As illustrated in Figure 2, <<FORMULA>> denotes the vector of 
                 :;c l ;m l ;k all corresponding weights located at spatial position of <<FORMULA>> in the 2D ﬁlters across the cl-th
                 channel. Thus, we deﬁneW(l)    as the shape ﬁber related to learning arbitrary ﬁlter shape <<FORMULA>> because a 
                 homogeneous non-cubic ﬁlter shape can be learned by zeroing out some shape ﬁbers. The l
                 optimization target of learning shapes of ﬁlers becomes:
                                                 
                              <<FORMULA>>          (3) 
                                               
                 Regularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs
                 in order to improve accuracy and reduce computation cost. The corresponding optimization target is  
                 Different from other discussed sparsiﬁcation techniques,
                 zeroing out all the ﬁlters in a layer will cut off the message propagation in the DNN so that the output
                 neurons cannot perform any classiﬁcation. Inspired by the structure of highway networks [17] and
                 deep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As
                 illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still
                 be forwarded through the shortcut.

                 3.3 Structured sparsity learning for computationally efﬁcient structures

                 All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction.
                 Moreover, some variants of the formulations of these schemes can directly learn structures that can
                 be efﬁciently computed.
                 2D-ﬁlter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D
                 convolutions. To perform efﬁcient convolution, we explored a ﬁne-grain variant of ﬁlter-wise sparsity,
                 namely,2D-ﬁlter-wise sparsity, to spatially enforce group Lasso on each 2D ﬁlter ofW(l)nl ;c l ;:;: . The
                 saved convolution is proportional to the percentage of the removed 2D ﬁlters. The ﬁne-grain version
                 of ﬁlter-wise sparsity can more efﬁciently reduce the computation associated with convolution:
                 Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps
                 group Lasso to quickly obtain a high ratio of zero groups for a large-scale DNN.
                 Combination of ﬁlter-wise and shape-wise sparsity for GEMM. Convolutional computation in
                 DNNs is commonly converted to modality of general Matrix Multiplication (GEMM) by lowering
                 weight tensors and feature tensors to matrices [18]. For example, in Caffe [19], a 3D ﬁlter <<FORMULA>> is
                 reshaped to a row in the weight matrix where each column is the collection of weights <<FORMULA>>
                 related to shape-wise sparsity. Combining ﬁlter-wise and shape-wise sparsity can directly reduce the 
                 dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use
                 row-wise and column-wise sparsity as the interchangeable terminology of ﬁlter-wise and shape-wise
                 sparsity, respectively.

                 4 Experiments

                 We evaluated the effectiveness of our SSL using published models on three databases – MNIST,
                 CIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights
                 are initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in
                 a single-thread Intel Xeon E5-2630 CPU .

                 Table 1: Results after penalizing unimportant ﬁlters and channels inLeNet

                           <<TABLE>>

                  4.1 LeNet and multilayer perceptron on MNIST

                 In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks:
                 LeNet[20] implemented by Caffe and amultilayer perceptron(MLP) network. Both networks were
                 trained without data augmentation.
                 LeNet:When applying SSL toLeNet, we constrain the network with ﬁlter-wise and channel-wise
                 sparsity in convolutional layers to penalize unimportant ﬁlters and channels. Table 1 summarizes
                 the remained ﬁlters and channels,ﬂoating-point operations(FLOP), and practical speedups. In the
                 table,LeNet 1is the baseline and the others are the results after applying SSL in different strengths
                 of structured sparsity regularization. The results show that our method achieves the similar error
                 (0.1%) with much fewer ﬁlters and channels, and saves signiﬁcant FLOP and computation time.
                 To demonstrate the impact of SSL on the structures of ﬁlters, we present all learned   conv1   ﬁlters
                 in Figure 3. It can be seen that most ﬁlters inLeNet 2are entirely zeroed out except for ﬁve most
                 important detectors of stroke patterns that are sufﬁcient for feature extraction. The accuracy of
                 LeNet 3(that further removes the weakest and redundant stroke detector) drops only 0.2% from that
                 ofLeNet 2. Compared to the random and blurry ﬁlter patterns inLeNet 1that resulted from the high
                 freedom of parameter space, the ﬁlters inLeNet 2 & 3are regularized and converge to smoother and
                 more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but has
                 much less ﬁlters. The smoothness of the ﬁlters are also observed in the deeper layers.
                 The effectiveness of the shape-wise sparsity on LeNet is summarized in Table 2. The baselineLeNet 1
                 has   conv1   ﬁlters with a regular 5x5 square (size = 25) whileLeNet 5reduces the dimension that
                 can be constrained by a 2x4 rectangle (size = 7). The 3D shape of conv2 ﬁlters in the baseline is
                 also regularized to the 2D shape inLeNet 5within only one channel, indicating that only one ﬁlter in
                   conv1  is needed. This fact signiﬁcantly saves FLOP and computation time.

                                    <<FIGURE>>

                     Figure 3: Learned   conv1   ﬁlters in LeNet 1(top),LeNet 2(middle) and LeNet 3(bottom)

                 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.the
                 number of neurons) of fully-connected layers. We enforce the group Lasso regularization on all the
                 input (or output) connections of each neuron. A neuron whose input connections are all zeroed out
                 can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable
                 dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes the learned
                 structure and FLOP of differentMLPnetworks. The results show that SSL can not only remove
                 hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the number
                 of connections of each input neuron inMLP 2, where 40.18% of input neurons have zero connections
                 and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition:

                    Table 2: Results after learning ﬁlter shapes inLeNet
                
                                 <<TABLE>>
              
       Figure 4: The normalized reconstructure error of weight matrix vs. the percent of ranks.Principal
       Component Analysis(PCA) is utilized to explore the redundancy among ﬁlters.% ranks of eigenvectors
       corresponding to the largest eigenvalues are selected as basis to perform low rank approximation.
       Left:LeNet2 in Table 1; middle: ConvNet2 in Table 4; right: AlexNet 4 in Table 5. Dash lines
       indicate baselines and solid lines indicate results of SSL.


    170 detectors of stroke patterns which are sufﬁcient for feature extraction. The accuracy ofLeNet 3
    171 (that further removes one weakest and one redundant stroke detector) compared withLeNet 2drops
    172 only 0.2%. Although the training processes of three networks are independent, the corresponding
    173 regularized ﬁlters inLeNet 2andLeNet 3demonstrate very high similarity and represent certain level
    174 of alikeness to those inLeNet 1. Comparing with random and blurry ﬁlter patterns inLeNet 1resulted
    175 from the high freedom of parameter space, the ﬁlters inLeNet 2 & 3are regularized through the
    176 ﬁlter-wise and channel-wise sparsity and therefore converge at smoother and more natural patterns.
    177 This explains why our proposed SSL obtains the same-level accuracy but having much less ﬁlters.
    178 These regularity and similarity phenomena are also observed in deeper layers. Different from low
    179 rank decomposition which only explore the redundancy and does not change the rank, SSL can reduce
    180 the redundancy as shown in Figure 4.

    181 We also explore the effectiveness of the shape-wise sparsity onLeNetin Table 2. The baselineLeNet
    182 1has a regular5⇥5square size of  conv1  ﬁlters, whileLeNet 5reduces the dimension to less than
    183 2⇥4. And the 3D shape of ﬁlters inconv2ofLeNet 1are regularized to 2D shape ofLeNet 5with
    184 only one channel, indicating that only one ﬁlter in  conv1  is needed. This saves signiﬁcant FLOP and
    185 computing time.

    186 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.
    187 the number of neurons) in fully-connected layers. Here, the baselineMLPnetwork composed of
    188 two hidden layers with 500 and 300 neurons respectively obtains a test error of 1.43%. We enforced
    189 the group Lasso regularization on all the input (or output) connections of every neuron, including
    190 those of the input layer. Note that a neuron with all the input connections zeroed out degenerate
    191 to a bias neuron in the next layer; similarly, a neuron degenerates to a removable dummy neuron
    192 if all of its output connections are zeroed out. As such, the computation ofGEneral Matrix Vector
    193 (GEMV) product in fully-connected layers can be signiﬁcantly reduced. Table 3 summarizes the


                  Table 3: Learning the number of neurons in multi-layer perceptron

                  <<TABLE>>

                Figure 4: (a) Results of learning the number of neurons inMLP. (b) the connection numbers of input
                
                              <<FIGURE>>

                handwriting digits are usually written in the center and pixels close to the boundary contain little
                discriminative classiﬁcation information.

                4.2 ConvNet and  ResNet  on CIFAR-10
                We implemented the ConvNet of [1] and deep residual networks( ResNet ) [5] on CIFAR-10. When
                regularizing ﬁlters, channels, and ﬁlter shapes, the results and observations of both networks are
                similar to that of the MNIST experiment. Moreover, we simultaneously learn the ﬁlter-wise and
                shape-wise sparsity to reduce the dimension of weight matrix in GEMM ofConvNet. We also learn
                the depth-wise sparsity of  ResNet  to regularize the depth of the DNNs.
                ConvNet:We use the network from Alex Krizhevskyet al.[1] as the baseline and implement it
                using Caffe. All the conﬁgurations remain the same as the original implementation except that we
                added a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-ﬁtting.ConvNetis
                trained without data augmentation. Table 3 summarizes the results of threeConvNetnetworks. Here,
                the row/column sparsity of a weight matrix is deﬁned as the percentage of all-zero rows/columns.
                Figure 5 shows their learned  conv1  ﬁlters. In Table 3, SSL can reduce the size of weight matrix
                inConvNet 2by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups
                without accuracy drop. Surprisingly, without SSL, four  conv1  ﬁlters of the baseline are actually
                all-zeros as shown in Figure 5, demonstrating the great potential of ﬁlter sparsity. When SSL is
                applied, half of  conv1  ﬁlters inConvNet 2can be zeroed out without accuracy drop.
                On the other hand, inConvNet 3, SSL achieves 1.0% (0.16%) lower error with a model even smaller
                than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a
                better network structure (including the number of ﬁlters and ﬁler shapes) to reduce the error.

                                 <<FIGURE>>

                 Figure 5: Learned  conv1  ﬁlters inConvNet 1(top),ConvNet 2(middle) andConvNet 3(bottom)

                 ResNet :To investigate the necessary depth of DNNs required by SSL, we use a 20-layer deep residual
                networks ( ResNet -20) proposed in [5] as the baseline. The network has 19 convolutional layers and
                1 fully-connected layer.Identity shortcuts are utilized to connect the feature maps with the same
                dimension while 1%1 convolutional layers are chosen as shortcuts between the feature maps with
                different dimensions. Batch normalization [21] is adopted after convolution and before activation.
                We use the same data augmentation and training hyper-parameters as that in [5]. The ﬁnal error of
                baseline is 8.82%. In SSL, the depth of ResNet -20is regularized by depth-wise sparsity. Group Lasso
                regularization is only enforced on the convolutional layers between each pair of shortcut endpoints,
                excluding the ﬁrst convolutional layer and all convolutional shortcuts. After SSL converges, layers

                                               <<FIGURE>>

                                                   Figure 6: Error vs. layer number after depth regularization by SSL.


                 in [ 1412 5] with # layers.SSL- ResNet -#is the depth-regularized ResNet by SSL with # layers, including
                 the last fully-connected layer indicates the convolutional layers with an output map size of 32,64 32, and so forth
                 with all zero weights are removed and the net is ﬁnally ﬁne-tuned with a base learning rate of 0.01, 
                 Figure 6 plots the trend of the error vs. the number of layers under different strengths of depth
                 regularizations. Compared with original ResNet in [5], SSL learns a ResNet with 14 layers (SSL-
                  ResNet -14) that reaching a lower error than the one of the baseline with 20 layers ( ResNet -20);
                 SSL- ResNet -18and ResNet -32achieve an error of 7.40% and 7.51%, respectively. This result implies
                 that SSL can work as a depth regularization to improve classiﬁcation accuracy. Note that SSL can
                 efﬁciently learn shallower DNNs without accuracy loss to reduce computation cost; however, it
                 does not mean the depth of the network is not important. The trend in Figure 6 shows that the test
                 error generally declines as more layers are preserved. A slight error rise of SSL-ResNet-20 from
                 SSL- ResNet -18shows the suboptimal selection of the depth in the group of “32x32”.

                 4.3  AlexNet on ImageNet

                 To show the generalization of our method to large scale DNNs, we evaluate SSL using AlexNet with
                 ILSVRC 2012.CaffeNet[19] – the replication of AlexNet [1] with mirror changes, is used in our
                 experiment. All training images are rescaled to the size of 256x256. A 227%227 image is randomly
                 cropped from each scaled image and mirrored for data augmentation and only the center crop is
                 used for validation. The ﬁnal top-1 validation error is 42.63%. In SSL, AlexNet is ﬁrst trained with
                 structure regularization; when it converges, zero groups are removed to obtain a DNN with the new
                 structure; ﬁnally, the network is ﬁne-tuned without SSL to regain the accuracy.
                 We ﬁrst studied 2D-ﬁlter-wise and shape-wise sparsity by exploring the trade-offs between
                 computation complexity and classiﬁcation accuracy. Figure 7(a) shows the 2D-ﬁlter sparsity (the ratio
                 between the removed 2D ﬁlters and total 2D ﬁlters) and the saved FLOP of 2D convolutions vs. the
                 validation error. In Figure 7(a), deeper layers generally have higher sparsity as the group size shrinks

                    <<FIGURE>>

                 Figure 7: (a) 2D-ﬁlter-wise sparsity and FLOP reduction vs. top-1 error. Vertical dash line shows the
                 error of original AlexNet ; (b) The reconstruction error of weight tensor vs. dimensionality.Principal
                 Component Analysis(PCA) is utilized to perform dimensionality reduction to exploit ﬁlter redundancy.
                 The eigenvectors corresponding to the largest eigenvalues are selected as basis of lower-dimensional
                 space. Dash lines denote the results of the baselines and solid lines indicate the ones of the AlexNet  5
                 in Table 4; (c) Speedups of‘1 -norm and SSL on various CPU and GPU platforms (In labels of x-axis,
                 T# is number of the maximum physical threads in Xeon CPU). AlexNet  1and AlexNet  2in Table 4
                 are used as test benches.


                  and the number of 2D ﬁlters grows. 2D-ﬁlter sparsity regularization can reduce the total FLOP by
                 30%–40% without accuracy loss or reduce the error of AlexNet by%1% down to 41.69% by retaining
                 the original number of parameters. Shape-wise sparsity also obtains similar results – In Table 4, for
                 example, AlexNet  5achieves on average 1.4%layer-wise speedup on both CPU and GPU without
                 accuracy loss after shape regularization; The top-1 error can also be reduced down to 41.83% if
                 the parameters are retained. In Figure 7(a), the obtained DNN with the lowest error has a very low
                 sparsity, indicating that the number of parameters in a DNN is still important to maintain learning
                 capacity. In this case, SSL works as a regularization to add restriction of smoothness to the model in
                 order to avoid over-ﬁtting. Figure 7(b) compares the results of dimensionality reduction of weight
                 tensors in the baseline and our SSL-regularized AlexNet . The results show that the smoothness restriction
                 enforces parameter searching in lower-dimensional space and enables lower rank approximation
                 of the DNNs. Therefore, SSL can work together with low rank approximation to achieve even higher
                 model compression.
                 Besides the above analyses, the computation efﬁciencies of structured sparsity and non-structured
                 sparsity are compared in Caffe using standard off-the-shelf libraries,i.e., Intel Math Kernel Library
                 on CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn a AlexNet with high
                 column-wise and row-wise sparsity as the representative of structured sparsity method.‘1 -norm is
                 selected as the representative of non-structured sparsity method instead of connection pruning in
                 [7] because‘1 -norm get a higher sparsity on convolutional layers as the results of AlexNet  3and
                  AlexNet  4depicted in Table 4. Speedups achieved by SSL are measured by subroutines of GEMM
                 where nonzero rows and columns in each weight matrix are concatenated in consecutive memory
                 space. Note that compared to GEMM, the overhead of concatenation can be ignored. To measure the
                 speedups of‘1 -norm, sparse weight matrices are stored in the format of Compressed Sparse Row
                 (CSR) and computed by sparse-dense matrix multiplication subroutines.
                 Table 4 compares the obtained sparsity and speedups of‘1 -norm and SSL on CPU (Intel Xeon)
                 and GPU (GeForce GTX TITAN Black) under approximately the same errors,e.g., with acceptable
                 or no accuracy loss. For a fair comparison, after‘1 -norm regularization, the DNN is also ﬁne-
                 tuned by disconnecting all zero-weighted connections so that 1.39% accuracy is recovered for the
                  AlexNet  1. Our experiments show that the DNNs require a very high non-structured sparsity to achieve
                 a reasonable speedup (The speedups are even negative when the sparsity is low). SSL, however, can
                 always achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average
                 5.1% and 3.1% layer-wise acceleration on CPU and GPU, respectively. Instead,‘1 -norm achieves
                 on average only 3.0% and 0.9% layer-wise acceleration on CPU and GPU, respectively. We note
                 that at the same accuracy, our average speedup is indeed higher than that of [6] which adopts heavy
                 hardware customization to overcome the negative impact of non-structured sparsity. Figure 7(c)
                 shows the speedups of‘1 -norm and SSL on various platforms, including both GPU (Quadro, Tesla


                               Table 4: Sparsity and speedup of AlexNet on ILSVRC 2012

                    <<TABLE>>


                   and Titan) and CPU (Intel Xeon E5-2630). SSL can achieve on average%3%speedup on GPU while
                   non-structured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can
                   achieve good speedups and the beneﬁt grows as the processors become weaker. Nonetheless, SSL
                   can always achieve averagely%2%speedup compared to non-structured sparsity.


                   5 Conclusion

                   In this work, we have proposed aStructured Sparsity Learning(SSL) method to regularize ﬁlter,
                   channel, ﬁlter shape, and depth structures in deep neural networks (DNN). Our method can enforce
                   the DNN to dynamically learn more compact structures without accuracy loss. The structured
                   compactness of the DNN achieves signiﬁcant speedups for the DNN evaluation both on CPU
                   and GPU with off-the-shelf libraries. Moreover, a variant of SSL can be performed as structure
                   regularization to improve classiﬁcation accuracy of state-of-the-art DNNs.

                   Acknowledgments

                   This work was supported in part by NSF XPS-1337198 and NSF CCF-1615475. The authors thank
                   Drs. Sheng Li and Jongsoo Park for valuable feedback on this work.


                   References

                    [1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional
                       neural networks. InAdvances in Neural Information Processing Systems, pages 1097–1105. 2012.
                    [2]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate
                       object detection and semantic segmentation. InThe IEEE Conference on Computer Vision and Pattern
                       Recognition (CVPR), 2014.
                    [3]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
                       tion.arXiv preprint arXiv:1409.1556, 2014.
                    [4]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
                       Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint
                       arXiv:1409.4842, 2015.
                    [5]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
                       arXiv preprint arXiv:1512.03385, 2015.
                    [6]Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional
                       neural networks. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
                    [7]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efﬁcient
                       neural network. InAdvances in Neural Information Processing Systems, pages 1135–1143. 2015.
                    [8]Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with
                       pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015.
                    [9] Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. Predicting
                       parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 2148–2156.
                       2013.
                   [10]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure
                       within convolutional networks for efﬁcient evaluation. InAdvances in Neural Information Processing
                       Systems, pages 1269–1277. 2014.
                   [11]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with
                       low rank expansions.arXiv preprint arXiv:1405.3866, 2014.
                   [12]Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training
                       cnns with low-rank ﬁlters for efﬁcient image classiﬁcation.arXiv preprint arXiv:1511.06744, 2015.
                   [13]Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank
                       regularization.arXiv preprint arXiv:1511.06067, 2015.
                   [14]Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of
                       the Royal Statistical Society. Series B (Statistical Methodology), 68(1):49–67, 2006.
                   [15]Seyoung Kim and Eric P Xing. Tree-guided group lasso for multi-task regression with structured sparsity.
                       InProceedings of the 27th International Conference on Machine Learning, 2010.
                   [16]Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. InThe IEEE
                       International Conference on Computer Vision (ICCV), 2015.
                   [17]Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint
                       arXiv:1505.00387, 2015.
                   [18]Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and
                       Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning.arXiv preprint arXiv:1410.0759, 2014.
                   [19]Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio
                       Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.arXiv
                       preprint arXiv:1408.5093, 2014.
                   [20]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
                       document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                   [21]Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
                       internal covariate shift.arXiv preprint arXiv:1502.03167, 2015.
<|endoftext|>


<|startoftext|>
                  MIXED PRECISION TRAINING


                   Sharan Narang % , Gregory Diamos, Erich Elsen y
                  Baidu Research
                  fsharan, gdiamosg@baidu.com

                  Paulius Micikevicius % , Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston,
                  Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
                  NVIDIA
                  fpauliusm, alben, dagarcia, bginsburg, mhouston,
                   okuchaiev, gavenkatesh, skywg@nvidia.com

                                        ABSTRACT

                       Increasing the size of a neural network typically improves accuracy but also in-
                       creases the memory and compute requirements for training the model. We intro-
                       duce methodology for training deep neural networks using half-precision ﬂoat-
                       ing point numbers, without losing model accuracy or having to modify hyper-
                       parameters. This nearly halves memory requirements and, on recent GPUs,
                       speeds up arithmetic. Weights, activations, and gradients are stored in IEEE half-
                       precision format. Since this format has a narrower range than single-precision we
                       propose three techniques for preventing the loss of critical information. Firstly,
                       we recommend maintaining a single-precision copy of weights that accumulates
                       the gradients after each optimizer step (this copy is rounded to half-precision for
                       the forward- and back-propagation). Secondly, we propose loss-scaling to pre-
                       serve gradient values with small magnitudes. Thirdly, we use half-precision arith-
                       metic that accumulates into single-precision outputs, which are converted to half-
                       precision before storing to memory. We demonstrate that the proposed methodology
                       works across a wide variety of tasks and modern large scale (exceeding 100
                       million parameters) model architectures, trained on large datasets.


                  1 INTRODUCTION

                 Deep Learning has enabled progress in many different applications, ranging from image recognition
                 (He et al., 2016a) to language modeling (Jozefowicz et al., 2016) to machine translation (Wu et al.,
                 2016) and speech recognition (Amodei et al., 2016). Two trends have been critical to these results
                 - increasingly large training data sets and increasingly complex models. For example, the neural
                 network used in Hannun et al. (2014) had 11 million parameters which grew to approximately 67
                 million for bidirectional RNNs and further to 116 million for the latest forward only Gated Recurrent
                 Unit (GRU) models in Amodei et al. (2016).
                 Larger models usually require more compute and memory resources to train. These requirements
                 can be lowered by using reduced precision representation and arithmetic. Performance (speed) of
                 any program, including neural network training and inference, is limited by one of three factors:
                 arithmetic bandwidth, memory bandwidth, or latency. Reduced precision addresses two of these
                 limiters. Memory bandwidth pressure is lowered by using fewer bits to to store the same number of
                 values. Arithmetic time can also be lowered on processors that offer higher throughput for reduced
                 precision math. For example, half-precision math throughput in recent GPUs is 2% to 8% higher
                 than for single-precision. In addition to speed improvements, reduced precision formats also reduce
                 the amount of memory required for training.
                 Modern deep learning training systems use single-precision (FP32) format. In this paper, we address
                 the training with reduced precision while maintaining model accuracy. Speciﬁcally, we train various
                 neural networks using IEEE half-precision format (FP16). Since FP16 format has a narrower
                 dynamic range than FP32, we introduce three techniques to prevent model accuracy loss: maintain-
                 ing a master copy of weights in FP32, loss-scaling that minimizes gradient values becoming zeros,
                 and FP16 arithmetic with accumulation in FP32. Using these techniques we demonstrate that a
                 wide variety of network architectures and applications can be trained to match the accuracy FP32
                 training. Experimental results include convolutional and recurrent network architectures, trained
                 for classiﬁcation, regression, and generative tasks. Applications include image classiﬁcation, image
                 generation, object detection, language modeling, machine translation, and speech recognition. The
                 proposed methodology requires no changes to models or training hyper-parameters.

                  2 RELATED WORK

                  There have been a number of publications on training Convolutional Neural Networks (CNNs) with
                  reduced precision. Courbariaux et al. (2015) proposed training with binary weights, all other tensors
                  and arithmetic were in full precision. Hubara et al. (2016a) extended that work to also binarize
                  the activations, but gradients were stored and computed in single precision. Hubara et al. (2016b)
                  considered quantization of weights and activations to 2, 4 and 6 bits, gradients were real numbers.
                  Rastegari et al. (2016) binarize all tensors, including the gradients. However, all of these approaches
                  lead to non-trivial loss of accuracy when larger CNN models were trained for ILSVRC classiﬁcation
                  task (Russakovsky et al., 2015). Zhou et al. (2016) quantize weights, activations, and gradients
                  to different bit counts to further improve result accuracy. This still incurs some accuracy loss and
                  requires a search over bit width conﬁgurations per network, which can be impractical for larger
                  models. Mishra et al. improve on the top-1 accuracy achieved by prior weight and activation 
                  quantizations by doubling or tripling the width of layers in popular CNNs. However, the gradients are
                  still computed and stored in single precision, while quantized model accuracy is lower than that of
                  the widened baseline. Gupta et al. (2015) demonstrate that 16 bit ﬁxed point representation can be
                  used to train CNNs on MNIST and CIFAR-10 datasets without accuracy loss. It is not clear how
                  this approach would work on the larger CNNs trained on large datasets or whether it would work for
                  Recurrent Neural Networks (RNNs).
                  There have also been several proposals to quantize RNN training. He et al. (2016c) train quantized
                  variants of the GRU (Cho et al., 2014) and Long Short Term Memory (LSTM) (Hochreiter and
                  Schmidhuber, 1997) cells to use fewer bits for weights and activations, albeit with a small loss in
                  accuracy. It is not clear whether their results hold for larger networks needed for larger datasets
                  Hubara et al. (2016b) propose another approach to quantize RNNs without altering their structure.
                  Another approach to quantize RNNs is proposed in Ott et al. (2016). They evaluate binary, ternary
                  and exponential quantization for weights in various different RNN models trained for language
                  modelling and speech recognition. All of these approaches leave the gradients unmodiﬁed in single-
                  precision and therefore the computation cost during back propagation is unchanged.
                  The techniques proposed in this paper are different from the above approaches in three aspects.
                  First, all tensors and arithmetic for forward and backward passes use reduced precision, FP16 in
                  our case. Second, no hyper-parameters (such as layer width) are adjusted. Lastly, models trained
                  with these techniques do not incur accuracy loss when compared to single-precision baselines. We
                  demonstrate that this technique works across a variety of applications using state-of-the-art models
                  trained on large scale datasets.

                  3 IMPLEMENTATION

                 We introduce the key techniques for training with FP16 while still matching the model accuracy of
                 FP32 training session: single-precision master weights and updates, loss-scaling, and accumulating
                 FP16 products into FP32. Results of training with these techniques are presented in Section 4.

                  3.1 FP32 MASTER COPY OF WEIGHTS

                  In mixed precision training, weights, activations and gradients are stored as FP16. In order to match
                  the accuracy of the FP32 networks, an FP32 master copy of weights is maintained and updated with
                  the weight gradient during the optimizer step. In each iteration an FP16 copy of the master weights is
                 used in the forward and backward pass, halving the storage and bandwidth needed by FP32 training.
                 Figure 1 illustrates this mixed precision training process.
                 While the need for FP32 master weights is not universal, there are two possible reasons why a
                 number of networks require it. One explanation is that updates (weight gradients multiplied by the
                 learning rate) become too small to be represented in FP16 - any value whose magnitude is smaller
                 than2%24 becomes zero in FP16. We can see in Figure 2b that approximately 5% of weight gradient
                 values have exponents smaller than%24. These small valued gradients would become zero in the
                 optimizer when multiplied with the learning rate and adversely affect the model accuracy. Using a
                 single-precision copy for the updates allows us to overcome this problem and recover the accuracy.
                 Another explanation is that the ratio of the weight value to the weight update is very large. In
                 this case, even though the weight update is representable in FP16, it could still become zero when
                 addition operation right-shifts it to align the binary point with the weight. This can happen when
                 the magnitude of a normalized weight value is at least 2048 times larger that of the weight update.
                 Since FP16 has 10 bits of mantissa, the implicit bit must be right-shifted by 11 or more positions to
                 potentially create a zero (in some cases rounding can recover the value). In cases where the ratio is
                 larger than 2048, the implicit bit would be right-shifted by 12 or more positions. This will cause the
                 weight update to become a zero which cannot be recovered. An even larger ratio will result in this
                 effect for de-normalized numbers. Again, this effect can be counteracted by computing the update
                 in FP32.
                 To illustrate the need for an FP32 master copy of weights, we use the Mandarin speech model
                 (described in more detail in Section 4.3) trained on a dataset comprising of approximately 800 hours
                 of speech data for 20 epochs. As shown in 2a, we match FP32 training results when updating an
                 FP32 master copy of weights after FP16 forward and backward passes, while updating FP16 weights
                 results in 80% relative accuracy loss.
                 Even though maintaining an additional copy of weights increases the memory requirements for the
                 weights by 50% compared with single precision training, impact on overall memory usage is much
                 smaller. For training memory consumption is dominated by activations, due to larger batch sizes
                 and activations of each layer being saved for reuse in the back-propagation pass. Since activations
                 are also stored in half-precision format, the overall memory consumption for training deep neural
                 networks is roughly halved.

                  3.2 LOSS SCALING

                 FP16 exponent bias centers the range of normalized value exponents to[%14;15]while gradient
                 values in practice tend to be dominated by small magnitudes (negative exponents). For example,
                 consider Figure 3 showing the histogram of activation gradient values, collected across all layers
                 during FP32 training of Multibox SSD detector network (Liu et al., 2015a). Note that much of
                 the FP16 representable range was left unused, while many values were below the minimum representable
                 range and became zeros. Scaling up the gradients will shift them to occupy more of the
                 representable range and preserve values that are otherwise lost to zeros. This particular network
                 diverges when gradients are not scaled, but scaling them by a factor of 8 (increasing the exponents
                 by 3) is sufﬁcient to match the accuracy achieved with FP32 training. This suggests that activation
                  gradient values below2%27 in magnitude were irrelevant to the training of this model, but values in
                  the[2 %27 ;2%24 )range were important to preserve.
                  One efﬁcient way to shift the gradient values into FP16-representable range is to scale the loss value
                  computed in the forward pass, prior to starting back-propagation. By chain rule back-propagation
                  ensures that all the gradient values are scaled by the same amount. This requires no extra operations
                  during back-propagation and keeps the relevant gradient values from becoming zeros. Weight gradients
                  must be unscaled before weight update to maintain the update magnitudes as in FP32 training. It
                  is simplest to perform this unscaling right after the backward pass but before gradient clipping or any
                  other gradient-related computations, ensuring that no hyper-parameters (such as gradient clipping
                  threshold, weight decay, etc.) have to be adjusted.
                  There are several options to choose the loss scaling factor. The simplest one is to pick a constant
                  scaling factor. We trained a variety of networks with scaling factors ranging from 8 to 32K
                  (many networks did not require a scaling factor). A constant scaling factor can be chosen empirically
                 or, if gradient statistics are available, directly by choosing a factor so that its product with
                 the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16).
                 There is no downside to choosing a large scaling factor as long as it does not cause overﬂow during
                 back-propagation - overﬂows will result in inﬁnities and NaNs in the weight gradients which will
                 irreversibly damage the weights after an update. Note that overﬂows can be efﬁciently detected by
                 inspecting the computed weight gradients, for example, when weight gradient values are unscaled.
                 One option is to skip the weight update when an overﬂow is detected and simply move on to the
                 next iteration.

                   <<FIGURE>>

                  Figure 2: Figure 2a shows the results of three experiments; baseline (FP32), pseudo FP16 with
                  FP32 master copy, pseudo FP16 without FP32 master copy. Figure 2b shows the histogram for the
                  exponents of weight gradients for Mandarin speech recognition training with FP32 weights. The
                  gradients are sampled every 4,000 iterations during training for all the layers in the model.

                              <<FIGURE>>

                 Figure 3: Histogram of activation gradient values during the training of Multibox SSD network.
                 Note that the bins on the x-axis cover varying ranges and there’s a separate bin for zeros. For
                 example, 2% of the values are in the[2 %34 ;2%32 )range, 2% of values are in the[2 %24 ;2%23 )range,
                  and 67% of values are zero.


                  3.3 ARITHMETIC PRECISION

                  By and large neural network arithmetic falls into three categories: vector dot-products, reductions,
                  and point-wise operations. These categories beneﬁt from different treatment when it comes to
                  reduced precision arithmetic. To maintain model accuracy, we found that some networks require that
                  FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted
                  to FP16 before writing to memory. Without this accumulation in FP32, some FP16 models did not
                  match the accuracy of the baseline models. Whereas previous GPUs supported only FP16 multiply-
                  add operation, NVIDIA Volta GPUs introduce Tensor Cores that multiply FP16 input matrices and
                  accumulate products into either FP16 or FP32 outputs (NVIDIA, 2017).
                  Large reductions (sums across elements of a vector) should be carried out in FP32. Such reductions
                  mostly come up in batch-normalization layers when accumulating statistics and softmax layers.
                  Both of the layer types in our implementations still read and write FP16 tensors from memory,
                  performing the arithmetic in FP32. This did not slow down the training process since these layers
                  are memory-bandwidth limited and not sensitive to arithmetic speed.
                  Point-wise operations, such as non-linearities and element-wise matrix products, are memory-
                  bandwidth limited. Since arithmetic precision does not impact the speed of these operations, either
                  FP16 or FP32 math can be used.

                  4 RESULTS

                  We have run experiments for a variety of deep learning tasks covering a wide range of deep learning
                  models. We conducted the following experiments for each application:

                      %Baseline (FP32): Single-precision storage is used for activations, weights and gradients.
                       All arithmetic is also in FP32.
                      %Mixed Precision (MP): FP16 is used for storage and arithmetic. Weights, activations and
                       gradients are stored using in FP16, an FP32 master copy of weights is used for updates.
                       Loss-scaling is used for some applications. Experiments with FP16 arithmetic used Tensor
                       Core operations with accumulation into FP32 for convolutions, fully-connected layers, and
                       matrix multiplies in recurrent layers.

                  The Baseline experiments were conducted on NVIDIA’s Maxwell or Pascal GPU. Mixed Precision
                  experiments were conducted on Volta V100 that accumulates FP16 products into FP32. The mixed
                  precision speech recognition experiments (Section 4.3) were conducted using Maxwell GPUs using
                  FP16 storage only. This setup allows us to emulate the TensorCore operations on non-Volta hard-
                  ware. A number of networks were trained in this mode to conﬁrm that resulting model accuracies
                  are equivalent to MP training run on Volta V100 GPUs. This is intuitive since MP arithmetic was
                  accumulating FP16 products into FP32 before converting the result to FP16 on a memory write.

                  4.1 CNN S FOR ILSVRC CLASSIFICATION

                  We trained several CNNs for ILSVRC classiﬁcation task (Russakovsky et al., 2015) using mixed
                  precision: Alexnet, VGG-D, GoogLeNet, Inception v2, Inception v3, and pre-activation Resnet-50.
                  In all of these cases we were able to match the top-1 accuracy of baseline FP32 training session
                  using identical hyper-parameters. Networks were trained using Caffe (Jia et al., 2014) framework
                  modiﬁed to use Volta TensorOps, except for Resnet50 which used PyTorch (Paszke et al., 2017).
                  Training schedules were used from public repositories, when available (training schedule for VGG-
                 D has not been published). Top-1 accuracy on ILSVRC validation set are shown in Table 1. Baseline
                 (FP32) accuracy in a few cases is different from published results due to single-crop testing and a
                 simpler data augmentation. Our data augmentation in Caffe included random horizontal ﬂipping and
                 random cropping from 256x256 images, Resnet50 training in PyTorch used the full augmentation in
                 the training script from PyTorch vision repository.

                                  Table 1: ILSVRC12 classiﬁcation top-1 accuracy.

                          <<TABLE>>


                 Loss-scaling technique was not required for successful mixed precision training of these networks.
                 While all tensors in the forward and backward passes were in FP16, a master copy of weights was
                 updated in FP32 as outlined in Section 3.1.

                  4.2 DETECTION CNN'S

                  Object detection is a regression task, where bounding box coordinate values are predicted by the
                  network (compared to classiﬁcation, where the predicted values are passed through a softmax layer
                  to convert them to probabilities). Object detectors also have a classiﬁcation component, where prob-
                  abilities for an object type are predicted for each bounding box. We trained two popular detection
                  approaches: Faster-RCNN (Ren et al., 2015) and Multibox-SSD (Liu et al., 2015a). Both detectors
                  used VGG-16 network as the backbone. Models and training scripts were from public repositories
                  (Girshick; Liu). Mean average precision (mAP) was computed on Pascal VOC 2007 test set. Faster-
                  RCNN was trained on VOC 2007 training set, whereas SSD was trained on a union of VOC 2007
                  and 2012 data, which is the reason behind baseline mAP difference in Table 2.

                                 Table 2: Detection network average mean precision.

                            <<TABLE>>


                 As can be seen in table 2, SSD detector failed to train in FP16 without loss-scaling. By losing
                 small gradient values to zeros, as described in Section 3.2, poor weights are learned and training
                 diverges. As described in Section 3.2, loss-scaling factor of 8 recovers the relevant gradient values
                 and mixed-precision training matches FP32 mAP.

                  4.3 SPEECH RECOGNITION

                  We explore mixed precision training for speech data using the DeepSpeech 2 model for both English
                  and Mandarin datasets. The model used for training on the English dataset consists of two 2D con-
                  volution layers, three recurrent layers with GRU cells, 1 row convolution layer and Connectionist
                  temporal classiﬁcation (CTC) cost layer (Graves et al., 2006). It has approximately 115 million 
                  parameters. This model is trained on our internal dataset consisting of 6000 hours of English speech.
                  The Mandarin model has a similar architecture with a total of 215 million parameters. The Man-
                  darin model was trained on 2600 hours of our internal training set. For these models, we run the
                  Baseline and Pseudo FP16 experiments. All the models were trained for 20 epochs using Nesterov
                  Stochastic Gradient Descent (SGD). All hyper-parameters such as learning rate, annealing schedule
                  and momentum were the same for baseline and pseudo FP16 experiments. Table 3 shows the results
                  of these experiments on independent test sets.

                 Table 3: Character Error Rate (CER) using mixed precision training for speech recognition. English
                 results are reported on the WSJ ’92 test set. Mandarin results are reported on our internal test set.

                                   <<TABLE>>

                 Similar to classiﬁcation and detection networks, mixed precision training works well for recurrent
                 neural networks trained on large scale speech datasets. These speech models are the largest models
                 trained using this technique. Also, the number of time-steps involved in training a speech model are
                 unusually large compared to other applications using recurrent layers. As shown in table 3, Pseudo
                 FP16 results are roughly 5 to 10% better than the baseline. This suggests that the half-precision
                 storage format may act as a regularizer during training.

                            <<TABLE>>

                 Figure 4: English to French translation network training perplexity, 3x1024 LSTM model with
                 attention. Ref1, ref2 and ref3 represent three different FP32 training runs.

                  4.4 MACHINE TRANSLATION

                  For language translation we trained several variants of the model in TensorFlow tutorial for 
                  English to French translation (Google). The model used word-vocabularies, 100K and 40K entries for
                  English and French, respectively. The networks we trained had 3 or 5 layers in the encoder and
                  decoder, each. In both cases a layer consisted of 1024 LSTM cells. SGD optimizer was used to
                  train on WMT15 dataset. There was a noticeable variation in accuracy of different training sessions
                  with the same settings. For example, see the three FP32 curves in Figure 4, which shows the 3-layer
                  model. Mixed-precision with loss-scaling matched the FP32 results, while no loss-scaling resulted
                  in a slight degradation in the results. The 5-layer model exhibited the same training behavior.

                  4.5 LANGUAGE MODELING

                 We trained English language model, designated as big LSTM (Jozefowicz et al., 2016), on the 1
                 billion word dataset. The model consists of two layers of 8192 LSTM cells with projection to a
                 1024-dimensional embedding. This model was trained for 50 epochs using the Adagrad optimizer.
                 The the vocabulary size is 793K words. During training, we use a sampled softmax layer with 8K
                 negative samples. Batch size aggregated over 4 GPUs is 1024. To match FP32 perplexity training
                 this network with FP16 requires loss-scaling, as shown in Figure 5. Without loss scaling the training
                 perplexity curve for FP16 training diverges, compared with the FP32 training, after 300K iterations.
                 Scaling factor of 128 recovers all the relevant gradient values and the accuracy of FP16 training
                 matches the baseline run.

                  4.6 DCGAN RESULTS

                 Generative Adversarial Networks (GANs) combine regression and discrimination tasks during train-
                 ing. For image tasks, the generator network regresses pixel colors. In our case, the generator predicts
                 three channels of 8-bit color values each. The network was trained to generate 128x128 pixel im-
                 ages of faces, using DCGAN methodology (Radford et al., 2015) and CelebFaces dataset (Liu et al.,
                 2015b). The generator had 7 layers of fractionally-strided convolutions, 6 with leaky ReLU activa-
                 tions, 1 withtanh. The discriminator had 6 convolutions, and 2 fully-connected layers. All used
                  leaky ReLU activations except for the last layer, which used sigmoid. Batch normalization was ap-
                  plied to all layers except the last fully-connected layer of the discriminator. Adam optimizer was
                  used to train for 100K iterations. An set of output images in Figure 6. Note that we show a randomly
                  selected set of output images, whereas GAN publications typically show a curated set of outputs by
                  excluding poor examples. Unlike other networks covered in this paper, GANs do not have a widely-
                  accepted quantiﬁcation of their result quality. Qualitatively the outputs of FP32 and mixed-precision
                  training appear comparable. This network did not require loss-scaling to match FP32 results.

                                                  <<FIGURE>>

                                      Figure 5: bigLSTM training perplexity

                                <<FIGURE>>

                 Figure 6: An uncurated set of face images generated by DCGAN. FP32 training (left) and mixed-
                 precision training (right).

                  5 CONCLUSIONS AND FUTURE WORK

                  Mixed precision training is an important technique that allows us to reduce the memory consumption
                  as well as time spent in memory and arithmetic operations of deep neural networks. We have
                  demonstrated that many different deep learning models can be trained using this technique with no
                  loss in accuracy without any hyper-parameter tuning. For certain models with a large number of
                  small gradient values, we introduce the gradient scaling method to help them converge to the same
                  accuracy as FP32 baseline models.
                  DNN operations benchmarked with DeepBench 1 on Volta GPU see 2-6x speedups compared to
                 FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower
                 when operations are latency-limited. Full network training and inference speedups depend on library
                 and framework optimizations for mixed precision and are a focus of future work (experiments in this
                 paper were carried out with early versions of both libraries and frameworks).
                 We would also like to extend this work to include generative models like text-to-speech systems
                 and deep reinforcement learning applications. Furthermore, automating loss-scaling factor selection
                 would further simplify training with mixed precision. Loss-scaling factor could be dynamically
                 increased or decreased by inspecting the weight gradients for overﬂow, skipping weight updates
                 when an overﬂow is detected.

                                                  REFERENCES

                 D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski,
                   A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and
                   mandarin. InProceedings of The 33rd International Conference on Machine Learning, pages
                   173–182, 2016.
                  K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.¨
                   Learning phrase representations using rnn encoder-decoder for statistical machine translation.
                   arXiv preprint arXiv:1406.1078, 2014.
                  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with
                   binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
                   and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages
                   3123–3131. Curran Associates, Inc., 2015. URLhttp://papers.nips.cc/paper/
                   5647-binaryconnect-training-deep-neural-networks-with-binary-weights-during-propagations.
                   pdf.
                  R. Girshick. Faster r-cnn github repository.  https://github.com/rbgirshick/
                   py-faster-rcnn.
                  Google. Tensorﬂow tutorial: Sequence-to-sequence models. URL https://www.
                   tensorflow.org/tutorials/seq2seq.
                  A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classiﬁcation:´
                   labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd
                   international conference on Machine learning, pages 369–376. ACM, 2006.
                  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical
                   precision. InProceedings of the 32nd International Conference on Machine Learning (ICML-15),
                   pages 1737–1746, 2015.
                  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sen-
                   gupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition.arXiv preprint
                   arXiv:1412.5567, 2014.
                  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings
                   of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
                 K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. InECCV, 2016b.
                 Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for
                   recurrent neural networks.arXiv preprint arXiv:1611.10176, 2016c.
                 S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Comput., 9(8):1735–1780, Nov.
                   1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URLhttp://dx.doi.org/10.
                   1162/neco.1997.9.8.1735.
                 I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In
                   Advances in Neural Information Processing Systems, pages 4107–4115, 2016a.
                 I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural net-
                   works: Training neural networks with low precision weights and activations. arXiv preprint
                   arXiv:1609.07061, 2016b.
                 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reduc-
                   ing internal covariate shift. In F. R. Bach and D. M. Blei, editors,ICML, volume 37 of
                   JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015. URLhttp:
                   //dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15.
                 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
                   Caffe: Convolutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093,
                   2014.
                 R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language
                   modeling, 2016. URLhttps://arxiv.org/pdf/1602.02410.pdf.
                 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convo-
                   lutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-
                   berger, editors, Advances in Neural Information Processing Systems 25, pages 1097–
                   1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/
                   4824-imagenet-classification-with-deep-convolutional-neural-networks.
                   pdf.
                 W. Liu. Ssd github repository.https://github.com/weiliu89/caffe/tree/ssd.
                 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detec-
                   tor.CoRR, abs/1512.02325, 2015a. URLhttp://dblp.uni-trier.de/db/journals/
                   corr/corr1512.html#LiuAESR15.
                 Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. InProceedings of
                   International Conference on Computer Vision (ICCV), 2015b.
                 A. Mishra, E. Nurvitadhi, J. Cook, and D. Marr. Wrpn: Wide reduced-precision networks.arXiv
                   preprint arXiv:1709.01134, year=2017.
                 NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/
                   volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf,
                   2017.
                 J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical
                   precision.arXiv preprint arXiv:1608.06902, 2016.
                 A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,
                   and A. Lerer. Automatic differentiation in pytorch. 2017.
                 A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-
                   tional generative adversarial networks. CoRR, abs/1511.06434, 2015. URLhttp://dblp.
                   uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15.
                 M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.XNOR-Net: ImageNet Classiﬁcation Using
                   Binary Convolutional Neural Networks, pages 525–542. Springer International Publishing, Cham,
                   2016. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-032. URLhttps://doi.
                   org/10.1007/978-3-319-46493-0_32.
                 S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
                   region proposal networks. InNeural Information Processing Systems (NIPS), 2015.
                 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
                   M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
                   lenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/
                   s11263-015-0816-y.
                 K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-
                   tion.arXiv preprint arXiv:1409.1556, 2014.
                 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
                   binovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition (CVPR),
                   2015. URLhttp://arxiv.org/abs/1409.4842.
                 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-
                   ture for computer vision. InThe IEEE Conference on Computer Vision and Pattern Recognition
                   (CVPR), June 2016.
                 Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
                   K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human
                   and machine translation.arXiv preprint arXiv:1609.08144, 2016.
                 S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth con-
                   volutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL
                   http://arxiv.org/abs/1606.06160.
<|endoftext|>


<|startoftext|>
Learning to Generalize 
SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING 
MANFRED OPPER 
Neural Computation Research Group Aston University Birmingham B4 7ET, United Kingdom 

Introduction 
Neural networks learn from examples. This statement is obviously true for the brain, but also artificial networks (or neural networks), which have become a powerful new tool for many pattern-recognition problems, adapt their synaptic couplings to a set of examples. Neural nets usually consist of many simple computing units which are combined in an architecture which is often independent from the problem. The parameters which control the interaction among the units can be changed during the learning phase and these are often called synaptic couplings. After the learning phase, a network adopts some ability to generalize from the examples; it can make predictions about inputs which it has not seen before; it has begun to understand a 
Theories that try to understand the ability of neural networks to generalize from learned examples are discussed. Also, an approach that is based on ideas from statistical physics which aims to model typical learning behavior is compared with a worst-case framework. 
rule. To what extent is it possible to understand the complexity of learning from examples by mathematical models and their solutions? This question is the focus of this article. I concentrate on the use of neural networks for classification. Here, one can take characteristic features (e.g., the pixels of an image) as an input pattern to the network. In the simplest case, it should decide whether a given pattern belongs (at least more likely) to a certain class of objects and respond with the output 1 or 1. To learn the under.lying Classification rule, the network is trained on a set of patterns together with the Classification labels, which are provided by a trainer. A heuristic strategy for training is to tune the parameters of the machine (the couplings of the network) using a learning algorithm, in such a way that the errors made on the set of training examples are small, in the hope that this helps to reduce the errors on new data. How well will the trained network be able to classify an in.
put that it has not seen before? This performance on new data defines the generalization ability of the network. This ability will be affected by the problem of realizability: The network may not be sufficiently complex to learn the rule completely or there may be ambiguities in Classification. Here, I concentrate on a second problem arising from the fact that learning will mostly not be exhaustive and the in.formation about the rule contained in the examples is not complete. Hence, the performance of a network may vary from one training set to another. In order to treat the generalization ability in a quantitative way, a common model assumes that all input patterns, those from the training set and the new one on which the network is tested, have a pre.assigned probability distribution (which characterizes the feature that must be classified), and they are produced in.dependently at random with the same probability distribution from the network's environment. Sometimes the probability distribution used to extract the examples and the Classification of these examples is called the rule. The network's performance on novel data can now be quantified by the so-called generalization error, which is the probability of misclassifying the test input and can be measured by repeating the same learning experiment many times with different data. 
Within such a probabilistic framework, neural networks are often viewed as statistical adaptive models which should give a likely explanation of the observed data. In this frame.work, the learning process becomes mathematically related to a statistical estimation problem for optimal network parameters. Hence, mathematical statistics seems to be a most appropriate candidate for studying a neural network's behavior. In fact, various statistical approaches have been ap.plied to quantify the generalization performance. For ex.ample, expressions for the generalization error have been obtained in the limit, where the number of examples is large compared to the number of couplings (Seung et al., 1992; for the case of realizable rules they are also independent of the specific algorithm, as long as the training examples are perfectly learned. Because it is able to cover even bad situations which are unfavorable for improvement of the learning process, it is not surprising that this theory may in some cases provide too pessimistic results which are also too crude to reveal interesting behavior in the intermediate region of the learning curve. 
In this article, I concentrate mainly on a different approach, which has its origin in statistical physics rather than in mathematical statistics, and compare its results with the worst-case results. This method aims at studying the typical rather than the worst-case behavior and often enables the exact calculations of the entire learning curve for models of simple networks which have many parameters. Since both biological and artificial neural networks are composed of many elements, it is hoped that such an approach may actually reveal some relevant and interesting structures. 
At first, it may seem surprising that a problem should simplify when the number of its constituents becomes large. However, this phenomenon is well-known for macroscopic physical systems such as gases or liquids which consist of a huge number of molecules. Clearly, it is not possible to study the complete microscopic state of such a system, which is described by the rapidly fluctuating positions and velocities of all particles. On the other hand, macroscopic quantities such as density, temperature, and pressure are usually collective properties influenced by all elements. For such quantities, fluctuations are averaged out in the thermodynamic limit of a large number of particles and the collective properties become, to some extent, independent of the microstate. Similarly, the generalization ability of a neu.ral network is a collective property of all the network parameters, and the techniques of statistical physics allow, at least for some simple but nontrivial models, for exact computations in the thermodynamic limit. Before explaining these ideas in detail, I provide a short description of feed-forward neural networks. 
Amari and Murata, 1993). In such a case, one can expect that learning is almost exhaustive, such that the statistical fluctuations of the parameters around their optimal values are small. However, in practice the number of parameters is 
artificial Neural Networks often large so that the network can be flexible, and it is not clear how many examples are needed for the asymptotic theory to become valid. The asymptotic theory may actually miss interesting behavior of the so-called learning curve, which displays the progress of generalization ability with an increasing amount of training data. 
A second important approach, which was introduced into mathematical statistics in the 1970s by Vapnik and Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact bounds for the generalization error which are valid for any number of training examples. Moreover, they are entirely independent of the underlying distribution of inputs, 
and Based on highly idealized models of brain function, artificial neural networks are built from simple elementary computing units, which are sometimes termed neurons after their biological counterparts. Although hardware implementations have become an important research topic, neu.ral nets are still simulated mostly on standard computers. 
Each computing unit of a neural net has a single output and several ingoing connections which receive the outputs of other units. To every ingoing connection (labeled by the index i) a real number is assigned, the synaptic weight w_i, which is the basic adjustable parameter of the network. To compute a unit's output, all incoming values xi are multi.plied by the weights wi and then added. 
Figure 1a shows an example of such a computation with three couplings. 
Finally, the result, <<FORMULA>>, is passed through an activation function which is typically of the shape of the red curve in Fig. 1a (a sigmoidal function), which allows for a soft, ambiguous Classification between 1 and 1. 
Other important cases are the step function (green curve) and the linear function (yellow curve; used in the output neuron for problems of fitting continuous functions). In the following, to keep matters simple, I restrict the discussion mainly to the step function. Such simple units can develop a remarkable computational power when connected in a suitable architectures. An important network type is the feedforward architecture shown in Fig. 1b, which has two layers of computing units and adjustable couplings. The input nodes (which do not compute) are coupled to the so-called hidden units, which feed their outputs into one or more output units. With such an architecture and sigmoidal activation functions, any continuous function of the inputs can be arbitrarily closely approximated when the number of hidden units is sufficiently large. 

<<FIGURE>>

FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numerical values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information. 

LEARNING TO GENERALIZE 

The Perceptron 

The simplest type of network is the perceptron (Fig. 2a). There are N inputs, N synaptic couplings <<FORMULA>>, and the output is simply 

<<FORMULA>>

It has a single-layer architecture and the step function (green curve in Fig. 1a) as its activation function. Despite 
. 
its simple structure, it can for many learning problems give a nontrivial generalization performance and may be used as a first step to an unknown Classification task. As can be seen by comparing Figs. 2a and 1b, it is also a building block for the more complex multilayer networks. Hence, understanding its performance theoretically may also pro.vide insight into the more complex machines. To learn a set of examples, a network must adjust its couplings appropriately (I often use the word couplings for their numerical strengths, the weights <<FORMULA>>, for <<FORMULA>>). Remarkably, for the perceptron there exists a simple learning algorithm which always enables the network to find those parameter values whenever the examples can be learnt by a perceptron. In Rosenblatt's algorithm, the input patterns are presented sequentially (e.g., in cycles) to the network and the 

<<FIGURE>>

FIGURE 2 (a) The perceptron. (b) Classification of inputs by a perceptron with two inputs. The arrow indicates the vector composed of the weights of the network, and the line per.pendicular to this vector is the boundary between the classes of input. 

output is tested. Whenever a pattern is not classified correctly, all couplings are altered simultaneously. We increase by a fixed amount all weights for which the input unit and the correct value of the output neuron have the same sign but we decrease them for the opposite sign. This simple algorithm is reminiscent of the so-called Hebbian learning rule, a physiological model of a learning processes in the real brain. It assumes that synaptic weights are increased when two neurons are simultaneously active. Rosenblatt's theorem states that in cases in which there exists a choice of the wi which classify correctly all of the examples (i.e., perfectly learnable perceptron), this algorithm finds a solution in a finite number of steps, which is at worst equal to A N3, where A is an appropriate constant. 
It is often useful to obtain an intuition of a perceptron's Classification performance by thinking in terms of a geo.metric picture. We may view the numerical values of the in.puts as the coordinates of a point in some (usually) high-dimensional space. The case of two dimensions is shown in Fig. 2b. A corresponding point is also constructed for the couplings wi. The arrow which points from the origin of the coordinate system to this latter point is called the weight vector or coupling vector. An application of linear algebra to the computation of the network shows that the line which is perpendicular to the coupling vector is the boundary be.tween inputs belonging to the two different classes. Input points which are on the same side as the coupling vector are classified as 1 (the green region in Fig. 2b) and those on the other side as 1 (red region in Fig. 2b). 
Rosenblatt's algorithm aims to determine such a line when it is possible. This picture generalizes to higher dimensions, for which a hyperplane plays the same role of the line of the previous two-dimensional example. We can still obtain an intuitive picture by projecting on two-dimensional planes. In Fig. 3a, 200 input patterns with random coordinates (randomly labeled red and blue) in a 200-dimensional input space are projected on the plane spanned by two arbitrary coordinate axes. If we instead use a plane for projection which contains the coupling vector (determined from a variant of Rosenblatt's algorithm) we obtain the view shown in Fig. 3b, in which red and green points are clearly separated and there is even a gap between the two clouds. 
It is evident that there are cases in which the two sets of points are too mixed and there is no line in two dimensions (or no hyperplane in higher dimensions which separates them). In these cases, the rule is too complex to be perfectly learned by a perceptron. If this happens, we must attempt to determine the choice of the coupling which minimizes the number of errors on a given set of examples. Here, Rosenblatt's algorithm does not work and the problem of finding the minimum is much more difficult from the algorithmic point. The training error, which is the number of errors made on the training set, is usually a non-smooth function of the network couplings (i.e., it may have large variations for small changes of the couplings). Hence, in general, in addition to the perfectly learnable perceptron case in which the final error is zero, minimizing the training error is usually a difficult task which could take a large amount of computer time. However, in practice, iterative approaches, which are based on the minimization of other smooth cost functions, are used to train a neural network (Bishop, 1995). 
As previously shown, perceptrons are only able to realize a very restricted type of Classification rules, the so-called linearly separable ones. Hence, independently from the issue of finding the best algorithm to learn the rule, one may ask the following question: In how many cases will the perceptron be able to learn a given set of training examples perfectly if the output labels are chosen arbitrarily? In order to answer this question in a quantitative way, it is convenient to introduce some concepts such as capacity, VC dimension and Worst-Case Generalization.

<<FIGURE>>

FIGURE 3 (a) Projection of 200 random points (with ran.dom labels) from a 200-dimensional space onto the first two coordinate axes (x1 and x2). (b) Projection of the same points onto a plane which contains the coupling vector of a perfectly trained perceptron. 

LEARNING TO GENERALIZE 
<<FORMULA>>, where the function <<FORMULA>> vanishes for a 2 and it is positive for a 2. Such a threshold phenomenon is an example of a phase transition (i.e., a sharp change of behavior) which can occur in the thermodynamic limit of a large network size. 
and worst-case generalization, which can be used in the case of the perceptron and have a more general meaning. 
In the case of perceptrons, this question was answered in the 1960s by Cover (1965). He calculated for any set of in.put patterns, e.g., m, the fraction of all the 2m possible map.pings that can be linearly separated and are thus learnable by perceptrons. This fraction is shown in Fig. 4 as a function of the number of examples per coupling for different numbers of input nodes (couplings) N. Three regions can be distinguished: 
Region in which m/N 1: Simple linear algebra shows that it is always possible to learn all mappings when the number m of input patterns is less than or equal to the number N of couplings (there are simply enough adjustable parameters). 
Region in which m/N 1: For this region, there are examples of rules that cannot be learned. However, when the number of examples is less than twice the number of couplings (m/N 2), if the network is large enough almost all mappings can be learned. If the output labels for each of the m inputs are chosen randomly 1 or 1 with equal probability, the probability of finding a nonrealizable coupling goes to zero exponentially when N goes to infinity at fixed ratio m/N. 
Region in which m/N 2: For m/N 2 the probability for a mapping to be realizable by perceptrons decreases to zero rapidly and it goes to zero exponentially when N goes to infinity at fixed ratio m/N (it is proportional to 

<<FIGURE>>

FIGURE 4 Fraction of all mappings of m input patterns which are learnable by perceptrons as a function of m/N for different numbers of couplings N: N 10 (in green), N 20 (in blue), and N 100 (in red). fraction of realizable mappings 

Generally, the point at which such a transition takes place defines the so-called capacity of the neural network. Although the capacity measures the ability of a network to learn random mappings of the inputs, it is also related to its ability to learn a rule (i.e., to generalize from examples). The question now is, how does the network perform on a new example after having been trained to learn m example on the training set? 
To obtain an intuitive idea of the connection between capacity and ability to generalize, we assume a training set of size m and a single pattern for test. Suppose we define a possible rule by an arbitrary learnable mapping from inputs to outputs. If m 1 is much larger than the capacity, then for most rules the labels on the m training pat.terns which the perceptron is able to recognize will nearly uniquely determine the couplings (and consequently the answer of the learning algorithm on the test pattern), and the rule can be perfectly understood from the examples. Be.low capacity, in most cases there are two different choices of couplings which give opposite answers for the test pat.tern. Hence, a correct Classification will occur with probability 0.5 assuming all rules to be equally probable. Figure 5 displays the two types of situations for m^3 and N^2. 
This intuitive connection can be sharpened. Vapnik and Chervonenkis established a relation between a capacity such as quantity and the generalization ability that is valid for general classifiers (Vapnik, 1982, 1995). The VC dimension is defined as the size of the largest set of inputs for which all mappings can be learned by the type of classifier. It equals N for the perceptron. Vapnik and Chervonenkis were able to show that for any training set of size m 

<<FIGURE>>

FIGURE 5 Classification rules for four patterns based on a perceptron. The patterns colored in red represent the training examples, and triangles and circles represent different class la.bels. The question mark is a test pattern. (a) There are two possible ways of classifying the test point consistent with the examples; (b) only one Classification is possible. 

larger than the VC dimension DVC, the growth of the number of realizable mappings is bounded by an expression which grows much slower than 2m (in fact, only like a polynomial in m). 
They proved that a large difference between training er.ror (i.e., the minimum percentage of errors that is done on the training set) and generalization error (i.e., the probability of producing an error on the test pattern after having learned the examples) of classifiers is highly improbable if the number of examples is well above DVC. This theorem implies a small expected generalization error for perfect learning of the training set results. The expected generalization error is bounded by a quantity which increases proportionally to DVC and decreases (neglecting logarithmic corrections in m) inversely proportional to m. 
than DVC is also necessary for good generalization. The VC results should, in practice, enable us to select the network with the proper complexity which guarantees the smallest bound on the generalization error. For example, in order to find the proper size of the hidden layer of a network with two layers, one could train networks of different sizes on the same data. 
The relation among these concepts can be better under.stood if we consider a family of networks of increasing complexity which have to learn the same rule. A qualitative picture of the results is shown in Fig. 6. As indicated by the blue curve in Fig. 6, the minimal training error will decrease for increasing complexity of the nets. On the other hand, the VC dimension and the complexity of the networks in.crease with the increasing number of hidden units, leading to an increasing expected difference (confidence interval) between training error and generalization error as indicated by the red curve. The sum of both (green curve) will have a minimum, giving the smallest bound on the generalization error. As discussed later, this procedure will in some cases lead to not very realistic estimates by the rather pessimistic bounds of the theory. In other words, the rigorous bounds, which are obtained from an arbitrary network and rule, are much larger than those determined from the results for most of the networks and rules. Conversely, one can construct a worst-case distribution 

Typical Scenario: The Approach 

of input patterns, for which a size of the training set larger of Statistical Physics When the number of examples is comparable to the size of the network, which for a perceptron equals the VC dimension, the VC theory states that one can construct malicious situations which prevent generalizations. However, in gen.eral, we would not expect that the world acts as an adver.sary. Therefore, how should one model a typical situation? As a first step, one may construct rules and pattern dis.tributions which act together in a nonadversarial way. The teacherstudent paradigm has proven to be useful in such a situation. Here, the rule to be learned is modeled by a sec.ond network, the teacher network; in this case, if the teacher and the student have the same architecture and the same 

<<FIGURE>>

FIGURE 6 As the complexity of the network varies (i.e., of the number of hidden units, as shown schematically below), the generalization error (in red), calculated from the sum of the training error (in green) and the confidence interval (in blue) according to the theory of Vapnik Chervonenkis, shows a minimum; this corresponds to the network with the best generalization ability. 
number of units, the rule is evidently realizable. The correct class labels for any inputs are given by the outputs of the teacher. Within this framework, it is often possible to ob.tain simple expressions for the generalization error. For a perceptron, we can use the geometric picture to visualize the generalization error. A misClassification of a new in.put vector by a student perceptron with coupling vector ST occurs only if the input pattern is between the separating planes (dashed region in Fig. 7) defined by ST and the vector of teacher couplings TE. If the inputs are drawn randomly from a uniform distribution, the generalization error is directly proportional to the angle between ST and TE. Hence, the generalization error is small when teacher and student vectors are close together and decreases to zero when both coincide. 
In the limit, when the number of examples is very large all the students which learn the training examples perfectly will not differ very much from and their couplings will be close to those of the teacher. Such cases with a small generalization error have been successfully treated by asymptotic methods of statistics. On the other hand, when the number of examples is relatively small, there are many different students which are consistent with the teacher regarding the training examples, and the uncertainty about 


LEARNING TO GENERALIZE 

<<FIGURE>>

FIGURE 7 For a uniform distribution of patterns, the generalization error of a perceptron equals the area of the shaded region divided by the area of the entire circle. ST and TE represent the coupling vectors of the student and teacher, respectively. 
the true couplings of the teacher is large. Possible generalization errors may range from zero (if, by chance, a learning algorithm converges to the teacher) to some worst-case value. We may say that the constraint which specifies the macrostate of the network (its training error) does not spec.ify the microstate uniquely. Nevertheless, it makes sense to speak of a typical value for the generalization error, which is defined as the value which is realized by the majority of the students. In the thermodynamic limit known from statistical physics, in which the number of parameters of the network is taken to be large, we expect that in fact almost all students belong to this majority, provided the quantity of interest is a cooperative effect of all components of the system. As the geometric visualization for the generalization error of the perceptron shows, this is actually the case. The following approach, which was pioneered by Elizabeth Gardner (Gardner, 1988; Gardner and Derrida, 1989), is based on the calculation of V(e), the volume of the space of couplings which both perfectly implement m training examples and have a given generalization error e. For an intuitive picture, consider that only discrete values for the couplings are allowed; then <<FORMULA>> would be proportional to the number of students. The typical value of the generalization error is the value of e, which maximizes V(e). It should be kept in mind that V(e) is a random number and fluctuates from one training set to another. A correct treatment of this randomness requires involved mathematical techniques (Mzard et al., 1987). To obtain a picture which is quite often qualitatively correct, we may replace it by its average over many realizations of training sets. From elementary probability theory we see that this average number can be found by calculating the volume A of the space of all students with generalization error e, irrespective of their behavior on the training set, and multiplying it by the probability B that a student with generalization error e gives m times the correct answers on independent drawings of the input patterns. Since A increases exponentially with the number of couplings N (like typical volumes in N-dimensional spaces) and B decreases exponentially with m (because it becomes more improbable to be correct m times for any e 0), both factors can balance each other when m increases like m aN. a is an effective measure for the size of the training set when N goes to infinity. In order to have quantities which remain finite as N Sq, it is also useful to take the logarithm of V(e) and divide by N, which transforms the product into a sum of two terms. The first one (which is often called the entropic term) increases with increasing generalization error (green curve in Fig. 8). This is true because there are many networks which are not similar to the teacher, but there is only one network equal to the teacher. For almost all networks (remember, the entropic term does not include the effect of the training examples) e 0.5, i.e., they are correct half of the time by random guessing. On the other hand, the second term (red curve in Fig. 8) decreases with increasing generalization er.ror because the probability of being correct on an input pattern increases when the student network becomes more similar to the teacher. It is often called the energetic contribution because it favors highly ordered (toward the teacher) network states, reminiscent of the states of physical systems at low energies. Hence, there will be a maximum (Fig. 8, ar.row) of <<FORMULA>> at some value of e which by definition is the typical generalization error. 
The development of the learning process as the number of examples aN increases can be understood as a competition between the entropic term, which favors disordered network configurations that are not similar to the teacher, and the energetic term. The latter term dominates when the number of examples is large. It will later be shown that such a competition can lead to a rich and interesting behavior as the number of examples is varied. The result for the learning curve (Gyrgyi and Tishby, 1990; Sompolinsky et al., 

FIGURE 8 Logarithm of the average volume of students that have learned m examples and give e generalization error (green curve). The blue and red curves represent the energetic and entropic contributions, respectively. 

<<FIGURE>>

student is free to ask the teacher questions, i.e., if the stu.dent can choose highly informative input patterns. For the simple perceptron a fruitful query strategy is to select a new input vector which is perpendicular to the current coupling vector of the student (Kinzel and Rujn, 1990). Such an input is a highly ambiguous pattern because small changes 
in the student couplings produce different Classification answers. For more complicated networks it may be difficult to obtain similar ambiguous inputs by an explicit construction. A general algorithm has been proposed (Seung et al., 
1992a) which uses the principle of maximal disagreement 
in a committee of several students as a selection process for training patterns. Using an appropriate randomized train.ing strategy, different students are generated which all learn 
the same set of examples. Next, any new input vector is only

<<FIGURE>>

FIGURE 9 Learning curves for typical student perceptrons. a m/N is the ratio between the number of examples and the coupling number. 
1990) of a perceptron obtained by the statistical physics approach (treating the random sampling the proper way) is shown by the red curve of Fig. 9. In contrast to the worst-case predictions of the VC theory, it is possible to have some generalization ability below VC dimension or capacity. As we might have expected, the generalization error decreases monotonically, showing that the more that is learned, the more that is understood. Asymptotically, the error is pro-accepted for training when the disagreement of its classification between the students is maximal. For a committee of two students it can be shown that when the number of examples is large, the information gain does not decrease but reaches a positive constant. This results in a much faster decrease of the generalization error. Instead of being in.versely proportional to the number of examples, the de.crease is now exponentially fast. 
portional to N and inversely proportional to m, in agree-monotonically decreasing learning curve, the possibility ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set more complicated networks. of student couplings which are untypical in the sense of our theory cannot be ruled out. For bad students, even non-monotic generalization behavior is possible. The problem 

Bad Students and Good Students 

Although the typical student perceptron has a smooth, 


Query Learning 

Soon after Gardner's pioneering work, it was realized that the approach of statistical physics is closely related to ideas in information theory and Bayesian statistics (Levin et al., 1989; Gyfirgyi and Tishby, 1990; Opper and Haussler, 1991), for which the reduction of an initial uncertainty about the true state of a system (teacher) by observing data is a cen.tral topic of interest. The logarithm of the volume of rele.vant microstates as defined in the previous section is a di.rect measure for such uncertainty. The moderate progress in generalization ability displayed by the red learning curve of Fig. 9 can be understood by the fact that as learning progresses less information about the teacher is gained from a new random example. Here, the information gain is defined as the reduction of the uncertainty when a new example is learned. The decrease in information gain is due to the in.crease in the generalization performance. This is plausible because inputs for which the majority of student networks give the correct answer are less informative than those for which a mistake is more likely. The situation changes if the of a concrete learning algorithm can be made to fit into the statistical physics framework if the algorithm minimizes a certain cost function. Treating the achieved values of the new cost function as a macroscopic constraint, the tools of statistical physics apply again. 
As an example, it is convenient to consider a case in which the teacher and the student have a different architectures: In one of the simplest examples one tries to learn a Classification problem by interpreting it as a regression problem, i.e., a problem of fitting a continuous function through data points. To be specific, we study the situation in which the teacher network is still given by a perceptron which computes binary valued outputs of the form y i wixi , 1, but as the student we choose a network with a linear transfer function (the yellow curve in Fig. 1a) 

<<FORMULA>>

and try to fit this linear expression to the binary labels of the teacher. If the number of couplings is sufficiently large (larger than the number of examples) the linear function 


LEARNING TO GENERALIZE 
(unlike the sign) is perfectly able to fit arbitrary continuous output values. This linear fit is an attempt to explain the data in a more complicated way than necessary, and the couplings have to be nely tuned in order to achieve this goal. We find that the student trained in such a way does not generalize well (Opper and Kinzel, 1995). In order to compare the Classifications of teacher and student on a new random input after training, we have finally converted the students output into a Classification label by taking the sign of its output. As shown in the red curve of Fig. 10, after an initial improvement of performance the generalization error increases again to the random guessing value e 0.5 at a 1 (Fig. 10, red curve). This phenomenon is called overfitting. For a 1 (i.e., for more data than parameters), it is no longer possible to have a perfect linear fit through the data, but a fit with a minimal deviation from a linear function leads to the second part of the learning curve. e de.creases again and approaches 0 asymptotically for a Sq. This shows that when enough data are available, the details of the training algorithm are less important. 
The dependence of the generalization performance on the complexity of the assumed data model is well-known. If function class is used that is too complex, data values can be perfectly fitted but the predicted function will be very sen.sitive to the variations of the data sample, leading to very unreliable predictions on novel inputs. On the other hand, functions that are too simple make the best fit almost insen.sitive to the data, which prevents us from learning enough from them. 
It is also possible to calculate the worst-case generalization ability of perceptron students learning from a perceptron teacher. The largest generalization error is obtained (Fig. 7) when the angle between the coupling vectors of teacher and student is maximized under the constraint that the student learns all examples perfectly. Although it may not be easy to construct a learning algorithm which per.forms such a maximization in practice, the resulting gener.alization error can be calculated using the statistical phys.ics approach (Engel and Van den Broeck, 1993). The result is in agreement with the VC theory: There is no prediction better than random guessing below the capacity. 
Although the previous algorithms led to a behavior which is worse than the typical one, we now examine the op.posite case of an algorithm which does better. Since the generalization ability of a neural network is related to the fact that similar input vectors are mapped onto the same out.put, one can assume that such a property can be enhanced if the separating gap between the two classes is maximized, which defines a new cost function for an algorithm. This optimal margin perceptron can be practically realized and when applied to a set of data leads to the projection of Fig. 11. As a remarkable result, it can be seen that there is a relatively large fraction of patterns which are located at the gap. These points are called support vectors (SVs). In order to understand their importance for the generalization abil.ity, we make the following gedankenexperiment and assume that all the points which lie outside the gap (the nonsupport vectors) are eliminated from the training set of examples. 
From the two-dimensional projection of Fig. 11, we may conjecture that by running the maximal margin algorithm on the remaining examples (the SVs) we cannot create a larger gap between the points. Hence, the algorithm will converge to the same separating hyperplane as before. This intuitive picture is actually correct. If the SVs of a training set were known beforehand (unfortunately, they are only identied after running the algorithm), the margin classifier would have to be trained only on the SVs. It would au.tomatically classify the rest of the training inputs correctly. 
FIGURE 11 Learning with a margin classifier and m 300 examples in an N 150-dimensional space. 

Bias/Variance trade-off

Hence, if in an actual Classification experiment the number of SVs is small compared to the number of non-SVs, we may expect a good generalization ability. 
The learning curve for a margin classifier (Opper and Kinzel, 1995) learning from a perceptron teacher (calculated by the statistical physics approach) is shown in Fig. 10 (blue curve). The concept of a margin classifier has recently ber of consistent students is small; nevertheless, the few re.maining ones must still differ in a finite fraction of bits from each other and from the teacher so that perfect generalization is still impossible. For a slightly above ac only the couplings of the teacher survive. 
been generalized to the so-called support vector machines (Vapnik, 1995), for which the inputs of a perceptron are re.placed by suitable features which are cleverly chosen nonlinear functions of the original inputs. In this way, nonlinear separable rules can be learned, providing an interesting alternative to multilayer networks. 

Learning with Errors 

The example of the Ising perceptron teaches us that it will 
not always be simple to obtain zero training error. Moreover, an algorithm trying to achieve this goal may get stuck in local minima. Hence, the idea of allowing errors explic.
itly in the learning procedure, by introducing an appropriate noise, can make sense. An early analysis of such a sto-

The Ising Perceptron 

The approach of statistical physics can develop a specific predictive power in situations in which one would like to un.derstand novel network models or architectures for which currently no efcient learning algorithm is known. As the simplest example, we consider a perceptron for which the couplings wj are constrained to binary values 1 and 1 (Gardner and Derrida, 1989; Gyrgyi, 1990; Seung et al., 1992b). For this so-called Ising perceptron (named after Ernst Ising, who studied coupled binary-valued elements as a model for a ferromagnet), perfect learning of examples is equivalent to a difficult combinatorial optimization prob.lem (integer linear programming), which in the worst case is believed to require a learning time that increases expo.nentially with the number of couplings N. 
To obtain the learning curve for the typical student, we can proceed as before, replacing V(e) by the number of student configurations that are consistent with the teacher which results in changing the entropic term appropriately. When the examples are provided by a teacher network of the same binary type, one can expect that the generalization error will decrease monotonically to zero as a function of a. The learning curve is shown as the blue curve in Fig. 9. For sufficiently small a, the discreteness of the couplings has al.most no effect. However, in contrast to the continuous case, perfect generalization does not require infinitely many examples but is achieved already at a finite number ac 1.24. This is not surprising because the teachers couplings con.tain only a finite amount of information (one bit per coupling) and one would expect that it does not take much more than about N examples to learn them. The remark.able and unexpected result of the analysis is the fact that the transition to perfect generalization is discontinuous. The generalization error decreases immediately from a non.zero value to zero. This gives an impression about the com.plex structure of the space of all consistent students and also gives a hint as to why perfect learning in the Ising per.ceptron is a difficult task. For a slightly below ac, the num.chastic training procedure and its generalization ability for the learning in so-called Boolean networks (with elemen.tary computing units different from the ones used in neural networks) can be found in Carnevali and Patarnello (1987). A stochastic algorithm can be useful to escape local min.ima of the training error, enabling a better learning of the training set. Surprisingly, such a method can also lead to better generalization abilities if the Classification rule is also corrupted by some degree of noise (Gyrgyi and Tishby, 1990). A stochastic training algorithm can be realized by the Monte Carlo metropolis method, which was invented to generate the effects of temperature in simulations of physical systems. Any changes of the network couplings which lead to a decrease of the training error during learning are allowed. However, with some probability that in.creases with the temperature, an increase of the training error is also accepted. Although in principle this algorithm may visit all the network's configurations, for a large sys.tem, with an overwhelming probability, only states close to some fixed training error will actually appear. The method of statistical physics applied to this situation shows that for sufficiently large temperatures (T) we often obtain a quali.tatively correct picture if we repeat the approximate calcu.lation for the noise-free case and replace the relative number of examples a by the effective number a/T. Hence, the learning curves become essentially stretched and good generalization ability is still possible at the price of an increase in necessary training examples. 
Within the stochastic framework, learning (with errors) can now also be realized for the Ising perceptron, and it is interesting to study the number of relevant student congu.rations as a function of e in more detail (Fig. 12). The green curve is obtained for a small value of a where a strong maxi.mum with high generalization error exists. By increasing a, this maximum decreases until it is the same as the second maximum at e 0.5, indicating a transition like that of the blue learning curve in Fig. 9. For larger a, the state of per.fect generalization should be the typical state. Neverthe.less, if the stochastic algorithm starts with an initial state 

<<FIGURE>>

FIGURE 12 Logarithm of the number of relevant Ising stu.dents for different values of a. 
which has no resemblance to the (unknown) teacher (i.e., with e 0.5), it will spend time that increases exponentially with N in the smaller local maximum, the metastable state. Hence, a sudden transition to perfect generalization will be observable only in examples which correspond to the blue curve of Fig. 12, where this metastable state disappears. For large vales of a (yellow curve), the stochastic algorithm will converge always to the state of perfect generalization. On the other hand, since the state with e 0.5 is always metastable, a stochastic algorithm which starts with the teachers couplings will never drive the student out of the state of perfect generalization. It should be made clear that the sharp phase transitions are the result of the thermody.namic limit, where the macroscopic state is entirely domi.nated by the typical configurations. For simulations of any finite system a rounding and softening of the transitions will be observed. 
More Sophisticated Computations Are Needed for Multilayer Networks 
As a first step to understand the generalization perfor.mance of multilayer networks, one can study an architectures which is simpler than the fully connected one of Fig. 1b. The tree architecture of Fig. 13 has become a popular model. Here, each hidden unit is connected to a different set of the input nodes. A further simplication is the replacement of adaptive couplings from the hidden units to the output node by a prewired fixed function which maps the states of the hidden units to the output. 
Two such functions have been studied in great detail. For the first one, the output gives just the majority vote of the hidden unitsfithatis, if themajority of the hidden units is negative, then the total output is negative, and vice versa. This network is called a committee machine. For the second type of network, the parity machine, the output is the par.ity of the hidden outputsfithat is, a minus results from an odd number of negative hidden units and a plus from an even number. For both types of networks, the capacity has been calculated in the thermodynamic limit of a large number N of (first layer) couplings (Barkai et al., 1990; Monas.son and Zecchina, 1995). By increasing the number of hid.den units (but always keeping it much smaller than N), the capacity per coupling (and the VC dimension) can be made arbitrarily large. Hence, the VC theory predicts that the ability to generalize begins at a size of the training set which increases with the capacity. The learning curves of the typical parity machine (Fig. 14) being trained by a par.ity teacher for (from left to right) one, two, four, and six hidden units seem to partially support this prediction. 
Below a certain number of examples, only memorization of the learned patterns occurs and not generalization. Then, a transition to nontrivial generalization takes place (Han.sel et al., 1992; Opper, 1994). Far beyond the transition, the decay of the learning curves becomes that of a simple per.ceptron (black curve in Fig. 14) independent of the number of hidden units, and this occurs much faster than for the bound given by VC theory. This shows that the typical learning curve can in fact be determined by more than one 

<<TABLE>>

complexity parameter. In contrast, the learning curve of the committee machine with the tree architecture of Fig. 13 (Schwarze and Hertz, 1992) is smooth and resembles that of the simple perceptron. As the number of hidden units is increased (keeping N fixed and very large), the generalization error increases, but despite the diverging VC di.mension the curves converge to a limiting one having an asymptotic decay which is only twice as slow as that of the perceptron. This is an example for which typical and worst-case generalization behaviors are entirely different. 
Recently, more light has been shed on the relation be.tween average and worst-case scenarios of the tree com-the same similarity to every teacher perceptron. Although this symmetric state allows for some degree of generalization, it is not able to recover the teachers rule completely. After a long plateau, the symmetry is broken and each of the student perceptrons specializes to one of the teacher perceptrons, and thus their similarity with the others is lost. This leads to a rapid (but continuous) decrease in the generalization error. Such types of learning curves with plateaus can actually be observed in applications of fully connected multilayer networks. 


Outlook

mittee. A reduced worst-case scenario, in which a tree committee teacher was to be learned from tree committee students under an input distribution, has been analyzed from a statistical physics perspective (Urbanczik, 1996). As expected, few students show a much worse generalization ability than the typical one. Moreover, such students may also be difficult to find by most reasonable learning algorithms because bad students require very ne tuning of their couplings. Calculation of the couplings with finite pre.cision requires many bits per coupling that increases faster than exponentially with a and which for sufficiently large a will be beyond the capability of practical algorithms. Hence, it is expected that, in practice, a bad behavior will not be observed. 
Transitions of the generalization error such as those observed for the tree parity machine are a characteristic feature of large systems which have a symmetry that can be spontaneously broken. To explain this, consider the sim.plest case of two hidden units. The output of this parity ma.chine does not change if we simultaneously change the sign of all the couplings for both hidden units. Hence, if the teachers couplings are all equal to 1, a student with all couplings equal to 1 acts exactly as the same classifier. If there are few examples in the training set, the entropic contribution will dominate the typical behavior and the typical students will display the same symmetry. Their coupling vectors will consist of positive and negative random numbers. Hence, there is no preference for the teacher or the reversed one and generalization is not possible. If the number of examples is large enough, the symmetry is broken and there are two possible types of typical students, one with more positive and the other one with more negative couplings. Hence, any of the typical students will show some similarity with the teacher (or it's negative image) and generalization occurs. A similar type of symmetry break.ing also leads to a continuous phase transition in the fully connected committee machine. This can be viewed as a committee of perceptrons, one for each hidden unit, which share the same input nodes. Any permutation of these perceptrons obviously leaves the output invariant. Again, if few examples are learned, the typical state reflects the symmetry. Each student perceptron will show approximately The worst-case approach of the VC theory and the typical case approach of statistical physics are important theories for modeling and understanding the complexity of learning to generalize from examples. Although the VC approach plays an important role in a general theory of learnability, its practical applications for neural networks have been limited by the overall generality of the approach. Since only weak assumptions about probability distributions and machines are considered by the theory, the estimates for generalization errors have often been too pessimistic. Recent developments of the theory seem to overcome these problems. By using modified VC dimensions, which depend on the data that have actually occurred and which in favorable cases are much smaller than the general dimensions, more realistic results seem to be possible. For the support vector machines (Vapnik, 1995) (generalizations of the margin classifiers which allow for nonlinear boundaries that separate the two classes), Vapnik and collaborators have shown the effectiveness of the modified VC results for selecting the optimal type of model in practical applications. 
The statistical physics approach, on the other hand, has revealed new and unexpected behavior of simple network models, such as a variety of phase transitions. Whether such transitions play a cognitive role in animal or human brains is an exciting topic. Recent developments of the theory aim to understand dynamical problems of learning. For ex.ample, online learning (Saad, 1998), in which the problems of learning and generalization are strongly mixed, has en.abled the study of complex multilayer networks and has stimulated research on the development of optimized algorithms. In addition to an extension of the approach to more complicated networks, an understanding of the robustness of the typical behavior, and an interpolation to the other extreme, the worst-case scenario is an important subject of research. 
Acknowledgments 

I thank members of the Department of Physics of Complex Sys.tems at the Weizmann Institute in Rehovot, Israel, where parts of this article were written, for their warm hospitality. 

References Cited 
AMARI, S., and MURATA, N. (1993). Statistical theory of learning curves under entropic loss. Neural Comput. 5, 140. 
BARKAI, E., HANSEL, D., and KANTER, I. (1990). Statistical me.chanics of a multilayered neural network. Phys. Rev. Lett. 65, 2312. 
BISHOP, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon/Oxford Univ. Press, Oxford/New York. 
CARNEVALI, P., and PATARNELLO, S. (1987). Exhaustive thermo.dynamical analysis of Boolean learning networks. Europhys. Lett. 4, 1199. 
COVER, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern rec.ognition. IEEE Trans. El. Comp. 14, 326. 
ENGEL, A., and VAN DEN BROECK, C. (1993). Systems that can learn from examples: Replica calculation of uniform conver.gence bound for the perceptron. Phys. Rev. Lett. 71, 1772. 
GARDNER, E. (1988). The space of interactions in neural networks. J. Phys. A 21, 257. 
GARDNER, E., and DERRIDA, B. (1989). Optimal storage proper.ties of neural network models. J. Phys. A 21, 271. 
GYRGYI, G. (1990). First order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A 41, 7097. 
GYRGYI, G., and TISHBY, N. (1990). Statistical theory of learning a rule. In Neural Networks and Spin Glasses: Proceedings of the STATPHYS 17 Workshop on Neural Networks and Spin Glasses (W. K. Theumann and R. Koberle, Eds.). World Scien.tic, Singapore. 
HANSEL, D., MATO, G., and MEUNIER, C. (1992). Memorization without generalization in a multilayered neural network. Eu.rophys. Lett. 20, 471. 
KINZEL, W., and RUJN, P. (1990). Improving a network generalization ability by selecting examples. Europhys. Lett. 13, 473. 
LEVIN, E., TISHBY,N.,andSOLLA, S. (1989). A statistical approach to learning and generalization in neural networks. In Proceed.ings of the Second Workshop on Computational Learning The.ory (R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan Kaufmann, San Mateo, CA. 
MZARD, M., PARISI, G., and VIRASORO, M. A. (1987). Spin glass theory and beyond. In Lecture Notes in Physics, Vol. 9. World Scientic, Singapore. 
MONASSON, R., and ZECCHINA, R. (1995). Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett. 75, 2432. 
OPPER, M. (1994). Learning and generalization in a two-layer neural network: The role of the VapnikChervonenkis dimension. Phys. Rev. Lett. 72, 2113. 
LEARNING TO GENERALIZE 
OPPER, M., and HAUSSLER, M. (1991). Generalization perfor.mance of Bayes optimal Classification algorithm for learning a perceptron. Phys. Rev. Lett. 66, 2677. 
OPPER, M., and KINZEL, W. (1995). Statistical mechanics of generalization. In Physics of Neural Networks III (J. L. van Hem-men, E. Domany, and K. Schulten, Eds.). Springer-Verlag, New York. 
SAAD, D. (Ed.) (1998). Online Learning in Neural Networks. Cambridge Univ. Press, New York. 
SCHWARZE, H., and HERTZ, J. (1992). Generalization in a large committee machine. Europhys. Lett. 20, 375. 
SCHWARZE, H., and HERTZ, J. (1993). Generalization in fully con.nected committee machines. Europhys. Lett. 21, 785. 
SEUNG, H. S., SOMPOLINSKY, H., and TISHBY, N. (1992a). Statis.tical mechanics of learning from examples. Phys. Rev. A 45, 6056. 
SEUNG, H. S., OPPER, M., and SOMPOLINSKY, H. (1992b). Query by committee. In The Proceedings of the Vth Annual Workshop on Computational Learning Theory (COLT92), p. 287. Associ.ation for Computing Machinery, New York. 
SOMPOLINSKY, H., TISHBY, N., and SEUNG, H. S. (1990). Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683. 
URBANCZIK, R. (1996). Learning in a large committee machine: Worst case and average case. Europhys. Lett. 35, 553. 
VALLET, F., CAILTON, J., and REFREGIER, P. (1989). Linear and nonlinear extension of the pseudo-inverse solution for learning Boolean functions. Europhys. Lett. 9, 315. 
VAPNIK, V. N. (1982). Estimation of Dependencies Based on Em.pirical Data. Springer-Verlag, New York. 
VAPNIK, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York. 
VAPNIK, V. N., and CHERVONENKIS, A. (1971). On the uniform convergence of relative frequencies of events to their probabil.ities. Theory Probability Appl. 16, 254. 

General References 

ARBIB, M. A. (Ed.) (1995). The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA. 
BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York. 
HERTZ,J.A.,KROGH,A.,andPALMER, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Red.wood City, CA. 
MINSKY, M., and PAPERT, S. (1969). Perceptrons. MIT Press, Cambridge, MA. 
WATKIN, T. L. H., RAU, A., and BIEHL, M. (1993). The statistical mechanics of learning a rule. Rev. Modern Phys. 65, 499. 
<|endoftext|>


<|startoftext|>
Model Compression and Acceleration for Deep Neural Networks The principles, progress, and challenges 

In recent years, deep neural networks (DNNs) have received increased attention, have been applied to different applications, and achieved dramatic accuracy improvements in many tasks. These works rely on deep networks with millions or even billions of parameters, and the availability of graphics process.ing units (GPUs) with very high computation capability plays a key role in their success. For example, Krizhevsky et al. [1] achieved breakthrough results in the 2012 ImageNet Challenge using a network containing 60 million parameters with five convolutional layers and three fully connected layers. Usually, it takes two to three days to train the whole model on the ImagetNet data set with an NVIDIA K40 machine. In another example, the top face-verification results from the Labeled Faces in the Wild (LFW) data set were obtained with networks containing hundreds of millions of parameters, using a mix of convolutional, locally connected, and fully connected layers [2], [3]. It is also very time-consuming to train such a model to obtain a reasonable performance. In architectures that only rely on fully connected layers, the number of parameters can grow to billions [4]. 

Introduction 

As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications such as online learning and incremental learning. In addition, recent years witnessed significant progress in virtual reality, augmented reality, and smart wearable devices, creating unprecedented opportunities for researchers to tackle fundamental challenges in deploying deep-learning systems to portable devices with limited resources [e.g., memory, central processing units (CPUs), energy, bandwidth]. Efficient deep-learning methods can have a significant impact on distributed systems, embedded devices, and field-programmable gate ar.ray (FPGA) for artificial intelligence (AI). For example, the residual network-50 (ResNet-50) [5], which has 50 convolutional layers, needs more than 95 megabytes of memory for storage, and numerous floating number multiplications for 

calculating each image. After discarding some redundant weights, the network still works as usual but saved more than 75% of parameters and 50% computational time. 
For devices like cell phones and FPGAs with only several megabyte resources, how to compact the models used on them is also important. 
Achieving these goals calls for joint solutions from many disciplines, including but not limited to machine learning, optimization, computer architecture, data compression, indexing, and hardware design. 
In this article, we review recent works on compressing and accelerating DNNs, which attracted much attention from the deep-learning community and has already achieved significant progress in past years. 
We classify these approaches into four categories: 
1) Parameter pruning and sharing: The parameter pruning and sharing-based methods explore the redundancy in the model parameters and try to remove the redundant and noncritical ones. 
2) Low-rank factorization: Low-rank factorization-based techniques use matrix/tensor decomposition to estimate the informative parameters of the deep convolutional neural networks (CNNs). 
3) Transferred/compact convolutional filters: The trans.ferred/compact convolutional filters-based approaches design special structural convolutional filters to reduce the storage and computation complexity. 
4) Knowledge distillation (KD): The KD methods learn a dis.tilled model and train a more compact neural network to reproduce the output of a larger network. In Table 1, we briefly summarize these four types of methods. Generally, the parameter pruning and sharing, low-rank factorization, and KD approaches can be used in DNNs with fully connected layers and convolutional layers, achieving comparable performances. On the other hand, methods using transferred/compact filters are designed for models with convolutional layers only. Low-rank factorization and transferred/compact filters-based 

As larger neural networks 
with more layers and 
approaches provide an end-to-end pipeline 


nodes are considered, 
and can be easily implemented in a CPU/ 


reducing their storage 
GPU environment, which is straight for and computational ward, while parameter pruning and sharing cost becomes critical, use different methods such as vector quan.
especially for some real-
tization, binary coding, and sparse constraints to perform the task. Usually, it will 


time applications such 
take several steps to achieve the goal. 


as online learning and 
Regarding training protocols, models 


incremental learning. 

based on parameter pruning/sharing low-rank factorization can be extracted from pretrained ones or trained from scratch, while the transferred/ compact filter and KD models can only support training from scratch. These methods are independently designed and complement each other. For example, transferred layers and parameter pruning and sharing can be used together, and model quantization and binarization can be used together with low-rank approximations to achieve further speedup. We will de.scribe the details of each theme and their properties, strengths, and drawbacks in the following sections. 

Parameter pruning and sharing 

An early work that showed that network pruning is effective in reducing the network complexity and addressed the overfitting problem is [6]. Since then, it has been widely studied to compress DNN models, trying to remove parameters that are not crucial to the model performance. These techniques can be further classified into three categories: model quantization and binarization, parameter sharing, and structural matrix. 

Quantization and binarization 

Network quantization compresses the original network by reducing the number of bits required to represent each weight. Gong et al. [6] and Wu et al. [7] applied k-means scalar quantization to the parameter values. Vanhoucke et al. [8] showed that 8-bit quantization of the parameters can result in significant speedup with minimal loss of accuracy. The work in [9] used 


Theme Name 
Parameter pruning and sharing 
Low-rank factorization 
Transferred/compact convolutional filters 
KD 
Description 
Reducing redundant parameters that are not sensitive to the performance 
Using matrix/tensor decomposition to estimate the informative parameters 

Designing special structural convolutional filters to save parameters 
Training a compact neural network with distilled knowledge of a large model 
Applications 
Convolutional layer and fully connected layer 
Convolutional layer and fully connected layer 
Only for convolutional layer 
Convolutional layer and fully connected layer 

More Details 

Robust to various settings, can achieve good performance, can support both train.ing from scratch and pretrained model 
Standardized pipeline, easily implement.ed, can support both training from scratch and pretrained model 
Algorithms are dependent on applications, usually achieve good performance, only support training from scratch 
Model performances are sensitive to applications and network structure, only support training from scratch 
16-bit fixed-point representation in stochastic rounding-based CNN training, which significantly reduced memory usage and float- point operations with little loss in classification accuracy. 
The method proposed in [10] first pruned the unimportant connections and retrained the sparsely connected networks. Then it quantized the link weights using weight-sharing, and then applied 
Huffman coding to the quantized weights as well as the codebook to further reduce the rate. As shown in Figure 1, it starts by learn.ing the connectivity via normal network train.ing, followed by pruning the small-weight connections. Finally, the network is retrained to learn the final weights for the remaining sparse connections. This work achieves the state-of-the-art performance among all parameter quantization-based methods. It was shown in [11] that Hessian weight could be used to measure the importance of network parameters and proposed to minimize Hessian-weighted quantization errors on average for clustering network parameters. A novel quantization framework was introduced in [12], which reduced the precision of network weights to ternary values. 
In the extreme case of 1-bit representation of each weight, i.e., binary weight neural networks, there are also many works that directly train CNNs with binary weights; for instance, Binary-Connect [13], BinaryNet [14], and XNORNetworks [15]. The main idea is to directly learn binary weights or activations dur.ing the model training. The systematic study in [16] showed that networks trained with backpropagation could be robust against (robust against or resilient to) specific weight distortions, includ.ing binary weights. 

Drawbacks 

However, the accuracy of such binary nets is significantly low.ered when dealing with large CNNs such as GoogleNet. Another drawback of these binary nets is that existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the accuracy loss. To address this issue, the work in [17] proposed a proximal Newton algorithm with diagonal Hessian approximation that directly mini.mizes the loss with respect to the binary weights. The work in 
[18] significantly reduced the time on float-point multiplication in the training stage by stochastically binarizing weights and con-sharing has been used converting multiplications in the hidden state both to reduce network computation to sign changes complexity and to address the overfitting issue. 

Pruning and sharing

Network pruning and sharing has been used both to reduce network complexity and to address the overfitting issue. An early approach to pruning was biased weight decay [19]. The optimal brain damage [20] and the optimal brain surgeon [21] methods reduced the number of connections based on the Hessian of the loss function, and their works suggested that such pruning gave higher accuracy than magnitude-based pruning such as the weight decay meth.od. Those methods supported training from scratch. 
A recent trend in this direction is to prune redundant, non-informative weights in a pretrained CNN model. For example, Srinivas and Babu [22] explored the redundancy among neurons and proposed a data-free pruning method to remove redundant neurons. Han et al. [23] proposed to reduce the total number of parameters and operations in the entire network. Chen et al. [24] proposed a HashedNets model that used a low-cost hash function to group weights into hash buckets for parameter sharing. The deep compression method in [10] removed the redundant connections and quantized the weights and then used Huffman coding to encode the quantized weights. In [25], a simple regularization method based on soft weight-sharing was proposed, which 
included both quantization and pruning in one simple (re)train.ing procedure. It is worth noting that the aforementioned pruning schemes typically produce connection pruning in CNNs. 
There is also growing interest in training compact CNNs with sparsity constraints. Those sparsity constraints are 

<<FIGURE>>

FIGURE 1. The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is the compression model. 

nels, or even layers. In filter-level pruning, all of the afore.mentioned works used l_2-norm regularizers. The work in [29] used l1-norm to select and prune unimportant filters. 

Drawbacks 

There are some potential issues of the pruning and sharing works. First, pruning with l1 or l2 regularization requires more iterations to converge. Furthermore, all pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters and could be cumbersome for some applications. 
Designing the structural matrix 
In architectures that contain only fully connected layers, the number of parameters can grow up to billions [4]. Thus, it is critical to explore this redundancy of parameters in fully connected layers, which is often the bottleneck in terms of memory consumption. These network layers use the nonlinear transforms 
<<FORMULA>>, where v () is an element-wise nonlinear operator, x is the input vector, and M is the mn matrix of <<FORMULA>> parameters. When M is a large general dense matrix, the cost of storing mn parameters and computing matrix-vector products in Om( n) time. Thus, an intuitive way to prune parameters is to impose x as a parameterized structural matrix. An mn matrix 
<<FORMULA>> that can be described using much fewer parameters than mn is called a structured matrix. Typically, the structure should not only reduce the memory cost but also dramatically accelerate the inference and training stage via fast matrix-vector multiplication and gradient computations. 
Following this direction, the work in [30] proposed a sim.ple and efficient approach based on circulant projections, while maintaining competitive error rates. Given a vector 
<<FORMULA>>, a circulant matrix <<FORMULA>> is defined as Thus the memory cost becomes <<FORMULA>> instead of <<FORMULA>>.
<<FORMULA>> This circulant structure also enables the use of fast Fourier transform (FFT) to speed up the computation. Given a d-dimensional vector r, the 1-layer circulant neural network in (1) has time complexity of <<FORMULA>>. 
In [31], a novel adaptive fastfood transform was introduced to reparameterize the matrix-vector multiplication of fully connected layers. The adaptive fastfood translation invariant property form matrix <<FORMULA>> was defined as of the representations to input image, which is the key <<FORMULA>>. (2) to the success of training 
due to exploring the very deep models without 
Here, <<FORMULA>> are random diago-SG and severe overfitting. 


nal matrices. <<FORMULA>> is a random permutation matrix and H denotes the Walsh-Hadamard matrix. Reparameterizing a fully connected layer with d inputs and n outputs using the adaptive fastfood transform reduces the storage and the computational costs from <<FORMULA>> to <<FORMULA>> and from <<FORMULA>> to
<<FORMULA>> And <<FORMULA>>, respectively. 
The work in [32] showed the effectiveness of the new notion of parsimony in the theory of structured matrices. Their pro.posed method can be extended to various other structured matrix classes, including block and multilevel Toeplitz-like [33] matrices related to multidimensional convolution [34]. 

Drawbacks 

One potential problem of this kind of approach is that the structural constraint will cause loss in accuracy since the constraint might bring bias to the model. On the other hand, how to find a proper structural matrix is difficult. There is no theoretical way from which to derive it. 
Low-rank factorization and sparsity 
As convolution operations constitute the bulk of all computations in CNNs, simplifying the convolution layer would have a direct impact on the overall speedup. The convolution kernels in a typical CNN is a four-dimensional tensor. The key observation is that there might be a significant amount of redundancy in the tensor. Ideas based on tensor decomposition seem to be a particularly promising way to remove the redundancy. Regarding to the fully connected layer, it can be viewed as a two-dimensional (2-D) matrix and the low-rankness can also help. 
Using low-rank filters to accelerate convolution has a long history. Typical examples include high-dimensional discrete cosine transform (DCT) and wavelet systems constructed from one-dimensional (1-D) DCT transform and 1-D wave.lets, respectively, using tensor products. In the context of dictionary learning, Rigamonti et al. [35] suggested learning separable 1-D filters. In [36], a few low-rank approximation 

<<FORMULA>> 

and clustering schemes for the convolutional kernels were 
proposed. They achieved 2# speedup for a single convolutional layer with 1% drop in classification accuracy. The work in [37] suggested using different tensor decomposition 
schemes, reporting a 45. # speedup with 1% drop in accuracy 

<<FORMULA>>

case. For the scheme in [39], the decomposition always exists and can achieve better performance than general CP. Table 2 lists a performance comparison of both methods. The actual speedup and compression rates are used to mea.sure the performances. We can see that the BN version can achieve slightly bet.ter performance while the CP version 
gives higher compression rates. 

Original Framework Low-Rank 

Note that the fully connected layers 

<<FIGURE>>

FIGURE 2. A typical framework of the low-rank regularization method. (a) is the original convolutional 
layer, and (b) is the low-rank constraint convolutional layer with rank-K. 
in text recognition. In both works, the approximation was done layer by layer. After one layer was approximated by the low-rank filters, the parameters of that layer were fixed, and the layers above were fine-tuned based on a reconstruction error criterion. These are typical low-rank methods for compressing 2-D convolutional layers, which is described in Figure 2. In [38], canonical polyadic (CP) decomposition of the kernel tensors was proposed. Their work used nonlinear least squares to compute the CP decomposition, which was also based on the tensor decomposition idea. In [39], a new algorithm for computing the low-rank tensor decomposition and a new method for training low-rank constrained CNNs from scratch were proposed. It used batch normalization (BN) to transform the activations of the internal hidden units, and it was shown to be an effective way to deal with the exploding or vanishing gradients. 
In principle, both the CP decomposition scheme and the decomposition scheme in [39] (BN low-rank) can be used to train CNNs from scratch. For the CP decomposition, finding the best low-rank approximation is an ill-posed problem, and the best rank-K approximation may not exist in the general in fully connected layers. For instance, 

Misha et al. [40] reduced the number of dynamic parameters in deep models using the low-rank method. Reference [41] explored a low-rank matrix factorization of the final weight layer in a DNN for acoustic modeling. 
Drawbacks 
Low-rank approaches are straightforward for model compression and acceleration. The idea complements recent advances in deep learning such as dropout, rectified units, and maxout. However, the implementation is not that easy since it involves a decomposition operation, which is computationally expensive. Another issue is that current methods perform low-rank approximation layer by layer, and thus cannot perform global parameter compression, which is important as different layers hold different information. Finally, factorization requires extensive model retraining to achieve convergence when com.pared to the original model. 
Transferred/compact convolutional filters 
CNNs are parameter-efficient due to exploring the translation invariant property of the representations to input image, which is the key to the success of training very deep models without severe overfitting. Although a strong theory is currently missing, a large amount of empirical evidence sup.ports the notion that both the translation invariant property and convolutional weight-sharing are important for good predictive performance. The idea of using transferred convolutional filters to compress CNN models is motivated by recent works in [42], which introduced the equivariant group theory. Let x be an input, <<FORMULA>> be a network or layer, and <<FORMULA>> be the transform matrix. The concept of equivariance is defined as 

<<FORMULA>>, (3) 

which says that transforming the input x by the transform <<FORMULA>> and then passing it through the network or layer <<FORMULA>> should give the same result as first mapping x through the network and then transforming the representation. Note that, 

Model 

AlexNet BN low-rank CP low-rank VGG-16 BN low-rank CP low-rank GoogleNet BN low-rank CP low-rank 

<<TABLE>>

in [42], the transforms <<FORMULA>> and Tl ()$ are not necessarily the same as they operate on different objects. According to this theory, it is reasonable to apply the transform to layers or filters <<FORMULA>> to compress the whole network models. From empirical observation, deep CNNs also benefit from using a large set of convolutional filters by applying a certain transform <<FORMULA>> to a small set of base filters since it acts as a regularizer for the model. 
Following this trend, there are many recent works proposed to build a convolutional layer from a set of base filters [42] [45]. What they have in common is that the transform <<FORMULA>> lies in the family of functions that only operate in the spatial 
domain of the convolutional filters. For example, the work in [44] found that the lower convolution layers of CNNs learned redundant filters to extract both positive and negative phase information of an input signal, and defined <<FORMULA>> to be the simple negation function 
-
<<FORMULA>>.        (4) 
Here, Wx is the basis convolutional filter 
-
and Wx is the filter consisting of the shifts whose activation is opposite to that of Wx and selected after max-pooling operation. By doing this, the work in [44] can easily achieve 2# compression rate on all the convolutional layers. It is also shown that the negation transform acts as a strong regularizer to improve the classification accuracy. The intuition is that the learning algorithm with pair-wise positive-negative constraint can lead to useful convolutional filters instead of redundant ones. 
In [45], it was observed that magnitudes of the responses from convolutional kernels had a wide diversity of pattern representations in the network, and it was not proper to discard weaker signals with a single threshold. Thus, a multibias nonlinearity activation function was proposed to generate more patterns in the feature space at low computational cost. The transform <<FORMULA>> was define as 

<<FORMULA>>, (5) 

where d were the multibias factors. The work in [46] considered a combination of rotation by a multiple of 90% and horizontal/vertical flipping with 

<<FORMULA>>, (6) 

where WTi was the transformation matrix that rotated the original filters with angle i ! {90,180,270}. In [42], the transform was generalized to any angle learned from data, and i was directly obtained from data. Both [46] and [42] can achieve good classification performance. 
Reference [43] defined <<FORMULA>> as the set of translation functions applied to 2-D filters 

<<FORMULA>>, (7)

The basic idea of KD is to distill knowledge from a Drawbacks large teacher model into There are several issues that need to be 
a small one by learning addressed for approaches that apply transfer information to convolutional filters. First, the class distributions these methods can achieve competitive performance for wide/flat architectures (like 

output by the teacher via softened softmax. 

where <<FORMULA>> denoted the translation of the first operand by <<FORMULA>>
xy along its spatial dimensions, with proper zero padding at borders to maintain the shape. The proposed framework can be used to 1) improve the classification accuracy as a regularized version of maxout networks and 2) to achieve parameter efficiency by flexibly varying their architectures to compress networks. 
Table 3 briefly compares the performance of different methods with transferred convolutional filters, using VGG-Net (16 layers) as the baseline model. The results are report.ed on the CIFAR-10 and CIFAR-100 data sets with top-five error rates. It is observed that they can achieve reduction in 
parameters with little or no drop in classification accuracy. 
VGGNet) but not narrow/special ones (like 
GoogleNet and ResNet). Second, the trans.fer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some data sets. 
Using a compact filter for convolution can directly reduce the computation cost. The key idea is to replace the loose and overparametric filters with compact blocks to improve the speed, which significantly accelerate CNNs on several benchmarks. Decomposing 33 convolution into two 1x1 convolutions was used in [47], which achieved state-of-the-art acceleration performance on object recognition. SqueezeNet 
[48] was proposed to replace 33# convolution with 1x1 convolution, which created a compact neural network with approximately 50 fewer parameters and comparable accuracy when compared to AlexNet. 
KD 
To the best of our knowledge, exploiting knowledge transfer to compress model was first proposed by Caruana et al. [49]. They trained a compressed model with pseudo-data labeled by an ensemble of strong classifiers and reproduced the output of the original larger network. However, their work is limited to shal.low models. The idea has been recently adopted in [50] as KD to compress deep and wide networks into shallower ones, where 

<<TABLE>>

the compressed model mimicked the function learned by the complex model. The basic idea of KD is to distill knowledge from a large teacher model into a small one by learning the class distributions output by the teacher via softened softmax. 
The work in [51] introduced a KD compression framework, which eased the training of deep networks by following a student-teacher paradigm, in which the student was penalized according to a softened version of the teacher's output. The framework compressed an ensemble of deep networks (teacher) into a stu.dent network of similar depth. To do so, the student was trained to predict the output of the teacher, as well as the true classifica.tion labels. Despite its simplicity, KD demonstrates promising results in various image classification tasks. The work in [52] 
aimed to address the network compression problem by taking advantage of depth neural networks. It proposed an approach to train thin and deep networks, called FitNets, to compress wide and shallower (but still deep) networks. The method was rooted in KD and extended the idea to allow for thinner and deeper student models. To learn from the intermediate representations of the teacher 
network, FitNet made the student mimic the full feature maps of the teacher. However, such assumptions are too strict since the capacities of teacher and student may differ greatly. In certain circumstances, FitNet may adversely affect the performance and convergence. All the aforementioned methods are validated on the MNIST, CIFAR-10, CIFAR-100, SVHN, and AFLW benchmark data sets, and simulation results show that these methods match or outperform the teacher's performance, while requiring notably fewer parameters and multiplications. 
There are several extensions along this direction of distillation knowledge. The work in [53] trained a parametric student model to approximate a Monte Carlo teacher. The proposed framework used online training and used DNNs for the student model. Different from previous works, which represented the knowledge using the softened label probabilities, [54] represented the knowledge by using the neurons in the higher hidden layer, which preserved as much information as the label probabilities, but are more compact. The work in [55] accelerated the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. The techniques are based on the concept of function-preserving transformations between neural net.work specifications. Zagoruyko et al. [56] proposed attention transfer to relax the assumption of FitNet. They transferred the attention maps that are summaries of the full activations. 

Drawbacks 

KD-based approaches can make deeper models thinner and help significantly reduce the computational cost. However, there are a few disadvantages. One of them is that KD can only be applied to classification tasks with softmax loss function, which hinders its usage. Another drawback is that the model assumptions sometimes are too strict to make the performance competitive with other types of approaches. 

Other types of approaches 

We first summarize the works utilizing attention-based methods. Note that attention-based systems [57] can reduce computations significantly by learning to selectively focus or attend to a few, task-relevant input regions. The work in [57] introduced the dynamic capacity network that combined two types of modules: the small subnetworks with low capacity, and the large ones with high capacity. The low-capacity subnetworks were active on the whole input to first find the task-relevant areas in the input, and then the attention mechanism was used to direct the high-capacity subnetworks to focus on the task-relevant regions in the input. By doing this, the size of the CNN model could be significantly reduced. 

Following this direction, the work in to measure the quality some important neurons. It proposed a new 


and acceleration are the 

The standard criteria [58] introduced the conditional computation idea, which only computes the gradient for of model compression type of general-purpose neural network component: a sparsely gated mixture-of-experts compression and the (MoE) layer. The MoE consisted of a number speedup rates. of experts, each a simple feed-forward neural 
network, and a trainable gating network that selected a sparse combination of the experts to process each input. In [59], dynamic DNNs (D2NNs) were introduced, which were a type of feed-forward DNN that selected and executed a subset of D2NN neurons based on the input. 
There have been other attempts to reduce the number of parameters of neural networks by replacing the fully connected layer with global average pooling[43], [60]. Network architectures, such as GoogleNet or network in network, can achieve state-of-the-art results on several benchmarks by adopting this idea. However, transfer learning, i.e., reusing features learned on the ImageNet data set and applying them to new tasks, is more difficult with this approach. This problem was noted by Szegedy et al. [60] and motivated them to add a linear layer on  top of their networks to enable transfer learning. 
The work in [61] targeted the ResNet-based model with a spatially varying computation time, called stochastic depth, which enabled the seemingly contradictory setup to train short networks and used deep networks at test time. It started with very deep networks and, while during training, for each mini-batch, randomly dropped a subset of layers and bypassed them with the identity function. This model is end-to-end trainable, deterministic, and can be viewed as a black-box feature extractor. Following this direction, the work in [62] proposed a pyramidal residual network with stochastic depth. 
Other approaches to reduce the convolutional overheads include using FFT-based convolutions [63] and fast convolution using the Winograd algorithm [64]. Those works only aim to speedup the computation but not reduce the memory storage. 

Benchmarks, evaluation, and databases 

In the past five years, the deep-learning community has made great efforts in benchmark models. One of the most well-known models used in compression and acceleration for CNNs is Alexnet [1], which occasionally has been used for assessing the performance of compression. Other popular standard models include LeNets [65], All-CNN-nets [66], and many others. LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each. LeNet-5 is 


Proposing some general/ unified approaches is one direction that can be taken regarding the use of CNNs in small platforms. 
about how to choose different compression approaches and possible challenges/solutions in this area. 

General suggestions 

There is no golden rule to measure which one of the four kinds of approaches is the best. How 
a convolutional network that has two convolutional layers and two fully connected layers. Recently, more state-of-the-art architectures are used as baseline models in many works, including network in networks [67], VGGNets [68], and ResNets [69]. Table 4 summarizes the baseline mod.els commonly used in several typical compression methods. 
The standard criteria to measure the quality of model compression and acceleration are the compression and the speedup rates. Assume that a is the number of the parameters in the original model M and a* is that of the compressed model M*, then the compression rate a (,MM*) of M* over Mis 

<<FORMULA>> (8)

Another widely used measurement is the index space saving defined in several papers [70], [71] as 

<<FORMULA>>, (9)

where a and a are the number of the dimension of the index space in the original model and that of the compressed model, respectively. 
Similarly, given the running time s of M and s* of M* , the speedup rate d (,MM*) is defined as 

<<FORMULA>> (10) 

Most work used the average training time per epoch to mea.sure the running time, while in [70] and [71], the average testing time was used. Generally, the compression rate and speedup rate are highly correlated, as smaller models often results in faster computation for both the training and the testing stages. 
Good compression methods are expected to achieve almost the same performance as the original model with much smaller parameters and less computational time. However, for differ.ent applications with varying CNN designs, the correlation between parameter size and computational time may be different. For example, it is observed that, for deep CNNs with fully connected layers, most of the parameters are in the fully connected layers; while for image classification tasks, float-point operations are mainly in the first few convolutional layers since each filter is convolved with the whole image, which is usually very large at the beginning. Different applications should focus on different layers. 

Discussion and challenges 

In this article, we summarized recent works on compress.ing and accelerating DNNs. Here we discuss more details to choose the proper approaches is really de.pendent on the applications and requirements. Here, we provide some general suggestions. 
If the applications needs compacted models from pretrained models, one can choose either pruning and sharing or low-rank factorization-based methods. If end-to-end solutions are needed for the problem, the low-rank and transferred convolutional filters approaches are preferred. 
For applications in some specific domains, methods with human prior (like the transferred convolutional filters and structural matrix) sometimes have benefits. For example, when conducting medical images classification, transferred convolutional filters should work well as medical images (like organs) do have the rotation transformation property. 
Usually, the approaches of pruning and sharing could give a reasonable compression rate while not hurting the accuracy. Thus, for applications that require stable model accuracy, it is better to utilize pruning and sharing. 
If a problem involves small- or medium-size data sets, one can try the KD approaches. The compressed student model can take the benefit of transferring knowledge from the teacher model, making it a robust data set that is not large. 
As we mentioned in the Introduction, techniques of the four themes are orthogonal. It makes sense to combine two or three of them to maximize the compression/speedup rates. For some specific applications, like object detection, which requires both convolutional and fully connected layers, one can compress the convolutional layers with low-rank factorization and the fully connected layers with a pruning method. 

<<FORMULA>>

Technique challenges 

Techniques for deep model compression and acceleration are still in the early stages, and the following challenges still need to be addressed. 
Most of the current state-of-the-art approaches are built on well-designed CNN models, which have limited freedom to change the configuration (e.g., network structural, hyperparameters). To handle more complicated tasks, it should provide more plausible ways to configure the compressed models. 

Good compression methods are expected to achieve almost the same performance as the original model with much smaller parameters and less computational time. 
approaches. Instead of directly reducing and transferring parameters from the teach.er models, passing selectivity knowledge of neurons could be helpful. One can derive a way to select essential neurons related to the task. The intuition is that, if a neuron is activated in certain regions or samples, this implies these regions or samples share 
Pruning is an effective way to compress and accelerate CNNs. Current pruning techniques are mostly designed to eliminate connections between neurons. On the other hand, a pruning channel can directly reduce the feature map width and shrink the model into a thinner one. It is efficient but also challenging because removing channels might dramatically change the input of the following layer. It is important to focus on how to address this issue. 
As we mentioned previously, methods of structural matrix and transferred convolutional filters impose prior human knowledge to the model, which could significantly affect the performance and stability. It is critical to investigate how to control the impact of the imposed prior knowledge. 
The methods of KD provide many benefits such as directly accelerating the model without special hardware or implementations. It is still worth it to develop KD-based approaches and explore how to improve the performance.  
Hardware constraints in various of small platforms (e.g., mobile, robotic, self-driving cars) are still a major problem that hinder the extension of deep CNNs. How to make full use of the limited computational source available and how to design special compression methods for such platforms are still challenges that need to be addressed. 

Possible solutions 

To solve the hyperparameters configuration problem, we can rely on the recent learning-to-learn strategy [72], [73]. This framework provides a mechanism, allowing the algorithm to automatically learn how to exploit structure in the problem of interest. There are two different ways to combine the learning.to-learn module with the model compression. The first designs compression and learning-to-learn simultaneously, while the second way first configures the model with learn-to-learning and then prunes the parameters. 
Channel pruning provides the efficiency benefit on both CPUs and GPUs because no special implementation is required. But it is also challenging to handle the input con.figuration. One possible solution is to use the training-based channel pruning methods [74], which focus on imposing sparse constraints on weights during training, and could adaptively determine hyperparameters. However, training from scratch for such a method is costly for very deep CNNs. 
Exploring new types of knowledge in the teacher models and transferring it to the student models is useful for the KD 
some common properties that may relate to the task. Performing such steps is time-consuming, thus efficient implementation is important. 
For methods with convolutional filters and the structural matrix, we can conclude that the transformation lies in the family of functions that only operations on the spatial dimen.sions. Hence, to address the imposed prior issue, one solution is to provide a generalization of the aforementioned approach.es in two aspects: 1) instead of limiting the transformation to belong to a set of predefined transformations, let it be the whole family of spatial transformations applied to 2-D filters or the matrix, and 2) learn the transformation jointly with all of the model parameters. 
Proposing some general/unified approaches is one direction that can be taken regarding the use of CNNs in small platforms. Yuhen et al. [75] presented a feature map dimensionality reduc.tion method by excavating and removing redundancy in feature maps generated by different filters, which could also preserve intrinsic information of the original network. The idea can be extended to make CNNs more applicable for different platforms. The work in [76] proposed a one-shot whole network compression scheme consisting of three components: rank selection, low-rank tensor decomposition, and fine-tuning to make deep CNNs work in mobile devices. From the systematic side, Facebook released the platform Caffe2 [77], which employed a particularly lightweight and modular framework and included mobile-specif.ic optimizations based on the hardware design. Caffe2 can help developers and researchers train large machine-learning models and deliver AI on mobile devices. 

Acknowledgments 

We would like to thank the reviewers and broader community for their feedback on this survey. In particular, we would like to thank Hong Zhao from the Department of Automation of Tsinghua University for her help on modifying this article. This research is supported by National Science Foundation of China, grant number 61401169. The corresponding author of this article is Pan Zhou. 

Authors 

Yu Cheng (chengyu@us.ibm.com) received his bachelors degree in automation from Tsinghua University, Beijing, China, in 2010 and his Ph.D. degree in computer science from Northwestern University, Evanston, Illinois in 2015. Currently, he is a research staff member at AI Foundations Lab, IBM T.J. Watson Research Center, Yorktown Heights, New York. His research is focused on deep learning in general, with specific interests in deep generative models and deep models compression. He also has published many works regarding the applications of deep learning in computer vision and natural language processing. 
Duo Wang (d-wang15@mails.tsinghua.edu.cn) received the 
B.S. degree in automation from the Harbin Institute of Technology, China, in 2015, where he is currently pursuing his Ph.D. degree in the Department of Automation, Tsinghua University. His research interests are deep/machine learning and their applications in computer vision and robotics vision. 
Pan Zhou (panzhou@hust.edu.cn) received his B.S. degree in the Advanced Class of Huazhong University of Science and Technology (HUST), Wuhan China, and his M.S. degree in electronics and information engineering from the same university in 2006 and 2008, respectively. He received his Ph.D. degree from the School of Electrical and Computer Engineering at the Georgia Institute of Technology, Atlanta in 2011. Currently, he is an associate professor with School of Electronic Information and Communications, HUST. His research interests include big data analytics and machine learning, security and privacy, and information networks. 
Tao Zhang (taozhang@mail.tsinghua.edu.cn) received his B.S., M.S., and Ph.D. degrees from Tsinghua University, Beijing, China, in 1993, 1995, and 1999, respectively, and his Ph.D. degree from Saga University, Japan, in 2002, all in control engineering. He is a professor with the Department of Automation, Tsinghua University. His current research interests include artificial intelligence, robotics, image processing, control theory, and control of spacecraft. 

References 

[1] A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classification with deep convolutional neural networks, in Proc. Conf. Neural Information Processing Systems, 2012, pp. 10971105. 
[2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2014, pp. 17011708. 
[3] Y. Sun, X. Wang, and X. Tang, Deeply learned face representations are sparse, selective, and robust, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2015, pp. pp. 28922900. 
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, Large scale distributed deep networks, in Proc. Conf. Neural Information Processing Systems, 2012, pp. 12231231. 
[5] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recogni.tion, Computing Res. Repository, vol. abs/1512.03385, 2015. [Online]. Available: https://arxiv.org/pdf/1512.03385.pdf 
[6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, Compressing deep convolutional networks using vector quantization, Computing Res. Repository, vol. abs/1412.6115, 2014. [Online]. Available: https://arxiv.org/pdf/1412.6115.pdf 
[7] Y. W. Q. H. Jiaxiang Wu, C. Leng, and J. Cheng, Quantized convolutional neu.ral networks for mobile devices, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 48204828. 
[8] V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural net.works on cpus, in Proc. Conf. Neural Information Processing Systems Deep Learning and Unsupervised Feature Learning Workshop, 2011. 
[9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, Deep learning with limited numerical precision, in Proc. 32nd Int. Conf. Machine Learning, 2015, vol. 37, pp. 17371746. 
[10] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding, in Proc. Int. Conf. Learning Representations, 2016. 
[11] Y. Choi, M. El-Khamy, and J. Lee, Towards the limit of network quantization, Computing Res. Repository, vol. abs/1612.01543, 2016. [Online]. Available: https://arxiv.org/abs/1612.01543 
[12] C. Zhu, S. Han, H. Mao, and W. J. Dally, Trained ternary quantization, arXiv Preprint, arXiv:1612.01064, 2016. 
[13] M. Courbariaux, Y. Bengio, and J. David, Binaryconnect: Training deep neu.ral networks with binary weights during propagations, in Proc. Advances Neural Information Processing Systems Annu. Conf., 2015, pp. 31233131. 
[14] M. Courbariaux and Y. Bengio, Binarynet: Training deep neural networks with weights and activations constrained to +1 or .1, Computing Res. Repository, vol. abs/1602.02830, 2016. [Online]. Available: https://arxiv.org/abs/1602.02830 
[15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks, in Proc. European Conf. Computer Vision, 2016, pp. 525542. 
[16] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, Deep neural networks are robust to weight binarization and other non-linear distortions, Computing Res. Repository, vol. abs/1606.01981, 2016. [Online]. Available: https:// arxiv.org/abs/1606.01981 
[17] L. Hou, Q. Yao, and J. T. Kwok, Loss-aware binarization of deep networks, Computing Res. Repository, vol. abs/1611.01600, 2016. [Online]. Available: https:// arxiv.org/abs/1611.01600 
[18] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, Neural networks with few multiplications, Computing Res. Repository, vol. abs/1510.03009, 2015. [Online]. Available: https://arxiv.org/abs/1510.03009 
[19] S. J. Hanson and L. Y. Pratt, Comparing biases for minimal network con.struction with back-propagation, Adv. Neural Inform. Process. Syst. 1, 1989, pp. 177185. 
[20] Y. L. Cun, J. S. Denker, and S. A. Solla, Advances in neural information pro.cessing systems 2, in Optimal Brain Damage, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, pp. 598605. 
[21] B. Hassibi, D. G. Stork, and S. C. R. Com, Second order derivatives for network pruning: Optimal brain surgeon, in Advances in Neural Information Processing Systems, vol. 5. San Mateo, CA: Morgan Kaufmann, 1993, pp. 164 171. 
[22] S. Srinivas and R. V. Babu, Data-free parameter pruning for deep neural net.works, in Proc. British Machine Vision Conf., 2015, pp. 31.131.12. 
[23] S. Han, J. Pool, J. Tran, and W. J. Dally, Learning both weights and connections for efficient neural networks, in Proc. 28th Int. Conf. Neural Information Processing Systems, 2015, pp. 11351143. 
[24] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, in Proc. Machine Learning Research Workshop Conf., 2015, pp. 22852294. 
[25] K. Ullrich, E. Meeds, and M. Welling, Soft weight-sharing for neural network compression, Computing Res. Repository, vol. abs/1702.04008, 2017. [Online]. Available: https://arxiv.org/abs/1702.04008 
[26] V. Lebedev and V. S. Lempitsky, Fast convnets using group-wise brain dam.age, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2554 2564. 
[27] H. Zhou, J. M. Alvarez, and F. Porikli, Less is more: Towards compact CNNs, in Proc. European Conf. Computer Vision, 2016, pp. 662677. 
[28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, Learning structured sparsity in deep neural networks, Adv. Neural Inform. Process. Syst., vol. 29, pp. 20742082, 2016. 
[29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, Pruning filters for efficient convnets, Computing Res. Repository, vol. abs/1608.08710, 2016. [Online]. Available: https://arxiv.org/abs/1608.08710 
[30] Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, An exploration of parameter redundancy in deep networks with circulant projections, in Proc. Int. Conf. Computer Vision, 2015, pp. 28572865. 
[31] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, Deep fried convnets, in Proc. Int. Conf. Computer Vision, 2015, pp. 1476 1483. 
[32] V. Sindhwani, T. Sainath, and S. Kumar. (2015). Structured transforms for small-footprint deep learning. Advances in Neural Information Processing Systems, 28, pp. 30883096. [Online]. Available: http://papers.nips.cc/paper/5869.structured-transforms-for-small-footprint-deep-learning.pdf 
[33] J. Chun and T. Kailath, Generalized Displacement Structure for Block-Toeplitz, Toeplitz-Block, and Toeplitz-Derived Matrices. Berlin, Germany: Springer, 1991, pp. 215236. 
[34] M. V. Rakhuba and I. V. Oseledets. (2015). Fast multidimensional convolution in low-rank tensor formats via cross approximation. SIAM J. Sci. Comput., 37(2). [Online]. Available: http://dx.doi.org/10.1137/140958529 
[35] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, Learning separable filters, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2013, pp. 2754 2761. 
[36] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, Exploiting lin.ear structure within convolutional networks for efficient evaluation, Adv. Neural Inform. Process. Syst. vol. 27, pp. 12691277, 2014. 
[37] M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neu.ral networks with low rank expansions, in Proc. British Machine Vision Conf., 2014, pp. 113. 
[38] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, Speeding-up convolutional neural networks using fine-tuned CP-decomposition, Computing Res. Repository, vol. abs/1412.6553, 2014. [Online]. Available: https:// arxiv.org/abs/1412.6553 
[39] C. Tai, T. Xiao, X. Wang, and E. Weinan, Convolutional neural networks with low-rank regularization, Computing Res. Repository, vol. abs/1511.06067, 2015. 
[40] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas. (2013). Predicting parameters in deep learning. Advances in Neural Information Processing Systems, 26, 21482156. [Online]. Available: http://media.nips.cc/nips.books/nipspapers/paper_files/nips26/1053.pdf 
[41] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimen.sional output targets, in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, 2013, pp. 66556659. 
[42] T. S. Cohen and M. Welling, Group equivariant convolutional networks, arXiv Preprint, arXiv:1602.07576, 2016. 
[43] S. Zhai, Y. Cheng, and Z. M. Zhang, Doubly convolutional neural networks, in Proc. Advances Neural Information Processing Systems, 2016, pp. 10821090. 
[44] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, arXiv Preprint, arXiv:1603.05201, 2016. 
[45] H. Li, W. Ouyang, and X. Wang, Multi-bias non-linear activation in deep neural networks, arXiv Preprint, arXiv:1604.00676, 2016. 
[46] S. Dieleman, J. D Fauw, and K. Kavukcuoglu, Exploiting cyclic symmetry in convolutional neural networks, in Proc. 33rd Int. Conf. Machine Learning, 2016, vol. 48, pp. 18891898. 
[47] C. Szegedy, S. Ioffe, and V. Vanhoucke. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning, Computing Res. Repository, vol. abs/1602.07261. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1602. html#SzegedyIV16 
[48] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autono.mous driving, Computing Res. Repository, vol. abs/1612.01051, 2016. [Online]. Available: https://arxiv.org/abs/1612.01051 
[49] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. (2006). Model compression. Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 535 541. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150464 
[50] J. Ba and R. Caruana, Do deep nets really need to be deep? Adv. Neural Inform. Process. Syst., vol. 27, pp. 26542662, 2014. 
[51] G. E. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural net.work, Computing Res. Repository, vol. abs/1503.02531, 2015. [Online]. Available: https://arxiv.org/abs/1503.02531 
[52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, Fitnets: Hints for thin deep nets, Computing Res. Repository, vol. abs/1412.6550, 2014. [Online]. Available: https://arxiv.org/abs/1412.6550 
[53] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling. (2015). Bayesian dark knowledge. Advances in Neural Information Processing Systems, 28, 34203428. [Online]. Available: http://papers.nips.cc/paper/5965-bayesian-dark-knowledge.pdf 
[54] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, Face model compression by dis.tilling knowledge from neurons, in Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 35603566. 
[55] T. Chen, I. J. Goodfellow, and J. Shlens, Net2net: Accelerating learning via knowledge transfer, Computing Res. Repository, vol. abs/1511.05641, 2015. [Online]. Available: https://arxiv.org/abs/1511.05641 
[56] S. Zagoruyko and N. Komodakis. (2016). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, Computing Res. Repository, vol. abs/1612.03928. [Online]. Available: http://arxiv.org/ abs/1612.03928 
[57] A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. C. Courville, Dynamic capacity networks, in Proc. 33rd Int. Conf. Machine Learning, 2016, pp. 25492558. 
[58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. [Online]. Available: https://openreview.net/pdf?id=B1ckMDqlg 
[59] D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and J. Odobez, Deep dynamic neural networks for multimodal gesture segmentation and recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1583 1597, 2016. 
[60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. (2015). Going deeper with convolutions. Proc. IEEE Computer Vision Pattern Recognition. [Online]. Available: http://arxiv.org/ abs/1409.4842 
[61] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, Deep networks with stochastic depth, Computing Res. Repository, vol. arXiv:1603.09382, 2016. 
[62] Y. Yamada, M. Iwamura, and K. Kise. (2016). Deep pyramidal residual networks with separated stochastic depth, Computing Res. Repository, vol. abs/1612.01230. [Online]. Available: http://arxiv.org/abs/1612.01230 
[63] M. Mathieu, M. Henaff, and Y. Lecun, Fast training of convolutional networks through FFTs, Computing Res. Repository, vol. arXiv:1312.5851, 2014. 
[64] A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 4013 4021. 
[65] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, pp. 22782324, 1998. 
[66] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, Striving for simplicity: The all convolutional net, Computing Res. Repository, vol. abs/1412.6806, 2014. [Online]. Available: https://arxiv.org/abs/1412.6806 
[67] M. Lin, Q. Chen, and S. Yan, Network in network, in Proc. Int. Conf. Learning Representations, 2014. [Online]. Available: https://arxiv.org/abs/ 1312.4400 
[68] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Computing Res. Repository, vol. abs/1409.1556, 2014. [Online]. Available: https://arxiv.org/abs/1409.1556 
[69] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recogni.tion, arXiv Preprint, arXiv:1512.03385, 2015. 
[70] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang, An exploration of parameter redundancy in deep networks with circulant projections, in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 28572865. 
[71] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, ACDC: A structured efficient linear layer, in Proc. Int. Conf. Learning Representations, 2016. 
[72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas, Learning to learn by gradient descent by gradient descent, in Proc. Neural Information Processing Systems Conf., 2016, pp. 3981 3989. 
[73] D. Ha, A. Dai, and Q. Le, Hypernetworks, in Proc. Int. Conf. Learning Representations, 2016. 
[74] J. M. Alvarez and M. Salzmann, Learning the number of neurons in deep net.works, in Proc. Neural Information Processing Systems Conf., 2016, pp. 2270 2278. 
[75] Y. Wang, C. Xu, C. Xu, and D. Tao, Beyond filters: Compact feature map for portable deep model, in Proc. 34th Int. Conf. Machine Learning, 2017, pp. 3703 3711. 
[76] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, Compression of deep convolutional neural networks for fast and low power mobile applications, Computing Res. Repository, vol. abs/1511.06530, 2015. [Online]. Available: https://arxiv.org/ abs/1511.06530 
[77] Facebook, Inc. Caffe2: A new lightweight, modular, and scalable deep learning framework. (2016). [Online]. Available: https://caffe2.ai/ 
<|endoftext|>


<|startoftext|>
                  MOGRIFIER LSTM


                  Gábor Melis y , Tomáš Kociskýˇ  y , Phil Blunsom yz
                  {melisgl,tkocisky,pblunsom}@google.com
                  y DeepMind, London, UK
                  z University of Oxford


                                              ABSTRACT


                       Many advances in Natural Language Processing have been based upon more expressive
                       models for how inputs interact with the context in which they occur. Recurrent
                       networks, which have enjoyed a modicum of success, still lack the generalization
                       and systematicity ultimately required for modelling language. In this work, we
                       propose an extension to the venerable Long Short-Term Memory in the form of
                       mutual gating of the current input and the previous output. This mechanism affords
                       the modelling of a richer space of interactions between inputs and their context.
                       Equivalently, our model can be viewed as making the transition function given
                       by the LSTM context-dependent. Experiments demonstrate markedly improved
                       generalization on language modelling in the range of 3–4 perplexity points on Penn
                       Treebank and Wikitext-2, and 0.01–0.05 bpc on four character-based datasets. We
                       establish a new state of the art on all datasets with the exception of Enwik8, where
                       we close a large gap between the LSTM and Transformer models.


                  1 INTRODUCTION

                 The domination of Natural Language Processing by neural models is hampered only by their limited
                 ability to generalize and questionable sample complexity (Belinkov and Bisk 2017; Jia and Liang
                 2017; Iyyer et al. 2018; Moosavi and Strube 2017; Agrawal et al. 2016), their poor grasp of grammar
                 (Linzen et al. 2016; Kuncoro et al. 2018), and their inability to chunk input sequences into meaningful
                 units (Wang et al. 2017). While direct attacks on the latter are possible, in this paper, we take a
                 language-agnostic approach to improving Recurrent Neural Networks (RNN, Rumelhart et al. (1988)),
                 which brought about many advances in tasks such as language modelling, semantic parsing, machine
                 translation, with no shortage of non-NLP applications either (Bakker 2002; Mayer et al. 2008). Many
                 neural models are built from RNNs including the sequence-to-sequence family (Sutskever et al. 2014)
                 and its attention-based branch (Bahdanau et al. 2014). Thus, innovations in RNN architecture tend to
                 have a trickle-down effect from language modelling, where evaluation is often the easiest and data
                 the most readily available, to many other tasks, a trend greatly strengthened by ULMFiT (Howard
                 and Ruder 2018), ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018), which promote language
                 models from architectural blueprints to pretrained building blocks.
                 To improve the generalization ability of language models, we propose an extension to the LSTM
                 (Hochreiter and Schmidhuber 1997), where the LSTM’s inputxis gated conditioned on the output of
                 the previous step h_prev . Next, the gated input is used in a similar manner to gate the output of the
                 previous time step. After a couple of rounds of this mutual gating, the last update dx and h_prev are
                 fed to an LSTM. By introducing these additional of gating operations, in one sense, our model joins
                 the long list of recurrent architectures with gating structures of varying complexity which followed
                 the invention of Elman Networks (Elman 1990). Examples include the LSTM, the GRU (Chung et al.
                 2015), and even designs by Neural Architecture Search (Zoph and Le 2016).
                 Intuitively, in the lowermost layer, the ﬁrst gating step scales the input embedding (itself a representation
                 of the average context in which the token occurs) depending on the actual context, resulting in a
                 contextualized representation of the input. While intuitive, as Section4 shows, this interpretation
                 cannot account for all the observed phenomena.
                 In a more encompassing view, our model can be seen as enriching the mostly additive dynamics of
                 recurrent transitions placing it in the company of the Input Switched Afﬁne Network (Foerster et al.

                              <<FIGURE>>

                 Figure 1: Mogriﬁer with 5 rounds of updates. The previous state h0 =h prev is transformed linearly (dashed
                 arrows), fed through a sigmoid and gates <<FORMULA>> in an element-wise manner producing <<FORMULA>>. Conversely, the
                 linearly transformed <<FORMULA>> gates <<FORMULA>> and produces <<FORMULA>>. After a number of repetitions of this mutual gating cycle, the
                  last values of h and x sequences are fed to an LSTM cell. The prev subscript of his omitted to reduce clutter.


                 2017) with a separate transition matrix for each possible input, and the Multiplicative RNN (Sutskever
                 et al. 2011), which factorizes the three-way tensor of stacked transition matrices. Also following
                 this line of research are the Multiplicative Integration LSTM (Wu et al. 2016) and – closest to our
                 model in the literature – the Multiplicative LSTM (Krause et al. 2016). The results in Section3.4
                 demonstrate the utility of our approach, which consistently improves on the LSTM and establishes a
                 new state of the art on all but the largest dataset, Enwik8, where we match similarly sized transformer
                 models.

                  2 MODEL

                 To allow for ease of subsequent extension, we present the standard LSTM update (Sak et al. 2014)
                 with input and state of size m and n respectively as the following function:

                                      <<FORMULA>>

                 The updated state c and the output h are computed as follows:

                                      <<FORMULA>>

                 where <<FORMULA>> is the logistic sigmoid function, <<FORMULA>> is the elementwise product,<<FORMULA>> and b are weight
                 matrices and biases.
                 While the LSTM is typically presented as a solution to the vanishing gradients problem, its gate i
                 can also be interpreted as scaling the rows of weight matrices <<FORMULA>> (ignoring the non-linearity in
                 j). In this sense, the LSTM nudges Elman Networks towards context-dependent transitions and
                  the extreme case of Input Switched Afﬁne Networks. If we took another, larger step towards that
                  extreme, we could end up with Hypernetworks (Ha et al. 2016). Here, instead, we take a more
                 cautious step, and equip the LSTM with gates that scale the columns of all its weight matrices <<FORMULA>>
                 in a context-dependent manner. The scaling of the matrices <<FORMULA>> (those that transform the cell input)
                 makes the input embeddings dependent on the cell state, while the scaling of <<FORMULA>> does the reverse.
                 The Mogriﬁer LSTM is an LSTM where two inputs x and  h_prev modulate one another in
                  an alternating fashion before the usual LSTM computation takes place (see Fig.1). That is,
                 <<FORMULA>> where the modulated inputs x" and h" are prev
                 deﬁned as the highest indexed xi and hi , respectively, from the interleaved sequences

                              <<FORMULA>>        (1) 

                    1 It’s like a transmogriﬁer 2 without the magic: it can only shrink or expand objects.
                    2 Transmogrify (verb, 1650s): to completely alter the form of something in a surprising or magical manner.

                                                  <<FORMULA>>        (2) 

                 with <<FORMULA>>, is a hyperparameter; <<FORMULA>>. The number of “rounds”,r
                 recovers the LSTM. Multiplication with the constant2ensures that randomly initializedQi ;Ri
                 matrices result in transformations close to identity. To reduce the number of additional model
                 parameters, we typically factorize theQi ;Ri matrices as products of low-rank matrices: <<FORMULA>>  
                 with <<FORMULA>>, where <<FORMULA>> is the rank.

                  3 EXPERIMENTS

                  3.1 THE CASE FOR SMALL-SCALE

                 Before describing the details of the data, the experimental setup and the results, we take a short detour
                 to motivate work on smaller-scale datasets. A recurring theme in the history of sequence models is
                 that the problem of model design is intermingled with optimizability and scalability. Elman Networks
                 are notoriously difﬁcult to optimize, a property that ultimately gave birth to the idea of the LSTM,
                 but also to more recent models such as the Unitary Evolution RNN (Arjovsky et al. 2016) and ﬁxes
                 like gradient clipping (Pascanu et al. 2013). Still, it is far from clear – if we could optimize these
                 models well – how different their biases would turn out to be. The non-separability of model and
                 optimization is fairly evident in these cases.
                 Scalability, on the other hand, is often optimized for indirectly. Given the limited ability of current
                 models to generalize, we often compensate by throwing more data at the problem. To ﬁt a larger
                 dataset, model size must be increased. Thus the best performing models are evaluated based on their
                 scalability 3 . Today, scaling up still yields tangible gains on down-stream tasks, and language
                 modelling data is abundant. However, we believe that simply scaling up will not solve the generalization
                 problem and better models will be needed. Our hope is that by choosing small enough datasets, so
                 that model size is no longer the limiting factor, we get a number of practical advantages:

                  Generalization ability will be more clearly reﬂected in evaluations even without domain adaptation.

                  Turnaround time in experiments will be reduced, and the freed up computational budget can be
                   put to good use by controlling for nuisance factors.

                  The transient effects of changing hardware performance characteristics are somewhat lessened.

                 Thus, we develop, analyse and evaluate models primarily on small datasets. Evaluation on larger
                 datasets is included to learn more about the models’ scaling behaviour and because of its relevance
                 for applications, but it is to be understood that these evaluations come with much larger error bars
                 and provide more limited guidance for further research on better models.

                  3.2 DATASETS

                 We compare models on both word and character-level language modelling datasets. The two word-
                 level datasets we picked are the Penn Treebank (PTB) corpus by Marcus et al. (1993) with prepro-
                 cessing from Mikolov et al. (2010) and Wikitext-2 by Merity et al. (2016), which is about twice
                 the size of PTB with a larger vocabulary and lighter preprocessing. These datasets are deﬁnitely
                 on the small side, but – and because of this – they are suitable for exploring different model biases.
                 Their main shortcoming is the small vocabulary size, only in the tens of thousands, which makes
                 them inappropriate for exploring the behavior of the long tail. For that, open vocabulary language
                 modelling and byte pair encoding (Sennrich et al. 2015) would be an obvious choice. Still, our
                 primary goal here is the comparison of the LSTM and Mogriﬁer architectures, thus we instead opt
                 for character-based language modelling tasks, where vocabulary size is not an issue, the long tail
                 is not truncated, and there are no additional hyperparameters as in byte pair encoding that make
                 fair comparison harder. The ﬁrst character-based corpus is Enwik8 from the Hutter Prize dataset
                 (Hutter 2012). Following common practice, we use the ﬁrst 90 million characters for training and
                 the remaining 10 million evenly split between validation and test. The character-level task on the

                    3 Note that the focus on scalability is not a problem per se. Indeed the unsupervised pretraining methods
                 (Peters et al. 2018; Devlin et al. 2018) take great advantage of this approach.

                 Table 1: Word-level perplexities of near state-of-the-art models, ourLSTMbaseline and theMogriﬁeron PTB
                 and Wikitext-2. Models with Mixture of Softmaxes (Yang et al. 2017) are denoted withMoS, depth N withdN.
                 MCstands for Monte-Carlo dropout evaluation. Previous state-of-the-art results in italics. Note the comfortable
                 margin of 2.8–4.3 perplexity points the Mogriﬁer enjoys over the LSTM.

                                                             <<TABLE>>

                 Mikolov preprocessed PTB corpus (Merity et al. 2018) is unique in that it has the disadvantages of
                 closed vocabulary without the advantages of word-level modelling, but we include it for comparison
                 to previous work. The ﬁnal character-level dataset is the Multilingual Wikipedia Corpus (MWC,
                 Kawakami et al. (2017)), from which we focus on the English and Finnish language subdatasets in
                 the single text, large setting.

                  3.3 SETUP

                 We tune hyperparameters following the experimental setup of Melis et al. (2018) using a black-box
                 hyperparameter tuner based on batched Gaussian Process Bandits (Golovin et al. 2017). For the
                 LSTM, the tuned hyperparameters are the same:input_embedding_ratio,learning_rate,l2_penalty,
                 input_dropout,inter_layer_dropout,state_dropout,output_dropout. For the Mogriﬁer, the number
                 of rounds r and the rank k of the low-rank approximation is also tuned (allowing for full rank, too).
                 For word-level tasks, BPTT (Werbos et al. 1990) window size is set to 70 and batch size to 64. For
                 character-level tasks, BPTT window size is set to 150 and batch size to 128 except for Enwik8 where
                 the window size is 500. Input and output embeddings are tied for word-level tasks following Inan
                 et al. (2016) and Press and Wolf (2016). Optimization is performed with Adam (Kingma and Ba
                 2014) with <<FORMULA>>, a setting that resembles RMSProp without momentum. Gradients are clipped
                 (Pascanu et al. 2013) to norm 10. We switch to averaging weights similarly to Merity et al. (2017)
                 after a certain number of checkpoints with no improvement in validation cross-entropy or at 80% of
                 the training time at the latest. We found no beneﬁt to using two-step ﬁnetuning.
                 Model evaluation is performed with the standard, deterministic dropout approximation or Monte-
                 Carlo averaging (Gal and Ghahramani 2016) where explicitly noted (MC). In standard dropout
                 evaluation, dropout is turned off while in MC dropout predictions are averaged over randomly
                 sampled dropout masks (200 in our experiments). Optimal softmax temperature is determined on
                 the validation set, and in the MC case dropout rates are scaled (Melis et al. 2018). Finally, we report
                 results with and without dynamic evaluation (Krause et al. 2017). Hyperparameters for dynamic
                 evaluation are tuned using the same method (see AppendixA for details).
                 We make the code and the tuner output available at https://github.com/deepmind/lamb.

                  3.4 RESULTS

                 Table1 lists our results on word-level datasets. On the PTB and Wikitext-2 datasets, the Mogriﬁer
                 has lower perplexity than the LSTM by 3–4 perplexity points regardless of whether or not dynamic
                 evaluation (Krause et al. 2017) and Monte-Carlo averaging are used. On both datasets, the state of
                 the art is held by the AWD LSTM (Merity et al. 2017) extended with Mixture of Softmaxes (Yang

                   Table 2: Bits per character on character-based datasets of near state-of-the-art models, our LSTM baseline
                   and theMogriﬁer. Previous state-of-the-art results in italics. Depth N is denoted withdN. MC stands for
                   Monte-Carlo dropout evaluation. Once again the Mogriﬁer strictly dominates the LSTM and sets a new state of
                   the art on all but the Enwik8 dataset where with dynamic evaluation it closes the gap to the Transformer-XL of
                   similar size (y Krause et al. (2019),zBen Krause, personal communications, May 17, 2019). On most datasets,
                   model size was set large enough for underﬁtting not to be an issue. This was very much not the case with Enwik8,
                   so we grouped models of similar sizes together for ease of comparison. Unfortunately, a couple of dynamic
                   evaluation test runs diverged (NaN) on the test set and some were just too expensive to run (Enwik8, MC).

                                                                   <<TABLE>>

                   et al. 2017) and FRAGE (Gong et al. 2018). The Mogriﬁer improves the state of the art without either
                   of these methods on PTB, and without FRAGE on Wikitext-2.
                   Table2 lists the character-level modelling results. On all datasets, our baseline LSTM results are much
                   better than those previously reported for LSTMs, highlighting the issue of scalability and experimental
                   controls. In some cases, these unexpectedly large gaps may be down to lack of hyperparameter tuning
                   as in the case of Merity et al. (2017), or in others, to using a BPTT window size (50) that is too small
                   for character-level modelling (Melis et al. 2017) in order to ﬁt the model into memory. The Mogriﬁer
                   further improves on these baselines by a considerable margin. Even the smallest improvement of
                   0.012 bpc on the highly idiosyncratic, character-based, Mikolov preprocessed PTB task is equivalent
                   to gaining about 3 perplexity points on word-level PTB. MWC, which was built for open-vocabulary
                   language modelling, is a much better smaller-scale character-level dataset. On the English and the
                   Finnish corpora in MWC, the Mogriﬁer enjoys a gap of 0.033-0.046 bpc. Finally, on the Enwik8
                   dataset, the gap is 0.029-0.039 bpc in favour of the Mogriﬁer.

                                                       <<FIGURE>>

                    Figure 2: “No-zigzag” Mogriﬁer for the ablation study. Gating is always based on the original inputs.

                               Table 3: PTB ablation study validation perplexities with 24M parameters.
                
                                                         <<TABLE>>


                 Of particular note is the comparison to Transformer-XL (Dai et al. 2019), a state-of-the-art model
                 on larger datasets such as Wikitext-103 and Enwik8. On PTB, without dynamic evaluation, the
                 Transformer-XL is on par with our LSTM baseline which puts it about 3.5 perplexity points behind
                 the Mogriﬁer. On Enwik8, also without dynamic evaluation, the Transformer-XL has a large, 0.09 bpc
                 advantage at similar parameter budgets, but with dynamic evaluation this gap disappears. However,
                 we did not test the Transformer-XL ourselves, so fair comparison is not possible due to differing
                 experimental setups and the rather sparse result matrix for the Transformer-XL.

                  4 ANALYSIS

                  4.1 ABLATION STUDY

                 The Mogriﬁer consistently outperformed the LSTM in our experiments. The optimal settings were
                 similar across all datasets, with <<FORMULA>> and <<FORMULA>> (see AppendixB for a discussion of
                 hyperparameter sensitivity). In this section, we explore the effect of these hyperparameters and show
                 that the proposed model is not unnecessarily complicated. To save computation, we tune all models
                 using a shortened schedule with only 145 epochs instead of 964 and a truncated BPTT window
                 size of 35 on the word-level PTB dataset, and evaluate using the standard, deterministic dropout
                 approximation with a tuned softmax temperature.
                 Fig.3 shows that the number of rounds r greatly inﬂuences the results. Second, we found the low-rank
                 factorization ofQi andRi to help a bit, but the full-rank variant is close behind which is what we
                 observed on other datasets, as well. Finally, to verify that the alternating gating scheme is not overly
                 complicated, we conditional l new ly introduced gates on the original inputs x and h_prev (see Fig.2).
                 That is, instead of Eq.1 and Eq.2 the no-zigzag updates are

                              <<FORMULA>>

                 In our experiments, the no-zigzag variant underperformed the baseline Mogriﬁer by a small but
                 signiﬁcant margin, and was on par with the <<FORMULA>> model in Fig.3 suggesting that the Mogriﬁer’s
                 iterative reﬁnement scheme does more than simply widen the range of possible gating values ofx
                 and h_prev to(0;2dr=2e )and(0;2br=2c ), respectively.

                  4.2 COMPARISON TO THE M LSTM

                 The Multiplicative LSTM (Krause et al. 2016), or mLSTM for short, is closest to our model in
                  the literature. It is deﬁned asmLSTM(x;cprev ; h_prev ) = LSTM(x;cprev ;hm ), wherehm =prev       prev

                                                  <<FIGURE>>

                 Figure 4: Cross-entropy vs sequence length in the reverse copy task with i.i.d. tokens. Lower is better. The
                 Mogriﬁer is better than the LSTM even in this synthetic task with no resemblance to natural language.


                 <<FORMULA>>. In this formulation, the differences are readily apparent. First, the mLSTM
                 allows for multiplicative interaction betweenxand h_prev , but it only overrides h_prev , while in the
                 Mogriﬁer the interaction is two-way, which – as the ablation study showed – is important. Second,
                 the mLSTM can change not only the magnitude but also the sign of values in h_prev , something with
                 which we experimented in the Mogriﬁer, but could not get to work. Furthermore, in the deﬁnition of
                 hm , the unsquashed linearities and their elementwise product make the mLSTM more sensitive to prev initialization and unstable during optimization.
                 On the Enwik8 dataset, we greatly improved on the published results of the mLSTM (Krause et al.
                 2016). In fact, even our LSTM baseline outperformed the mLSTM by 0.03 bpc. We also conducted
                 experiments on PTB based on our reimplementation of the mLSTM following the same methodology
                 as the ablation study and found that the mLSTM did not improve on the LSTM (see Table3).
                 Krause et al. (2016) posit and verify the recovery hypothesis which says that having just suffered
                 a large loss, the loss on the next time step will be smaller on average for the mLSTM than for the
                 LSTM. This was found not to be the case for the Mogriﬁer. Neither did we observe a signiﬁcant
                 change in the gap between the LSTM and the Mogriﬁer in the tied and untied embeddings settings,
                 which would be expected if recovery was affected byxand h_prev being in different domains.


                  4.3 THE REVERSE COPY TASK

                 Our original motivation for the Mogriﬁer was to allow the context to amplify salient and attenuate
                 nuisance features in the input embeddings. We conduct a simple experiment to support this point
                 of view. Consider the reverse copy task where the network reads an input sequence of tokens and
                 a marker token after which it has to repeat the input in reverse order. In this simple sequence-to-
                 sequence learning (Sutskever et al. 2014) setup, the reversal is intended to avoid the minimal time lag
                 problem (Hochreiter and Schmidhuber 1997), which is not our focus here.
                 The experimental setup is as follows. For the training set, we generate 500000 examples by uniformly
                 sampling a given number of tokens from a vocabulary of size1000. The validation and test sets
                  are constructed similarly, and contain 10000 examples. The model consists of an independent,
                 unidirectional encoder and a decoder, whose total number of parameters is10million. The decoder
                 is initialized from the last state of the encoder. Since overﬁtting is not an issue here, no dropout is
                 necessary, and we only tune the learning rate, the l2 penalty, and the embedding size for the LSTM.
                 For the Mogriﬁer, the number of roundsrand the rankkof the low-rank approximation are also
                 tuned.
                 We compare the case where both the encoder and decoder are LSTMs to where both are Mogriﬁers.
                 Fig.4a shows that, for sequences of length 50 and 100, both models can solve the task perfectly. At
                 higher lengths though, the Mogriﬁer has a considerable advantage. Examining the best hyperparameter
                 settings found, the embedding/hidden sizes for the LSTM and Mogriﬁer are 498/787 vs 41/1054 at
                 150 steps, and 493/790 vs 181/961 at 200 steps. Clearly, the Mogriﬁer was able to work with a much
                 smaller embedding size than the LSTM, which is in line with our expectations for a model with a
                 more ﬂexible interaction between the input and recurrent state. We also conducted experiments with
                 a larger model and vocabulary size, and found the effect even more pronounced (see Fig.4b).

                  4.4 WHAT THE MOGRIFIER IS NOT

                 The results on the reverse copy task support our hypothesis that input embeddings are enriched by
                 the Mogriﬁer architecture, but that cannot be the full explanation as the results of the ablation study
                 indicate. In the following, we consider a number of hypotheses about where the advantage of the
                 Mogriﬁer lies and the experiments that provide evidenceagainstthem.

                  E Hypothesis: the beneﬁt is in scalingxand h_prev .We veriﬁed that data dependency is a crucial
                   feature by adding a learnable scaling factor to the LSTM inputs. We observed no improvement.
                   Also, at extremely low-rank (less than 5) settings where the amount of information in its gating is
                   small, the Mogriﬁer loses its advantage.

                  E Hypothesis: the beneﬁt is in making optimization easier.We performed experiments with different
                   optimizers (SGD, RMSProp), with intra-layer batch normalization and layer normalization on
                   the LSTM gates. While we cannot rule out an effect on optimization difﬁculty, in all of these
                   experiments the gap between the LSTM and the Mogriﬁer was the same.

                  E Hypothesis: exact tying of embeddings is too constraining, the beneﬁt is in making this rela-
                   tionship less strict.Experiments conducted with untied embeddings and character-based models
                   demonstrate improvements of similar magnitude.

                  E Hypothesis: the beneﬁt is in the low-rank factorization of <<FORMULA>> implicitly imposing structure on
                   the LSTM weight matrices.We observed that the full-rank Mogriﬁer also performed better than
                   the plain LSTM. We conducted additional experiments where the LSTM’s gate matrices were
                   factorized and observed no improvement.

                  E Hypothesis: the beneﬁt comes from better performance on rare words.The observed advantage
                   on character-based modelling is harder to explain based on frequency. Also, in the reverse copy
                   experiments, a large number of tokens were sampled uniformly, so there were no rare words at all.

                  E Hypothesis: the beneﬁt is speciﬁc to the English language.This is directly contradicted by the
                   Finnish MWC and the reverse copy experiments.

                  E Hypothesis: the beneﬁt is in handling long-range dependencies better.Experiments in the episodic
                   setting (i.e. sentence-level language modelling) exhibited the same gap as the non-episodic ones.

                  E Hypothesis: the scaling up of inputs saturates the downstream LSTM gates.The idea here is that
                   saturated gates may make states more stable over time. We observed the opposite: the means
                   of the standard LSTM gates in the Mogriﬁer were very close between the two models, but their
                   variance was smaller in the Mogriﬁer.


                  5 CONCLUSIONS AND FUTURE WORK

                 We presented the Mogriﬁer LSTM, an extension to the LSTM, with state-of-the-art results on
                  several language modelling tasks. Our original motivation for this work was that the context-free
                  representation of input tokens may be a bottleneck in language models and by conditioning the
                 input embedding on the recurrent state some beneﬁt was indeed derived. While it may be part of the
                 explanation, this interpretation clearly does not account for the improvements brought by conditioning
                 the recurrent state on the input and especially the applicability to character-level datasets. Positioning
                 our work on the Multiplicative RNN line of research offers a more compelling perspective.
                 To give more credence to this interpretation, in the analysis we highlighted a number of possible
                  alternative explanations, and ruled them all out to varying degrees. In particular, the connection
                  to the mLSTM is weaker than expected as the Mogriﬁer does not exhibit improved recovery (see
                  Section4.2), and on PTB the mLSTM works only as well as the LSTM. At the same time, the
                  evidence against easier optimization is weak, and the Mogriﬁer establishing some kind of sharing
                 between otherwise independent LSTM weight matrices is a distinct possibility.
                 Finally, note that as shown by Fig.1 and Eq.1-2, the Mogriﬁer is a series of preprocessing steps
                 composed with the LSTM function, but other architectures, such as Mogriﬁer GRU or Mogriﬁer
                 Elman Network are possible. We also leave investigations into other forms of parameterization of
                 context-dependent transitions for future work.

                    ACKNOWLEDGMENTS

                   We would like to thank Ben Krause for the Transformer-XL dynamic evaluation results, Laura
                   Rimell, Aida Nematzadeh, Angeliki Lazaridou, Karl Moritz Hermann, Daniel Fried for helping with
                   experiments, Chris Dyer, Sebastian Ruder and Jack Rae for their valuable feedback.


                    REFERENCES
                   Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models.
                     arXiv preprint arXiv:1606.07356, 2016.

                   Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. InInternational
                     Conference on Machine Learning, pages 1120–1128, 2016.

                   Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to
                     align and translate.arXiv preprint arXiv:1409.0473, 2014.

                   Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. arXiv preprint
                     arXiv:1810.06682, 2018.

                   Bram Bakker. Reinforcement learning with long short-term memory. InAdvances in neural information
                     processing systems, pages 1475–1482, 2002.

                   Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural machine translation.arXiv
                     preprint arXiv:1711.02173, 2017.

                   Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural
                     networks. InInternational Conference on Machine Learning, pages 2067–2075, 2015.

                   Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan
                     Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context.arXiv preprint
                     arXiv:1901.02860, 2019.

                   Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
                     transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.

                   Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990.

                   Jakob N Foerster, Justin Gilmer, Jascha Sohl-Dickstein, Jan Chorowski, and David Sussillo. Input switched
                     afﬁne networks: An rnn architecture designed for interpretability. InProceedings of the 34th International
                     Conference on Machine Learning-Volume 70, pages 1136–1145. JMLR. org, 2017.

                   Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks.
                     InAdvances in Neural Information Processing Systems, pages 1019–1027, 2016.

                   Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google
                     vizier: A service for black-box optimization. InProceedings of the 23rd ACM SIGKDD International
                     Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.

                   Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: frequency-agnostic word
                     representation. InAdvances in Neural Information Processing Systems, pages 1334–1345, 2018.

                   David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016.

                   Sepp Hochreiter and Jürgen Schmidhuber. Lstm can solve hard long time lag problems. InAdvances in neural
                     information processing systems, pages 473–479, 1997.

                   Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.arXiv preprint
                     arXiv:1801.06146, 2018.

                   Marcus Hutter. The human knowledge compression contest.URL http://prize. hutter1. net, 6, 2012.

                   Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classiﬁers: A loss framework
                     for language modeling.CoRR, abs/1611.01462, 2016. URLhttp://arxiv.org/abs/1611.01462.

                   Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with
                     syntactically controlled paraphrase networks.arXiv preprint arXiv:1804.06059, 2018.

                   Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint
                     arXiv:1707.07328, 2017.

                   Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Learning to create and reuse words in open-vocabulary
                     neural language modeling.arXiv preprint arXiv:1704.06986, 2017.

                   Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
                     2014.

                   Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative LSTM for sequence modelling.CoRR,
                     abs/1609.07959, 2016. URLhttp://arxiv.org/abs/1609.07959.

                   Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence
                     models.arXiv preprint arXiv:1709.07432, 2017.

                   Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language
                     models.arXiv preprint arXiv:1904.08378, 2019.

                   Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. Lstms can learn
                     syntax-sensitive dependencies well, but modeling structure makes them better. InProceedings of the 56th
                     Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436,
                     2018.

                   Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of lstms to learn syntax-sensitive
                     dependencies.Transactions of the Association for Computational Linguistics, 4:521–535, 2016.

                   Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of
                     english: The Penn treebank.Computational linguistics, 19(2):313–330, 1993.

                   Hermann Mayer, Faustino Gomez, Daan Wierstra, Istvan Nagy, Alois Knoll, and Jürgen Schmidhuber. A system
                     for robotic heart surgery that learns to tie knots using recurrent neural networks.Advanced Robotics, 22
                     (13-14):1521–1537, 2008.

                   Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models.
                     arXiv preprint arXiv:1707.05589, 2017.

                   Gábor Melis, Charles Blundell, Tomáš Kociskˇ  y, Karl Moritz Hermann, Chris Dyer, and Phil Blunsom. Pushing`
                     the bounds of dropout.arXiv preprint arXiv:1805.09208, 2018.

                   Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.CoRR,
                     abs/1609.07843, 2016. URLhttp://arxiv.org/abs/1609.07843.

                   Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models.
                     arXiv preprint arXiv:1708.02182, 2017.

                   Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple
                     scales.arXiv preprint arXiv:1803.08240, 2018.

                   Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural`
                     network based language model. InInterspeech, volume 2, page 3, 2010.

                   Naﬁse Sadat Moosavi and Michael Strube. Lexical features in coreference resolution: To be used with caution.
                     arXiv preprint arXiv:1704.06779, 2017.

                   Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In
                     International conference on machine learning, pages 1310–1318, 2013.

                   Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
                     Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365, 2018.

                   Oﬁr Press and Lior Wolf. Using the output embedding to improve language models.CoRR, abs/1608.05859,
                     2016. URL http://arxiv.org/abs/1608.05859.

                   David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating
                     errors.Cognitive modeling, 5(3):1, 1988.

                   Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long short-term memory based recurrent neural
                     network architectures for large vocabulary speech recognition.CoRR, abs/1402.1128, 2014. URL http://arxiv.org/abs/1402.1128.

                   Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword
                     units.arXiv preprint arXiv:1508.07909, 2015.

                   Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In
                     Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011.

                   Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances
                     in neural information processing systems, pages 3104–3112, 2014.

                   Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. Sequence
                     modeling via segmentations. InProceedings of the 34th International Conference on Machine Learning-
                     Volume 70, pages 3674–3683. JMLR. org, 2017.

                   Paul J Werbos et al. Backpropagation through time: what it does and how to do it.Proceedings of the IEEE, 78
                     (10):1550–1560, 1990.

                   Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative
                     integration with recurrent neural networks. InAdvances in neural information processing systems, pages
                     2856–2864, 2016.

                   Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a
                     high-rank rnn language model.arXiv preprint arXiv:1711.03953, 2017.

                   Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning.CoRR, abs/1611.01578,
                     2016. URLhttp://arxiv.org/abs/1611.01578.


                  APPENDIX A HYPERPARAMETER TUNING RANGES

                 In all experiments, we tuned hyperparameters using Google Vizier (Golovin et al. 2017). The tuning
                 ranges are listed in Table4. Obviously,mogriﬁer_roundsandmogriﬁer_rankare tuned only for the
                 Mogriﬁer. Ifinput_embedding_ratio>1, then the input/output embedding sizes and the hidden
                 sizes are set to equal and the linear projection from the cell output into the output embeddings space
                 is omitted. Similarly,mogrifier_rank60is taken to mean full rank <<FORMULA>> without factorization.
                 Since Enwik8 is a much larger dataset, we don’t tuneinput_embedding_ratioand specify tighter
                 tuning ranges for dropout based on preliminary experiments (see Table5).
                 Dynamic evaluation hyperparameters were tuned according to Table6. The highest possible value
                 formax_time_steps, the BPTT window size, was 20 for word, and 50 for character-level tasks. The
                 batch size for estimating the mean squared gradients over the training data was set to 1024, gradient
                 clipping was turned off, and the l2 penalty was set to zero.

                              Table 4: Hyperparameter tuning ranges for all tasks except Enwik8.

                                                   <<TABLE>>


                                   Table 5: Hyperparameter tuning ranges for Enwik8.

                                                  <<TABLE>>


                               Table 6: Hyperparameter tuning ranges for dynamic evaluation.

                                                  <<TABLE>>


                  APPENDIX B HYPERPARAMETER SENSITIVITY

                 The parallel coordinate plots in Fig.5 and 6, give a rough idea about hyperparameter sensitivity. The
                 red lines correspond to hyperparameter combinations closest to the best solution found. To ﬁnd the
                 closest combinations, we restricted the range for each hyperparameter separately to about 15% of its
                 entire tuning range.
                 For both the LSTM and the Mogriﬁer, the results are at most 1.2 perplexity points off the best result,
                 so our results are somewhat insensitive to jitter in the hyperparameters. Still, in this setup, grid search
                 would require orders of magnitude more trials to ﬁnd comparable solutions.
                 On the other hand, the tuner does take advantage of the stochasticity of training, and repeated runs
                 with the same parameters may be give slightly worse results. To gauge the extent of this effect, on
                 PTB we estimated the standard deviation in reruns of the LSTM with the best hyperparameters to be
                 about 0.2 perplexity points, but the mean was about 0.7 perplexity points off the result produced with
                 the weights saved in best tuning run.

                                              <<FIGURE>>

                 Figure 5: Average per-word validation cross-entropies for hyperparameter combinations in the neighbourhood of
                 the best solution for a 2-layer LSTM with 24M weights on the Penn Treebank dataset.

                                                  <<FIGURE>>

                 Figure 6: Average per-word validation cross-entropies for hyperparameter combinations in the neighbourhood
                 of the best solution for a 2-layer Mogriﬁer LSTM with 24M weights on the Penn Treebank dataset.
                 feature_mask_rank and feature_mask_roundsare aliases for mogriﬁer_rank and mogriﬁer_rounds
<|endoftext|>


<|startoftext|>
                                      Movement Pruning:
                              Adaptive Sparsity by Fine-Tuning


                                Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
                                       1 Hugging Face, 2 Cornell University
                             {victor,thomas}@huggingface.co;arush@cornell.edu


                                              Abstract

                       Magnitude pruning is a widely used strategy for reducing model size in pure
                       supervised learning; however, it is less effective in the transfer learning regime that
                       has become standard for state-of-the-art natural language processing applications.
                       We propose the use of movement pruning, a simple, deterministic ﬁrst-order weight
                       pruning method that is more adaptive to pretrained model ﬁne-tuning. We give
                       mathematical foundations to the method and compare it to existing zeroth- and
                       ﬁrst-order pruning methods. Experiments show that when pruning large pretrained
                       language models, movement pruning shows signiﬁcant improvements in high-
                       sparsity regimes. When combined with distillation, the approach achieves minimal
                       accuracy loss with down to only 3% of the model parameters.


                 1 Introduction

                 Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
                 performance in applications in natural language processing and related ﬁelds. In this setup, a large
                 model pretrained on a massive generic dataset is then ﬁne-tuned on a smaller annotated dataset to
                 perform a speciﬁc end-task. Model accuracy has been shown to scale with the pretrained model and
                 dataset size [Raffel et al., 2019]. However, signiﬁcant resources are required to ship and deploy these
                 large models, and training the models have high environmental costs [Strubell et al., 2019].
                 Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
                 only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
                 are a particularly simple and effective method for compressing models to be sent to edge devices such
                 as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
                 absolute values, is the most widely used method for weight pruning. It has been applied to a large
                 variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
                 2019], and more recently has been leveraged as a core component in the lottery ticket hypothesis
                 [Frankle et al., 2019].
                 While magnitude pruning is highly effective for standard supervised learning, it is inherently less
                 useful in the transfer learning regime. In supervised learning, weight values are primarily determined
                 by the end-task training data. In transfer learning, weight values are mostly predetermined by the
                 original model and are only ﬁne-tuned on the end task. This prevents these methods from learning to
                 prune based on the ﬁne-tuning step, or “ﬁne-pruning.”
                 In this work, we argue that to effectively reduce the size of models for transfer learning, one should
                 instead use movement pruning, i.e., pruning approaches that consider the changes in weights during
                 ﬁne-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
                 high values can be pruned if they shrink during training. This strategy moves the selection criteria
                 from the 0th to the 1st-order and facilitates greater pruning based on the ﬁne-tuning objective. To
                 test this approach, we introduce a particularly simple, deterministic version of movement pruning
                 utilizing the straight-through estimator [Bengio et al., 2013].
                 We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
                 Vaswani et al., 2017] on a diverse set of ﬁne-tuning tasks. In highly sparse regimes (less than 15% of
                 remaining weights), we observe signiﬁcant improvements over magnitude pruning and other 1st-order
                 methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
                 BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
                 [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
                 the differences between magnitude pruning and movement pruning shows that the two methods lead
                 to radically different pruned models with movement pruning showing greater ability to adapt to the
                 end-task.

                 2 Related Work

                 In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
                 Most similar to our approach are methods for using parallel score matrices to augment the weight
                 matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for 
                 convolutional networks. Differing from our methods, these methods keep the weights of the model ﬁxed
                 (either from a randomly initialized network or a pre-trained network) and the scores are updated to
                 ﬁnd a good sparse subnetwork.
                 Many previous works have also explored using higher-order information to select prunable weights.
                 LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
                 deletion. Our method does not require the (possibly costly) computation of second-order derivatives
                 since the importance scores are obtained simply as the by-product of the standard ﬁne-tuning. The is
                 et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
                 contrast, we found it useful to preserve the direction of movement in our algorithm.
                 Compressing pretrained language models for transfer learning is also a popular area of study. Other
                 approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
                 [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
                 and targets individual weight. We also show that having a teacher can further improve our approach.
                 Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
                 sparse language models from scratch. This differs from our approach which focuses on the ﬁne-tuning
                 stage. Finally, another popular compression approach is quantization. Quantization has been applied
                 to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
                 providing high memory compression rates at the cost of no or little performance. As shown in
                 previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
                 can be combined to further improve the performance/size ratio.

                 3 Background: Score-Based Pruning

                 We ﬁrst establish shared notation for discussing different neural network pruning strategies. Let
                 <<FORMULA>> refer to a generic weight matrix in the model (we consider square matrices, but they
                 could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
                 associated importance scores <<FORMULA>>. Given importance scores, each pruning strategy computes a
                 mask <<FORMULA>>. Inference for an input x becomes <<FORMULA>>, where <<FORMULA>> is the Hadamard
                 product. A common strategy is to keep the top-v percent of weights by importance. We deﬁne <<FORMULA>> 
                 as a function which selects the v% highest values in 
                 
                 <<FORMULA>>                                     (1)

                 Magnitude-based weight pruning determines the mask based on the absolute value of each weight <<FORMULA>>
                 as a measure of importance. Formally, we have importance scores <<FORMULA>>, and masks <<FORMULA>>. 
                 There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude 
                 pruning: the model is ﬁrst trained until convergence and weights with the lowest
                 magnitudes are removed afterward. The sparsiﬁed model is then re-trained with the removed weights
                 ﬁxed to 0. This loop is repeated until the desired sparsity level is reached.

                                                  <<FORMULA>>

                 Table 1: Summary of the pruning methods considered in this work and their speciﬁcities. The
                 expression of <<FORMULA>> regularization is detailed in Eq (3).


                 In this study, we focus on automated gradual pruning[Zhu and Gupta, 2018]. It supplements
                 magnitude pruning by allowing masked weights to be updated such that they are not ﬁxed for the
                 entire duration of the training. Automated gradual pruning enables the model to recover from previous
                 masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity level <<FORMULA>>
                 during training using a cubic sparsity scheduler: <<FORMULA>>. 
                 The sparsity <<FORMULA>> level at time step <<FORMULA>> is increased from an initial value vi (usually 0) 
                 to a ﬁnal value vf in n pruning steps after ti steps of warm-up. The model is thus pruned and trained jointly.

                 4 Movement Pruning

                 Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
                 model. In this work, we focus on movement pruning methods where importance is derived from
                 ﬁrst-order information. Intuitively, instead of selecting weights that are far from zero, we retain
                 connections that are moving away from zero during the training process. We consider two versions of
                 movement pruning: hard and soft.
                 For (hard) movement pruning, masks are computed using the Top v function: <<FORMULA>>. Unlike v magnitude 
                 pruning, during training, we learn both the weights <<FORMULA>> and their importance scores S.
                 During the forward pass, we compute for all <<FORMULA>>.
                 Since the gradient of Top v is 0 everywhere it is deﬁned, we follow Ramanujan et al. [2020], Mallya
                 and Lazebnik [2018] and approximate its value with the straight-through estimator [Bengio et al.,
                 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
                 approximation of gradient of the loss L with respect to <<FORMULA>> is given by

                                        <<FORMULA>>                   (2)
                                         
                 This implies that the scores of weights are updated, even if these weights are masked in the forward
                 pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
                 We also consider a relaxed (soft) version of movement pruning based on the binary mask function
                 described by Mallya and Lazebnik [2018]. Here we replace hyper parameter v with a ﬁxed global
                 threshold value <<FORMULA>> that controls the binary mask. The mask is calculated as <<FORMULA>>. In order to
                 control the sparsity level, we add a regularization term <<FORMULA>> which encourages
                 the importance scores to decrease over time 1 . The coefﬁcient <<FORMULA>> controls the penalty intensity and
                 thus the sparsity level.
                 Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
                 movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
                 hard-concrete distribution, where the maskMis sampled for all <<FORMULA>> with hyperparameters <<FORMULA>>,
                 <<FORMULA>>, and <<FORMULA>>:                        
                 
                 <<FORMULA>>

                 The expected <<FORMULA>> norm has a closed form involving the parameters of the hard-concrete:                                      E(L0 ) =
                 <<FORMULA>>. Thus, the weights and scores of the model can be optimized in <<FORMULA>> We also 
                 experimented with <<FORMULA>> but it turned out to be harder to tune while giving similar results.

                                                  <<FORMULA>>

                 Figure 1: During ﬁne-tuning (on MNLI), the weights stay close to their pre-trained values which
                 limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
                 plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
                 selects weights that are moving away from 0.


                 an end-to-end fashion to minimize the sum of the training loss L and the expected L0 penalty. A
                 coefﬁcient l0 controls the L0 penalty and indirectly the sparsity level. Gradients take a similar form:

                         <<FORMULA>>     (3)
                           
                 At test time, a non-stochastic estimation of the mask is used: <<FORMULA>>
                  and weights multiplied by 0 can simply be discarded.

                                      <<TABLE>>

                 Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
                 functions, pruning structure, and the ﬁnal gradient form.

                 Method Interpretation In movement pruning, the gradient of L with respect to <<FORMULA>> is given
                 by the standard gradient derivation: <<FORMULA>>. By combining it to Eq(2), we 
                 have <<FORMULA>> (we omit the binary mask term <<FORMULA>> for simplicity). From the gradient update in <<FORMULA>>
                 Eq (2), is increasing when <<FORMULA>>, which happens in two cases: 

                      <<FORMULA>>           (a) 
                      <<FORMULA>>           (b) 

                 It means that during training <<FORMULA>> is increasing while being positive or is decreasing while being
                 negative. It is equivalent to saying thatSi;j is increasing when <<FORMULA>> is moving away from 0. Inversely,
                 <<FORMULA>> is decreasing when @L >0which means that <<FORMULA>> is shrinking towards 0.
                 While magnitude pruning selects the most important weights as the ones which maximize their
                 distance to 0 (<<FORMULA>>), movement pruning selects the weights which are moving the most away from
                 0 (<<FORMULA>>). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
                 pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
                 equation (2), after T gradient updates, we have

                                                <<FORMULA>>                   (4) 

                 Figure 1 shows this difference empirically by comparing weight values during ﬁne-tuning against
                 their pre-trained value. As observed by Gordon et al. [2020], ﬁne-tuned weights stay close in absolute
                 value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained

                                    <<FIGURE>>

                 values implies that we know with high conﬁdence before even ﬁne-tuning which weights will be
                 pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
                 pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
                 the pruning decision since the selection is made during ﬁne-tuning (moving away from 0), and both
                 low and high values can be pruned. We posit that this is critical for the success of the approach as it
                 is able to prune based on the task-speciﬁc data, not only the pre-trained value.

                 5 Experimental Setup

                 Transfer learning for NLP uses large pre-trained language models that are ﬁne-tuned on target tasks
                 [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
                 speciﬁc pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
                 We freeze the embedding modules and ﬁne-tune the transformer layers and the task-speciﬁc head.
                 We perform experiments on three monolingual (English) tasks, which are common benchmarks for
                 the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
                 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
                 [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
                 is formulated as a span extraction task, MNLI and QQP are paired sentence classiﬁcation tasks.
                 For a given task, we ﬁne-tune the pre-trained model for the same number of updates (between 6
                 and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
                 scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
                 (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
                 especially in high sparsity regimes. The schedule for v is:
                                 
                                   <<FORMULA>>                        (5)
                                     
                 where tf is the number of cool-down steps.
                 We compare our results against several state-of-the-art pruning baselines: Re-weighted Proximal
                 Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
                 [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
                 structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
                 authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
                 models with varying hyper-parameters [Turc et al., 2019].

                 6 Results

                 Figure 2 displays the results for the main pruning methods at different levels of pruning on each
                 dataset. First, we observe the consistency of the comparison between magnitude and movement
                 pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
                 all methods with little or no loss with respect to the dense model whereas the performance of
                 movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
                 pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
                 ﬁrst-order methods show strong performances with less than 15% of remaining weights.
                 Table 2 shows the speciﬁc model scores for different methods at high sparsity levels. Magnitude
                 pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regularization,
                 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
                 indicate that in high sparsity regimes, importance scores derived from the movement accumulated
                 during ﬁne-tuning induce signiﬁcantly better pruned models compared to absolute values.
                 Next, we compare the difference in performance between ﬁrst-order methods. We see that straight-
                 through based hard movement pruning (MvP) is comparable withL0 regularization (with a signiﬁcant
                 gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
                    2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
                 performance

                  Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft movement
                 pruning consistently outperforms other methods in high sparsity regimes.We plot the
                 performance of the standard ﬁne-tuned BERT along with 95% of its performance.

                                                    <<FIGURE>>

                 Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
                  state-of-the art pruning methods at different high sparsity levels.

                                <<TABLE>>

                 hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
                 among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
                 movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
                 it yet yields stronger performances for the same compute budget.
                 Finally, movement pruning and soft movement pruning compare favorably to the other baselines, 
                 except for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
                 the ﬁne-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efﬁcient and
                 more effective to train a large model and compress it afterward than training a smaller model from
                 scratch. We do note though that current hardware does not support optimized inference for sparse
                 models: from an inference speed perspective, it might often desirable to use a small dense model
                 such as mini-BERT over a sparse alternative of the same size.

                 Distillation further boosts performance Following previous work, we can further leverage knowledge
                 distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
                 domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline ﬁne-tuned BERT-base model as
                 teacher. The training objective is a linear combination of the training loss and a knowledge distillation


                 Figure 3: Comparisons between different pruning methods augmented with distillation. Distillation
                 improves the performance across all pruning methods and sparsity levels.

                                                              <<FIGURE>>

                 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
                 beneﬁt from distillation signal further enhancing the ratio Performance VS Model Size.

                                <<TABLE>>

                 Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
                 weight distribution.

                                              <<FIGURE>>

                 loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
                 pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
                 pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
                 instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
                 with distillation, soft movement pruning yields the strongest performances across all pruning methods
                 and studied datasets: it reaches 95% of BERT-base with only a fraction of the weights in the encoder
                 (5% on SQuAD and MNLI).

                 7 Analysis

                 Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
                 the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
                 observe that by deﬁnition, magnitude pruning removes all the weights that are close to zero, ending
                 up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
                 the whole interval except for values close to 0.
                 Figure 4b displays each individual weight against its associated importance score in movement
                 pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
                 relationship between the scores and the weights. Both weights with high absolute value or low
                 absolute value can be considered important. However, high scores are systematically associated with
                 non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
                 scores (section 4): a high score S indicates that during ﬁne-tuning, the associated weight moved away
                 from 0 and is thus non-null.

                 Local and global masks perform similarly  We study the inﬂuence of the locality of the pruning
                 decision. While local Top v selects the v% most important weights matrix by matrix, global Top v 
                 uncovers non-uniform sparsity patterns in the network by selecting the v% most important weights in

                 Figure 5: Comparison of local and global selections of weights on SQuAD at different sparsity heavily 
                 pruning the highest layers.

                                                        <<FIGURE>>
                 
                 Figure 6:Remaining weights per layer in the Transformer. Global magnitude pruning tends to levels. 
                 For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global 
                 Top v performs similarly at all ods allocate the weight to the lower layers while levels of sparsity.                     

                                                <<FIGURE>>

                 the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
                 the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
                 found that the sparsity tends to increase with the depth of the network layer.
                 Figure 5 compares the performance of local selection (matrix by matrix) against global selection
                 (all the matrices) for magnitude pruning and movement pruning. Despite being able to ﬁnd a
                 global sparsity structure, we found that global did not signiﬁcantly outperform local, except in high
                 sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
                 Even though the distillation signal boosts the performance of pruned models, the end performance
                 difference between local and global selections remains marginal.
                 Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
                 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
                 through the network, and it has been shown to be crucial for the performance in high sparsity regimes
                 [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
                 tend to allocate a signiﬁcant part of the weights to the lowest layers while heavily pruning in the
                 highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
                 uniformly across layers.

                 8 Conclusion

                 We consider the case of pruning of pretrained models for task-speciﬁc ﬁne-tuning and compare
                 zeroth- and ﬁrst-order pruning methods. We show that a simple method for weight pruning based on
                 straight-through gradients is effective for this task and that it adapts using a ﬁrst-order importance
                 score. We apply this movement pruning to a transformer-based architecture and empirically show that
                 our method consistently yields strong improvements over existing methods in high-sparsity regimes.
                 The analysis demonstrates how this approach adapts to the ﬁne-tuning regime in a way that magnitude
                 pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
                 penalties [Bach et al., 2011] to remove entire columns or ﬁlters. In this setup, we would associate a
                 score to a group of weights (a column or a row for instance). In the transformer architecture, it would
                 give a systematic way to perform feature selection and remove entire columns of the embedding
                 matrix.


                 References
                 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
                   Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
                   transformer.ArXiv, abs/1910.10683, 2019.
                 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
                   learning in nlp. InACL, 2019.
                 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
                   efﬁcient neural network. InNIPS, 2015.
                 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
                   with pruning, trained quantization and huffman coding. InICLR, 2016.
                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InNIPS,
                   2016.
                 Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
                   abs/1902.09574, 2019.
                 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
                   hypothesis at scale.ArXiv, abs/1903.01611, 2019.
                 Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
                   through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
                   bidirectional transformers for language understanding. InNAACL, 2019.
                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
                   Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
                 Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
                   l0 regularization. InICLR, 2017.
                 Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
                   sentence understanding through inference. InNAACL, 2018.
                 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
                   machine comprehension of text. InEMNLP, 2016.
                 Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, ﬁxed network by
                   learning to mask.ArXiv, abs/1801.06519, 2018.
                 Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
                   What’s hidden in a randomly weighted neural network? InCVPR, 2020.
                 Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
                 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
                   performance comparisons. InNIPS, 1993.
                 Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
                   dense networks and ﬁsher pruning.ArXiv, abs/1801.05787, 2018.
                 Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
                   momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
                 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
                   bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
                 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
                   task-speciﬁc knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
                 Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
                   structured dropout. InICLR, 2020a.
                 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
                   2019.
                 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
                   multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
                 Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
                   and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
                   abs/2004.07320, 2020b.
                 Oﬁr Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
                   abs/1910.06188, 2019.
                 Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
                   networks using vector quantization.ArXiv, abs/1412.6115, 2014.
                 Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
                   zalez. Train large, then compress: Rethinking model size for efﬁcient training and inference of
                   transformers.ArXiv, abs/2002.11794, 2020.
                 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model
                   compression. InICLR, 2018.
                 Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
                   weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
                 Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
                   natural language processing. InNAACL, 2019.
                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                   models are unsupervised multitask learners. 2019.
                 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
                   Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
                   approach.ArXiv, abs/1907.11692, 2019.
                 Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
                   URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
                 Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
                   pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
                  Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014.
                 Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
                   The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
                 Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
                 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
                   InNIPS, 2014.
                 Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
                   Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
                 Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
                   compression and acceleration on mobile devices. InECCV, 2018.
                 Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
                   through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.

                                              A Appendices

                 A.1 Guarantees on the decrease of the training loss

                 As the scores are updated, the relative order of the importances is likely shufﬂed, and some connections
                 will be replaced by more important ones. Under certain conditions, we are able to formally prove that
                 as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
                 [Ramanujan et al., 2020] to consider the case of ﬁne-tuableW.
                 We suppose that (a) the training lossLis smooth and admits a ﬁrst-order Taylor development
                 everywhere it is deﬁned and (b) the learning rate of <<FORMULA>> is small. We deﬁne the TopK
                 function as the analog of the Top v function, wherekis an integer instead of a proportion. We ﬁrst
                 consider the case where k=1 in the Top K masking, meaning that only one connection is remaining
                 (and the other weights are deactivated/masked). Let’s denote <<FORMULA>> this sole remaining connection at
                 stept. Following Eq (1), it means that <<FORMULA>>
                 We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
                 <<FORMULA>>. We have:

                                   <<FORMULA>>

                 Eq(6)gives the following inequality: <<FORMULA>>. After re-injecting the gradient <<FORMULA>> update in Eq (2), we have:
                                          
                                         <<FORMULA>>                                (7)

                 Moreover, the conditions in Eq (6) lead to the following inferences: <<FORMULA>> and <<FORMULA>>.

                 Since <<FORMULA>> is small, <<FORMULA>> is also small. Because the training loss L is
                 smooth, we can write the 1st order Taylor development of L in point <<FORMULA>>


                               <<FORMULA>>                         (8) 
                                
                 The ﬁrst term is null because of inequalities(6)and the second term is negative because of inequality
                 (7). Thus <<FORMULA>> when connection <<FORMULA>> becomes more important than 
                 <<FORMULA>>, the connections are swapped and the training loss decreases between step k                         
                 t and <<FORMULA>>. Similarly, we can generalize the proof to a set <<FORMULA>> of N swapping connections.
                 We note that this proof is not speciﬁc to theTopKmasking function. In fact, we can extend the proof
                 using theThresholdmasking function <<FORMULA>> [Mallya and Lazebnik, 2018]. Inequalities
                 (6) are still valid and the proof stays unchanged.

                 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
                 it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
                 would also be true one speciﬁc case: thenegative thresholdmasking function <<FORMULA>> where
                 <<FORMULA>>.
                 We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
                                  
                                                      <<FORMULA>>                 (9)

                 The inequality on the gradient update becomes:<<FORMULA>> and following i
                 the same development as in Eq(8), we have <<FORMULA>> the loss increases. 
                 <<FORMULA>> We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
                 the absolute value of the score as a proxy for importance.
<|endoftext|>


  <|startoftext|>
  Network Pruning

     As one of the earliest works in network pruning, Yann Lecun's Optimal brain 
     damage (OBD) paper has been cited in many of the papers.
     Some research focuses on module network designs. "These models, such as 
     SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions 
     convolution with lesser parameters and better performance."
     Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a 
     compression and regularization method, as opposed to other techniques such as 
     non-structured pruning (weight sparsifying and connection pruning), low rank 
     approximation and vector quantization (references to these approaches can be 
     found in the related work sections of the following papers). 
     Difference between structred and non-structured pruning:
       "Non-structured pruning aims to remove single parameters that have little 
       influence on the accuracy of networks". For example, L1-norm regularization on 
       weights is noted as non-structured pruning- since it's basically a weight 
       sparsifying method, i.e removes single parameter. 
       The term 'structure' refers to a structured unit in the network. So instead of 
       pruning individual weights or connections, structured pruning targets neurons, 
       filters, channels, layers etc. But the general implementation idea is the same as 
       penalizing individual weights: introducing a regularization term (mostly in the 
       form of L1-norm) to the loss function to penalize (sparsify) structures.
     I focused on structured pruning and read through the following papers:

   1. Structured Pruning of Convolutional Neural Networks via L1 
     Regularization (August 2019)
       "(...) network pruning is useful to remove redundant parameters, filters, 
       channels or neurons, and address the over-fitting issue."

       Provides a good review of previous work on non-structured and structured 
       pruning.
       "This study presents a scheme to prune filters or neurons of fully-connected 
       layers based on L1 regularization to zero out the weights of some filters or 
       neurons."
       Didn't quite understand the method and implementation. There are two key 
       elements: mask and threshold. "(...) the problem of zeroing out the values of 
       some filters can be transformed to zero some mask." || "Though the proposed 
       method introduces mask, the network topology will be preserved because the        
       mask can be absorbed into weight." || "Here the mask value cannot be 
       completely zeroed in practical application, because the objective function (7) is 
       non-convex and the global optimal solution may not be obtained. A strategy is 
       adopted in the proposed method to solve this problem. If the order of 
       magnitude of the mask value is small enough, it can be considered almost as 
       zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) 
       The average value of the product of the mask and the weight is used to 
       determine whether the mask is exactly zero or not."
       From what I understand they use L1 norm in the loss function to penalize 
       useless filters through penalizing masks. And a threshold value is introduced 
       to determine when the mask is small enough to be considered zero. 
       They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
       32)

   2. Learning Efficient Convolutional Networks through Network Slimming (August 
     2017) + Git repo
       "Our approach imposes L1 regularization on the scaling factors in batch 
       normalization (BN) layers, thus it is easy to implement without introducing any 
       change to existing CNN architectures. Pushing the values of BN scaling factors 
       towards zero with L1 regularization enables us to identify insignificant channels 
       (or neurons), as each scaling factor corresponds to a specific convolutional 
       channel (or a neuron in a fully-connected layer)."
       They provide a good insight on advantages and disadvantages of other 
       computation reduction methods such as low rank approximation, vector 
       quantization etc. 
       I believe here they use the word 'channel' to refer to filters (?).
       "Our idea is introducing a scaling factor γ for each channel, which is multiplied 
       to the output of that channel. Then we jointly train the network weights and 
       these scaling factors, with sparsity regularization imposed on the latter. Finally 

       we prune those channels with small factors, and fine-tune the pruned network. 
       " --> so instead of 'mask' they use the 'scaling factor' and impose regularization 
       on that, but the idea is very similar.
       "The way BN normalizes the activations motivates us to design a simple and 
       efficient method to incorporates the channel-wise scaling factors. Particularly, 
       BN layer normalizes the internal activations using mini-batch statistics." || " 
       (...) we can directly leverage the γ parameters in BN layers as the scaling factors 
       we need for network slim- ming. It has the great advantage of introducing no 
       overhead to the network."       They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), 
       ImageNet (model: VGG-A) and MNIST (model: Lenet)

   3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
       " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a 
       compressed structure of deep CNNs by group Lasso regularization during the 
       training. SSL is a generic regularization to adaptively adjust multiple structures 
       in DNN, including structures of filters, channels, and filter shapes within each 
       layer, and structure of depth beyond the layers." || " (...) offering not only well-
       regularized big models with improved accuracy but greatly accelerated 
       computation."

        "Here W represents the collection of all weights in the DNN; ED(W) is the loss 
       on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
       norm; and Rg(·) is the structured sparsity regularization on each layer. Because 
       Group Lasso can effectively zero out all weights in some groups [14][15], we 
       adopt it in our SSL. The regularization of group Lasso on a set of weights w can 
       be represented as, where w(g) is a group of partial weights in w and G is the total number of 
       groups. "<<FORMULA>>" In SSL, the learned “structure” is decided by the way of splitting 
       groups of w(g). We investigate and formulate the filer-wise, channel-wise, 
       shape-wise, and depth-wise structured sparsity (...)"
       They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
       20) and ImageNet (model:AlexNet)
       The authors also provide a visualization of filters after pruning, showing that 
       only important detectors of patterns remain after pruning.

       In conclusions: "Moreover, a variant of SSL can be performed as structure 
       regularization to improve classification accuracy of state-of-the-art DNNs."

   4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
       "After an initial training phase, we remove all connections whose weight is 
       lower than a threshold. This pruning converts a dense, fully-connected layer to 
       a sparse layer." || "We then retrain the sparse network so the remaining 
       connections can compensate for the connections that have been removed. The 
       phases of pruning and retraining may be repeated iteratively to further reduce        
       network complexity. In effect, this training process learns the network 
       connectivity in addition to the weights (...)"
       Although the description above implies the pruning was done only for FC 
       layers, they also do pruning on convolutional layers - although they don't 
       provide much detail on this in the methods. But there's this statement when 
       they explain retraining: "(...) we fix the parameters for CONV layers and only 
       retrain the FC layers after pruning the FC layers, and vice versa.". The results 
       section also shows that convolutional layer connections were also 
       pruned on the tested models.
       They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and 
       ImageNet (models: AlexNet, VGG-16)
       The authors provide a visualization of the sparsity patterns of neurons after 
       pruning (for an FC layer) which shows that pruning can detect visual attention 
       regions.
       The method used in this paper targets individual parameters (weights) to 
       prune. So, technically this should be considered as a non-structured pruning 
       method. However, the reason I think this is referenced as a structured pruning 
       method is that if all connections of a neuron is pruned (i.e all input and output 
       weights were below threshold), the neuron itself will be removed from the 
       network:  "After pruning connections, neurons with zero input connections or 
       zero output connections may be safely pruned."
       SIDENOTE: They touch on the use of global average pooling instead of fully 
       connected layers in CNNs: "There have been other attempts to reduce the 
       number of parameters of neural networks by replacing the fully connected 
       layer with global average pooling."

   5. Many more can be picked from the references of these papers. 

     There's a paper on Bayesian compression for Deep Learning from 2017. Their 
     hypothesis is: "By employing sparsity inducing priors for hidden units (and not 
     individual weights) we can prune neurons including all their ingoing and outgoing 
     weights." However, the method is mathematically heavy and the related work 
     references are quite old (1990s, 2000s). 
<|endoftext|>


<|startoftext|>
                  Network Trimming: A Data-Driven Neuron Pruning
                     Approach towards Efﬁcient Deep Architectures

                    Hengyuan Hu %   Rui Peng %        Yu-Wing Tai        Chi-Keung Tang
                      HKUST       HKUST     SenseTime Group Limited       HKUST
                    hhuaa@ust.hk   rpeng@ust.hk  yuwing@sensetime.com  cktang@cse.ust.hk


                                              Abstract

                       State-of-the-art neural networks are getting deeper and wider. While their performance
                       increases with the increasing number of layers and neurons, it is crucial to
                       design an efﬁcient deep architecture in order to reduce computational and memory
                       costs. Designing an efﬁcient neural network, however, is labor intensive requiring
                       many experiments, and ﬁne-tunings. In this paper, we introduce network trimming
                       which iteratively optimizes the network by pruning unimportant neurons based on
                       analysis of their outputs on a large dataset. Our algorithm is inspired by an observation
                       that the outputs of a signiﬁcant portion of neurons in a large network are
                       mostly zero, regardless of what inputs the network received. These zero activation
                       neurons are redundant, and can be removed without affecting the overall accuracy
                       of the network. After pruning the zero activation neurons, we retrain the network
                       using the weights before pruning as initialization. We alternate the pruning and
                       retraining to further reduce zero activations in a network. Our experiments on the
                       LeNet and VGG-16 show that we can achieve high compression ratio of parameters
                       without losing or even achieving higher accuracy than the original network.


                 1 Introduction

                 Neural networks have been widely adopted in many scenarios, achieving state-of-the-art results in
                 numerous tasks [1] [2]. One of the keys to improved performance is their increased depth and width
                 and thus the increased number of parameters. In computer vision, we have witnessed orders of
                 magnitude increase in the number of parameters in CNNs from LeNet with less than 1M parameters
                 in handwritten digit classiﬁcation [3] to Deepface with more than 120M parameters in human face
                 classiﬁcation [4].
                 Although CNNs with elegant network architectures are easy to deploy in real-world tasks, designing
                 one can be hard and labor-intensive, which involves signiﬁcant amount of effort in empirical experiments.
                 In terms of designing the network architecture, one crucial part is to determine the number of
                 neurons in each layer. There is no way to directly arrive at an optimal number of neurons for each
                 layer and thus even the most successful network architectures use empirical numbers like 128, 512,
                 4096. Experienced scientists often arrive at the numbers once they deem the network have enough
                 representation power for the speciﬁc task. However, the extremely sparse matrices produced by top
                 layers of neural networks have caught our attention, indicating that empirically designed networks are
                 heavily oversized. After some simple statistics, we ﬁnd that many neurons in a CNN have very low
                 activations no matter what data is presented. Such weak neurons are highly likely to be redundant
                 and can be excluded without damaging the overall performance. Their existence can only increase
                 the chance of overﬁtting and optimization difﬁculty, both of which are harmful to the network.
                 With the motivation of achieving more efﬁcient network architectures by ﬁnding the optimal number
                 of neurons in each layer, we come up with an iterative optimization method that gradually eliminates
                   Part of the work was done when Hengyuan Hu and Rui Peng were interns in SenseTime Group Limited                 
                   weak neurons in a network via a pruning-retraining loop. Starting from an empirically designed
                 network, our algorithm ﬁrst identiﬁes redundant weak neurons by analyzing their activations on a
                 large validation dataset. Then those weak neurons are pruned while others are kept to initialize a new
                 model. Finally, the new model is retrained or ﬁne-tuned depending on the performance drop. The
                 retrained new model can maintain the same or achieve higher performance with smaller number of
                 neurons. This process can be carried out iteratively until a satisfying model is produced.

                 2 Related Work

                 Signiﬁcant redundancy has been demonstrated in several deep learning models [5] and such
                 redundancy is mainly caused by the overwhelming amount of parameters in deep neural networks. An
                 over-parameterized model not only wastes memory and computation, but also leads to serious overﬁtting
                 problem. Therefore, reducing the number of parameters has been studied by many researchers
                 in this ﬁeld. However, there is little work directly addressing the optimization of the number of
                 neurons. Most previous works on improving network architectures fall in two main categories; one
                 concentrates on the high level architectural design and the other focuses on low level weight pruning.
                 On the high level side, some researchers invented new layers or modules to substitute main bottleneck
                 components in conventional neural networks. Two famous examples of this kind are the global
                 average pooling in Network in Network [6] invented to replace the extremely dense parameterized
                 fully connected layer and the inception module employed by GoogLeNet [7] to avoid explosion in
                 computational complexity at later stage. Both methods achieve state-of-the-art results on several
                 benchmarks with much less memory and computation consumption. More recently, SqueezeNet [8]
                 used a Fire module together with other strategies to achieve AlexNet-level accuracy with 50% less
                 parameters.
                 On the low level side, different methods have been explored to reduce number of connections and
                 weights in neural networks. Some late 20th century methods, such as magnitude-based approach [9]
                 and Hessian matrix based approach [10], prune weights basing on numerical properties of the weights
                 and loss functions without any external data involved. Han et al. recently proposed an iterative
                 method [11] to prune connections in deep architectures, together with an external compression by
                 quantization and encoding [12]. The network is ﬁrst pruned by removing low weights connections.
                 Then, learned mapping of similar weights to ﬁxed bits are used to perform quantization of weights
                 after pruning, which facilitates the Huffman coding compression in the last stage to reduce bits for
                 storage. When all three techniques used in pipeline, the number of parameters in the network can be
                 reduced by around 10%.
                 While above methods work well in practice by reducing number of parameters directly, we seek
                 answers to the fundamental problem that lies in the middle of those two levels of approaches –
                 determining the optimal number of neurons for each layer for a given network architecture and
                 speciﬁc tasks. Along our direction, not only can we achieve parameter savings without the need of
                 seeking new network architectures, we can also evaluate the redundancy in each layer of a network,
                 and thus provide guidance on effective ways for architecture optimization in large neural networks.

                 3 Zero Activations and Network Trimming

                 In this section, we describe our algorithm for network trimming. To facilitate our discussions, we
                 use VGG-16 [13] as our case study. The VGG-16 network consists of 13 convolutional layers, and 3
                 fully connected layers. Each of the layers is followed by a ReLU [14] layer for non-linear mapping.
                 The VGG-16 is recognized as one of the representative network which has been adopted to many
                 applications [15] [16], not limited to object classiﬁcation and localization tasks.


                 3.1 Zero Activations in VGG-16

                 We deﬁne Average Percentage of Zeros (APoZ) to measure the percentage of zero activations of
                 a neuron after the ReLU mapping. Let <<FORMULA>> denotes the output ofc-th channel ini-th layer, our
                 <<FORMULA>> of the c-th neuron ini-th layer is deﬁned as:

                                                <<FORMULA>>                  (1) 

                 where <<FORMULA>> if true, and <<FORMULA>> if false,M denotes the dimension of output feature map
                 of <<FORMULA>>, and N denotes the total number of validation examples. The larger number of validation
                 examples, the more accurate is the measurement of APoZ. In our experiment, we use the validation
                 set (N= 50;000) of ImageNet classiﬁcation task to measure APoZ.
                 We use the deﬁnition of APoZ to evaluate the importance of each neuron in a network. To validate
                 our observation that the outputs of some neurons in a large network are mostly zero, we calculate the
                 APoZ of each neuron and ﬁnd that there are631neurons in the VGG-16 network which have APoZ
                 larger than90%.

                                   Table 1: Mean APoZ of each layer in VGG-16

                                                    <<TABLE>>

                 To better understand the behavior of zero activations in a network, we compute the mean APoZ
                 (Table 1) of all neurons in each layer (except for the last one) of the VGG-16 network. Since the
                 VGG-16 network has inverse pyramid shape, most redundancy occurs at the higher convolutional
                 layers and the fully connected layers. The higher mean APoZ also indicates more redundancy in a
                 layer. Detailed distributions of APoZ of 512 CONV5-3 neurons and 4096 FC6 neurons are shown in
                 Figure 1, 2 respectively. Since a neural network has a multiplication-addition-activation computation
                 process, a neuron which has its outputs mostly zeros will have very little contribution to the output of
                 subsequent layers, as well as to the ﬁnal results. Thus, we can remove those neurons without harming
                 too much to the overall accuracy of the network. In this way, we can ﬁnd the optimal number of
                 neurons for each layer and thus obtain a better network without redesign and extensive human labor.

                                <<FIGURE>>          .                       <<FIGURE>>

                   Figure 1: CONV5-3 APoZ Distribution            Figure 2: FC6 APoZ Distribution

                 3.2 Network Trimming and Retraining

                 Our network trimming method consists of three main steps, as illustrated in Figure 3. First the network
                 is trained under conventional process and the number of neurons in each layer is set empirically. Next,
                 we run the network on a large validation dataset to obtain the APoZ of each neuron.
                 Neurons with high APoZ are pruned according to certain criteria. The connections to and from the
                 neuron are removed accordingly when a neuron is pruned (see Figure 4 5). After the neuron pruning,
                 the trimmed network is initialized using the weights before trimming. The trimmed network exhibits

                            <<FIGURE>>                   <<FIGURE>>                    <<FIGURE>>

                 Figure 3: Three main steps for    Figure 4: Before pruning    Figure 5: After pruning
                 trimming

                 some level of performance drop. Thus, in the ﬁnal step, we retrain the network to strengthen the
                 remaining neurons to enhance the performance of the trimmed network.
                 The weight initialization is necessary for the network to obtain the same performance as it was before
                 the trimming. If a trimmed network is trained from scratch, we ﬁnd that it contains larger percentage
                 of zero activation neurons than the counterpart with weight initialization. This means that a retrained
                 network without weight initialization is much less efﬁcient.
                 We experimented different ways to prune the neurons according to the APoZ measurements. We
                 found that pruning too many neurons at once severely damaged the performance, and the performance
                 drops are unrecoverable. Therefore, we chose an iterative scheme to trim a network. However, it is
                 not trivial to trim a network with deep architecture. If too many layers are trimmed in one step, the
                 performance would drop by a large margin, and it is hard to recover the original performance before
                 trimming through the retraining. For example, trimming CONV4, CONV5, FC6 and FC7 of the
                 VGG-16 network concurrently would lead to a46:650%top-5 accuracy in the image classiﬁcation
                 task, where the original accuracy of VGG-16 2 is88:444%. On the other hand, if only the CONV5-3
                 and FC6 are trimmed, the trimmed network with weight initialization before retraining can achieve
                 85:900%top-5 accuracy. After retraining, the trimmed network achieves90:278%accuracy which is
                 even higher than the original accuracy before trimming.
                 Empirically, we found that starting to trim from a few layers with high mean APoZ, and then
                 progressively trim its neighboring layers can rapidly reduce the number of neurons while maintaining
                 the performance of the original network. To decide which neurons to prune, we empirically found
                 that pruning the neurons whose APoZ is larger than one standard derivation from the mean APoZ
                 of the target trimming layer would produce good retraining results. Using this threshold, we would
                 reject16%of neurons on average from the trimmed layers, assuming that the APoZ values roughly
                 follow a Gaussian distribution.

                 4 Experiments

                 We implemented our algorithm using the standard Caffe [17] library. To obtain the weights for
                 initialization for retraining, we use the Python and PyCaffe interface to copy the weights of remaining
                 connections after the trimming. We tested our algorithm primarily on two networks, LeNet [3] on
                 MNIST dataset and VGG-16 on ImageNet classiﬁcation dataset [18].

                 4.1 LeNet

                 The LeNet network consists of two convolutional layers followed by two fully connected layers, the
                 layers have20;50;500;10outputs respectively. We use a short hand notion (20-50-500-10) to denote
                 the number of neurons in each layer of the network. In the LeNet,93%of parameters are in the
                 connections between the CONV2 layer and the FC1 layer. Consequently, we can easily achieve a
                 more efﬁcient network by trimming the size of CONV2 and FC1 layers.

                    2 Single scale, without dense evaluation [13]

                 4.1.1 Effectiveness
                 We apply our algorithm to iteratively prune the neurons in CONV2 and FC1 layers, as shown in
                 Table 2. At the ﬁrst iteration of the pruning, the numbers of neurons in CONV2 and FC1 layers are
                 reduced to 41 and 426 respectively, which achieves1:41%compression on the number of parameters
                 after the ﬁrst pruning. The accuracy drops from99:27%to98:75%after the pruning, but before
                 retraining. After retraining the network, we achieve99:29%accuracy which is slightly higher than
                 the original accuracy. We repeat these processes for 4 iterations. As shown in Table 2, our algorithm
                 achieves more than 2%3%compression on the number of parameters without loss in accuracy.

                                      Table 2: Iterative Trimming on LeNet

                                                    <<TABLE>>

                 4.1.2 Necessity of Weight Initialization
                 We experiment our algorithm with retraining with and without weight initialization, as summarized
                 in Table 3. The network exhibits deterioration in classiﬁcation accuracy without weight initialization,
                 whereas with proper weight initialization from the ancestor network from the previous iteration, the
                 trimmed network can retain its original or even achieve higher accuracy.

                         Table 3: Iterative Trimming on LeNet with and without Weight Initialization

                                              <<TABLE>>

                 Moreover, we observe that with the weight initialization, the trimmed network consistently has
                 smaller mean APoZ values than its ancestor network. This means that the retrained network has
                 less redundancy than its ancestor network. In contrast, mean APoZ values increase if we retrain the
                 network from scratch even though the trimmed network has less neurons than its ancestor network.
                 This observation gives us an insight that proper weight initialization is necessary to achieve an
                 efﬁcient trimmed network.

                 4.2 VGG-16

                 4.2.1 Effectiveness
                 With the similar objective to obtain optimal number of neurons in each layer, we analyzed the APoZ
                 values ofO(i)c for all i and c_in VGG-16 on ImageNet classiﬁcation validation set. As shown in
                 Table 1, CONV4, CONV5 and FC layers have higher mean APoZ compared with bottom layers,
                 exhibiting more redundancy. Drawing from previous experience on LeNet, we focus on the parameter
                 bottleneck of VGG-16. We trim the VGG-16 network starting from the CONV5-3 and FC6 layers
                 since they account for 100M/138M parameters.
                 We iteratively prune neurons from CONV5-3 and FC6 layers. Similar to the case in LeNet, the
                 trimming process can effectively eliminate neurons with high APoZ. As shown in Figure 6, after
                 trimming, the entire distribution of APoZ inO(fc6) shifts left, indicating a signiﬁcant drop in network

                                                  <<FIGURE>>

                               Figure 6: FC6 APoZ distribution before and after trimming


                 redundancy. Meanwhile, the diminishing tail on the right side of the curve manifests that the weak
                 neurons in FC6 are vanishing, a proof of the beneﬁt gained from weight initialization as discussed in
                 Section 3.2 and 4.1.2.
                             Table 4: Iterative Trimming Result on VGG-16 {CONV5-3, FC6}

                                               <<TABLE>>

                 After 6 iterations of trimming, we reduce more than half of the total number of parameters and achieve
                 a compression rate of2:59%while the trimmed network has 2%3% higher Top-1/Top-5 accuracy
                 than the original VGG-16 model. The detailed performance of intermediate models are summarized
                 in Table 4. There are two interesting observations in the table. First, the initial accuracy just after
                 trimming does not drop much from the last model even though around 500 neurons in CONV5-3 and
                 FC6 are pruned in each iteration. This is a strong proof of redundancy in empirically designed neural
                 networks. Also, such a small decrease in accuracy can be remedied via a fast ﬁne-tuning instead
                 of a week-long retraining. In our experiments, it takes less than 5K iterations to reach the original
                 accuracy (with batch size = 256). Therefore, our trimming method allows fast optimization towards
                 better architecture. Secondly, the trimmed networks surprisingly surpass the original VGG-16 in
                 accuracy with less parameters. The good initialization provided by previous model sets a promising
                 starting point for the trimmed model. In addition, having less parameters in FC6 also reduces the
                 chance of overﬁtting, which may also contribute to the increment in accuracy.

                 4.2.2 Trimming Multiple Layers
                 VGG-16 differs from LeNet greatly in that it has a much deeper architecture with signiﬁcantly more
                 layers, which naturally gives us more options to determine which layers to trim. After the previous
                 experiments, we want to further investigate if trimming multiple layers simultaneously can achieve
                 the same effectiveness.
                 After trimming the CONV5-3 and FC6 layers, we continue to trim their neighboring layers. We
                 experimented with three sets of trimming layouts: {CONV5, FC6}, {CONV5, FC6, FC7}, {CONV4,
                 CONV5, FC6, FC7} (see Table 5). When more neurons are pruned, the large performance drop in the
                 trimmed network indicates retraining is necessary. We use the same set of training hyperparameters
                 in our experiments: {base-lr: 0.001, gamma: 0.1, step-size: 3000}. After retraining, the trimmed
                 networks gradually recover from the loss of neurons and rise to an accuracy level equivalent to the

                                       Table 5: Iterative Trimming Result on VGG-16 Many Layers

                                               <<TABLE>>

                 reference model or slightly higher. In contrast to trimming only one layer, these models regain to
                 their capacity rather slowly, taking more than 10K iterations to recover the accuracy. Empirically, we
                 found that iteratively trimming the network starting from a few layers can achieve better performance.
                 We also found that trimming the last convolutional layer and the fully connected layers are the most
                 effective. As shown in Table 6, additional trimming of FC7 layer (based on previously trimmed model
                 (CONV5-3, FC6) = (420, 2121)), can achieve a high2:7%compression rate with improved accuracy.
                 The underlying reason is that once we have pruned the FC6 layer, the numerous zeros contribute to
                 the high APoZ value of neurons in the FC7 layer. For the goal to reduce network parameters, it is
                 sufﬁces to just trim the {CONV5-3, FC6, FC7} layers since around86%of all the parameters are in
                 the {CONV5-3, FC6, FC7} layers.

                           Table 6: Iterative Trimming Result on VGG-16 {CONV5-3, FC6, FC7}

                                                 <<TABLE>>


                 5 Discussion

                 5.1 Comparison with Connection Pruning

                 Work closest to ours is the work by Han et al. [11] where they iteratively prune the network connections
                 when the correspondent weights of the connections are close to zero. They also prune a neuron when
                 the connections to a neuron are all pruned. Compared with their work, our work is better in two major
                 aspects. First, although Han et al. claim that they have achieved a reduction rate of parameters by
                 13%on VGG-16, their reduction is tailored for CPU implementation of a neural network. In a GPU
                 implementation, the convolutional layer is implemented by ﬁrst vectorizing a 2D feature map into
                 a 1D feature vector followed by a matrix multiplication [19]. Thus, if a neuron is not pruned, the
                 number of multiplications for the convolutional layers will remain the same since the vectorization is
                 performed in a universal manner for all neurons in the same layer. This is also the same case for fully
                 connected layers where the number of multiplications are universal for all neurons in the same layer.
                 Note that the computational costs to re-vectorize a 2D feature map to ﬁt for different shape of neuron
                 connections, or adding a conditional mask checking is a lot higher than a simple matrix multiplication
                 with redundancy. Our method, on the contrary, removes all unneeded neurons so that they do not
                 consume any memory and are not involved in any computation at all. As shown in Section 4.2, the
                 trimmed VGG-16 has more than2%less FLOPs in the ﬁrst fully connected layer.
                 Second, pruning a neuron by ﬁrst pruning all of its connections is less efﬁcient and less effective than
                 our APoZ measurement. This is because the number of connections is signiﬁcantly larger than the
                 number of neurons in a network, especially for fully connected layers. In our experiments, we found
                 that most of the redundancy resides in fully connected layers, and in the connections between the last
                 convolutional layer and the ﬁrst fully connected layer. However, it is rarely the case that the weights
                 of all connections to a neuron in these layers are close to zero. Consequently, it is difﬁcult to prune a
                 neuron in these layers. On the other hand, our APoZ measurement can easily identify zero activation
                 neurons for pruning regardless the weight of connections. The mean APoZ can also be used as a
                 guideline to evaluate the effectiveness of a network as demonstrated in our experiments.

                 5.2 Dataset Used During Trimming

                 In all of our experiments, we train the network using training set and run the network on validation
                 set to obtain APoZs for neuron pruning. This method may be controversial because the validation set
                 should not be glimpsed before ﬁnalizing the model which may potentially lead to overﬁtting of the
                 validation set. We also have the same suspicion, especially after the experiments that the trimmed
                 model can have2%higher top-5 accuracy than that of the original VGG-16 on the validation set.
                 Therefore, we consult two more experiments to explore the potential issue.
                 In the ﬁrst experiment, we randomly sampled a subset from training set with equal number of images
                 (50K) as validation set. Then we used the same criteria to select weak neurons for pruning. The
                 weak neurons selected using the sampled training set have more than95%overlap ratio with the
                 exact neurons selected using the validation set. This shows that neurons have consistent activation
                 performance on training and validation sets. In another word, the trimmed networks learned from
                 sampled training data will be similar to the trimmed networks learned from the validation set.
                 In addition, we also tested our model on the test set of ILSVRC2012 classiﬁcation track. Using
                 single model without dense evaluation, the original VGG-16 model with11:56%validation error
                 has an error rate of13:02%on test set. Our trimmed network with conﬁguration {CONV5-3: 420,
                 FC6: 2121, FC7: 2482, Compression Rate: 2.00, Validation Error:9:7%} achieved10:02%error
                 rate on test set. Note that the test set and validation set are non-overlapping in this ILSVRC2012
                 classiﬁcation task. Telling from the data, after the network trimming, not only the overall accuracy
                 is increased, but the gap between validation error and test error is also shrunk, indicating that the
                 trimmed network has less overﬁtting.
                 The two extra experiments dismiss our concern on overﬁtting. They also suggest that the validation
                 set can be used for analyzing APoZs.

                 6 Conclusion

                 We have presented Network Trimming to prune redundant neurons based on the statistics of neurons’
                 activations. With our method, one network architecture can be deployed to handle different tasks on
                 different datasets and the algorithm can tailor the network accordingly by determining how many
                 neurons to use for each layer without the need of intensive computational power as well as human
                 labor. Our method can iteratively remove low activation neurons that provide little power to the ﬁnal
                 results without damaging performance of the model. We experimented our algorithm on LeNet and
                 VGG-16 achieving the same accuracy with 2%3%less parameters. In VGG-16, the trimmed models
                 can even surpass the original one, which could be caused by the reduced optimization difﬁculty.
                 Lying in the middle of high level network redesign and low level weight pruning, neuron pruning can
                 be applied to any mature architecture together with weight pruning to sharply reduce the complexity
                 of network.

                                                  References
                  [1]Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutional
                     neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
                  [2]Graves, A., Schmidhuber, J.: Framewise phoneme classiﬁcation with bidirectional lstm and
                     other neural network architectures. Neural Networks18(5) (2005) 602–610
                  [3]Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
                     recognition. Proceedings of the IEEE86(11) (Nov 1998) 2278–2324
                  [4]Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level
                     performance in face veriﬁcation. In: Proceedings of the IEEE Conference on Computer Vision
                     and Pattern Recognition. (2014) 1701–1708
                  [5]Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep
                     learning. CoRRabs/1306.0543(2013)
                  [6] Lin, M., Chen, Q., Yan, S.: Network in network. CoRRabs/1312.4400(2013)
                  [7]Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V.,
                     Rabinovich, A.: Going deeper with convolutions. CoRRabs/1409.4842(2014)
                  [8]Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: Squeezenet:
                     Alexnet-level accuracy with 50x fewer parameters and< 1mb model size. arXiv preprint
                     arXiv:1602.07360 (2016)
                  [9]Hanson, S.J., Pratt, L.: Advances in neural information processing systems 1. Morgan Kaufmann
                     Publishers Inc., San Francisco, CA, USA (1989) 177–185
                 [10]Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal brain surgeon.
                     In: Advances in Neural Information Processing Systems 5, [NIPS Conference], San Francisco,
                     CA, USA, Morgan Kaufmann Publishers Inc. (1993) 164–171
                 [11]Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efﬁcient
                     neural networks. CoRRabs/1506.02626(2015)
                 [12]Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with
                     pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
                 [13]Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recogni-
                     tion. arXiv preprint arXiv:1409.1556 (2014)
                 [14]Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann machines. In:
                     Proceedings of the 27th International Conference on Machine Learning (ICML-10). (2010)
                     807–814
                 [15]Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
                     region proposal networks. In: Advances in Neural Information Processing Systems. (2015)
                     91–99
                 [16]Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to
                     sequence-video to text. In: Proceedings of the IEEE International Conference on Computer
                     Vision. (2015) 4534–4542
                 [17]Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell,
                     T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM
                     International Conference on Multimedia, ACM (2014) 675–678
                 [18]Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
                     Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition
                     Challenge. International Journal of Computer Vision (IJCV)115(3) (2015) 211–252
                 [19]Scherer, D., Schulz, H., Behnke, S.: Accelerating large-scale convolutional neural networks
                     with parallel graphics multiprocessors. In: Artiﬁcial Neural Networks–ICANN 2010. Springer
                     (2010) 82–91
<|endoftext|>


<|startoftext|>
                  PLUG AND PLAY LANGUAGE MODELS : A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION

                  Sumanth Dathathri       Andrea Madotto       Janice Lan       Jane Hung
                  CMS, Caltech            HKUST              Uber AI         Uber AI

                  Eric Frank        Piero Molino        Jason Yosinski yy        Rosanne Liu y
                  Uber AI           Uber AI            Uber AI              Uber AI
                  dathathris@gmail.com, amadotto@connect.ust.hk
                  {janlan, jane.hung, mysterefrank, piero, yosinski, rosanne}@uber.com

                                              ABSTRACT

                       Large transformer-based language models (LMs) trained on huge text corpora
                       have shown unparalleled generation capabilities. However, controlling attributes
                       of the generated language (e.g. switching topic or sentiment) is difﬁcult without
                       modifying the model architecture or ﬁne-tuning on attribute-speciﬁc data and en-
                       tailing the signiﬁcant cost of retraining. We propose a simple alternative: the Plug
                       and Play Language Model (PPLM) for controllable language generation, which
                       combines a pretrained LM with one or more simple attribute classiﬁers that guide
                       text generation without any further training of the LM. In the canonical scenario
                       we present, the attribute models are simple classiﬁers consisting of a user-speciﬁed
                       bag of words or a single learned layer with 100,000 times fewer parameters than
                       the LM. Sampling entails a forward and backward pass in which gradients from
                       the attribute model push the LM’s hidden activations and thus guide the
                       generation. Model samples demonstrate control over a range of topics and sentiment
                       styles, and extensive automated and human annotated evaluations show attribute
                       alignment and ﬂuency. PPLMs are ﬂexible in that any combination of differentiable
                       attribute models may be used to steer text generation, which will allow for
                       diverse and creative applications beyond the examples given in this paper.


                  1 INTRODUCTION

                 The Transformer architecture (Vaswani et al., 2017) has enabled large-scale language models (LMs)
                 trained on a huge amount of data (Radford et al., 2019; Dai et al., 2019b; Radford et al., 2018b) to
                 greatly improve the state-of-the-art on natural language processing tasks. These models are used to
                 extract contextualized word embeddings for transfer learning purposes (Devlin et al., 2019) and as
                 natural language generators. The latter can leverage large amounts of unannotated data and a simple
                 log-likelihood training objective. However, once such models are trained, controlling attributes of
                 generated text becomes difﬁcult without modifying the model architecture to allow for extra input
                 attributes or ﬁne-tuning with attribute-speciﬁc data (Keskar et al., 2019; Ziegler et al., 2019).
                 conceptualized PPLMs and led the manuscript writing. SD led thecproject, implemented the PPLM, set 
                 up and ran all modeling experiments, engineered how to obtain workable
                 gradients via the weighted embedding approach, and made the model work. AM helped with preparing datasets
                 for discriminator training, automated evaluation, running experiments, and writing the manuscript. SD, RL &
                 AM ran the external baselines. RL & JL built and oversaw the human evaluation pipeline and computed the
                 statistics. JH ran the story generation with skeleton preﬁxes. EF assisted with detoxiﬁcation experiments. PM
                 led efforts to migrate to the new pytorch transformer, helped with code release. JY helped with the annotation
                 pipeline, ﬁnding bugs, navigating model and experimental directions, engineering workable gradients, and
                 posing the model mathematically. RL implemented preliminary experiments and multi-attribute control, and
                 cleaned and coordinated release of the code. RL & JY oversaw the project.

                                                  <<TABLE>>

                 Table 1: The PPLM employs a pre-trained language model (LM) without any changes to the model
                 parameters and can generate text with controlled attributes such as topic and sentiment. We demon-
                 strate control with two tiny and easy to construct attribute models: a bag of words (BoW) related to a
                 topic and a linear discriminator trained on top of LM latent representations to control sentiment. The
                 underlined preﬁx is what the LM is conditioned on to generate a passage of text (e.g.The potatoThe potatoThe 
                 potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe 
                 potatoThe potatoThe potatoThe potato).
                 The controlled attributes are colored and bracketed (e.g.[Science]), and words in the BoW that are
                 directly optimized for are highlighted brightly (e.g.research). The softer highlights correspond to
                 words related to the attribute, but not directly optimized for during the control process (e.g.health).
                  [–] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato and cauliﬂower are both in season to make combo breads, mounds, or pads. For an added
                  challenge, try some garlic mashed potatoes.
                  [Negative] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato is a prettybad idea. It can make you fat, it can cause you to have aterribleimmune
                  system, and it can even kill you....
                  [Positive] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato chip recipe you asked for! Welovemaking these, and I’ve been doing so for years.
                  I’ve always had a hard time keeping a recipe secret. I think it’s the way our kids love to eat them – so many
                  little ones.
                  [Science] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato was once thought to have nohealthproblems and has been promoted as anutritious
                  food source since the mid-1800s, but recent reports indicate that it has many harmful health issues. In fact,
                  researchers from Johns Hopkins University...
                  [Politics][Positive]To concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludethis series of articles, I will present three of the mostpopularand
                  inﬂuential works on this topic. The ﬁrst article deals with the role of women’s political participation in
                  building a political system that is representative of the will of the people.
                  [Politics][Negative]To concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo conclude, the most signiﬁcant and lastingdamagefrom the economiccrisisin
                  2008 was that many governments, including those in the political center,lost power for the ﬁrst time in
                  modern history.

                 Controllable generation entails modeling <<FORMULA>>, where a is some desired controllable attribute(s)
                 and x  the generated sample. However, generative models only learn <<FORMULA>>. In computer vision,
                  Plug & Play Generative Networks (PPGN) from Nguyen et al. (2017) developed a mechanism for
                  generating images with different attributes by plugging a discriminator (attribute model) <<FORMULA>>
                 together with a base generative model <<FORMULA>> and sampling from the resulting <<FORMULA>>,
                 effectively creating a conditional generative model on the ﬂy from any supplied attribute model. In
                 a similar manner, we propose the Plug and Play Language Model (PPLM) for conditional language
                 generation that combines one or more simple attribute models <<FORMULA>>—either in the form of a bag-
                 of-words (BoW) or single layer classiﬁers—with a pre-trained, unconditional language model <<FORMULA>>.
                 We sample from the resulting combined model by following gradients in the latent representation
                 space in a manner inspired by the approximate Metropolis-adjusted Langevin (MALA) (Roberts
                 et al., 1996; Roberts & Rosenthal, 1998) sampler deployed in Nguyen et al. (2017).
                 Optimization is performedex post factoin the activation space, the <<FORMULA>> for <<FORMULA>> re-training or ﬁne-
                 tuning is needed. Control is ﬁne-grained, with a strength parameter determining how strong the
                 attribute inﬂuence should be; a strength of0fully recovers the original model<<FORMULA>>. This design
                 allows vast ﬂexibility: users can combine a state-of-the-art generative model, which may be large
                 and difﬁcult to train, with any number of attribute controllers. Attribute models may be easier to train
                 or untrained (in the case of BoW models), and multiple controllers may be combined ﬂexibly during
                 inference. In this paper, we demonstrate the PPLM approach using a GPT-2 345M model (Radford
                 et al., 2019) as the general-purpose LM <<FORMULA>>, but the method applies in any representation space
                 from any transformer-based text generator and allows combination with any attribute model <<FORMULA>>.
                 We demonstrate controlled generation with a number of attribute controllers, assembled and 
                 combined during generation, each with a different strength, acting as a set of “control knobs” that tune
                 generation towards the desired attribute (see examples in Table 1). Code for the experiments is
                 available at:https://github.com/uber-research/PPLM. Our key contributions are:

                      •We introduce the Plug and Play LM for controlled language generation, discuss its relation
                       to existing work, and how sampling from a PPLM works (Sections 2 and 3).
                      •We demonstrate controlling of text generation on a range of attributes, including 7 topics
                       each deﬁned using a bag of words, and 1 simple discriminator on sentiments. We quantify
                       effectiveness using both automated evaluation (separately trained perplexity and sentiment
                       models) as well as human evaluation (for attribute relevance and ﬂuency). All evaluations
                       point toward the ability of PPLMs to generate attribute controlled, ﬂuent text (Section 4).
                      •We compare PPLM with CTRL (Keskar et al., 2019) and GPT-2 ﬁnetuned for positivity
                       (Ziegler et al., 2019). Our method, without any LM training, is on par and often outper-
                       forms the baselines on attribute relevance and ﬂuency (Section 4.2, and Section 4.3).
                      •We show that the PPLM approach can be used to detoxify instances where generation
                       of toxic content is likely by following the negative gradient of a model trained to detect
                       toxicity (Section 4.4). We also show how PPLM can be used for structurally constrained
                       story writing (Section 4.5).

                  2 RELATED WORK

                  Controlled generation Current methods for controlled text generation involve either ﬁne-tuning
                  existing models with Reinforcement Learning (RL) (Ziegler et al., 2019), training Generative
                  Adversarial Networks (Yu et al., 2017), or training conditional generative models (Kikuchi et al., 2016;
                  Ficler & Goldberg, 2017). Different from our approach, these methodologies are not plug and
                  play, since the entire model needs to be separately ﬁne-tuned for each speciﬁc attribute. Keskar
                  et al. (2019) train a large language model with over 50 different control codes. The results are high
                  quality because they train exactly to maximize <<FORMULA>>, but this comes at the expense of ﬁxing control
                 codes upfront and of training a very large model (1.6B parameters). Our method does not require
                 retraining any conditional generative model, and both the language model and the conditional model
                 can be ﬂexibly assembled. Table 2 gives a comparison of recent approaches to language modeling
                 tuned for speciﬁc attributes. In another interesting but tangential piece of work, Subramani et al.
                 (2019) recently showed that a pre-trained language model can be steered to recover arbitrary
                 sentences. In earlier works Gu et al. (2016; 2017); Chen et al. (2018) explored the idea of using a small
                 neural network to steer an LM.

                 Noisy Channel Modeling Yu et al. (2016), and more recently Yu et al. (2019); Yee et al. (2019);
                 Ng et al. (2019), leveraged the Shannon Noisy Channel Theory (Shannon, 1948) for improving
                 sequence-to-sequence modeling. Their approach translates a source language sentence y into a target
                 language sentence x by ﬁrst sampling from a forward model proposal distribution p forward <<FORMULA>> and
                 then reranking samples based on probabilities given by p backward <<FORMULA>>. PPLM scores
                 samples using the same basic equation, but as we have no forward or proposal model p forward <<FORMULA>>,
                 we rely on the latent space updates, similar to Nguyen et al. (2017). As a baseline, we consider
                 using<<FORMULA>>as a “forward model” and then reranking, which we will see works moderately well in
                 some scenarios and poorly in others (see Tables 4 and 6).

                 Weighted decoding Holtzman et al. (2018); Ghazvininejad et al. (2017) consider controlled
                 language generation – the former with discriminators, and the latter with a bag of words – where the
                 decoding procedure is modiﬁed to consider the scoring function used for decoding. See et al. (2019)
                 note that control with weighted decoding (WD) is difﬁcult and often leads to sacriﬁcing ﬂuency and
                 coherence. Further, Ghazvininejad et al. (2017) strongly relies on sampling from a set of keywords
                 on a speciﬁc topic and it does not allow to bias generation towards a topic in a manner that does not
                 necessary include a set of keywords. Similarly, Baheti et al. (2018) proposed a decoding strategy
                 for generating interesting responses in dialogue systems, using bags of words and word embeddings.
                 Sophisticated sampling methods (Metropolis et al., 1953) can be used to constrain the model
                 generation to certain keywords and topics. We evaluate WD as a baseline.

                 Text Style Transfer Outside of language modeling, the text style transfer studies a related task.
                 Shen et al. (2017); Hu et al. (2017) train variational auto-encoders for style transfer that rely on
                 learning disentangled latent representations for style and content. Li et al. (2018) demonstrate the
                 efﬁcacy of a simple approach based on replacing attribute related n-grams with n-grams corresponding
                 to the desired attribute based on a conditional generative model. A key difference between the
                 above and our approach is that we use an ofﬂine discriminator and perform optimization based on
                 this discriminator, which as suggested by Elazar & Goldberg (2018) may outperform adversarial
                 training approaches. More recently, Lample et al. (2019) adapt an approach from unsupervised
                 language translation to style transfer, where a denoised auto-encoder is trained with an objective

                 Table 2: Comparison of the different models and distributions. All models in this table are useful in
                 different scenarios. The particular advantage of PPLM is that very small, custom attribute models,
                 <<FORMULA>>, may be combined with powerful, general pre-trained language models, <<FORMULA>>, to create cheap
                 but still powerful conditional generative models, <<FORMULA>>.

                                                     <<TABLE>>

                 consisting of a weighted combination of a re-construction loss and a back-translation loss. While
                 the above approaches have shown impressive success on style transfer tasks, the main focus is not
                 controlled language generation, and further, the methods are not plug and play.

                  3 PLUG AND PLAY LANGUAGE MODELS

                  3.1 LANGUAGE MODELING WITH TRANSFORMERS

                  Given a sequence of tokens <<FORMULA>>, LMs are trained to compute the unconditional prob-
                  ability of the sequence <<FORMULA>>. This probability can be rewritten in terms of product of conditional
                  probabilities by recursively applying the chain-rule (Manning et al., 1999; Bengio et al., 2003) as:
                                            
                                        <<FORMULA>>                    (1)
                                              
                 In this paper, we use a transformer (Vaswani et al., 2017) to model the distribution of natural lan-
                 guage. To present our approach clearly, we ﬁrst brieﬂy summarize the transformer using recur-
                 rent notation. Let us deﬁne the history matrixHt to consist of the key-value pairs from the past
                 <<FORMULA>>, where <<FORMULA>> corresponds to the key-value pairs <<FORMULA>> from the i-th layer 
                 generated at all time-steps from 0 tot. Efﬁcient implementations of the transformer
                 (Wolf et al., 2019) use the cachedHt to generate <<FORMULA>>, given <<FORMULA>>. This recurrent interpretation
                 of a transformer can be summarized as:

                                         <<FORMULA>>;                     (2)

                 where <<FORMULA>> a linear transformation that maps the logit vector <<FORMULA>> to a vector of vocabulary size, and
                 then <<FORMULA>> is sampled as<<FORMULA>> pt+1 =Softmax(Wo t+1 ). This allows for efﬁcient language 
                 generation without repeated forward passes corresponding to the prior conditioning text <<FORMULA>>.

                  3.2 STEERING GENERATION :ASCENDING log<<FORMULA>>

                 In order to control the output of the language model, at every generation step t, we shift the history
                 Ht in the direction of the sum of two gradients: one toward higher log-likelihood (LL) of the attribute
                 a under the conditional attribute model<<FORMULA>>and one toward higher LL of the unmodiﬁed language
                 model<<FORMULA>>. Combining these factors with a variable multiplier provides us with a controllable
                 “knob” to guide generation in a given direction with a speciﬁed strength. The updates are restricted
                 toHt and not the other model activations because future predictions depend on the past only via Ht 
                 (note thatHt is composed of all transformer key and value pairs generated up to time t). Taking
                 steps inHt space leads to gradual changes to model activations — which may be thought of as
                 gradual reinterpretations of the past — that guide future generation in the desired direction.
                 Let<<FORMULA>> be the update toHt , such that generation with (<<FORMULA>>) shifts the distribution of
                 the generated text such that it is more likely to possess the desired attribute. Ht is initialized

                                                  <<FIGURE>>                 

                 Figure 1: Simpliﬁed illustration of the proposed approach in three phases. In Step 1, a forward pass
                 is performed through the language model to compute the likelihood of a desired attribute using an
                 attribute model that predicts<<FORMULA>>. In Step 2, a backward pass updates the internal latent 
                 representations of the LM, using gradients from the attribute model, to increase the likelihood of the passage
                 having the desired attribute. In Step 3, a new distribution over the vocabulary (<<FORMULA>>) is generated
                 from the updated latents(Het )and the current token <<FORMULA>>. The next token is then sampled from the
                 updated distribution. This process of updating the latents is repeated at each time-step, leading to
                 a gradual transition towards the desired attribute. For computational efﬁciency, one may choose to
                 modify only the latents within some window of the recent past, depicted as the dotted-red region.

                 at zero and updated with gradients from an attribute model that measures the extent to which the
                 generated text possesses the desired attribute (e.g. positivity). We rewrite the attribute model<<FORMULA>>
                 FORMULA>> and then make gradient based updates to <<FORMULA>> as follows:

                                                <<FORMULA>>                (3)

                 where <<FORMULA>> is the step size, <<FORMULA>> is the scaling coefﬁcient for the normalization term. 1 This update step
                 can be repeated m times; in practice we use3to10. Subsequently, a forward pass through the LM
                 with the updated key-value pairs is performed to obtain the updated <<FORMULA>>, where <<FORMULA>>. 
                 The perturbed oet+1 is then used to generate a new distribution <<FORMULA>> as in Equation 2.

                  3.3 ENSURING FLUENCY :ASCENDING log<<FORMULA>>

                  The approach described in the previous section is able to generate text tuned for a particular
                  discriminator, but left unchecked it will quickly result in unrealistic adversarial or fooling examples
                  (Szegedy et al., 2013; Nguyen et al., 2015) as the text moves into low probability regions. To com-
                  bat this, we use the unconditional language model in two ways that ensure the ﬂuency is maintained
                  at or near the level of the unconditional language model (here GPT-2).

                  Kullback–Leibler (KL) Divergence We update<<FORMULA>> to minimize the KL divergence between the
                  output distribution of the modiﬁed and unmodiﬁed language models in addition to the step above.
                  In practice, this is accomplished by adding the quantities together before taking a gradient, though it
                  can be visualized as two separate steps as in Figure 2. We scale the KL coefﬁcient by a scalarKL ,
                  and in practice, setting this hyperparameter to 0.01 works well in general across tasks.

                  Post-norm Geometric Mean Fusion In addition to minimizing KL divergence, which affects the
                  past via<<FORMULA>> , we perform post-norm fusion similarly to Stahlberg et al. (2018). This does not
                  directly affect<<FORMULA>> ; rather, it just serves to constantly tie the generated text to the unconditional
                  <<FORMULA>>LM distribution. We accomplish this by sampling from <<FORMULA>>, where <<FORMULA>>
                  and <<FORMULA>> are the unmodiﬁed and modiﬁed output distributions, respectively, and <<FORMULA>> is a normalizing
                 factor such that it forms a valid distribution. As <<FORMULA>> this converges to the distribution from
                 the updated LM, and as <<FORMULA>> converges to the unconditional LM distribution. We ﬁnd that in
                 practice values for <<FORMULA>> in the range 0.8-0.95 work well.

                 1 One normalization term is computed for each layer of the transformer.

                 Figure 2: An oversimpliﬁed view into why steps
                 that maximize both log<<FORMULA>>and log<<FORMULA>> are
                 needed. The sentence under consideration is
                 shown as a black dot, which is ﬁrst pushed in the
                 direction of maximizing <<FORMULA>> and then in the               ascend <<FORMULA>>
                 direction of maximizing <<FORMULA>>. In practice we           ascend <<FORMULA>>

                                                          higher use a single step and simply add the log probabilities; 
                             <<FIGURE>>                   we take steps in continuous space of hid-
                                                            lower den representations H rather than in the discrete x
                                                          higher <<FORMULA>> (byte pair) space, and rather than resampling the
                 entire sentence each step, we take one step inH
                 space per byte-pair sample.


                  3.4 SAMPLING AND RANKING

                  The attribute model<<FORMULA>>in PPLM provides two functionalities: ﬁrst, a score that can be used to
                  rank samples based on the LL of the desired attribute (forward pass only; Step 1, Figure 1), and
                  second, a gradient ascent direction to perform an update in the latent space (Step 2 & 3; Figure 1).
                  The former can be used to generate r samples and rank them to choose the best one. This can
                 serve as an additional method for attribute control in addition to sampling with updated latents.
                 Further, to avoid the problem of repetitive, low quality text (Holtzman et al., 2018), we compute the
                 mean over the Dist-1, Dist-2 and Dist-3 scores (for the generated passage), which is an indicator of
                 repetitiveness (Li et al., 2015), and then discard samples with a mean score below a threshold.


                  4 EXPERIMENTS, RESULTS, AND EVALUATION

                  In this section, we describe our evaluation methodology and then show controlled generation results
                  under various attribute models. We also show use cases of PPLM in language detoxiﬁcation and in
                  controlled story telling. For all results reported in this section, we use top-k sampling (Fan et al.,
                  2018) with k=10 to draw from the softmax distribution over the vocabulary.

                  4.1 EVALUATION METHODS AND ABLATION STUDY

                  We evaluate to assess two properties: whether PPLM generates text that satisﬁes the desired attribute
                  (topic or sentiment) and whether the quality of its text deteriorates as we intensify control of the
                  attribute. Note we can always turn the control knob down to zero to disable control of attributes
                  and reach the ﬂuency of the original model. If desired, a user can tune the knobs at inference until a
                  chosen tradeoff between attribute strength and ﬂuency is reached. We evaluate using both automated
                  methods and human annotators:
                  Automated Eval.Perplexity is an automated measure of ﬂuency, though its effectiveness has been
                  questioned in open-domain text generation (Liu et al., 2016). We measure perplexity using a differ-
                  ent pre-trained language model, GPT (Radford et al., 2018b). The diversity of text in the passages
                  is measured using the number of distinct n-grams (normalized by the length of text) as in Li et al.
                  (2015). We report Dist-1, Dist-2, and Dist-3 scores for the distinct 1-2-3-grams (measured across
                  all samples generated for a given attribute control task, e.g. a speciﬁc topic for topic control). Such
                  scores are an indicator of the diversity of the samples generated (Li et al., 2015). We also use external
                  sentiment classiﬁers for sentiment evaluation.
                  Human Eval.We consider two types of human annotation: ﬂuency and A/B testing on attribute
                 relevance. Annotators are asked to evaluate the ﬂuency of each individual sample on a scale of 1-5,
                 with 1 being “not ﬂuent at all” and 5 being “very ﬂuent,” as done in Lample et al. (2019). In the A/B
                 testing for attribute relevance, we consider all combinatorial pairs of all four variants: B, BR, BC,
                 and BCR (6 combinations). We then ask annotators to rank the pair on the desired attribute (e.g. topic
                 relevance, sentiment strength), while allowing “neither” and “both” options to account for equally
                 good/bad generations (Lample et al., 2019). We obtain annotations from nine external occupational
                 annotators. Each pair of samples is evaluated by three individuals and we use majority-voting to

                   Table 3: Comparison of different samples generated by (top row) baseline GPT-2 and (other rows)
                   PPLM with different BoW corresponding to different topics (e.g.[Military]), all conditioned on a
                   single preﬁx: "The issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focused". Both directly optimized (inred) and related words (insoft red)
                   are highlighted, showing how the optimization takes effect.

                                                      <<TABLE>>

                   compute attribute relevance. For ﬂuency, we use average of the three annotations. The method of
                   generation is completely hidden and the order of samples in A/B testing is randomized.
                   Ablation study and baselines.We conduct an ablation study with four variants:B: the baseline,
                   unchanged GPT-2 LM, sampled once;BR: B but sampled r times, with best sample chosen based
                   on the LL ranking and ﬁltering based on Dist score;BC: update the latent representations(Het )and
                   then sample once; and lastlyBCR: update the latent representations(Het )and generate r samples,
                   choose the best sample based on the LL score (after ﬁltering out samples with low Dist scores). As
                   baseline approaches we considerCTRL: (Keskar et al., 2019), a recent language model;GPT2-FT-
                   RL: a GPT-2 LM ﬁne-tuned for human evaluated positivity with RL (Ziegler et al., 2019); andWD:
                   a weighted decoding baseline in which the B LM’s outputs are weighted directly toward maximizing
                   <<FORMULA>> (Ghazvininejad et al., 2017); see Section S7 for details, and Section S11 for hyperparameters.

                    4.2 BOW ATTRIBUTE MODELS

                   The simplest attribute model we use gives the log of the sum of likelihoods of each word in some
                   predeﬁned Bag of Words (BoW). Given a set of keywords <<FORMULA>> that specify a topic of
                   interest and the output distribution of the language model <<FORMULA>>, the log likelihood is:
                                                        
                                            <<FORMULA>>                     (4)
                                                           
                   We construct BoWs that represent seven distinct topics: SCIENCE, MILITARY, LEGAL, COMPUTERS,
                   SPACE, POLITICS and RELIGION (see Section S17 for complete word lists). Samples are
                   shown in Table 3, generated from a single preﬁx, while being controlled towards each topic.
                   Interestingly, we ﬁnd that increasing the probability of generating the words in the bag also increases
                   the probability of generating related topical words not in the BoW (e.g. in the[Science] sample
                   shown in Table 3, note that question and philosophers are sampled before the ﬁrst BoW word,laws).
                   Table S17 shows the gradual change of topic intensity under ﬁne-grained control. We found that
                   the optimization procedure works better with updating representations from the past over a ﬁnite
                   window and using an adaptive normalization scheme (see Section S11.3).
                   For automatic and human evaluation, we generate 420 samples evenly distributed among seven BoW
                   attribute models and 20 preﬁxes (see the full list in Section S15), for each of the four variants de-
                   scribed in the ablation study. See Section S8 for further details on evaluation and results. Table 4
                   shows that human annotators ﬁnd text from BCR (51.7%) and BC (46.9%) to be signiﬁcantly more

                 Table 4: For each treatment in the ablation study, we report mean std-dev across (human and 
                 automated) ﬂuency metrics. The topic (%) reports the fraction of samples matching the target topic,
                 as evaluated by human annotators. Table S8 provides per-topic results. Approaches BC and BCR
                 demonstrate signiﬁcant control over the topic of the generated text, while retaining similar diversity
                 (Dist-1, Dist-2, Dist-3) scores and minimal degradation in Perplexity and Fluency evaluations vs the
                 baseline LM (B). The gain from ranking and choosing from multiple samples BR over B is limited
                 (4.7%). The gain in topic-accuracy from latent (Het ) manipulation (from B to BC) is signiﬁcantly
                  higher (35.8%). Perplexity is computed using the GPT LM (Radford et al., 2018a), which differs
                  from the LM generating text (GPT-2). For CTRL and WD, since human evaluation is performed
                  in comparison with BCR via A/B testing, we report the numbers for BCR as well from these 
                  comparisons, for the human evaluated metrics. Further, we consider one sample per preﬁx for CTRL,
                  resulting in fewer samples and higher Dist-1, 2, 3 scores as a consequence. PPLM outperforms
                  CTRL and WD on topic-relevance, while being comparable on ﬂuency scores.

                                               <<TABLE>>

                 on topic than B (15.8%) and BR (11.1%). With only a slight degradation in ﬂuency scores, passages
                 generated with manipulated latents (BCR and BR) are signiﬁcantly on topic, demonstrating the de-
                 sired attribute control on this task. The Dist-1, Dist-2 and Dist-3 scores, which accounts for diversity
                 of text across the generated passages, are similar across all four ablation approaches. Further, BCR
                 slightly outperforms CTRL (51.7% & 50.0%), and signiﬁcantly outperforms WD (36 %). BC itself
                 outperforms WD (36 %). BCR, CTRL and WD all score similarly on the ﬂuency metric.
                 We note that gradient-based latent updates have signiﬁcantly greater inﬂuence on topic relevance
                 (R with or without C) than reranking based on the score (C with or without R), showing that shifting
                 meaning in latent space is more effective than shifting the output distribution directly through
                 reweighting. The effectiveness of shifting latents is further corroborated by the WD’s relatively
                 worse performance. WD directly controls the output distribution, which will not lead to increased
                 probability of sampling words from outside the bag that are related to the topic.
                 Finally, there is a large variance in the extent of controllability across topics (Table S8). We ﬁnd
                 that some topics (religion, science, politics) are easier to control for compared to others (computers,
                 space). Section S9 considers unusual or nonsensical combinations of preﬁxes and attributes
                 (e.g. preﬁx ‘potato’ and topic ’religion’), and we ﬁnd that even for these settings PPLM is able to
                 successfully control for the desired attribute, often with hilarious twists!

                  4.3 DISCRIMINATOR ATTRIBUTE MODELS

                  While BoW models have been demonstrated to be able to control text attributes such as sentiment
                  (e.g., Li et al. (2018) rely on extracting a set of attribute-based phrases to control the sentiment
                  during style transfer), being able to control attributes using more sophisticated discriminators is
                  desirable when it is difﬁcult to express the attribute with a simple bag of words.
                  We train a discriminator on a dataset with input sentences x and corresponding labels yx . For an
                 input xof length t, we compute ox and train fon the mean (<<FORMULA>>) of the embeddings across time. 
                 All :t discriminators in this work consist of a single layer classiﬁer that predicts the target label from
                 <<FORMULA>> The number of parameters in this layer is (embedding-dimension (e) number of attributes
                  (a) + number of attributes (a)), which is negligible compared to the number of parameters in the
                  LM model itself (Table 2). Although the loss is a function of the entire sequence, here we adopt a
                  greedy approach, similar to Ebrahimi et al. (2018); Wallace et al. (2019), in which we optimize for
                                
                 Table 5: Sentence samples in triplets, generated by {baseline GPT-2, PPLM-Discrim POSITIVE ,
                 PPLM-Discrim NEGATIVE }, conditioned on preﬁxes:The chickenThe chickenThe chickenThe chickenThe 
                 chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe 
                 chickenThe chickenThe chickenThe chicken&The countryThe countryThe countryThe countryThe countryThe 
                 countryThe countryThe countryThe countryThe countryThe countryThe countryThe countryThe countryThe 
                 countryThe countryThe country. Words related to
                 the sentiment are highlighted (in soft red). Each triplet is generated from the same random seed.
                 
                                                          <<TABLE>>

                 a higher-probability of the sequence having a speciﬁc attribute by considering changes only to the
                 next token to be generated. This objective can be described as follows, where f is the discriminator:

                                        <<FORMULA>>                   (5)

                  Note that <<FORMULA>> is a function of <<FORMULA>> . Further, <<FORMULA>>, which depends on <<FORMULA>>.
                 In the limit, minimizing the objective in Equation 5 corresponds to choosing <<FORMULA>> that produces the
                  optimal <<FORMULA>> that maximizes <<FORMULA>>. However, this limits the diversity of the generated text
                  and could potentially lead to language degeneration (Holtzman et al., 2019). Alternatively, we focus
                  on a softer optimization approach where we aim to shift the distribution <<FORMULA>>
                 towards one that in expectation has a higher likelihood of having the desired attribute a. Possible
                 approaches to accomplishing this are using REINFORCE (Williams, 1992) and the Gumbel-Softmax
                 trick (Jang et al., 2016). However, both of these would slow down convergence. Instead, as in Dai
                 et al. (2019a), we use the distribution  <<FORMULA>>  (instead of a hard sample<<FORMULA>> ), and feed it forward to
                  obtain (a biased) estimate of the next token’s embedding and then update<<FORMULA>> .
                  The sentiment discriminator here distinguishes sentiment between POSITIVE and NEGATIVE and is
                  trained on the SST-5 dataset (Socher et al., 2013). Table 5 shows PPLM-Discrim generated samples
                  in triplets: uncontrolled, controlled for POSITIVE sentiment, controlled for NEGATIVE sentiment.
                 For automatic and human evaluation, we use 15 preﬁxes (see the full list in Section S15) to generate
                 45 samples for each of two sentiment classes:very positive and very negative. Note
                 that even though the sentiment discriminator is trained with movie review data, the preﬁxes (e.g.
                 “The painting”, “The potato”, “The country”) we used are not necessarily associated with movie
                 reviews. This supports the generality of our approach: an attribute model trained with data from a
                 different domain can still provide meaningful gradients.
                 Table 6 shows evaluation results. For human evaluation, we obtain 1620 annotations for the ablation
                 study and 495 for baseline comparisons from the annotators distributed across the samples and
                 sentiments. Unlike the topic control setting, sampling and ranking results in a considerable increase
                 in attribute accuracy (19:3%!41:5%), because the prior probability of sampling, say, a negative
                 sentence, is relatively high. BC results in a decrease in ﬂuency when compared to B, while being
                 signiﬁcantly more consistent with the desired attribute (19:3%!39:6%). With latent manipulation
                  and ranking (BCR), we see a signiﬁcant increase in attribute control accuracy (73:7%) while retain-
                  ing ﬂuency similar to B and BR. Further, the gain in sentiment accuracy from re-sampling is larger
                  in the case of manipulated latents vs non-manipulated (34:1%increase from BC to BCR>22:2%
                  increase from B to BR), indicating that these two approaches may be proﬁtably combined. We also
                  evaluate attribute control with an external sentiment classiﬁer trained on IMDB movie reviews (Maas
                  et al., 2011), which is a different dataset from the one used to train the attribute model (Socher et al.,
                  2013), and the same rough story holds, albeit with smaller gaps between approaches. We compare to
                  baselines CTRL, GPT2-FT-RL, and WD. BCR performs comparably to CTRL (73.7% and 80.0%),
                  and BR, BC and BCR all outperform GPT2-FT-RL, the GPT-2 LM ﬁne tuned for positivity, and WD.

                 Table 6: Evaluation of models/ variants on the sentiment control task, with meanstd-dev reported
                 across ﬂuency metrics. Sentiment accuracy reports the fraction of samples with an accurate tar-
                 get sentiment. Approach BCR provides signiﬁcant control over sentiment while showing minimal
                 degradation in ﬂuency. See Table S9 for full results on individual sentiments. *GPT2-FT-RL is only
                 evaluated for the positivity half of the task, as it is ﬁne-tuned only for positivity (Ziegler et al., 2019).
                 For human evaluation metrics, we compare the baselines CTRL, GPT2-FT-RL and WD with BCR
                 and perform A/B style testing. We include both numbers for comparison.

                                <<TABLE>>

                  4.4 LANGUAGE DETOXIFICATION

                  Language models trained with large corpora of Internet data reﬂect biases and discrimination 
                  existing in the data. A recent paper by Wallace et al. (2019) conducted adversarial attacks that make
                  GPT-2 produce racist output when given a carefully optimized trigger string as preﬁx. They also
                  ﬁnd that when simply using “Blacks” as preﬁx, 2% of GPT-2 samples contain explicit racism. Other
                  preﬁxes (e.g., “Asians” or “Jews”) are mentioned but no percentage is reported. We conduct 
                  experiments and report the baseline toxicity percentages to be 10% (“Asians”), 12% (“Jews”) and 8%
                  (“Blacks”). With adversarial triggers generated from the released codebase by Wallace et al. (2019)
                  the average toxicity percentage is 63.6%. Further details can be found in Section S13.
                  PPLMs can be easily adapted for language detoxiﬁcation by plugging in a toxicity classiﬁer as the
                  attribute control model and update latents with the negative gradient. We train a single layer classiﬁer
                  on the toxicity data from the Toxic Comment Classiﬁcation Challenge (Jigsaw) and show that with
                  a similar hyper-parameter setting as other PPLM-Discrim methods, it works well on both natural
                  prompts and adversarial triggers. For natural prompts percentages of toxicity are 6%, 4% and 10%,
                  respectively, and for adversarial triggers it drastically dropped to 4.6% on average, with statistical
                  signiﬁcance. Details on the annotation procedure and full table of percentage and p-values can be
                  found in Table S23 and Section S13. Note that a model for detoxifying language can also potentially
                  be maliciously used for generating toxic language, a topic we brieﬂy discuss in Section S6.

                  4.5 CONTROLLED STORY WRITING

                  We explore controlled generation for assistive story writing (Peng et al., 2018; Luo et al., 2019; Yao
                  et al., 2019; Fan et al., 2018). Using uncontrolled LMs for assistive art creation can be difﬁcult. To
                  help with the structure, we use predeﬁned story skeletons often used in improvisation (Adams). We
                  ﬁll in the blank between these preﬁxes with a PPLM. See examples in Table S20 and Table S21.


                  5 CONCLUSION

                  We have presented PPLM, a plug and play method for controlled language generation that ﬂexibly
                  combines a large, pre-trained LM and a BoW or a small, easy-to-train discriminator. In Section S6
                  we discuss the ethics of controlled LMs. PPLM achieves ﬁne-grained control of attributes via a
                  simple gradient-based sampling mechanism. Because PPLMs can ﬂexibly control generation while
                  maintaining ﬂuency, they hold great promise for enabling the next generation of language models.

                  ACKNOWLEDGEMENTS

                  The authors are grateful to Bryan McCann for providing samples for the CTRL baseline, Joel
                  Lehman for discussion regarding the ethical implications for this work, Jiale Zhi for help with the
                  computational framework, Colan Chen for creating associated artwork for the blog, Avishek Joey
                  Bose for helpful discussions, Julien Chaumond, Lysandre Debut, Thomas Wolf, and the Hugging
                  Face team for co-producing the PPLM demo and helping integrate the code into their transformers
                  repository, all the annotators at Uber, HKUST and Caltech for their labeling, and members of the
                  Deep Collective research group for helpful discussion, ideas, and feedback on experiments.

                  REFERENCES
                  Kenn Adams. Improv encyclopedia story spine. http://improvencyclopedia.org/
                   games/Story_Spine.html. (accessed September 20, 2019).
                 Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. Generating more interesting responses in
                   neural conversation models with distributional constraints. InProceedings of the 2018 Conference
                   on Empirical Methods in Natural Language Processing, pp. 3970–3980, 2018.
                  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic
                   language model.Journal of machine learning research, 3(Feb):1137–1155, 2003.
                  Yun Chen, Victor OK Li, Kyunghyun Cho, and Samuel R Bowman. A stable and effective learning
                   strategy for trainable greedy decoding.arXiv preprint arXiv:1804.07915, 2018.
                  Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing Huang. Style transformer: Unpaired text style
                   transfer without disentangled latent representation.arXiv preprint arXiv:1905.05621, 2019a.
                 Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan
                   Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context.arXiv
                   preprint arXiv:1901.02860, 2019b.
                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
                   bidirectional transformers for language understanding. InProceedings of the 2019 Conference of
                   the North American Chapter of the Association for Computational Linguistics: Human Language
                   Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
                 Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. HotFlip: White-box adversarial ex-
                   amples for text classiﬁcation. InProceedings of the 56th Annual Meeting of the Associa-
                   tion for Computational Linguistics (Volume 2: Short Papers), pp. 31–36, Melbourne, Aus-
                   tralia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2006. URL
                   https://www.aclweb.org/anthology/P18-2006.
                 Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data.
                   InProceedings of the 2018 Conference on Empirical Methods in Natural Language Process-
                   ing, pp. 11–21, Brussels, Belgium, October-November 2018. Association for Computational Lin-
                   guistics. doi: 10.18653/v1/D18-1002. URLhttps://www.aclweb.org/anthology/
                   D18-1002.
                 Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation.arXiv preprint
                   arXiv:1805.04833, 2018.
                 Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation.
                   InProceedings of the Workshop on Stylistic Variation, pp. 94–104, 2017.
                 Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry
                   generation system. InProceedings of ACL 2017, System Demonstrations, pp. 43–48, Vancouver,
                   Canada, July 2017. Association for Computational Linguistics. URLhttps://www.aclweb.
                   org/anthology/P17-4008.
                 Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. Learning to translate in real-time
                   with neural machine translation.arXiv preprint arXiv:1610.00388, 2016.
                 Jiatao Gu, Kyunghyun Cho, and Victor OK Li. Trainable greedy decoding for neural machine
                   translation.arXiv preprint arXiv:1702.02429, 2017.
                 Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning
                   to write with cooperative discriminators.CoRR, abs/1805.06087, 2018. URLhttp://arxiv.
                   org/abs/1805.06087.
                 Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degener-
                   ation.arXiv preprint arXiv:1904.09751, 2019.
                 Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Controllable text
                   generation.CoRR, abs/1703.00955, 2017. URLhttp://arxiv.org/abs/1703.00955.
                 Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2016.
                 Jigsaw. Toxic comment classiﬁcation challenge.   https://www.kaggle.com/c/
                   jigsaw-toxic-comment-classification-challenge/. Accessed: 2019-11-13.
                 Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong, and Richard Socher. CTRL
                   - A Conditional Transformer Language Model for Controllable Generation. arXiv preprint
                   arXiv:1909, 2019.
                 Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. Con-
                   trolling output length in neural encoder-decoders. InProceedings of the 2016 Conference on
                   Empirical Methods in Natural Language Processing, pp. 1328–1338, Austin, Texas, Novem-
                   ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1140. URL
                   https://www.aclweb.org/anthology/D16-1140.
                 Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato,
                   and Y-Lan Boureau. Multiple-attribute text rewriting. InInternational Conference on Learning
                   Representations, 2019. URLhttps://openreview.net/forum?id=H1g2NhC5KQ.
                 Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting
                   Objective Function for Neural Conversation Models.arXiv e-prints, art. arXiv:1510.03055, Oct
                   2015.
                  Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to
                   sentiment and style transfer.CoRR, abs/1804.06437, 2018. URLhttp://arxiv.org/abs/
                   1804.06437.
                  Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau.
                   How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics
                   for dialogue response generation. InProceedings of the 2016 Conference on Empirical Methods
                   in Natural Language Processing, pp. 2122–2132, 2016.
                  Fuli Luo, Damai Dai, Pengcheng Yang, Tianyu Liu, Baobao Chang, Zhifang Sui, and Xu Sun.
                   Learning to control the ﬁne-grained sentiment for story ending generation. InProceedings of the
                   57th Annual Meeting of the Association for Computational Linguistics, pp. 6020–6026, 2019.
                  Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
                   Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting
                   of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150,
                   Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URLhttp:
                   //www.aclweb.org/anthology/P11-1015.
                  Christopher D Manning, Christopher D Manning, and Hinrich Schütze.Foundations of statistical
                   natural language processing. MIT press, 1999.
                  Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward
                   Teller. Equation of state calculations by fast computing machines. The journal of chemical
                   physics, 21(6):1087–1092, 1953.
                  Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fair’s
                   wmt19 news translation task submission.arXiv preprint arXiv:1907.06616, 2019.
                 Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con-
                   ﬁdence predictions for unrecognizable images.The IEEE Conference on Computer Vision and
                   Pattern Recognition (CVPR), June 2015.
                  Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & Play
                   Generative Networks: Conditional Iterative Generation of Images in Latent Space. InThe IEEE
                   Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
                  Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight. Towards controllable story
                   generation. InProceedings of the First Workshop on Storytelling, pp. 43–49, 2018.
                  Martin Potthast, Tim Gollub, Kristof Komlossy, Sebastian Schuster, Matti Wiegmann, Erika Pa-
                   tricia Garces Fernandez, Matthias Hagen, and Benno Stein. Crowdsourcing a large corpus of
                   clickbait on twitter. InProceedings of the 27th International Conference on Computational Lin-
                   guistics, pp. 1498–1507, 2018.
                  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un-
                   derstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
                   assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018a.
                  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un-
                   derstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
                   assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018b.
                  Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                   models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019.
                  Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin
                   diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):
                   255–268, 1998.
                  Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions and
                   their discrete approximations.Bernoulli, 2(4):341–363, 1996.
                  Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. What makes a good conversation?
                   How controllable attributes affect human judgments.arXiv e-prints, art. arXiv:1902.08654, Feb
                   2019.
                  Claude Elwood Shannon. A mathematical theory of communication.Bell system technical journal,
                   27(3):379–423, 1948.
                  Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. Style transfer from non-parallel
                   text by cross-alignment. CoRR, abs/1705.09655, 2017. URLhttp://arxiv.org/abs/
                   1705.09655.
                  Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng,
                   and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
                   treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language
                   Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computa-
                   tional Linguistics. URLhttps://www.aclweb.org/anthology/D13-1170.
                  Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion: Return of the language model.
                   arXiv preprint arXiv:1809.00125, 2018.
                  Nishant Subramani, Sam Bowman, and Kyunghyun Cho. Can unconditional language models re-
                   cover arbitrary sentences?arXiv preprint arXiv:1907.04944, 2019.
                  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfel-
                   low, and Rob Fergus. Intriguing properties of neural networks.CoRR, abs/1312.6199, 2013.
                  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
                   Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor-
                   mation Processing Systems, pp. 6000–6010, 2017.
                 Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial
                   triggers for nlp.arXiv preprint arXiv:1908.07125, 2019.
                  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
                   learning.Machine learning, 8(3-4):229–256, 1992.
                  Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
                   Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-
                   of-the-art natural language processing, 2019.
                  Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-
                   write: Towards better automatic storytelling. InProceedings of the AAAI Conference on Artiﬁcial
                   Intelligence, volume 33, pp. 7378–7385, 2019.
                  Kyra Yee, Nathan Ng, Yann N Dauphin, and Michael Auli. Simple and effective noisy channel
                   modeling for neural machine translation.arXiv preprint arXiv:1908.05731, 2019.
                  Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets
                   with policy gradient. InThirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.
                  Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The neural noisy
                   channel.arXiv preprint arXiv:1611.02554, 2016.
                  Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wang Ling, Lingpeng Kong, Phil Blunsom, and
                   Chris Dyer. Putting machine translation in context with the noisy channel model.arXiv preprint
                   arXiv:1910.00553, 2019.
                  Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul
                   Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv
                   preprint arXiv:1909.08593, 2019. URLhttps://arxiv.org/abs/1909.08593.


                           SUPPLEMENTARY INFORMATION FOR :
                    PLUG AND PLAY LANGUAGE MODELS : A SIMPLE
                     APPROACH TO CONTROLLED TEXT GENERATION

                  S6 ETHICS OF CONTROLLED LANGUAGE MODELS

                  There has recently been a substantial discussion around the ethics of capable language models (Rad-
                  ford et al., 2019; Keskar et al., 2019), both in their potential to recapitulate problematic social biases
                  and for them to be directly abused for societal harm (e.g. to generate disinformation). While one aim
                  of this paper is to suggest a mechanism to detoxify language models (Section 4.4), we also acknowl-
                  edge that nearly the same mechanism could be exploited to instead create more toxic language. Such
                  possibilities are inherent to general-purpose technologies such as machine learning, and we believe
                  that on balance this work creates more value than risks.

                  S7 DETAILS ON BASELINE METHODS

                  We consider three baselines: CTRL, GPT2-FT-RL, and WD. The ﬁrst two are strong baselines where
                  large language models are trained (or ﬁne-tuned) speciﬁcally to generate texts conditioned on certain
                  attributes, while WD is considered a weak baseline based on a direct integration of the conditioning
                  into the decoding.
                  For each baseline, we generate data from their method, and conduct the same human and automated
                  evaluations. For human evaluation of attribute relevance, we match baseline data with our method
                  (BCR in the ablation study), and pass to human annotators for an A/B testing style annotation. As
                  in the ablation study, human annotators are given a pair of texts, one from baseline, one from ours,
                  with orders randomized and source hidden, and asked to rank which one is more topic or sentiment
                  relevant, while having the options of “both” and “neither”.
                  On top of that, we have human annotators to give the ﬂuency score of each text sample under
                  each method individually. And automated evaluations of perplexity, sentiment, etc. are also done
                  individually.

                  S7.1 CTRL

                  The recent conditional language model, CTRL, from Keskar et al. (2019), trains a 1.6B LM condi-
                  tioned on around 50 control codes. We use the ofﬁcial released codebase 2 and their open-sourced
                  model to generate samples for the CTRL baseline. Out of the 7 topics considered in PPLM-BoW,
                  we found that 5 can be matched with a speciﬁc control code in CTRL. We append a secondary
                  code "Text:" to each primary control code, per the author’s suggestion, to encourage more ﬂuent and
                  longer passages. The 2 topics missing a match with CTRL are: Military, Space. For positive and
                  negative sentiments in PPLM-Discrim, we match with the Reviews control code and append a high
                  and low rating score.
                  The matched attributes and control codes are listed in Table S7.
                  Under this setting, for each control code we generate texts prompted by the same preﬁxes used for
                  corresponding PPLM attribute model (20 for PPLM-BoW, 15 for PPLM-Discrim). For example, “In
                  summary” and “To review,” for PPLM-BoW, and “The chicken”, “The lake” for PPLM-Discrim.
                  Due to the near-greedy sampling method CTRL uses, for each preﬁx it generates one sample. Hence
                  we have 20 samples for each matching topic with PPLM-BoW, and 15 samples for positive and 15
                  for negative.

                  S7.2 GPT2-FT-RL

                  A recently released GPT-2 model ﬁne-tuned using human feedback, from Ziegler et al. (2019),
                  showed success in summarization and text continuation in desired styles. To compare with PPLM,
                 2 CTRL codebase:https://github.com/salesforce/ctrl

                 Table S7: Control codes used for the model from Keskar et al. (2019) for experiments in Section 4.
                                  
                                  <<TABLE>>

                 we run GPT2-FT-RL 3 to generate positive texts on the same preﬁxes used in our PPLM-Discrim
                 experiment. For each preﬁx, we generate three GPT2-FT-RL samples, and pair them with those
                 generated from PPLM (BCR in the ablation study) randomly.

                  S7.3 WEIGHTED DECODING (WD)

                 We consider a simple baseline based on a direct integration of the conditioning into the decoding
                 procedure, similar to the approach from Ghazvininejad et al. (2017).

                 Topic Control with Bag of Words In Ghazvininejad et al. (2017), the authors consider increasing
                 the likelihood of sampling from a bag of key-words by performing beam-search with a modiﬁed
                 scoring function.
                                               <<FORMULA>>;
                                                              
                 where 1 BoW (<<FORMULA>>) is an indicator function indicating if the tokenwi is present in the bag BoW. Since,
                 it has been shown that beam-search results in degradation of language for GPT-2 (Holtzman et al.,
                 2019), we consider top-5 sampling from a distribution <<FORMULA>>  deﬁned such that:

                                    <<FORMULA>>  

                 where <<FORMULA>> and FORMULA is the distribution over the vocabulary as predicted by the GPT-2 LM . For
                 the experiments in Section 4, we set <<FORMULA>>.

                 Sentiment Control with Discriminator Here, we implemented weighted decoding similarly for
                 sentiment control. Here we wish to incorporate the score from the attribute model into decoding. To
                 control for stylea^, instead of sampling from the distributionpt+1 , we sample from <<FORMULA>>  deﬁned as:

                                      <<FORMULA>>

                 <<FORMULA>> is the probabilty of the sequence <<FORMULA>> possessing attribute <<FORMULA>> assigned by the
                 attribute model. By Bayes’ rule, <<FORMULA>>, and we do top-5
                 sampling from this distribution. Recall that <<FORMULA>> under the language model.

                  S8 FURTHER DETAILS ON HUMAN AND AUTOMATED EVALUATION

                 We conduct evaluations on attribute relevance and language ﬂuency, both including human and
                 automated evaluation.
                 For topic relevance (a.k.a attribute relevance where the attribute is a topic, in our case represented
                 by a BoW), we rely entirely on human annotation. For sentiment relevance, we rely on human
                 annotation as well as a separately trained sentiment classiﬁer. We also performed a “clickbait” style
                 control, for which the effectiveness relies on human annotation.

                 3 GPT2-FT-RL codebase:https://github.com/openai/lm-human-preferences

                 For ﬂuency, we use human annotations (between 1 to 5) and automated methods: perplexity, Dist-1,
                 Dist-2, and Dist-3 scores.
                 The number of human evaluations are as below:

                                               <<TABLE>>

                  In ablation studies, the generation procedure for BCR, BR and BC is always initiated from the same
                  random seeds. The same set of random seeds that lead to the samples chosen with BCR are stored
                  and used to generate the samples with B.
                  The full table of all these measures, human and automated, on PPLM-BoW, seperated by sentiment
                  and style, is in Table S8. Included also are strong baselines (CTRL and WD) for each sentiment.
                  The human annotated topic relevance is further visualized in Figure S3. The ﬂuency scores, while
                  being across {B, BC,BR, BCR,} methods in the table, when shown in distribution are very similar,
                  as seen in Figure S5.
                  The full table of all these measures, human and automated, on PPLM-discrm sentiments, is in Ta-
                  ble S9. Included also are strong baselines (CTRL, WD and GPT2-FT-RL) for each topic. The human
                  annotated sentiment and style (e.g. “Clickbait”) relevance is further visualized in Figure S4, along
                  with congregated measures: all sentiments, all discriminators, all topics. The ﬂuency scores again
                  have similar distributions across {B, BC,BR, BCR,} methods, as seen in Figure S6.

                                  <<FIGURE>>

                 Figure S3: Topic relevance by human evaluation. We can see that taking a PPLM gradient step
                 (B!BC) makes a big difference. Reranking is mostly helpful (B!BR; BC!BCR). We can also
                  see a rough distribution of various topics in unperturbed, GPT-2 generation (B), which possibly
                  mirrors the distribution of topis in its training data. Some topics, like science, naturally appear
                  rather frequently.


                  S9 ODD COMBINATION OF TOPICS AND PREFIXES

                  It is interesting to see how PPLM can steer the text generation when the topic and preﬁx combination
                  appears odd or illogical. For example, will “The potato” still prompt sensible text generation under
                  the topic R ELIGION ? In this study we design a set of odd combinations, as bellow.

                 Table S8: Full result of human and automated evaluation of PPLM-BoW, attribute relevance and
                 language ﬂuency. This is a detailed version of Table 4, where results were averaged over all topics.
                 Results here correspond to the average over all samples in each topic, for each method in the ablation
                 study (B, BC, BR, BCR), and in baselines (CTRL, WD). Perplexity is computed based on an
                 external LM (Radford et al., 2018a), that is different from the LM generating text.
                  
                                 <<TABLE>>

                 Figure S4: Bar charts of discriminator relevance by human evaluation, together with different ver-
                 sions of combined results.

                                  <<FIGURE>>

                 Table S9: Full result of human and automated evaluation of PPLM-Discrim, attribute relevance and
                 language ﬂuency. The top two rows are a detailed version of Table 6, where results were averaged
                 over both sentiments (except for GPT2-FT-RL, where there is only positive sentiment). The last
                 row is the additional C LICKBAIT style control, where there is only ablation study and no baseline
                 comparison. Results here correspond to the average over all samples in each sentiment and style,
                 for each method in the ablation study (B, BC, BR, BCR), and in baselines (CTRL, GPT-2-FT-RL,
                 WD). Perplexity is computed based on an external LM (Radford et al., 2018a), that is different from
                 the LM generating text.
                  
                                    <<TABLE>>

                  We found that PPLM control is easy even under those scenarios. We had to increase the strength
                  two or three fold (to0:02or0:03as opposed to0:01in most studies) to allow for a stronger
                  inﬂuence of attribute, but this is as expected: the strength parameter is a knob that user can tune to
                  reach ﬁne-grained control. The resulting generation is included in Table S10 - Table S16.

                   Table S10: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Military]. We show that PPLM is still able to generate ﬂuent, sensible and interesting
                   samples, respecting both the topic and the preﬁx.

                               <<TABLE>>

                   Table S11: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Legal]. We show that PPLM is still able to generate ﬂuent, sensible and interesting
                   samples, respecting both the topic and the preﬁx.

                               <<TABLE>>

                   Table S12: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Computers]. We show that PPLM is still able to generate ﬂuent, sensible and inter-
                   esting samples, respecting both the topic and the preﬁx.

                               <<TABLE>>

                   Table S13: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Politics]. We show that PPLM is still able to generate ﬂuent, sensible and interesting
                   samples, respecting both the topic and the preﬁx.

                               <<TABLE>>

                   Table S14: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Religion]. We show that PPLM is still able to generate ﬂuent, sensible and interesting
                   samples, respecting both the topic and the preﬁx.

                               <<TABLE>>

                   Table S15: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                   topic here is[Space]. We show that PPLM is still able to generate ﬂuent, sensible and interesting
                   samples, respecting both the topic and the preﬁx.

                              <<TABLE>>

                 Table S16: Examples generated from a designed odd combination of topic and preﬁx pairs. The
                 sentiment here is[Positive] and[Negative]. We show that PPLM is still able to generate ﬂuent,
                  sensible and interesting samples, respecting both the topic and the preﬁx.

                           <<TABLE>>

                  S10 FINE-GRAINED CONTROL WITH PPLM-BOW

                  Table S17 shows the subtle effect when you turn the step sizeup, while keeping everything else
                  (hyperparameters, text preﬁx) the same.

                  S11 HYPERPARAMETERS

                  We list, in Table S18, the full set of hyperparameters used in each task in the experiments section,
                  corresponding to results in Table 4 and Table 6, as well as in Section 4.4. In addition, we explain in
                  details three hyperparameters and their effect, below.

                  S11.1 E RLY STOPPING OF LATENT UPDATES

                  Degeneration (the occurrence of repetitive words) is a known issue with language generation (Holtz-
                  man et al., 2019), and we found it to be a case in PPLM-BoW when the update step size t is too
                  large. The model tends to degenerate towards repeating certain keywords targeted in the optimiza-
                  tion (e.g. words in the BoW). In this case, we can either reduce, or use the trick of early stopping
                  latent updates.
                  Examples shown in Table S19. With the exact same setting, but just stopping latent updates after 20
                  time steps, the samples show much less degeneration.

                  S11.2 FINITE HORIZON UPDATE

                  As opposed to updating the entire vectorHt , which consists of key-value pairs corresponding to
                  every token in the preﬁx, we consider modifying the key-value pairs corresponding to the most
                  recentwtokens. At each time-stept, we only modify <<FORMULA>>. This means that we modify
                  Hi at mostwtimes, and requires lesser computation that updating the whole past. We ﬁnd that
                  w= 5produces more ﬂuent passages for control with the bag of words. For control with the neural
                  attribute model, we update the entire latent history.

                  S11.3 ADAPTIVE GRADIENT NORMALIZATION

                  For the bag-of-words based attribute model, what we wish to enforce is that a word from the bag
                  appears at least once in the generated passage and not at every time-step. To account for this, instead
                  of normalizing directly by the gradient norm as in Equation 3, we normalize by the maximum
                  gradient norm over time. This implies that we make smaller updates when it is less likely for

                 3 We choose top 3 samples from a single batch of 10 here
                  
                  S17: Comparison of different sentences generated by (top row) unconditioned GPT-2 and

                   <<TABLE>>

                   a word from the bag of words to appear. Formally, the normalization constant at time-step t is:

                    <<FORMULA>>

                         
                   Table S18: The full set of hyperparameters used in each task in the experiments section. Note that
                   for PPLM-BoW, we select three of the highest scoring samples from a single batch ofr= 10. For
                    PPLM-Discrim, we get 1 sample per batch, across 3 batches ofr= 10.

                               <<TABLE>>

                     Table S19: The effect of using early stopping of latent updates to prevent sample degeneration.
                     Before (Latent updates at every generation step) After (Latent updates for only the ﬁrst 20
                                                         generation steps)

                                              <<TABLE>>

                   Table S20: Skeleton story generation with different attribute models. Each story is generated within
                   
                                      <<TABLE>>

                    S12 ADDITIONAL DETAILS AND SAMPLES OF CONTROLLED STORY TELLING

                   Table S20 and Table S21 show examples of the controlled story telling with skeleton. Table S22
                   shows samples of combinging multiple (three to four) attribut models, across different types (PPLM-
                   BoW and PPLM-Discrim).


                    S13 FURTHER DETAILS ON LANGUAGE DETOXIFICATION

                   We ask a group of four volunteers to assign binary toxicity labels to texts generated from both the
                   original GPT-2 and toxicity-controlled GPT-2 and to texts prompted with natural words as well as
                   adversarial triggers. Natural prompts include “Blacks”, “Asians”, “Jews” as suggested in Wallace
                   et al. (2019). 9 Adversarial triggers are generated using the released codebase from the same work.
                   The percentages of toxicity, before (with uncontrolled GPT-2) and after (with toxicity-controlled
                   GPT-2), are reported in Table S23.


                    S14 MORE EXAMPLES

                   We include more PPLM controlled generation examples in Table S24 – Table S27.


                    S15 PREFIXES USED IN PPLM EVALUATION

                   We consider 20 preﬁxes as sentence starters for evaluating PPLM-BoW generation, chosen randomly
                   from www2.eit.ac.nz/library/ls_guides_sentencestarters.html. For PPLM-Discrim, we use 15 preﬁxes. 
                   The entire set is below.

                   PPLM-Bow  “In summary”, “This essay discusses”, “Views on”, “The
                   connection”, “Foundational to this is”, “To review,”, “In brief,”,
                   “An illustration of”, “Furthermore,”, “The central theme”, “To
                   conclude,”, “The key aspect”, “Prior to this”, “Emphasised are”,
                   “To summarise”, “The relationship”, “More importantly,”, “It has
                   been shown”, “The issue focused on”, “In this essay”.

                   PPLM-Discrim  “Once upon a time”, “The book”, “The chicken”, “The
                   city”, “The country”, “The horse”, “The lake”, “The last time”,


                   Table S21: More examples of skeleton story generation with different attribute models. Each story
                   
                                            <<TABLE>>                    
                   
                   S16 COMBINING MULTIPLE CONTROLLERS FOR INSPIRATION

                   Earlier we demonstrated attribute control using a single attribute model or two attribute models of
                   the same type (e.g. BoW from two separate topics). Here we mix different types of attribute models
                   (BoW and discriminator). For example, we can control the generation toward a mixed topic about
                    WINTER , P OLITICS , K ITCHEN , while turning POSITIVE . See examples in Table S22.

                                                    <<FIGURE>>

                 Figure S5: Histogram illustrating the distribution of ﬂuency scores based on controlled generated
                 with PPLM-BoW from the four methods considered for ablation study. We ﬁnd that ﬂuency scores
                 from all four approaches are similarly distributed.

                                <<FIGURE>>

                 Figure S6: Histogram illustrating the distribution of ﬂuency scores based on controlled generated
                 with PPLM-Discrim from the four methods considered for ablation study. We ﬁnd that ﬂuency
                 scores from all four approaches are similarly distributed.


                  S17 WORD LISTS FOR BAG OF WORDS APPROACHES

                 We curate word lists fromwww.enchantedlearning.com/wordlist.

                  <<TABLE>>

                 Table S22: Examples of attribute controlled text generation with multiple knobs. We train a clickbait
                 discriminator using the dataset from Potthast et al. (2018)

                  <<TABLE>>

                   Table S23: Language detoxiﬁcation applied to natural prompts and adversarial triggers. Shown are
                   number of toxic passages / number of samples annotated, and percentage of toxicity. The column
                   p-value shows the statistical signiﬁcance of "After" lower than "Before".

                                                          <<TABLE>>

                   Table S24: Comparison of different samples generated with different preﬁxes using the same PPLM-
                   BoW control under the[Military]topic. All samples are generated using exact same hyperparam-
                   eters.

                    <<TABLE>>

                   Table S25: Comparison of different samples generated with different preﬁxes using the same PPLM-
                   BoW control under the[Space]topic. All samples are generated using exact same hyperparameters.

                    <<TABLE>>

                   Table S26: Comparison of different samples generated with different preﬁxes using the same PPLM-
                   BoW control under the[Science]topic. All samples are generated using exact same hyperparame-
                   ters.

                    <<TABLE>>

                   Table S27: Comparison of different samples generated with different preﬁxes using the same PPLM-
                   BoW control under the[Politics]topic. All samples are generated using exact same hyperparame-
                   ters.

                    <<TABLE>>
                    
<|endoftext|>


<|startoftext|>
Predicting Performance for Natural Language Processing Tasks 

Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig 
Language Technologies Institute, Carnegie Mellon University 
{mengzhox,aanastas,yiming,gneubig}@cs.cmu.edu ruochenx@gmail.com 

Abstract 

Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, with.out actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and  different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of  representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.

1 Introduction 

Natural language processing (NLP) is an extraordinarily vast field, with a wide variety of models being applied to a multitude of tasks across a plenitude of domains and languages. In order to mea.sure progress in all these scenarios, it is necessary to compare performance on test datasets represent.ing each scenario. However, the cross-product of tasks, languages, and domains creates an explosion of potential application scenarios, and it is infeasible to collect high-quality test sets for each. In addition, even for tasks where we do have a wide variety of test data, e.g. for well-resourced tasks such as machine translation (MT), it is still 
computationally prohibitive as well as not environ.mentally friendly (Strubell et al., 2019) to build and test on systems for all languages or domains we are interested in. Because of this, the common practice is to test new methods on a small number of languages or domains, often semi-arbitrarily chosen based on previous work or the experimenters intuition. 
As a result, this practice impedes the NLP community from gaining a comprehensive under.standing of newly-proposed models. Table 1 il.lustrates this fact with an example from bilingual lexicon induction, a task that aims to find word translation pairs from cross-lingual word embed.dings. As vividly displayed in Table 1, almost all the works report evaluation results on a  different subset of language pairs. Evaluating only on a small subset raises concerns about making inferences when comparing the merits of these methods: there is no guarantee that performance on English/Spanish (ENOES, the only common evaluation dataset) is representative of the expected performance of the models over all other language pairs (Anastasopoulos and Neubig, 2020). Such phenomena lead us to consider if it is possible to make a decently accurate estimation for the  performance over an untested language pair without actually running the NLP model to bypass the computation restriction. 
Toward that end, through drawing on the idea of characterizing an experiment from Lin et al. (2019), we propose a framework, which we call NLPERF, to provide an exploratory solution. We build regression models, to predict the  performance on a particular experimental setting given past experimental records of the same task, with each record consisting of a characterization of its training dataset and a performance score of the corresponding metric. Concretely, in 2, we start with a partly populated table (such as the one from 

<<TABLE>>

Table 1: An illustration of the comparability issues across methods and multiple evaluation datasets from the Bilingual Lexicon Induction task. Our prediction model can reasonably ll in the blanks, as illustrated in Section 4. 

Table (1) and attempt to infer the missing values with the predictor. We begin by introducing the process of characterizing an NLP experiment for each task in 3. We evaluate the effectiveness and robustness of NLPERF by comparing to multiple baselines, human experts, and by perturbing a single feature to simulate a grid search over that feature (4). Evaluations on multiple tasks show that NLPERF is able to outperform all baselines. Notably, on a machine translation (MT) task, the predictions made by the predictor turn out to be more accurate than human experts. 
An effective predictor can be very useful for multiple applications associated with practical scenarios. In 5, we show how it is possible to adopt the predictor as a scoring function to find a small subset of experiments that are most representative of a bigger set of experiments. We argue that this will allow researchers to make informed decisions on what datasets to use for training and evaluation, in the case where they cannot experiment on all experimental settings. Last, in 6, we show that we can adequately predict the performance of new models even with a minimal number of experimental records. 

2 Problem Formulation 

In this section we formalize the problem of predicting performance on supervised NLP tasks. Given an NLP model of architecture M trained over dataset(s) D of a specific task involving language(s) L with a training procedure (optimization algorithms, learning rate scheduling etc.) P, we can test the model on a test dataset D0 and get a score S of a specific evaluation metric. The resulting score will surely vary depending on all the above mentioned factors, and we denote this relation as g: 

<<FORMULA>>. (1) 

In the ideal scenario, for each test dataset D0 of a specific task, one could enumerate all different settings and find the one that leads to the best performance. As mentioned in Section 1, however, such a brute-force method is computationally infeasible. Thus, we turn to modeling the process and formulating our problem as a regression task by using a parametric function f. to approximate the true function g as follows: 

<<FORMULA>> 

where <<FORMULA>> denotes a set of features for each influencing factor. 
For the purpose of this study, we mainly focus on dataset and language features .L and .D, as this already results in a significant search space, and gathering extensive experimental results with fine-grained tuning over model and training hyper-parameters is both expensive and relatively complicated. In the cases where we handle multiple models, we only use a single categorical model feature to denote the combination of model architecture and training procedure, denoted as .C. We still use the term model to refer to this combination in the rest of the paper. We also omit the test set features, under the assumption that the data distributions for training and testing data are the same (a fairly reasonable assumption if we ignore possible domain shift). Therefore, for all experiments below, our final prediction function is the following: 

<<FORMULA>>

In the next section we describe concrete instantiations of this function for several NLP tasks. 

3 NLP Task Instantiations 

To build a predictor for NLP task performance, we must 1) select a task, 2) describe its featurization, and 3) train a predictor. We describe details of these three steps in this section. 

<<TABLE>> 

Table 2: Statistics of the datasets we use for training predictors. # EXs denote the total number of experiment instances; Task Metric reflects how the models are evaluated. 
Tasks We test on tasks including bilingual lexicon induction (BLI); machine translation trained on aligned Wikipedia data (Wiki-MT), on TED talks (TED-MT), and with cross-lingual trans.fer for translation into English (TSF-MT); cross-lingual dependency parsing (TSF-Parsing); cross-lingual POS tagging (TSF-POS); cross-lingual entity linking (TSF-EL); morphological analysis (MA) and universal dependency parsing (UD). Ba.sic statistics on the datasets are outlined in Table 2. 
For Wiki-MT tasks, we collect experimental records directly from the paper describing the cor.responding datasets (Schwenk et al., 2019). For TED-MT and all the transfer tasks, we use the results of Lin et al. (2019). For BLI, we conduct experiments using published results from three papers, namely Artetxe et al. (2016), Artetxe et al. (2017) and Xu et al. (2018). For MA, we use the results of the SIGMORPHON 2019 shared task 2 (McCarthy et al., 2019). Last, the UD results are taken from the CoNLL 2018 Shared Task on universal dependency parsing (Zeman et al., 2018b). 
Featurization For language features, we utilize six distance features from the URIEL Typologi.cal Database (Littell et al., 2017), namely geo.graphic, genetic, inventory, syntactic, phonological, and featural distance. 
The complete set of dataset features includes the following: 
1. Dataset Size: The number of data entries used for training. 

2. Word/Subword Vocabulary Size: The number of word/subword types. 

3. Average Sentence Length: The average length of sentences from all experimental. 

4. Word/Subword Overlap: <<FORMULA>> where T1 and T2 denote vocabularies of any two corpora. 

5. Type-Token Ratio (TTR): The ratio between the number of types and number of tokens (Richards, 1987) of one corpus. 

6. Type-Token Ratio Distance: <<FORMULA>> where TTR1 and TTR2 denote TTR of any two corpora. 

7. Single Tag Type: Number of single tag types. 

8. Fused Tag Type: Number of fused tag types. 

9. Average Tag Length Per Word: Average number of single tags for each word. 

10. Dependency Arcs Matching WALS features: the proportion of dependency parsing arcs matching the following WALS features, computed over the training set: subject/object/oblique before/after verb and adjective/numeral before/after noun. 


For transfer tasks, we use the same set of dataset features .D as Lin et al. (2019), including features 1x6 on the source and the transfer language side. We also include language distance features between source and transfer language, as well as between source and target language. For MT tasks, we use features 1x6 and language distance features, but only between the source and target language. For MA, we use features 1, 2, 5 and morphological tag related features 7x9. For UD, we use features 1, 2, 5, and 10. For BLI, we use language distance features and URIEL syntactic features for the source and the target language. 
Predictor Our prediction model is based on gradient boosting trees (Friedman, 2001), implemented with XGBoost (Chen and Guestrin, 2016). This method is widely known as an effective means for solving problems including ranking, classification and regression. We also experimented with Gaussian processes (Williams and Rasmussen, 1996), but settled on gradient boosted trees because performance was similar and Xg.boost's implementation is very efficient through the use of parallelism. We use squared error as the objective function for the regression and adopted a fixed learning rate 0.1. To allow the model to fully fit the data we set the maximum tree depth to be 10 and the number of trees to be 100, and use the default regularization terms to prevent the model from overfitting. 

4 Can We Predict NLP Performance? 

In this section we investigate the effectiveness of NLPERF across different tasks on various metrics. Following Lin et al. (2019), we conduct k-fold cross validation for evaluation. To be specific, we randomly partition the experimental records of hL, D, C, Si tuples into k folds, and use k.1 folds to train a prediction model and evaluate on the remaining fold. Note that this scenario is similar to filling in the blanks in Table 1, where we have some experimental records that we can train the model on, and predict the remaining ones. 
For evaluation, we calculate the average root mean square error (RMSE) between the predicted scores and the true scores. 
Baselines We compare against a simple mean value baseline, as well as against language-wise mean value and model-wise mean value baselines. The simple mean value baseline outputs an aver.age of scores s from the training folds for all test entries in the left-out fold (i) as follows: 

<<FORMULA>>

But performance of what?

(FLOPS, energy, memory)

or plain accuracy?

<<FORMULA>> (2)

Note that for tasks involving multiple models, we calculate the RMSE score separately on each model and use the mean RMSE of all models as the final RMSE score. 
The language-wise baselines make more in.formed predictions, taking into account only train.ing instances with the same transfer, source, or tar.get language (depending on the task setting). For example, the source-language mean value baseline 

<<(i,j)>>

s for jth test instance in fold i outputs an average of the scores s of the training instances that share the same source language features s-lang, as shown in Equation 3: 

<<FORMULA>> (3) 
 
where . is the indicator function. Similarly, we define the target-and the transfer-language mean value baselines. 
In a similar manner, we also compare against a model-wise mean value baseline for tasks that include experimental records from multiple models. Now, the prediction for the jth test instance in the left-out fold i is an average of the scores on the same dataset (as characterized by the language .L and dataset .D features) from all other models: 
<<FORMULA>> (4) 

where <<FORMULA>> and <<FORMULA>> respectively denote the language and dataset features of the test instance. 
Main Results For multi-model tasks, we can do either Single Model prediction (SM), restricting training and testing of the predictor within a single model, or Multi-Model (MM) prediction using a categorical model feature. The RMSE scores of NLPERF along with the baselines are shown in Table 3. For all tasks, our single model predictor is able to more accurately estimate the evaluation score of unseen experiments compared to the single model baselines, confirming our hypothesis that the there exists a correlation that can be captured between experimental settings and the downstream performance of NLP systems. The language-wise baselines are much stronger than the simple mean value baseline but still perform worse than our single model predictor. Similarly, the model-wise baseline significantly outperforms the mean value baseline because results from other models reveal much information about the dataset. 

<<TABLE>> 

Table 3: RMSE scores of three baselines and our predictions under the single model and multi model setting (missing values correspond to settings not applicable to the task). All results are from k-fold (k =5) evaluations averaged over 10 random runs. 
Even so, our multi-model predictor still outperforms the model-wise baseline. 
The results nicely imply that for a wide range of tasks, our predictor is able to reasonably estimate left-out slots in a partly populated table given results of other experiment records, without actually running the system. 
We should note that RMSE scores across  different tasks should not be directly compared, mainly because the scale of each evaluation metric is different. For example, a BLEU score (Papineni et al., 2002) for MT experiments typically ranges from 1 to 40, while an accuracy score usually has a much larger range, for example, BLI accuracy ranges from 0.333 to 78.2 and TSF-POS accuracy ranges from 1.84 to 87.98, which consequently makes the RMSE scores of these tasks higher. 
Comparison to Expert Human Performance 
We constructed a small scale case study to evaluate whether NLPERF is competitive to the performance of NLP sub-field experts. We focused on the TED-MT task and recruited 10 MT practitioners, all of whom had published at least 3 MT-related papers in ACL-related conferences. 
In the first set of questions, the participants were presented with language pairs from one of the k data folds along with the dataset features and were asked to estimate an eventual BLEU score for each data entry. In the second part of the questionnaire, the participants were tasked with making estimations on the same set of language pairs, but this time they also had access to features, and BLEU scores from all the other folds.3 

<<TABLE>>

Table 4: Our model performs better than human MT experts on the TED-MT prediction task. 
The partition of the folds is consistent between the human study and the training/evaluation for the predictor. While the first sheet is intended to familiarize the participants with the task, the second sheet fairly adopts the training/evaluation setting for our predictor. As shown in Table 4, our participants outperform the mean baseline even without information from other folds, demonstrating their own strong prior knowledge in the field. In addition, the participants make more accurate guesses after acquiring more information on experimental records in other folds. In neither case, though, are the human experts competitive to our predictor. In fact, only one of the participants achieved  performance comparable to our predictor. 
Feature Perturbation Another question of interest concerning predicting performance is how will the model perform when trained on data of a different size (Kolachina et al., 2012a). To test NLPERF's extrapolation ability in this regard, we conduct an array of experiments on one language pair with various data sizes on the Wiki-MT task. We pick two language pairs, Turkish to English (TREN) and Portuguese to English (PTEN) as 

2 None of the study participants were affiliated to the au-our testbed for the Wiki-MT task. We sample parallel datasets with different sizes and train MT models with each sampled dataset to obtain the true BLEU scores. On the other hand, we collect the features of all sampled datasets and use our predictor (trained over all other languages pairs) to obtain predictions. The plot of true BLEU scores and predicted BLEU scores are shown in Figure 1. Our predictor achieves a very low average RMSE of 1.83 for TREN pair but a relatively higher RMSE of 9.97 for PTEN pair. The favorable performance on the tr-en pair demonstrates the possibility of our predictor to do feature extrapolation over data set size. In contrast, the predictions on the pt-en pair are significantly less accurate. This is due to the fact that there are only two other experimental settings scoring as high as 34 BLEU score, with data sizes of 3378k (en-es) and 611k (gl-es), leading to the predictors inadequacy in predicting high BLEU scores for low-resourced data sets during extrapolation. This reveals the fact that while the predictor is able to extrapolate performance on settings similar to what it has seen in the data, NLPERF may be less successful under circumstances unlike its training inputs. 
 
3 The interested reader can find an example questionnaire (and make estimations over one of the folds) in the Authors institutions, nor were familiar with this paper's content.

<<FIGURE>>

Figure 1: Our model's predicted BLEU scores and true BLEU scores, on sampled TREN datasets (sizes 10k/50k/100k/200k/478k) and PTEN datasets (sizes 100k/500k/1000k/2000k/2462k), achieving a RMSE score of 1.83 and 9.97 respectively.

5 What Datasets Should We Test On? 

As shown in Table 1, it is common practice to test models on a subset of all available datasets. The reason for this is practical <<FORMULA>> it is computationally prohibitive to evaluate on all settings. However, if we pick test sets that are not representative of the data as a whole, we may mistakenly reach un.founded conclusions about how well models per.form on other data with distinct properties. For example, models trained on a small-sized dataset may not scale well to a large-sized one, or models that perform well on languages with a particular linguistic characteristic may not do well on languages with other characteristics (Bender and Friedman, 2018). 
Here we ask the following question: if we are only practically able to test on a small number of experimental settings, which ones should we test on to achieve maximally representative results? Answering the question could have practical im.plications: organizers of large shared tasks like SIGMORPHON (McCarthy et al., 2019) or UD (Zeman et al., 2018a) could create a minimal sub.set of settings upon which they would ask participants to test to get representative results; similarly, participants could possibly expedite the iteration of model development by testing on the  representative subset only. A similar avenue for researchers and companies deploying systems over multiple languages could lead to not only financial savings, but potentially a significant cut-down of emissions from model training (Strubell et al., 2019). 
We present an approximate explorative solution to the problem mentioned above. Formally, assume that we have a set N , comprising experimental records (both features and scores) of n datasets for one task. We set a number m (<n) of datasets that we would like to select as the representative subset. By defining RMSEA(B) to be the RMSE score derived from evaluating on one subset B the predictor trained on another subset of experimental records A, we consider the most representative subset D to be the one that minimizes the RMSE score when predicting all of the other datasets: 

<<FORMULA>>. (5) 

Naturally, enumerating all possible subsets would be prohibitively costly, even though it would lead to the optimal solution. Instead, we employ a beam-search-like approach to efficiently search for an approximate solution to the best per.forming subset of arbitrary size. Concretely, we start our approximate search with an exhaustive enumeration of all subsets of size 2. At each fol.lowing step t, we only consider the best k subsets 
<<FORMULA>> into account and discard the t rest. As shown in Equation 6, for each candidate 

<<FIGURE>>

Figure 2: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for 4 NLP tasks. We also show random search results averaged over 100 random runs. 
subset, we expand it with one more data point, 

<<FORMULA>>. (6)

For tasks that involve multiple models, we take experimental records of the selected dataset from all models into account during expansion. Given all expanded subsets, we train a predictor for each to evaluate on the rest of the data sets, and keep the (i)
best performing k subsets <<FORMULA>> with minimum RMSE scores for the next step. Furthermore, note that by simply changing the arg min to an arg max in Equation 5, we can also find the least representative datasets. 
We present search results for four tasks4 as beam search progresses in Figure 2, with cor.responding RMSE scores from all remaining datasets as the y-axis. For comparison, we also conduct random searches by expanding the subset with a randomly selected experimental record. In all cases, the most representative sets are an aggregation of datasets with diverse characteristics such as languages and dataset sizes. For example, in the Wiki-MT task, the 5 most representative datasets include languages that fall into a diverse range of language families such as Romance, Turkic, Slavic, etc. while the least representative ones include duplicate pairs (opposite directions) mostly 
involving English. The phenomenon is more pronounced in the TED-MT task, where not only the 5 most representative source languages are di.verse, but also the dataset sizes. specifically, the Malay-English (msa-eng) is a tiny dataset (5k parallel sentences), and Hebrew-English (heb-eng) is a high-resource case (212k parallel sentences). 
Notably, for BLI task, to test how  representative the commonly used datasets are, we se.lect the most frequent 5 language pairs shown in Table 1, namely en-de, es-en, en-es, fr-en, en-fr for evaluation. Unsurprisingly, we get an RMSE score as high as 43.44, quite close to the  performance of the worst representative set found using beam search. This finding indicates that the standard practice of choosing datasets for evaluation is likely unrepresentative of results over the full dataset spectrum, well aligned with the claims in Anastasopoulos and Neubig (2020). 
A particularly encouraging observation is that the predictor trained with only the 5 most representative datasets can achieve an RMSE score comparable to k-fold validation, which required using all of the datasets for training.5 This indicates that one would only need to train NLP models on a small set of representative datasets to obtain reasonably plausible predictions for the rest. 

5 to be accurate, k . 1 folds of all datasets. 

6 Can We Extrapolate Performance for those better-performing systems, so the predictor New Models? is unable to generalize well (ONLP). 
In another common scenario, researchers propose new models for an existing task. It is both time-consuming and computationally intensive to run experiments with all settings for a new model. In this section, we explore if we can use past experimental records from other models and a minimal set of experiments from the new model to give a plausible prediction over the rest of the datasets, potentially reducing the time and resources needed for experimenting with the new model to a large extent. We use the task of UD parsing as our testbed6 as it is the task with most unique models (25 to be exact). Note that we still only use a single categorical feature for the model type. 
To investigate how many experiments are needed to have a plausible prediction for a new model, we first split the experimental records equally into a sample set and a test set. Then we randomly sample <<FORMULA>> experimental records from the sample set and add them into the collection of experiment records of past models. Each time we re-train a predictor and evaluate on the test set. The random split repeats 50 times and the random sampling repeats 50 times, adding up to a total of 2500 experiments. We use the mean value of the results from other models, shown in Equation 7 as the prediction baseline for the left-out model, and because experiment results of other models reveal significant information about the dataset, this serves as a relatively strong baseline: 
 
<<FORMULA>>. (7) 

M denotes a collection of models and k denotes the left-out model. 
We show the prediction performance (in RMSE) over 8 systems7 in Figure 3. Interestingly, the predictor trained with no model records (0) outperforms the mean value baseline for the 4 best systems, while it is the opposite case on the 4 worst systems. Since there is no information provided about the new-coming model, the predictions are solely based on dataset and language features. One reason might explain the phenomenon .the correlation between the features and the scores of the worse-performing systems is different from 
6MA and BLI task results are in Appendix C 7The best and worst 4 systems from the shared task. 
In the following discussion, we use RMSE@n to denote the RMSE from the predictor trained with n data points of a new model. The relatively low RMSE@0 scores indicate that other models' features and scores are informative for predicting the performance of the new model even without new model information. Comparing RMSE@0 and RMSE@1, we observe a consistent improvement for almost all systems, indicating that NLPERF trained on even a single ex.tra random example achieves more accurate estimates over the test sets. Adding more data points consistently leads to additional gains. However, predictions on worse-performing systems benefit more from it than for better-performing systems, indicating that their feature-performance correlation might be considerably different. The findings here indicate that by extrapolating from past experiments, one can make plausible judgments for newly developed models. 

7 Related Work 

As discussed in Domhan et al. (2015), there are two main threads of work focusing on predict.ing performance of machine learning algorithms. The first thread is to predict the performance of a method as a function of its training time, while the second thread is to predict a method's performance as a function of the training dataset size. Our work belongs in the second thread, but could easily be extended to encompass training time/procedure. 
In the first thread, Kolachina et al. (2012b) at.tempt to infer learning curves based on training data features and extrapolate the initial learning curves based on BLEU measurements for statistical machine translation (SMT). By extrapolating the performance of initial learning curves, the predictions on the remainder allows for early termination of a bad run (Domhan et al., 2015). 
In the second thread, Birch et al. (2008) adopt linear regression to capture the relationship between data features and SMT performance and find that the amount of reordering, the morphological complexity of the target language and the relatedness of the two languages explains the majority of performance variability. More recently, Elsa.
har and Gall (2019) use domain shift metrics such as H-divergence based metrics to predict drop in performance under domain-shift. Rosenfeld et al. 

<<FIGURE>>

Figure 3: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and <<FORMULA>> records from a new model. 
(2020) explore the functional form of the dependency of the generalization error of neural models on model and data size. We view our work as a generalization of such approaches, appropriate for application on any NLP task. 
8 Conclusion and Future Work 
In this work, we investigate whether the experiment setting itself is informative for predicting the evaluation scores of NLP tasks. Our findings promisingly show that given a sufficient number of past training experimental records, our predictor can 1) outperform human experts; 2) make plau.sible predictions even over new-coming models and languages; 3) extrapolate well on features like dataset size; 4) provide a guide on how we should choose representative datasets for fast iteration. 
While this discovery is a promising start, there are still several avenues on improvement in future work. 
First, the dataset and language settings covered in our study are still limited. Experimental records we use are from relatively homogeneous settings, 
e.g. all datasets in Wiki-MT task are sentence-pieced to have 5000 subwords, indicating that our predictor may fail for other subword settings. Our model also failed to generalize to cases where feature values are out of the range of the training experimental records. We attempted to apply the pre.dictor of Wiki-MT to evaluate on a low-resource MT dataset, translating from Mapudungun (arn) to Spanish (spa) with the dataset from Duan et al. (2019), but ended up with a poor RMSE score. It turned out that the average sentence length of the arn/spa data set is much lower than that of the training data sets and our predictors fail to generalize to this different setting. 
Second, using a categorical feature to denote model types constrains its expressive power for modeling performance. In reality, a slight change in model hyperparameters (Hoos and Leyton-Brown, 2014; Probst et al., 2019), optimization algorithms (Kingma and Ba, 2014), or even random seeds (Madhyastha and Jain, 2019) may give rise to a significant variation in performance, which our predictor is not able to capture. While investigating the systematic implications of model structures or hyperparameters is practically infeasible in this study, we may use additional information such as textual model descriptions for modeling NLP models and training procedures more elaborately in the future. 
Lastly, we assume that the distribution of train.ing and testing data is the same, which does not consider domain shift. On top of this, there might also be a domain shift between data sets of train.ing and testing experimental records. We believe that modeling domain shift is a promising future direction to improve performance prediction. 

Acknowledgement 

The authors sincerely thank all the reviewers for their insightful comments and suggestions, Philipp Koehn, Kevin Duh, Matt Post, Shuoyang Ding, Xuan Zhang, Adi Renduchintala, Paul Mc-Namee, Toan Nguyen and Kenton Murray for con.ducting human evaluation for the TED-MT task, Daniel Beck for discussions on Gaussian Pro.cesses, Shruti Rijhwani, Xinyi Wang, Paul Michel for discussions on this paper. This work is generously supported from the National Science Foundation under grant 1761548. 

References 
Antonios Anastasopoulos and Graham Neubig. 2020. Should all cross-lingual embeddings speak english? In Proc. ACL. To appear. 
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word em.beddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empiri.cal Methods in Natural Language Processing, pages 22892294. 
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin.guistics (Volume 1: Long Papers), pages 451462, Vancouver, Canada. Association for Computational Linguistics. 
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. Bilingual lexicon induction through unsupervised machine translation. In Proceedings of the 57th An.nual Meeting of the Association for Computational Linguistics, pages 50025007, Florence, Italy. Asso.ciation for Computational Linguistics. 
Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587604. 
Alexandra Birch, Miles Osborne, and Philipp Koehn. 2008. Predicting success in machine translation. In Proceedings of the Conference on Empirical methods in Natural Language Processing, pages 745 
754. Association for Computational Linguistics. 
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 
794. ACM. 
Xilun Chen and Claire Cardie. 2018. Unsupervised multilingual word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Nat.ural Language Processing, pages 261270, Brus.sels, Belgium. Association for Computational Lin.guistics. 
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparame.ter optimization of deep neural networks by extrap.olation of learning curves. In Twenty-Fourth Inter.national Joint Conference on Articial Intelligence. 
Mingjun Duan, Carlos Fasola, Sai Krishna Rallabandi, Rodolfo M. Vega, Antonios Anastasopoulos, Lori Levin, and Alan W Black. 2019. A resource for computational experiments on mapudungun. In Proc. LREC. To appear. 
Hady Elsahar and Matthias Gall. 2019. To annotate or not? predicting performance drop under domain shift. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu.ral Language Processing (EMNLP-IJCNLP), pages 21632173. 
Jerome H Friedman. 2001. Greedy function approx.imation: a gradient boosting machine. Annals of statistics, pages 11891232. 
Geert Heyman, Bregt Verreet, Ivan Vulic, and Marie-Francine Moens. 2019. Learning unsupervised mul.tilingual word embeddings with incremental multi.lingual hubs. In Proceedings of the 2019 Confer.ence of the North American Chapter of the Asso.ciation for Computational Linguistics: Human language Technologies, Volume 1 (Long and Short papers), pages 18901902. 
Holger Hoos and Kevin Leyton-Brown. 2014. An ef.cient approach for assessing hyperparameter im.portance. In International conference on machine learning, pages 754762. 
Jiaji Huang, Qiang Qiu, and Kenneth Church. 2019. Hubless nearest neighbor search for bilingual lexi.
con induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin.guistics, pages 40724080, Florence, Italy. Associa.tion for Computational Linguistics. 
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 
Prasanth Kolachina, Nicola Cancedda, Marc Dymet.man, and Sriram Venkatapathy. 2012a. Prediction of learning curves in machine translation. In Proceed.ings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), pages 2230, Jeju Island, Korea. Association for Computational Linguistics. 
Prasanth Kolachina, Nicola Cancedda, Marc Dymet.man, and Sriram Venkatapathy. 2012b. Prediction of learning curves in machine translation. In Proceed.ings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 2230. Association for Computational Lin.guistics. 
Guillaume Lample, Alexis Conneau, MarcAurelio Ranzato, Ludovic Denoyer, and Herv Jgou. 2018. Word translation without parallel data. In Interna.tional Conference on Learning Representations. 
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neu.big. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin.guistics, pages 31253135, Florence, Italy. Associa.tion for Computational Linguistics. 
Patrick Littell, David R Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. Uriel and lang2vec: Representing languages as typologi.cal, geographical, and phylogenetic vectors. In Pro.ceedings of the 15th Conference of the European Chapter of the Association for Computational Lin.guistics: Volume 2, Short Papers, pages 814. 
Pranava Madhyastha and Rishabh Jain. 2019. On model stability as a function of random seed. In Pro.ceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 929 939, Hong Kong, China. Association for Computa.tional Linguistics. 
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar.rett Nicolai, Christo Kirov, Miikka Silfverberg, Se.bastian J. Mielke, Jeffrey Heinz, Ryan Cotterell, and Mans Hulden. 2019. The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inection. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 229 244, Florence, Italy. Association for Computational Linguistics. 
Joakim Nivre, Mitchell Abrams, .eljko Agic, Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Balles.teros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, John Bauer, Sandra Bellato, Kepa Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Rogier Blokland, Victoria Bobicev, Carl Bstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Gsen Cebiro git, Giuseppe G. A. Celano, Savas 
glu Eryi Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Silvie Cinkov, Aurlie Collomb, a
gr fitekin, Miriam Connor, Marine Courtin, Elizabeth Davidson, Marie-Catherine de Marn.effe, Valeria de Paiva, Arantza Diaz de Ilarraza, Carly Dickerson, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Toma. Erjavec, Aline Etienne, Richrd Farkas, Hector Fernandez Alcalde, Jennifer Foster, Cludia Freitas, Katarna Gajdo.ov, Daniel Galbraith, Marcos Garcia, Moa Grdenfors, Kim Gerdes, Filip Gin.ter, Iakes Goenaga, Koldo Gojenola, Memduh Grmak, Yoav Goldberg, Xavier Gez Guino.vart, Berta Gonzles Saavedra, Matias Grioni, Normunds Gruz fitis, Bruno Guillaume, Cline Guillot-Barbance, Nizar Habash, Jan Hajic, Jan Hajic jr., Linh H My, Na-Rae Han, Kim Harris, Dag Haug, Barbora Hladk, Jaroslava Hlav
cov, Florinel Hociung, Petter Hohle, Jena Hwang, Radu Ion, Elena Irimia, Tom. Jelfinek, Anders Johannsen, Fredrik Jgensen, Her Kaskara, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Tolga Kayade.len, Vclava Kettnerov, Jesse Kirchner, Natalia Kotsyba, Simon Krek, Sookyoung Kwak, Veronika Laippala, Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phng L H` 
g, Alessandro Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li, KyungTae Lim, Nikola Ljube.ic, Olga Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, C alina
at Mar  cek, Katrin Marheinecke, 
anduc, David Mare Hctor Martfinez Alonso, Andr Martins, Jan Ma.ek, Yuji Matsumoto, Ryan McDonald, Gustavo Mendona, Niko Miekka, Anna Missil, Cat 
alin Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Laura Moreno Romero, Shinsuke Mori, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Yugo Murawaki, Kaili Mrisep, Pinkey Nainwani, Juan Ignacio Navarro Horacek, Anna Nedoluzhko, Ne.pore-B
Gunta erzkalne, ., .
Lng Nguyn Thi Huyn` Nguyn Thi Minh, Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala, Adday Olkun, Mai Omura,
. Petya Osenova, Robert stling, Lilja vrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Siyao Peng, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank, Thierry Poibeau, Martin Popel, Lauma Pretkalnin, a, Sophie Prvost, Prokopis Prokopidis, Adam Przepikowski, Tiina Puolakainen, Sampo Pyysalo, Andriela Rbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Michael Rieler, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Romanenko, Rudolf Rosa, Da.vide Rovati, Valentin Ros,ca, Olga Rudina, Shoval Sadde, Shadi Saleh, Tanja Samard.ic, Stephanie Samson, Manuela Sanguinetti, Baiba Saulfite, Yanin Sawanakunanon, Nathan Schneider, Sebas.tian Schuster, Djam Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh Shohibussirri, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simk Mria .imkov, Kiril Simov, Aaron Smith, Isabela Soares-Bastos, Antonio Stella, Milan Straka, Jana Strnadov, Alane Suhr, Umut Sulubacak, Zsolt Sznt Dima Taji, Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zde
nka Ure.ov, Larraitz Uria, Hans Uszkoreit, Sowmya Vajjala, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jonathan North Washington, Seyi Williams, Mats Wirn, Tsegay Woldemariam, Tak-sum Wong, Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu, Zdenek .abokrtsk Amir Zeldes, Daniel Zeman, Manying Zhang, and Hanzhi Zhu. 2018. Universal dependencies 2.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (FAL), Faculty of Mathematics and Physics, Charles University. 
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval.uation of machine translation. In Proceedings of the 40th annual meeting on association for compu.tational linguistics, pages 311318. Association for Computational Linguistics. 
Philipp Probst, Anne-Laure Boulesteix, and Bernd Bis.chl. 2019. Tunability: Importance of hyperparame.ters of machine learning algorithms. Journal of Ma.chine Learning Research, 20(53):132. 
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad.manabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Meeting of the North American Chapter of the Association for Computa.tional Linguistics (NAACL), New Orleans, USA. 
Brian Richards. 1987. Type/token ratios: What do they really tell us? Journal of child language, 14(2):201 209. 
Shruti Rijhwani, Jiateng Xie, Graham Neubig, and Jaime Carbonell. 2019. Zero-shot neural transfer for cross-lingual entity linking. In Thirty-Third AAAI Conference on Articial Intelligence (AAAI), Hon.olulu, Hawaii. 
Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Be.linkov, and Nir Shavit. 2020. A constructive pre.
diction of the generalization error across scales. In International Conference on Learning Representa.tions. 
Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmn. 2019. Wiki-matrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791. 
Emma Strubell, Ananya Ganesh, and Andrew McCal.lum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computa.tional Linguistics, pages 36453650, Florence, Italy. Association for Computational Linguistics. 
Christopher KI Williams and Carl Edward Rasmussen. 1996. Gaussian processes for regression. In Ad.vances in neural information processing systems, pages 514520. 
Ruochen Xu, Yiming Yang, Naoki Otani, and Yuexin Wu. 2018. Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural language Processing, pages 24652474. 
Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu, and Xu Sun. 2019. Maam: A morphology-aware alignment model for unsupervised bilingual lexicon 
induction. In Proceedings of the 57th Annual Meet.ing of the Association for Computational Linguis.tics, pages 31903196. 
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot.thast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018a. CoNLL 2018 shared task: Mul.
tilingual parsing from raw text to universal depen.
dencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Univer.sal Dependencies, pages 121, Brussels, Belgium. Association for Computational Linguistics. 
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot.thast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018b. Conll 2018 shared task: mul.tilingual parsing from raw text to universal depen.dencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Univer.sal Dependencies, pages 121. 
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Earth movers distance minimization for unsupervised bilingual lexicon induction. In Pro.ceedings of the 2017 Conference on Empirical methods in Natural Language Processing, pages 1934 1945. 
Appendix A Questionnaire 
An example of the first questionnaire from our user case study is shown below. The second sheet also included the results in 44 more language pairs. We provide an answer key after the second sheet. 

Please provide your prediction of the BLEU score based on the language pair and dataset features (the domain of the training and test sets is TED talks). After you nish, please go to sheet v2. 

<<TABLE>>

Please provide your prediction of the BLEU score in the yellow area given all the information in this sheet. Note that all experiments are trained with the same model. 

<<TABLE>>

B Representative datasets 

In this section, we show the searching results of most/least representative subsets for the rest of the 

<<FIGURE>>

Figure 4: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for the remaining NLP tasks. We also show random search results of corresponding sizes. 

C New Model 

In this section, we show the extrapolation  performance for new models on BLI, MA and the remaining systems of UD. 

<<FIGURE>>

Figure 5: RMSE scores of BLI task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 05 records from a new model (as indicated by the title of each graph). 

<<FIGURE>>

Figure 6: RMSE scores of MA task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 05 records from a new model (as indicated by the title of each graph) 

<<FIGURE>>

Figure 7: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 05 records from a new model (as indicated by the title of each graph). 

D Feature importance 

In this section, we show the plots of feature importance for all the tasks. 
<|endoftext|>


<|startoftext|>
               Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data


                     Charles H. Martin     Tongsu (Serena) Peng y    Michael W. Mahoney z


                                                 Abstract

                   In many practical applications, one works with deep neural network (DNN) models trained
                   by someone else. For such pretrained models, one typically does not have access to training
                   data or test data. Moreover, one does not know many details about the model, such as
                   the specifics of the training data, the loss function, the hyperparameter values, etc. Given
                   one or many pretrained models, can one say anything about the expected performance or
                   quality of the models? Here, we present and evaluate empirical quality metrics for pretrained
                   DNN models at scale. Using the open-source Weight Watcher tool, we analyze hundreds of
                   publicly-available pretrained models, including older and current state-of-the-art models in
                   computer vision (CV) and natural language processing (NLP). We examine both familiar
                   norm-based capacity control metrics (Frobenius and Spectral norms) as well as newer Power
                   Law (PL) based metrics (including fitted PL exponents, <<FORMULA>>, and the Weighted Alpha metric,
                   <<FORMULA>>, from the recently-developed Theory of Heavy-Tailed Self Regularization (HT-SR). We
                   also introduce the -Shatten Norm metric. We find that norm-based metrics correlate well
                   with reported test accuracies for well-trained models across nearly all CV architecture series.
                   On the other hand, we find that norm-based metrics can not distinguish \good-versus-bad"
                   models|which, arguably is the point of needing quality metrics. Indeed, they may give
                   spurious results. We also find that PL-based metrics do much better|quantitatively better
                   at discriminating among a series of \good-better-best" models, and qualitatively better at
                   discriminating \good-versus-bad" models. PL-based metrics can also be used to characterize
                   fine-scale properties of these models, and we introduce the layer-wiseCorrelation Flow as
                   new quality assessment. We show how poorly-trained (and/or poorly fine-tuned) models may
                   exhibit bothScale Collapse and unusually large PL exponents,6, in particular for recent
                   NLP models. Our techniques, as implemented in the  Weight Watcher tool, can be used to
                   identify when a pretrained DNN has problems that can not be detected simply by examining
                   training/test accuracies.


              1 Introduction

              A common problem in machine learning (ML) is to evaluate the quality of a given model. A
              popular way to accomplish this is to train a model and then evaluate its training/testing error.
              There are many problems with this approach. The training/testing curves give very limited insight
              into the overall properties of the model; they do not take into account the (often large human
              and CPU/GPU) time for hyperparameter fiddling; they typically do not correlate with other
              properties of interest such as robustness or fairness or interpretability; and so on. A less well-
              known problem, but one that is increasingly important, in particular in industrial-scale artificial
              intelligence (AI), arises when the model user is not the model developer. Here, one may not
             have access to either the training data or the testing data. Instead, one may simply be given a
             model that has already been trained a pretrained model|and need to use it as-is, or to fine-tune
             and/or compress it and then use it.
                Natively|but in our experience commonly, among ML practitioners and ML theorists|if one
             does not have access to training or testing data, then one can say absolutely nothing about the
             quality of a ML model. This may be true in worst-case theory, but models are used in practice,
             and there is a need for a practical theory to guide that practice. Moreover, if ML is to become
             an industrial process, then that process will become siloed: some groups will gather data, other
             groups will develop models, and other groups will use those models. Users of models can not be
             expected to know the precise details of how models were built, the specifics of data that were
             used to train the model, what was the loss function or hyperparameter values, how precisely the
             model was regularized, etc.
                Moreover, for many large scale, practical applications, there is no obvious way to define an
             ideal test metric. For example, models that generate fake text or conversational chatbots may
             use a proxy, like perplexity, as a test metric. In the end, however, they really require human
             evaluation. Alternatively, models that cluster user profiles, which are widely used in areas such
             as marketing and advertising, are unsupervised and have no obvious labels for comparison and/or
             evaluation. In these and other areas, ML objectives can be poor proxies for downstream goals.
                Most importantly, in industry, one faces unique practical problems such as: do we have enough
             data for this model? Indeed, high quality, labeled data can be very expensive to acquire, and this
             cost can make or break a project. Methods that are developed and evaluated on any well-defined
             publicly-available corpus of data, no matter how large or diverse or interesting, are clearly not
             going to be well-suited to address problems such as this. It is of great practical interest to have
             metrics to evaluate the quality of a trained model|in the absence of training/testing data and
             without any detailed knowledge of the training/testing process. We seek a practical theory for
             pretrained models which can predict how, when, and why such models can be expected to perform
             well or poorly.
                In this paper, we present and evaluate quality metrics for pretrained deep neural network
             (DNN) models, and we do so at scale. We consider a large suite of hundreds of publicly-available
             models, mostly from computer vision (CV) and natural language processing (NLP). By now, there
             are many such state-of-the-art models that are publicly-available, e.g., there are now hundreds
             of pretrained models in CV (500) and NLP (100). 1 These provide a large corpus of models
             that by some community standard are state-of-the-art. 2 Importantly, all of these models have
             been trained by someone else and have been viewed to be of sufficient interest/quality to be made
             publicly-available; and, for all of these models, we have no access to training data or testing data,
             and we have no knowledge of the training/testing protocols.
                The quality metrics we consider are based on the spectral properties of the layer weight
             matrices. They are based on norms of weight matrices (such norms have been used in traditional
             statistical learning theory to bound capacity and construct regularizers) and/or parameters of
             power law (PL) fits of the eigenvalues of weight matrices (such PL fits are based on statistical
             mechanics approaches to DNNs). Note that, while we use traditional norm-based and PL-based
             metrics, our goals are not the traditional goals. Unlike more common ML approaches,we do
             not seek a bound on the generalization(e.g., by evaluating training/test error during training),
                1 When we began this work in 2018, there were fewer than tens of such models; now in 2020, there are hundreds
             of such models; and we expect that in a year or two there will be an order of magnitude or more of such models.
                2 Clearly, there is a selection bias or survivorship bias here|people tend not to make publicly-available their
             poorly-performing models|but these models are things in the world that (like social networks or the internet) can
             be analyzed for their properties.
             we do not seek a new regularizer, and we do not aim to evaluate a single model(e.g., as with
             hyperparameter optimization). 3 Instead, we want to examine different models across common
             architecture series, and we want to compare models between different architectures themselves,
             and in both cases, we ask:
                 Can we predict trends in the quality of pretrained DNN models without access to
                 training or testing data?
                To answer this question, we analyze hundreds of publicly-available pretrained state-of-the-art
             CV and NLP models. Here is a summary of our main results.
                Norm-based metrics and well-trained models.Norm-based metrics do a reasonably
                 good job at predicting quality trends in well-trained CV/NLP models.
                Norm-based metrics and poorly-trained models.Norm-based metrics may give
                 spurious results when applied to poorly-trained models (e.g., models trained without enough
                 data, etc.), exhibitingScale Collapse for these models.
                PL-based metrics and model quality.PL-based metrics do much better at predicting
                 quality trends in pretrained CV/NLP models. They are quantitatively better at discriminating
                 good-better-best trends, and qualitatively better at distinguishing \good-versus-
                 bad" models.
                PL-based metrics and model diagnostics. PL-based metrics can also be used to
                 characterize fine-scale model properties (including layer-wiseCorrelation Flow) in well-
                 trained and poorly-trained models, and they can be used to evaluate model enhancements
                 (e.g., distillation, fine-tuning, etc.).
             We emphasize that our goal is a practical theory to predict trends in the quality of state-of-the-
             art DNN models, i.e., not to make a statement about every publicly-available model. We have
             examined hundreds of models, and we identify general trends, but we also highlight interesting
             exceptions.

             The WeightWatcher Tool. All of our computations were performed with the publicly-available
              Weight Watcher tool  (version 0.2.7) [1]. To be fully reproducible, we only examine publicly-
             available, pretrained models, and we also provide all Jupyter and Google Colab notebooks used
             in an accompanying github repository [2]. See Appendix A for details on how to reproduce all
             results.

             Organization of this paper. We start in Section2and Section3with background and an
             overview of our general approach. In Section4, we study three well-known widely-available
             DNN CV architectures (the VGG, ResNet, and DenseNet series of models); and we provide an
             illustration of our basic methodology, both to evaluate the different metrics against reported test
             accuracies and to use quality metrics to understand model properties. Then, in Section5, we
             look at several variations of a popular NLP DNN architecture (the OpenAI GPT and GPT2
             models); and we show how model quality and properties vary between several variants of GPT
             and GPT2, including how metrics behave similarly and differently. Then, in Section6, we present
             results based on an analysis of hundreds of pretrained DNN models, showing how well each metric
             predicts the reported test accuracies, and how the PL-based metrics perform remarkably well.
             Finally, in Section7, we provide a brief discussion and conclusion.
                3 One could of course use these techniques to improve training, and we have been asked about that, but we are
             not interested in that here. Our main goal here is to use these techniques to evaluate properties of state-of-the-art
             pretrained DNN models.

                                                2 Background and Related Work

             Most theory for DNNs is applied to small toy models and assumes access to data. There is very
             little work asking how to predict, in a theoretically-principled manner, the quality of large-scale
             state-of-the-art DNNs, and how to do so without access to training data or testing data or details
             of the training protocol, etc. Our approach is, however, related to two other lines of work.

             Statistical mechanics theory for DNNs. Statistical mechanics ideas have long had influence
             on DNN theory and practice [3,4,5]; and our best-performing metrics (those using fitted PL
             exponents) are based on statistical mechanics [4,6,7,8,9], in particular the recently-developed
             Theory of Heavy Tailed Self Regularization (HT-SR) [6,7,9]. We emphasize that the way in
             which we (and HT-SR Theory) use statistical mechanics theory is quite different than the way
             it is more commonly formulated. Several very good overviews of the more common approach are
             available [3,5]. We use statistical mechanics in a broader sense, drawing upon techniques from
             quantitative nance and random matrix theory. Thus, much more relevant for our methodological
             approach is older work of Bouchaud, Potters, Sornette, and coworkers [10,11,12,13] on the
             statistical mechanics of heavy tailed and strongly correlated systems.

             Norm-based capacity control theory. There is also a large body of work on using norm-
             based metrics to bound generalization error [14,15,16]. In this area, theoretical work aims
             to prove generalization bounds, and applied work uses these norms to construct regularizers to
             improve training. While we do find that norms provide relatively good quality metrics, at least
             for distinguishing good-better-best among well-trained models, we are not interested in proving
             generalization bounds or developing new regularizers.


             3 Methods

             Let us write the Energy Landscape (or optimization function, parameterized by <<FORMULA>> and <<FORMULA>>)
             for a DNN wit <<FORMULA>> layers, activation functions <<FORMULA>>, and <<FORMULA>> weight matrices <<FORMULA>> and biases
             <<FORMULA>>, as:

                          <<FORMULA>>           (1)

             Each DNN layer contains one or more layer 2D <<FORMULA>> weight matrices, <<FORMULA>>, or pre-activation
             maps, <<FORMULA>>, extracted from 2D Convolutional layers, and whereN > M.4 (We may drop the i
             and/or <<FORMULA>> subscripts below.) See Appendix A for how we define the Conv2D layer matrixes and
             for our choices of normalization.
                Assume we are given several pretrained DNNs, e.g., as part of an architecture series. The
             models have been trained and evaluated on labeled data fdi <<FORMULA>>, using standard techniques.
             The pretrained pytorch model files are publicly-available, and the test accuracies have been
             reported online. In this study, we do not have access to this data, and we have not trained
             any of the models ourselves, nor have we re-evaluated the test accuracies. We expect that most
             well-trained, production-quality models will employ one or more forms of regularization, such as
             Batch Normalization (BN), Dropout, etc., and many will also contain additional structure such
             as Skip Connections, etc. Here, we will ignore these details, and will focus only on the pretrained
             layer weight matrices Wl .

                4 We do not use intra-layer information from the models in our quality metrics, but (as we will describe) our
             metrics can be used to learn about intra-layer model properties.

             DNN Empirical Quality Metrics. The best performing empirical quality metrics depend
             on the norms and/or spectral properties of each weight matrix,W, and/or, equivalently, it’s
             Empirical Correlation Matrix:X=WT W.
                Here, we consider the following metrics.

                                            <<FORMULA>>

             Here, <<FORMULA>> is the i th eigenvalue of the X, and <<FORMULA>> is the maximum eigenvalue. Recall that the
             eigenvalues are squares of the singular values <<FORMULA>> of <<FORMULA>>. Also, note that we do not i normalize
             X by <<FORMULA>>; see Appendix A for a discussion of this issue.
                The first two norms are well-known in ML; the last two deserve special mention. The empirical
             parameter is the Power Law (PL) exponent that arises in the recently-developed HT-SR
             Theory [6,7,9]. Operationally, is determined by using the publicly-available Weight Watcher
             tool [1] to fit the Empirical Spectral Density (ESD) of X, i.e., a histogram of the eigenvalues, call
             it <<FORMULA>>, to a truncated PL,
                                        <<FORMULA>>                        (2)

             Each of these quantities is defined for a given layer W matrix.
                For norm-based metrics, we use the average of the log norm, and to the appropriate power.
             Informally, this amounts to assuming that the layer weight matrices are statistically independent,
             in which case we can estimate the model complexityC, or test accuracy, with a standard Product
             Norm (which resembles a data dependent VC complexity),

                                    <<FORMULA>>;                    (3)

             where <<FORMULA>> is a matrix norm. The log complexity,

                               <<FORMULA>>;               (4)

             takes the form of an average Log Norm. For the Frobenius Norm metric and Spectral Norm
             metric, we can use Eqn. (4) directly. 6
                The Weighted Alpha metric is an average of <<FORMULA>> over all layers <<FORMULA>>, weighted by the
             size, or scale, or each matrix,

                                       <<FORMULA>>;                   (5) 
                                         
             where L is the total number of layer weight matrices. The Weighted Alpha metric was introduced
             previously [9], where it was shown to correlate well with trends in reported test accuracies of
             pretrained DNNs, albeit on a limited set of models.
                Based on this, in this paper, we introduce and evaluate the -Shatten Norm metric. Notice
             for the -Shatten Norm metric, however,l varies from layer to layer, and so in Eqn. (6) it can
             not be taken out of the sum:

                5 Notice <<FORMULA>>. 
                6 When taking <<FORMULA>>, the 2 comes down and out of the sum, and thus ignoring it only changes the metric F by a constant factor.

              We use X to emphasize that <<FORMULA>> depends on the ESD of X.2    

                                                <<FORMULA>>                    (6) 

             For small <<FORMULA>>, the Weighted Alpha metric approximates the Log -Shatten norm, as can be shown
             with a statistical mechanics and random matrix theory derivation [17]; and the Weighted Alpha
             and -Shatten norm metrics often behave like an improved, weighted average Log Spectral Norm,
             and may track this metric in some cases.
                To avoid confusion, let us clarify the relationship between <<FORMULA>> and <<FORMULA>>. 
                We fit the ESD of the
             correlation matrix X to a truncated PL, parameterized by 2 values: the PL exponent <<FORMULA>>, and the
             maximum eigenvalue <<FORMULA>>. (Technically, we also need the minimum eigenvalue <<FORMULA>>, but this
             detail does not affect our analysis.) The PL exponent <<FORMULA>> measures of the amount of correlation
             in a DNN layer weight matrixW. It is valid for <<FORMULA>>, and it is scale-invariant, i.e., it does
             not depend on the normalization ofWorX. The <<FORMULA>> is a measure of the size, or scale, of W.
             Multiplying each <<FORMULA>> by the corresponding log <<FORMULA>> weighs \bigger" layers more, and averaging
             this product leads to a balanced, Weighted Alpha metric for the entire DNN.

             Convolutional Layers and Normalization issues. There are several technical issues
             (regarding spectral analysis of convolutional layers and normalization of empirical matrices) that
             are important for reproducibility of our results. See Appendix A for a discussion.

             4 Comparison of CV models

             In this section, we examine empirical quality metrics described in Section3for several CV model
             architecture series. This includes the VGG, ResNet, and DenseNet series of models, each of which
             consists of several pretrained DNN models, trained on the full ImageNet [18] dataset, and each
             of which is distributed with the current open source pyTorch framework (version 1.4) [19]. This
             also includes a larger set of ResNet models, trained on the ImageNet-1K dataset [18], provided
             on the OSMR \Sandbox for training convolutional networks for computer vision" [20], which we
             call the ResNet-1K series.
                We perform coarse model analysis, comparing and contrasting the four model series, and
             predicting trends in model quality. We also perform fine layer analysis, as a function of depth
             for these models, illustrating that PL-based metrics can provide novel insights among the VGG,
             ResNet/ResNet-1K, and DenseNet architectures.

             Average Quality Metrics versus Reported Test Accuracies. We have examined the
             performance of the four quality metrics (Log Frobenius norm, Log Spectral norm, Weighted Alpha,
             and Log -Norm) applied to each of the VGG, ResNet, ResNet-1K, and DenseNet series. To start,
             Figure1considers the VGG series (in particular, the pretrained models VGG11, VGG13, VGG16,
             and VGG19, with and without BN), and it plots the four quality metrics versus the reported test
             accuracies [19], 7 as well as a basic linear regression line. All four metrics correlate quite well
             with the reported Top1 accuracies, with smaller norms and smaller values of <<FORMULA>> implying better
             generalization (i.e., greater accuracy, lower error). While all four metrics perform well, notice
             that the Log -Norm metric (<<FORMULA>>) performs best (with an RMSE of 0:42, see Table 1); <<FORMULA>> 
             and the Weighted Alpha metric (<<FORMULA>>), which is an approximation to the Log -Norm
             metric [17], performs second best (with an RMSE of 0:48, see Table1).
                7 That is, these test accuracies have been previously reported and made publicly-available by others. We take
             them as given, and we do not attempt to reproduce/verify them, since we do not permit ourselves any access to
             training/test data.

                                               <<FIGURE>>

             Figure 1: Comparison of Average Log Norm and Weighted Alpha quality metrics versus reported
             test accuracy for pretrained VGG models (with and without BN), trained on ImageNet, available
             in pyTorch (v1.4). Metrics fit by linear regression, RMSE reported.


                See Table1for a summary of results for Top1 accuracies for all four metrics for the VGG,
             ResNet, and DenseNet series. Similar results (not shown) are obtained for the Top5 accuracies.
             Overall, for the the ResNet, ResNet-1K, and DenseNet series, all metrics perform relatively well,
             the Log-Norm metric performs second best, and the Weighted Alpha metric performs best.
             These model series are all well-trodden, and our results indicate that norm-based metrics and
             PL-based metrics can both distinguish among a series of \good-better-best" models, with PL-
             based metrics performing somewhat (i.e., quantitatively) better.
                The DenseNet series has similar behavior to what we see in Figures1and2for the other
             models. However, as noted in Table1, it has only 4 data points. In our larger analysis, in
             Section6, we will only include series with 5 or more models. (Note that these and many other
             such plots can be seen on our publicly-available repo.)

             Variation in Data Set Size. We are interested in how our four quality metrics depend on
             data set size. To examine this, we look at results on ResNet versus ResNet-1K. See Figure2,
             which plots and compares the Log-Norm metric for the full ResNet model, trained on the
             full ImageNet dataset, against the ResNet-1K model, which has been trained on a much smaller
             ImageNet-1K data set. The Log-Norm is much better than the Log Frobenius/Spectral norm
             metrics (although, as Table1shows, it is actually slightly worse than the Weighted Alpha metric).
             The ResNet series has strong correlation, with an RMSE of 0:66, whereas the ResNet-1K series

                                                <<TABLE>>

             Table 1: RMSE (smaller is better) for linear fits of quality metrics to reported Top1 test error
             for pretrained models in each architecture series. Column # refers to number of models. VGG,
             ResNet, and DenseNet were pretrained on ImageNet, and ResNet-1K was pretrained on ImageNet-
             1K.

             also shows good correlation, but has a much larger RMSE of 1:9. (Other metrics exhibit similar
             behavior.) As expected, the higher quality data set shows a better fit, even with fewer data points.

             Layer Analysis: Metrics as a Function of Depth. We can learn much more about a
             pretrained model by going beyond average values of quality metrics to examining quality metrics
             for each layer weight matrix,W, as a function of depth (or layer id). For example, we can
             plot (just) the PL exponent, , for each layer, as a function of depth. See Figure3, which
             plots  for each layer (the first layer corresponds to data, the last layer to labels) for the least
             accurate (shallowest) and most accurate (deepest) model in each of the VGG (no BN), ResNet,
             and DenseNet series. (Again, a much more detailed set of plots is available at our repo; but note
             that the corresponding layer-wise plots for Frobenius and Spectral norms are much less interesting
             than the results we present here.)
                In the VGG models, Figure3(a)shows that the PL exponent  systematically increases as
             we move down the network, from data to labels, in the Conv2D layers, starting with <<FORMULA>> and
             reaching all the way to <<FORMULA>> and then, in the last three, large, fully-connected (FC) layers, 
             stabilizes back down to <<FORMULA>>. This is seen for all the VGG models (again, only the shallowest
             and deepest are shown in this figure), indicating that the main effect of increasing depth is to
             increase the range over which  increases, thus leading to larger  values in later Conv2D layers
             of the VGG models. This is quite different than the behavior of either the ResNet-1K models or
             the DenseNet models.
                For the ResNet-1K models, Figure 3 (b) shows that  also increases in the last few layers
             (more dramatically, in fact, than for VGG, observe the differing scales on the Y axes). However,

                                                <<FIGURE>>

             Figure 3: PL exponent () versus layer id, for the least and the most accurate models in VGG
             (a), ResNet (b), and DenseNet (c) series. (VGG is without BN; and note that the Y axes on
             each plot are different.) Subfigure (d) displays the ResNet models (b), zoomed in to 2 [1;5],
             and with the layer ids overlaid on the X-axis, from smallest to largest, to allow a more detailed
             analysis of the most strongly correlated layers. Notice that ResNet152 exhibits different and much
             more stable behavior of  across layers. This contrasts with how both VGG models gradually
             worsen in deeper layers and how the DenseNet models are much more erratic. In the text, this is
             interpreted in terms ofCorrelation Flow.


             as the ResNet-1K models get deeper, there is a wide range over which  values tend to remain
             quite small. This is seen for other models in the ResNet-1K series, but it is most pronounced for
             the larger ResNet-1K (152) model, whereremains relatively stable at <<FORMULA>>, from the earliest
             layers all the way until we reach close to the final layers.
                For the DenseNet models, Figure 3 (c) shows that fi tends to increase as the layer id increases,
             in particular for layers toward the end. While this is similar to what is seen in the VGG models,
             with the DenseNet models, values increase almost immediately after the first few layers, and
             the variance is much larger (in particular for the earlier and middle layers, where it can range all
             the way to <<FORMULA>>) and much less systematic throughout the network.

             Comparison of VGG, ResNet, and DenseNet Architectures. We can interpret these
             observations by recalling the architectural differences between the VGG, ResNet, and DenseNet
             architectures, and, in particular, the number of of residual connections. VGG resembles the
             traditional convolutional architectures, such as LeNet5, and consists of several [Conv2D-Maxpool-

                                                <<FIGURE>>

             Figure 4: ResNet20, distilled with Group Regularization, as implemented in the distiller
             (4D regularized 5L removed) pretrained models. Log Spectral Norm (<<FORMULA>>) and PL exponent
             (<<FORMULA>>) for individual layers, versus layer id, for both baseline (before distillation, green) and fine-
             tuned (after distillation, red) pretrained models.


             ReLu] blocks, followed by 3 large Fully Connected (FC) layers. ResNet greatly improved on
             VGG by replacing the large FC layers, shrinking the Conv2D blocks, and introducing residual
             connections. This optimized approach allows for greater accuracy with far fewer parameters (and
             GPU memory requirements), and ResNet models of up to 1000 layers have been trained [21].
                We conjecture that the efficiency and effectiveness of ResNet is reflected in the smaller and
             more stable <<FORMULA>>, across nearly all layers, indicating that the inner layers are very well
             correlated and strongly optimized. Contrast this with the DenseNet models, which contains
             many connections between every layer. Our results (large , meaning they even a PL model
             is probably a poor fit) suggest that DenseNet has too many connections, diluting high quality
             interactions across layers, and leaving many layers very poorly optimized.

             Correlation Flow. More generally, we can understand the results presented in Figure3in
             terms of what we will call theCorrelation Flow of the model. Recall that the average Log -
             Norm metric and the Weighted Alpha metric are based on HT-SR Theory [6,7,9], which is
             in turn based on ideas from the statistical mechanics of heavy tailed and strongly correlated
             systems [10,11,12,13]. There, one expects the weight matrices of well-trained DNNs will exhibit
             correlations over many size scales. Their ESDs can be well-fit by a (truncated) PL, with exponents
             <<FORMULA>>. Much larger values (<<FORMULA>>) may reflect poorer PL fits, whereas smaller values (<<FORMULA>>),
             are associated with models that generalize better. Informally, one would expect a DNN model to
             perform well when it facilitates the propagation of information/features across layers. Previous
             work argues this by computing the gradients over the input data. In the absence of training/test
             data, one might hope that this leaves empirical signatures on weight matrices, and thus we can
             to try to quantify this by measuring the PL properties of weight matrices. In this case, smaller
             values correspond to layers in which correlations across multiple scales are better captured [6,11],
             and we expect that small  values that are stable across multiple layers enable better correlation
             flow through the network. We have seen this in many models, including those shown in Figure3.

             Scale Collapse; or How Distillation May Break Models. The similarity between norm-
             based metrics and PL-based metrics suggests a question: is the Weighted Alpha metric just a
             variation of the more familiar norm-based metrics? More generally, do fitted  values contain
             information not captured by norms? In examining hundreds of pretrained models, we have found
             several anomalies that demonstrate the power of our approach. In particular, to show that  does
             capture something different, consider the following example, which looks at a compressed/distilled
             DNN model [22]. In this example, we show that some distillation methods may actually break
             models unexpectedly by introducing what we callScale Collapse, where several distilled layers
             have unexpectedly small Spectral Norms.
                We consider ResNet20, trained on CIFAR10, before and after applying the Group Regularization
             distillation technique, as implemented in the distiller package [23]. We analyze the
             pretrained 4D regularized 5L removed baseline and fine-tuned models. The reported baseline test
             accuracies (Top1= 91:45 and Top5= 99:75) are better than the reported fine-tuned test accuracies
             (Top1= 91:02 and Top5= 99:67). Because the baseline accuracy is greater, the previous results
             on ResNet (Table1and Figure2) suggest that the baseline Spectral Norms should be smaller on
             average than the fine-tuned ones.The opposite is observed.Figure4presents the Spectral Norm
             (here denoted <<FORMULA>> ) and PL exponent () for each individual layer weight matrixW.8 On
             the other hand, the  values (in Figure 4 (b)) do not differ systematically between the baseline
             and fine-tuned models. Also (not shown), the average (unweighted) baseline  is smaller than
             the fine-tuned average (as predicted by HT-SR Theory, the basis of <<FORMULA>>).
                That being said, Figure4(b)also depicts two very large 6 values for the baseline,
             but not for the fine-tuned, model. This suggests the baseline model has at least two over-
             parameterized/under-trained layers, and that the distillation method does, in fact, improve the
             fine-tuned model by compressing these layers.
                The pretrained models in the distiller package have passed some quality metric, but they
             are much less well trodden than any of the VGG, ResNet, or DenseNet series. While norms
             make good regularizers for a single model, there is no reason a priori to expect them correlate
             so well with test accuracies across different models. We do expect, however, the PL fit o do so
             because it effectively measures the amount of correlation in the model [6,7,9]. The reason for the
             anomalous behavior shown in Figure4is that the distiller Group Regularization technique
             causes the norms of the W pre-activation maps for two Conv2D layers to increase spuriously.
             This is difficult to diagnose by analyzing training/test curves, but it is easy to diagnose with
             our approach.

             5 Comparison of NLP Models

             In this section, we examine empirical quality metrics described in Section3for several NLP
             model architectures. Within the past two years, nearly 100 open source, pretrained NLP DNNs
             based on the revolutionary Transformer architecture have emerged. These include variants of
             BERT, Transformer-XML, GPT, etc. The Transformer architectures consist of blocks of so-called
             Attention layers, containing two large, Feed Forward (Linear) weight matrices [24]. In contrast to
             smaller pre-Activation maps arising in Cond2D layers, Attention matrices are significantly larger.
             In general, we have found that they have larger PL exponents . Based on HT-SR Theory (in
             particular, the interpretation of values of 2 as modeling systems with good correlations over
             many size scales [10,11]), this suggests that these models fail to capture successfully many of the
             correlations in the data (relative to their size) and thus are substantially under-trained. More
             generally, compared to the CV models of Section4, modern NLP models have larger weight
             matrices and display different spectral properties. Thus, they provide a very different test for our
             empirical quality metrics.
                While norm-based metrics perform reasonably well on well-trained NLP models, they often
             behave anomalously on poorly-trained models. Indeed, for such \bad" models, weight matrices
             may display rank collapse, decreased Frobenius mass, or unusually small Spectral norms. (This
             may be misinterpreted as \smaller is better.") In contrast, PL-based metrics, including the Log
             -Norm metric (<<FORMULA>>) and the Weighted Alpha metric (<<FORMULA>>) display consistent
             behavior, even on poorly trained models. Indeed, we can use these metrics to help identify when
             architectures need repair and when more and/or better data are needed.

             What do large values of  mean? Many NLP models, such as GPT and BERT, have some
             weight matrices with unusually large PL exponents (e.g.,6). This indicates these matrices
             may be under-correlated (i.e., over-parameterized, relative to the amount of data). In this regime,
             the truncated PL fit itself may not be very reliable because the MLE estimator it uses is unreliable
             in this range (i.e., the specific  values returned by the truncated PL fits are less reliable, but
             having large versus small values of is reliable). Phenomenologically, if we examine the ESD
             visually, we can usually describe theseWas in the Bulk-Decayor Bulk-plus-Spikes phase [6,7].
             Previous work [6,7] has conjectured that very well-trained DNNs would not have many outlier
             6; and improved versions of GPT (shown below) and BERT (not shown) confirm this.

             OpenAI GPT Models. The OpenAI GPT and GPT2 models provide us with the opportunity
             to analyze two effects: training the same model with different data set sizes; and increasing
             the sizes of both the data set and the architectures simultaneously. These models have the
             remarkable ability to generate fake text that appears to the human to be real, and they have
             generated significant media attention because of the potential for their misuse. For this reason,
             the original GPT model released by OpenAI was trained on a deficient data set, rendering the
             model interesting but not fully functional. Later, OpenAI released a much improved model,
             GPT2-small, which has the same architecture and number of layers as GPT, but which has been
             trained on a larger and better data set (and with other changes), making it remarkably good at
             generating (near) human-quality fake text. By comparing the poorly-trained (i.e., \bad") GPT to
             the well-trained (i.e., \good") GPT2-small, we can identify empirical indicators for when a model
             has in fact been poorly-trained and thus may perform poorly when deployed. By comparing
             GPT2-medium to GPT2-large to GPT2-xl, we can examine the effect of increasing data set and
             model size simultaneously, an example of what we call a series of \good-better-best" models.
                The GPT models we analyze are deployed with the popular HuggingFace PyTorch library [25].
             GPT has 12 layers, with 4 Multi-head Attention Blocks, giving 48 layer Weight Matrices,W.
             Each Block has 2 components, the Self Attention (attn) and the Projection (proj) matrices. The
             self-attention matrices are larger, of dimension (2304x768) or (3072x768). The projection
             layer concatenates the self-attention results into a vector (of dimension 768). This gives 50
             large matrices. Because GPT and GPT2 are trained on different data sets, the initial Embedding
             matrices differ in shape. GPT has an initial Token and Positional Embedding layers, of dimension
             (40478x768) and (512x768), respectively, whereas GPT2 has input Embeddings of shape
             (50257x768) and (1024x768), respectively. The OpenAI GPT2 (English) models are: GPT2-
             small, GPT2-medium, GPT2-large, and GPT2-xl, having 12, 24, 36, and 48 layers, respectively,
             with increasingly larger weight matrices.

             Average Quality Metrics for GPT and GPT2. We have analyzed the four quality metrics
             described in Section3for the OpenAI GPT and GPT2 pretrained models. See Table2for a
             summary of results. We start by examining trends between GPT and GPT2-small. Observe
             that all four metrics increase when going from GPT to GPT2-small, i.e., they are smaller for the
             higher-quality model (higher quality since GPT was trained to better data), when the number of
             layers is held xed. Notice that in the GPT model, being poorly trained, the norm metrics all
             exhibitScale Collapse, compared to GPT2-small.

                                                <<TABLE>>

             Table 2: Average value for the average Log Norm and Weighted Alpha metrics for pretrained
             OpenAI GPT and GPT2 models. Column # refers to number of layers treated. Note that the
             averages do not include the first embedding layer(s) because they are not (implicitly) normalized.


                We next examine trends between GPT2-medium to GPT2-large to GPT2-xl. Observe that
             (with one minor exception involving the log Frobenius norm metric) all four metrics decrease as
             one goes from medium to large to xl, indicating that the larger models indeed look better than
             the smaller models. Notice that, for these well-trained models, the norm metrics now behave as
             expected, decreasing with increasing accuracy.
                Going beyond average values, Figure5(a)shows the histogram (empirical density), for all
             layers, of  for GPT and GPT2-small. These two histograms are very different. The older
             deficient GPT has numerous unusually large exponents meaning they are not really well-
             described by a PL fit. Indeed, we expect that a poorly-trained model will lack good (i.e., small)
             PL behavior in many/most layers. On the other hand, as expected, the newer improved GPT2-
             small model has, on average, smaller  values than the older GPT, with all 6 and with
             smaller mean/median. It also has far fewer unusually-large outlying values than GPT. From
             this (and other results not shown), we see that provides a good quality metric for comparing
             these two models, the \bad" GPT versus the \good" GPT2-small. This should be contrasted
             with the behavior displayed by the Frobenius norm (not shown) and the Spectral norm.

             Scale Collapse in Poorly Trained Models. We next describe the behavior of the Spectral
             norm in GPT versus GPT2-small. In Figure5(b), the \bad" GPT model has a smaller
             mean/median Spectral norm as well as, spuriously, many much smaller Spectral norms, com-
             pared to the \good" GPT2-small, violating the conventional wisdom that smaller Spectral norms
             are better. Indeed, because there are so many anonymously small Spectral norms, it appears that
             the GPT model may be exhibiting a kind ofScale Collapse, like that observed in the distilled
             CV models (in Figure4). This is important because it demonstrates that, while the Spectral
             (or Frobenius) norm may correlate well with predicted test error, it is not a good indicator of
             the overall model quality. It can mispredict good-versus-bad questions in ways not seen with
             PL-based metrics. Using it as an empirical quality metric may give spurious results when applied
             to poorly-trained or otherwise deficient models.
                (Note that Figure5(b)also shows some unusually large Spectral Norms. Upon examination,
             e.g., from Figure6(b)(below), we see that these correspond to the first embedding layer(s).
             These layers have a different effective normalization, and therefore a different scale. We discuss
             this further in AppendixA. Here, we do not include them in our computed average metrics in
             Table2, and we do not include them in the histogram plot in Figure5(b).)

             Layer Analysis: Correlation Flow and Scale Collapse in GPT and GPT2. We also
             examine in Figure 6 the PL exponent  and Log Spectral Norm versus layer id, for GPT and
             GPT2-small. Let’s start with Figure6(a), which plots  versus the depth (i.e., layer id) for
             each model. The deficient GPT model displays two trends in , one stable with 4, and one

                                                <<FIGURE>>

             Figure 5: Histogram of PL exponents (<<FORMULA>>) and Log Spectral Norms (<<FORMULA>>) for weight matrices
             from the OpenAI GPT and GPT2-small pretrained models.

             increasing with layer id, with  reaching as high as 12. In contrast, the well-trained GPT2-small
             model shows consistent and stable patterns, again with one stable <<FORMULA>> (and below the GPT
             trend), and the other only slightly trending up, with 6. The scale-invariant metric lets us
             identify potentially poorly-trained models. These results show that the Correlation Flow differs
             significantly between GPT and GPT2-small (with the better GPT2-small looking more like the
             better ResNet-1K from Figure3(b)).
                These results should be contrasted with the corresponding results for Spectral Norms, shown
             in Figure6(b). Attention models have two types of layers, one small and large; and the Spectral
             Norm, in particular, displays unusually small values for some of these layers for GPT. This Scale
             Collapse for the poorly-trained GPT is similar to what we observed for the distilled ResNet20
             model in Figure4(b). Because of the anomalous scale collapse that is frequently observed in
             poorly-trained models, these results suggest that scale-dependent norm metrics should not be
             directly applied to distinguish good-versus-bad models.

                                                <<FIGURE>>

             Figure 6: PL exponents (<<FORMULA>>) (in (a)) and Log Spectral Norms (<<FORMULA>>) (in (b)) for weight
             matrices from the OpenAI GPT and GPT2-small pretrained models. (Note that the quantities
             being shown on each Y axis are different.) In the text, this is interpreted in terms ofCorrelation
             Flow and Scale Collapse.


             GPT2: medium, large, xl. We now look across series of increasingly improving GPT2 models
             (i.e., we consider good-better-best questions), by examining both the PL exponent  as well as
             the Log Norm metrics. In general, as we move from GPT2-medium to GPT2-xl, histograms
             for both exponents and the Log Norm metrics downshift from larger to smaller values. For
             example, see Figure7, which shows the histograms over the layer weight matrices for fitted PL
             exponent (<<FORMULA>>) and the Log Alpha Norm (<<FORMULA>>) metric.  We see that the average  decreases 
             with increasing model size, although the differences
             are less noticeable between the differing good-better-best GTP2 models than between the good-
             versus-bad GPT and GPT2-small models. Unlike GPT, however, the layer Log Alpha Norms
             behave more as expected for GPT2 layers, with the larger models consistently having smaller
             norms. Similarly, the Log Spectral Norm also decreases on average with the larger models (not
             shown). As expected, the norm metrics can indeed distinguish among good-better-best models
             among a series well-trained models.
                We do notice, however, that while the peaks of the  are getting smaller, towards 2:0, the tails
             of the distribution shifts right, with larger GPT2 models having more usually large  (also not
             shown). We suspect this indicates that these larger GPT2 models are still under-optimized/over-
             parameterized (relative to the data on which they were trained) and that they have capacity to
             support datasets even larger than the recent XL 1.5B release [26].

                            <<FIGURE>>

             Figure 7: Histogram of PL exponents (<<FORMULA>>) and Log Alpha Norm (<<FORMULA>>) for weight matrices  
             from models of different sizes in the GPT2 architecture series. (Plots omit the first 2 (embedding)
             layers, because they are normalized differently giving anomalously large values.)


             6 Comparing Hundreds of CV Models

             In this section, we summarize results from a large-scale analysis of hundreds of CV models,
             including models developed for image classification, segmentation, and a range of related tasks. Our
             aim is to complement the detailed results from Sections4and5by providing broader conclusions.
             The models we consider have been pretrained on nine datasets. We provide full details about
             how to reproduce these results in AppendixA.
                We choose ordinary least squares (OLS) regression to quantify the relationship between quality
             metrics (computed with the Weight Watcher tool ) and the reported test error and/or accuracy
             metrics. We regress the metrics on the Top1 (and Top5) reported errors (as dependent variables).
             These include Top5 errors for the ImageNet-1K model, percent error for the CIFAR-10/100,
             SVHN, CUB-200-2011 models, and Pixel accuracy (Pix.Acc.) and Intersection-Over-Union (IOU)
             for other models. We regress them individually on each of the norm-based and PL-based metrics,
             as described in Section4.
                Our results are summarized in Table3. For the mean, largerR2 and smaller MSE are
             desirable; and for the standard deviation, smaller values are desirable. Taken as a whole, over the
             entire corpus of data, PL-based metrics are somewhat better for both theR2 mean and standard
             deviation; and PL-based metrics are much better for MSE mean and standard deviation. These

                                                <<TABLE>>

             Table 3: Comparison of linear regression fits for different average Log Norm and Weighted Alpha
             metrics across 5 CV datasets, 17 architectures, covering 108 (out of over 400) different pretrained
             DNNs. We include regressions only for architectures with five or more data points, and which are
             positively correlated with test error. These results can be readily reproduced using the Google
             Colab notebooks (see AppendixA).


             (and other) results suggest our conclusions from Sections4and5hold much more generally, and
             they suggest obvious questions for future work.

             7 Conclusion

             We have developed (based on strong theory) and evaluated (on a large corpus of publicly-available
             pretrained models from CV and NLP) methods to predict trends in the quality of state-of-the-art
             neural networks|without access to training or testing data. Prior to our work, it was not obvious
             that norm-based metrics would perform well to predict trends in quality across models (as they
             are usually used within a given model or parameterized model class, e.g., to bound generalization
             error or to construct regularizers). Our results are the first to demonstrate that they can be used
             for this important practical problem. That PL-based metrics perform better (than norm-based
             metrics) should not be surprising|at least to those familiar with the statistical mechanics of
             heavy tailed and strongly correlated systems [10,11,12,13] (since our use of PL exponents is
             designed to capture the idea that well-trained models capture correlations over many size scales
             in the data). Again, though, our results are the first to demonstrate this. It is also gratifying
             that our approach can be used to provide fine-scale insight (such as rationalizing the flow of
             correlations or the collapse of size scale) throughout a network.
                We conclude with a few comments on what a practical theory of DNNs should look like. To do
             so, we distinguish between two types of theories:non-empirical or analogical theories, in which one
             creates, often from general principles, a very simple toy model that can be analyzed rigorously,
             and one then argues that the model is relevant to the system of interest; and semi-empirical
             theories, in which there exists a rigorous asymptotic theory, which comes with parameters, for
             the system of interest, and one then adjusts or fits those parameters to the finite non-asymptotic
             data. A drawback of the former approach is that it typically makes very strong assumptions
             on the data, and the strength of those assumptions can limit the practical applicability of the
             theory. Nearly all of the work on the theory of DNNs focuses on the former type of theory. Our
             approach focuses on the latter type of theory. Our results, which are based on using sophisticated
             statistical mechanics theory and solving important practical DNN problems, suggests that the
             latter approach should be of interest more generally for those interested in developing a practical
             DNN theory.

             Acknowledgements. MWM would like to acknowledge ARO, DARPA, NSF, and ONR as well
             as the UC Berkeley BDD project and a gift from Intel for providing partial support of this work.
             We would also like to thank Amir Khosrowshahi and colleagues at Intel for helpful discussion
             regarding the Group Regularization distillation technique.


                                                References

               [1]WeightWatcher, 2018.https://pypi.org/project/WeightWatcher/.
               [2]https://github.com/CalculatedContent/ww-trends-2020.
               [3]A. Engel and C. P. L. Van den Broeck.Statistical mechanics of learning. Cambridge University Press,
                  New York, NY, USA, 2001.
               [4]C. H. Martin and M. W. Mahoney. Rethinking generalization requires revisiting old ideas: statistical
                  mechanics approaches and complex learning behavior. Technical Report Preprint:arXiv:1710.09553,
                  2017.
               [5]Y. Bahri, J. Kadmon, J. Pennington, S. Schoenholz, J. Sohl-Dickstein, and S. Ganguli. Statistical
                  mechanics of deep learning.Annual Review of Condensed Matter Physics, pages 000{000, 2020.
               [6]C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from
                  random matrix theory and implications for learning. Technical Report Preprint:arXiv:1810.01075,
                  2018.
               [7]C. H. Martin and M. W. Mahoney. Traditional and heavy-tailed self regularization in neural network
                  models. InProceedings of the 36th International Conference on Machine Learning, pages 4284{4293,
                  2019.
               [8]C. H. Martin and M. W. Mahoney. Statistical mechanics methods for discovering knowledge from
                  modern production quality neural networks. InProceedings of the 25th Annual ACM SIGKDD Con-
                  ference, pages 3239{3240, 2019.
               [9]C. H. Martin and M. W. Mahoney. Heavy-tailed Universality predicts trends in test accuracies for very
                  large pre-trained deep neural networks. InProceedings of the 20th SIAM International Conference on
                  Data Mining, 2020.
              [10]J. P. Bouchaud and M. Potters.Theory of Financial Risk and Derivative Pricing: From Statistical
                  Physics to Risk Management. Cambridge University Press, 2003.
              [11]D. Sornette.Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder:
                  concepts and tools. Springer-Verlag, Berlin, 2006.
              [12]J. P. Bouchaud and M. Potters. Financial applications of random matrix theory: a short review. In
                  G. Akemann, J. Baik, and P. Di Francesco, editors,The Oxford Handbook of Random Matrix Theory.
                  Oxford University Press, 2011.
              [13]J. Bun, J.-P. Bouchaud, and M. Potters. Cleaning large correlation matrices: tools from random
                  matrix theory.Physics Reports, 666:1{109, 2017.
              [14]B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In
                  Proceedings of the 28th Annual Conference on Learning Theory, pages 1376{1401, 2015.
              [15]P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks.
                  Technical Report Preprint:arXiv:1706.08498, 2017.
              [16]Q. Liao, B. Miranda, A. Banburski, J. Hidary, and T. Poggio. A surprising linear relationship predicts
                  test performance in deep networks. Technical Report Preprint:arXiv:1807.09659, 2018.
              [17]C. H. Martin and M. W. Mahoney. Unpublished results, 2020.
              [18]O. Russakovsky et al. Imagenet large scale visual recognition challenge. International Journal of
                  Computer Vision, 115(3):211{252, 2015.
              [19]A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. InAnnual
                  Advances in Neural Information Processing Systems 32: Proceedings of the 2019 Conference, pages
                  8024{8035, 2019.
              [20]Sandbox for training convolutional networks for computer vision. https://github.com/osmr/
                  imgclsmob.
              [21]K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. Technical Report
                Preprint:arXiv:1603.05027, 2016.
             [22]Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for
                deep neural networks. Technical Report Preprint:arXiv:1710.09282, 2017.
             [23]Intel Distiller package.https://nervanasystems.github.io/distiller.
             [24]A. Vaswani et al. Attention is all you need. Technical Report Preprint:arXiv:1706.03762, 2017.
             [25]T. Wolf et al. Huggingface’s transformers: State-of-the-art natural language processing. Technical
                Report Preprint:arXiv:1910.03771, 2019.
             [26]OpenAI GPT-2: 1.5B Release.https://openai.com/blog/gpt-2-1-5b-release/.
             [27]H. Sedghi, V. Gupta, and P. M. Long. The singular values of convolutional layers. Technical Report
                Preprint:arXiv:1805.10408, 2018.
             [28]X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.
                InProceedings of the 13th International Workshop on artificial Intelligence and Statistics, pages
                249{256, 2010.


             A Appendix

             In this appendix, we provide more details on several issues that are important for the reproducibility
             of our results. All of our computations were performed with the Weight Watcher tool  (version
             0.2.7) [1]. More details and more results are available in an accompanying github repository [2].

             A.1 Reproducibility Considerations
             SVD of Convolutional 2D Layers. There is some ambiguity in performing spectral analysis
             on Conv2D layers. Each layer is a 4-index tensor of dimension (<<FORMULA>>), with an (wxh)
             filter (or kernel) and (in;out) channels. When w=h=k, it gives (kxk) tensor slices, or
             pre-Activation Maps Wi;L of dimension (in/out) each. We identify 3 different approaches for
             running SVD on a Conv2D layer:

               1.run SVD on each pre-Activation MapWi;L , yielding (kxk) sets of M singular values;

               2.stack the maps into a single matrix of, say, dimension (<<FORMULA>>), and run SVD to
                 get in singular values;

               3.compute the 2D Fourier Transform (FFT) for each of the (in;out) pairs, and run SVD on
                 the Fourier coefficients [27], leading to <<FORMULA>> non-zero singular values.

             Each method has tradeoffs. Method (3) is mathematically sound, but computationally expensive.
             Method (2) is ambiguous. For our analysis, because we need thousands of runs, we select method
             (1), which is the fastest (and is easiest to reproduce).

             Normalization of Empirical Matrices. Normalization is an important, if underappreciated,
             practical issue. Importantly, the normalization of weight matrices does not affect the PL fits
             because  is scale-invariant. Norm-based metrics, however, do depend strongly on the scale of the
             weight matrix|that is the point. To apply RMT, we usually define X with a <<FORMULA>> normalization,
             assuming variance of <<FORMULA>>. Pretrained DNNs are typically initialized with random weight
             matrices <<FORMULA>>, with <<FORMULA>>, or some variant, e.g., the Glorot/Xavier normalization [28],
             or <<FORMULA>> normalization for Convolutional 2D Layers. With this implicit scale, we do not
             renormalize the empirical weight matrices, i.e., we use them as-is. The only exception is that

                                                <<FORMULA>>

                    Table 4: Jupyter notebooks used to reproduce all results in Sections 4 and 5.
             pwe do rescale the Conv2D pre-activation mapsWi;L byk= 2 so that they are on the same scale
             as the Linear / Fully Connected (FC) layers.

             Special consideration for NLP models. NLP models, and other models with large initial p
             embeddings, require special care because the embedding layers frequently lack the implicit <<FORMULA>>
             normalization present in other layers. For example, in GPT, for most layers, the maximum
             eigenvalue <<FORMULA>>, but in the first embedding layer, the maximum eigenvalue is of
             orderN(the number of words in the embedding), or <<FORMULA>>). For GPT and GPT2, we
             treat all layers as-is (although one may want to normalize the first 2 layers X by <<FORMULA>>, or to treat
             them as outliers).

             A.2 Reproducing Sections 4 and 5

             We provide a github repository for this paper that includes Jupyter notebooks that fully reproduce
             all results (as well as many other results) [2]. All results have been produced using the Weight-
             Watcher tool (v0.2.7) [1]. The ImageNet and OpenAI GPT pretrained models are provided in the
             current pyTorch [19] and Huggingface [25] distributions, as specified in the requirements.txt file.

             A.3 Reproducing Figure 4, for the Distiller Model

             In the distiller folder of our github repo, we provide the original Jupyter Notebooks, which use
             the Intel distiller framework [23]. Figure4is from the‘‘...-Distiller-ResNet20.ipynb’’
             notebook (see Table4). For completeness, we provide both the results described here, as well as
             additional results on other pretrained and distilled models using the Weight Watcher tool .

             A.4 Reproducing Table 3 in Section 6

             In the ww-colab folder of our github repo, we provide several Google Colab notebooks which can
             be used to reproduce the results of Section6. The ImageNet-1K and other pretrained models are
             taken from the pytorch models in theomsr/imgclsmob\Sandbox for training convolutional net-
             works for computer vision" github repository [20]. The data for each regression can be generated
             in parallel by running each Google Colab notebook (i.e.,wwcolab0100.ipynb) simultaneously
             on the same account. The data generated are analyzed withww colabresults.ipynb, which
             runs all regressions and which tabulates the results presented in Table3.
                We attempt to run linear regressions for all pyTorch models for each architecture series for
             all datasets provided. There are over 450 models in all, and we note that theosmr/imgclsmob
             repository is constantly being updated with new models. We omit the results for CUB-200-2011,

                                                <<TABLE>>

                                        Table 5: Datasets used

                                              <<TABLE>>

                                      Table 6: Architectures used

             Pascal-VOC2012, ADE20K, and COCO datasets, as there are fewer than 15 models for those
             datasets. Also, we filter out regressions with fewer than 5 datapoints.
                We remove the following outliers, as identified by visual inspection:efficientb0,b2. We
             also remove the entirecifar100 ResNeXTseries, which is the only example to show no trends
             with the norm metrics. The final datasets used are shown in Table 5. The final architecture series
             used are shown in Table6, with the number of models in each.
                To explain further how to reproduce our analysis, we run three batches of linear regressions.
             First, at the global level, we divide models by datasets and run regressions separately on all
             models of a certain dataset, regardless of the architecture. At this level, the plots are quite
             noisy and clustered, as each architecture has its own accuracy trend; but one can still see that
             most plots show positive relationship with positive coefficients. Example regressions are shown
             in Figure8, as available in the results notebook.
                To generate the results in Table3, we run linear regressions for each architecture series in
             Table6, regressing each empirical Log Norm metric against the reported Top1 (and Top5) errors
             (as listed on theosmr/imgclsmobgithub repository README file [20], with the relevant data
             extracted and provided in our github repo aspytorchcv.html). We record theR2 andMSE
             for each metric, averaged over all regressions for all architectures and datasets. See Table7and
             Table8. In the repo, plots are provided for every regression, and more fine grained results may
             be computed by the reader by analyzing the data in thedf all.xlsxfile. The final analysis
             includes 108 regressions in all, those with 4 or more models, with a positive R2.

                                                <<TABLE>>

                                       Table 7: MSEResults for all CV model regressions.

                                                          <<TABLE>>

                                         Table 8: R2 Results for all CV model regressions.

                                                                <<FIGURE>>

             Figure 8: PL exponentfiversus reported Top1 Test Accuracies for pretrained DNNs available
             for five different data sets.
<|endoftext|>


<|startoftext|>
                        Pruning neural networks without any data by iteratively conserving synaptic ﬂow

                             Hidenori Tanaka                     Daniel Kunin 
                       Physics & Informatics Laboratories          Institute for Computational and
                            NTT Reserach, Inc.                 Mathematical Engineering
                         Department of Applied Physics               Stanford University

                      Stanford University

                            Daniel L. K. Yamins                   Surya Ganguli
                          Department of Psychology            Department of Applied Physics
                        Department of Computer Science              Stanford University
                            Stanford University

                                               Abstract

                    Pruning the parameters of deep neural networks has generated intense interest due to potential
                    savings in time, memory and energy both during training and at test time. Recent works
                    have identiﬁed, through an expensive sequence of training and pruning cycles, the existence
                    of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a
                    foundational question: can we identify highly sparse trainable subnetworks at initialization,
                    without ever training, or indeed without ever looking at the data? We provide an afﬁrmative
                    answer to this question through theory driven algorithm design. We ﬁrst mathematically
                    formulate and experimentally verify a conservation law that explains why existing gradient-
                    based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of
                    an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse
                    can be entirely avoided, motivating a novel pruning algorithmIterative Synaptic Flow Pruning
                    (SynFlow). This algorithm can be interpreted as preserving the total ﬂow of synaptic strengths
                    through the network at initialization subject to a sparsity constraint. Notably, this algorithm
                    makes no reference to the training data and consistently outperforms existing state-of-the-art
                    pruning algorithms at initialization over a range of models (VGG and ResNet), datasets
                    (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to99:9percent). Thus our
                    data-agnostic pruning algorithm challenges the existing paradigm that data must be used to
                    quantify which synapses are important.


              1 Introduction

              Network pruning, or the compression of neural networks by removing parameters, has been an important subject
              both for reasons of practical deployment [1,2,3,4,5,6,7] and for theoretical understanding of artiﬁcial [8] and
              biological [9] neural networks. Conventionally, pruning algorithms have focused on compressing pre-trained
              models [1,2,3,5,6]. However, recent works [10,11] have identiﬁed through iterative training and pruning
              cycles (iterative magnitude pruning) that there exist sparse subnetworks (winning tickets) in randomly-initialized
              neural networks that, when trained in isolation, can match the test accuracy of the original network. Moreover,
              its been shown that some of these winning ticket subnetworks can generalize across datasets and optimizers
              [12]. While these results suggest training can be made more efﬁcient by identifying winning ticket subnetworks
              at initialization, they do not provide efﬁcient algorithms to ﬁnd them. Typically, it requires signiﬁcantly more
              computational costs to identify winning tickets through iterative training and pruning cycles than simply training
              the original network from scratch [10,11]. Thus, the fundamental unanswered question is: can we identify
              highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the
              data? Towards this goal, we start by investigating the limitations of existing pruning algorithms at initialization
              [13,14], determine simple strategies for avoiding these limitations, and provide a novel data-agnostic algorithm
              that improves upon state-of-the-art results. Our main contributions are:
                  1.We study layer-collapse, the premature pruning of an entire layer making a network untrainable, and
                    formulate the axiomMaximal Critical Compression that posits a pruning algorithm should avoid
                    layer-collapse whenever possible (Sec. 3).
                  2.We demonstrate theoretically and empirically that synaptic saliency, a general class of gradient-based
                    scores for pruning, is conserved at every hidden unit and layer of a neural network (Sec. 4).
                  3.We show that the se conservation law simply parameters in large layers receive lower scores than
                    parameters in small layers, which elucidates why single-shot pruning disproportionately prunes the
                    largest layer leading to layer-collapse (Sec. 4).
                  4.We hypothesize that iterative magnitude pruning[10] avoids layer-collapse because gradient descent
                    effectively encourages the magnitude scores to observe a conservation law, which combined with
                    iteration results in the relative scores for the largest layers increasing during pruning (Sec. 5).
                  5.We prove that a pruning algorithm avoids layer-collapse entirely and satisﬁes Maximal Critical
                    Compression if it uses iterative, positive synaptic saliency scores (Sec. 6).
                  6.We introduce a new data-agnostic algorithmIterative Synaptic Flow Pruning (SynFlow)that satisﬁes
                    Maximal Critical Compression (Sec. 6) and demonstrate empirically 2 that this algorithm achieves
                    state-of-the-art pruning performance on 12 distinct combinations of models and datasets (Sec. 7).

              2 Related work

              While there are a variety of approaches to compressing neural networks, such as novel design of micro-
              architectures [15,16,17], dimensionality reduction of network parameters [18,19], and training of dynamic
              sparse networks [20, 21], in this work we will focus on neural network pruning.
              Pruning after training.Conventional pruning algorithms assign scores to parameters in neural networks after
              training and remove the parameters with the lowest scores [5,22,23]. Popular scoring metrics include weight
              magnitudes [4,6], its generalization to multi-layers [24], ﬁrst- [1,25,26,27] and second-order [2,3,27] Taylor
              coefﬁcients of the training loss with respect to the parameters, and more sophisticated variants [28,29,30].
              While these pruning algorithms can indeed compress neural networks at test time, there is no reduction in the
              cost of training.
              Pruning before Training.Recent works demonstrated that randomly initialized neural networks can be pruned
              before training with little or no loss in the ﬁnal test accuracy [10,13,31]. In particular, the Iterative Magnitude
              Pruning (IMP) algorithm [10,11] repeats multiple cycles of training, pruning, and weight rewinding to identify
              extremely sparse neural networks at initialization that can be trained to match the test accuracy of the original
              network. While IMP is powerful, it requires multiple cycles of expensive training and pruning with very speciﬁc
              sets of hyperparameters. Avoiding these difﬁculties, a different approach uses the gradients of the training
              loss at initialization to prune the network in a single-shot [13,14]. While these single-shot pruning algorithms
              at initialization are much more efﬁcient, and work as well as IMP at moderate levels of sparsity, they suffer
              from layer-collapse, or the premature pruning of an entire layer rendering a network untrainable [32,33].
              Understanding and circumventing this layer-collapse issue is the fundamental motivation for our study.

              3 Layer-collapse: the key obstacle to pruning at initialization

              Broadly speaking, a pruning algorithm at initialization is deﬁned by two steps. The ﬁrst step scores the
              parameters of a network according to some metric and the second step masks the parameters (removes or
              keeps the parameter) according to their scores. The pruning algorithms we consider will always mask the
              parameters by simply removing the parameters with the smallest scores. This ranking process can be applied
              globally across the network, or layer-wise. Empirically, its been shown that global-masking performs far better
              than layer-masking, in part because it introduces fewer hyperparameters and allows for ﬂexible pruning rates
              across the network [23]. However, recent works [32,14,33] have identiﬁed a key failure mode,layer-collapse,
              for existing pruning algorithms using global-masking. Layer-collapse occurs when an algorithm prunes all
              parameters in a single weight layer even when prunable parameters remain elsewhere in the network. This
              renders the network untrainable, evident by sudden drops in the achievable accuracy for the network as shown in
              
              Fig. 1. To gain insight into the phenomenon of layer-collapse we will deﬁne some useful terms inspired by a
              recent paper studying the failure mode [33].

              Given a network,compression ratio (FORMULA) is the number of parameters
              in the original network divided by the number of parameters
              remaining after pruning. For example, when the compression
              ratio <<FORMULA>>, then only one out of a thousand of the parameters
              remain after pruning.Max compression (<<FORMULA>>) is the maximal                 <<FIGURE>>
              possible compression ratio for a network that doesn’t lead to
              layer-collapse. For example, for a network with L layers and
              N parameters, <<FORMULA>>, which is the compression ratio
              associated with pruning all but one parameter per layer. Critical
              compression (<<FORMULA>>) is the maximal compression ratio a given
              algorithm can achieve without inducing layer-collapse. In 
              particular, the critical compression of an algorithm is always upper
              bounded by the max compression of the network:<<FORMULA>>. Figure 1:Layer-collapse leads to a suddenThis inequality motivates the following axiom we postulate any drop in accuracy.Top-1 test accuracy as a successful pruning algorithm should satisfy.              function of the compression ratio for a VGG-
              Axiom.Maximal Critical Compression.The critical compression 1 6 model pruned at initialization and trained
              of a pruning algorithm applied to a network should always on CIFAR-100. Colored arrows represent
              equal the max compression of that network. the critical compression of the corresponding pruning algorithm. Only our algorithm,
              In other words, this axiom implies a pruning algorithm should SynFlow, reaches the theoretical limit of max
              never prune a set of parameters that results in layer-collapse if compression (black dashed line) without col-
              there exists another set of the same cardinality that will keep lapsing the network. See Sec. 7 for more
              the network trainable. To the best of our knowledge, no exist- details on the experiments.
              ing pruning algorithm with global-masking satisﬁes this simple
              axiom. Of course any pruning algorithm could be modiﬁed to satisfy the axiom by introducing specialized
              layer-wise pruning rates. However, to retain the beneﬁts of global-masking [23], we will formulate an algorithm,
              Iterative Synaptic Flow Pruning (SynFlow), which satisﬁes this property by construction. SynFlow is a natural
              extension of magnitude pruning, that preserves the total ﬂow of synaptic strengths from input to output rather
              than the individual synaptic strengths themselves. We will demonstrate that not only does the SynFlow
              algorithm achieve Maximal Critical Compression, but it consistently outperforms existing state-of-the-art pruning
              algorithms (as shown in Fig. 1 and in Sec. 7), all while not using the data.
              Throughout this work, we benchmark our algorithm, SynFlow, against two simple baselines, random scoring
              and scoring based on weight magnitudes, as well as two state-of-the-art single-shot pruning algorithms, Single-
              shot Network Pruning based on Connection Sensitivity (SNIP) [13] and Gradient Signal Preservation (GraSP)
              [14]. SNIP [13] is a pioneering algorithm to prune neural networks at initialization by scoring weights based
              on the gradients of the training loss. GraSP [14] is a more recent algorithm that aims to preserve gradient
              ﬂow at initialization by scoring weights based on the Hessian-gradient product. Both SNIP and GraSP have
              been thoroughly benchmarked by [14] against other state-of-the-art pruning algorithms that involve training
              [2, 34, 10, 11, 35, 21, 20], demonstrating competitive performance.

              4 Conservation laws of synaptic saliency

              In this section, we will further verify that layer-collapse is a key obstacle to effective pruning at initialization
              and explore what is causing this failure mode. As shown in Fig. 2, with increasing compression ratios, existing
              random, magnitude, and gradient-based pruning algorithms will prematurely prune an entire layer making the
              network untrainable. Understanding why certain score metrics lead to layer-collapse is essential to improve the
              design of pruning algorithms.
              Random pruning prunes every layer in a network by the same amount, evident by the horizontal lines in
              Fig. 2. With random pruning the smallest layer, the layer with the least parameters, is the ﬁrst to be fully
              pruned. Conversely, magnitude pruning prunes layers at different rates, evident by the staircase pattern in Fig. 2.
              Magnitude pruning effectively prunes parameters based on the variance of their initialization, which for common
              network initializations, such as Xavier [36] or Kaiming [37], are inversely proportional to the width of a layer
              [33]. With magnitude pruning the widest layers, the layers with largest input or output dimensions, are the

                                                  <<FIGURE>>
                                                   
              Figure 2:Where does layer-collapse occur Fraction of parameters remaining at each layer of a VGG-19
              model pruned at initialization with ImageNet over a range of compression ratios (<<FORMULA>>). A
              higher transparency represents a higher compression ratio. A dashed line indicates that there is at least one layer
              with no parameters, implying layer-collapse has occurred.


              ﬁrst to be fully pruned. Gradient-based pruning algorithms SNIP [13] and GraSP [14] also prune layers at
              different rates, but it is less clear what the root cause for this preference is. In particular, both SNIP and GraSP
              aggressively prune the largest layer, the layer with the most trainable parameters, evident by the sharp peaks
              in Fig. 2. Based on this observation, we hypothesize that gradient-based scores averaged within a layer are
              inversely proportional to the layer size. We examine this hypothesis by constructing a theoretical framework
              grounded in ﬂow networks. We ﬁrst deﬁne a general class of gradient-based scores, prove a conservation law for
              these scores, and then use this law to prove that our hypothesis of inverse proportionality between layer size and
              average layer score holds exactly.
              A general class of gradient-based scores.Synaptic saliency is any score metric that can be expressed as the
              Hadamard product
                                                  <<FORMULA>>                            (1)

              whereRis a scalar loss function of the output of a feed-forward network parameterized by . When R is the
              training loss L, the resulting synaptic saliency metric is equivalent (modulo sign) to <<FORMULA>>, the score metric
              used in Skeletonization [1], one of the ﬁrst network pruning algorithms. The resulting metric is also closely 
              related to <<FORMULA>> the score used in SNIP [13], <<FORMULA>> the score used in GraSP, and <<FORMULA>> the
              score used in the pruning after training algorithm Taylor-FO [27]. This general class of score metrics, while not
              encompassing, exposes key properties of gradient-based scores used for pruning.
              The conservation of synaptic saliency.All synaptic saliency metrics respect two surprising conservation laws
              that hold at any initialization and step in training.
              Theorem 1.Neuron-wise Conservation of Synaptic Saliency.For a feedforward neural network with homogenous
              activation functions, <<FORMULA>>, (e.g. ReLU, Leaky ReLU, linear), the sum of the synaptic saliency for
              the incoming parameters to a hidden neuron (Sin ) is equal to the sum of the synaptic saliency for the outgoing
              parameters from the hidden neuron (S_out).

              Proof.Consider the jth hidden neuron of a network with outgoing parameters  out and incoming parameters P
              <<FORMULA>>, such that <<FORMULA>> and <<FORMULA>>. The sum of the synaptic saliency for the outgoing
              parameters is
                                                            <<FORMULA>>      (2)

              The sum of the synaptic saliency for the incoming parameters is
                                                                  
                                             <<FORMULA>>         (3)
                                              
              When  is homogeneous, then <<FORMULA>>

                                                <<FIGURE>>

              Figure 3: Total score in Neuron-wise conservation of score.Each dot represents a hidden unit from the feature-extractor of a
              VGG-19 model pruned at initialization with ImageNet. The location of each dot corresponds to the total score
              for the unit’s incoming and outgoing parameters, <<FORMULA>>. The black dotted line represents exact neuron-wise
              conservation of score.

                      <<FORMULA>>

              Figure 4:Inverse relationship between layer size and average layer score.Each dot represents a layer from
              a VGG-19 model pruned at initialization with ImageNet. The location of each dot corresponds to the layer’s
              average score 4 and inverse number of elements. The black dotted line represents a perfect linear relationship.

              The neuron-wise conservation of synaptic saliency implies network conservation as well.
              Theorem 2.Network-wise Conservation of Synaptic Saliency.The sum of the synaptic saliency across any
              set of parameters that exactly 3 separates the input neurons x from the output neurons y of a feedforward neural
              network with homogenous activation functions equals <<FORMULA>>
              We prove this theorem in Appendix 10 by applying the neuron-wise conservation law recursively. Similar
              conservation properties have been noted in the neural network interpretability literature and have motivated the
              construction of interpretability methods such as Conductance [38] and Layer-wise Relevance Propagation [39],
              which have recently been modiﬁed for network pruning [9,40]. While the interpretability literature has focused
              on attribution to the input pixels and hidden neuron activations, we have formulated conservation laws that are
              more general and applicable to any parameter and neuron in a network. Remarkably, these conservation laws of
              synaptic saliency apply to modern neural network architectures and a wide variety of neural network layers (e.g.
              dense, convolutional, batchnorm, pooling, residual) as visually demonstrated in Fig. 3.
              Conservation and single-shot pruning leads to layer-collapse.The conservation laws of synaptic saliency
              provide us with the theoretical tools to validate our earlier hypothesis of inverse proportionality between layer
              size and average layer score as a root cause for layer-collapse of gradient-based pruning methods. Consider the
              set of parameters in a layer of a simple, fully connected neural network. This set would exactly separate the input
              neurons from the output neurons. Thus, by the network-wise conservation of synaptic saliency (theorem 2), the
              total score for this set is constant for all layers, implying the average is inversely proportional to the layer size.
              We can empirically evaluate this relationship at scale for existing pruning methods by computing the total score
              for each layer of a model, as shown in Fig. 4. While this inverse relationship is exact for synaptic saliency, other
              closely related gradient-based scores, such as the scores used in SNIP and GraSP, also respect this relationship.
              This validates the empirical observation that for a given compression ratio, gradient-based pruning methods will
              disproportionately prune the largest layers. Thus, if the compression ratio is large enough and the pruning score
              is only evaluated once, then a gradient-based pruning method will completely prune the largest layer leading to
              layer-collapse.

                3 Every element of the set is needed to separate the input neurons from the output neurons.
                4 For GraSP we negated the average layer score so that we could plot on a log-log plot.
                5 Magnitude pruning avoids layer-collapse with conservation and iteration

              Having demonstrated and investigated the cause of layer-collapse
              in single-shot pruning methods at initialization, we now explore   
              an iterative pruning method that appears to avoid the issue
              entirely. Iterative Magnitude Pruning (IMP) is a recently proposed   
              pruning algorithm that has proven to be successful in ﬁnding
              extremely sparse trainable neural networks at initialization  
              (winning lottery tickets) [10,11,12,41,42,43,44]. The algorithm           
              follows three simple steps. First train a network, second prune  
              parameters with the smallest magnitude, third reset the unpruned
              parameters to their initialization and repeat until the desired     
              compression ratio. While simple and powerful, IMP is impractical as
              it involves training the network several times, essentially defeating       <<FIGURE>>
              the purpose of constructing a sparse initialization. That being                     
              said it does not suffer from the same catastrophic layer-collapse
              that other pruning at initialization methods are susceptible to.
              Thus, understanding better how IMP avoids layer-collapse might
              shed light on how to improve pruning at initialization.
              As has been noted previously [10,11], iteration is essential for
              stabilizing IMP. In fact, without sufﬁcient pruning iterations, IMP                         
              will suffer from layer-collapse, evident in the sudden accuracy
              drops for the darker curves in Fig. 5a. However, the number of    
              layer-collapse. Notice that if IMP didn’t train the network during 
              each prune cycle, then, no matter the number of pruning iterations, 
              it would be equivalent to single-shot magnitude pruning. 
              Thus, something very critical must happen to the magnitude of 
              the parameters during training, that when coupled with sufﬁcient 
              pruning iterations allows IMP to avoid layer-collapse. We 
              hypothesize that gradient descent training effectively encourages 
              the scores to observe an approximate layer-wise conservation 
              law, which when coupled with sufﬁcient pruning iterations allows 
              IMP to avoid layer-collapse.                         
              Gradient descent encourages conservation. To better understand the dynamics of the IMP algorithm during
              training, we will consider a differentiable score <<FORMULA>> algorithmically equivalent to the magnitude score. 
              Consider these scores throughout training with gradient descent on a loss function L using an inﬁnitesimal step
              size (i.e. gradient ﬂow). In this setting, the temporal derivative of the parameters is equivalent to <<FORMULA>>,
              and thus the temporal derivative of the score is                             

                                        <<FORMULA>>                                 (4)
                                                         
              Surprisingly, this is a form of synaptic saliency and thus the neuron-wise and layer-wise conservation laws
              from Sec. 4 apply. In particular, this implies that for any two layers l and k of a simple, fully connected
              network, then <<FORMULA>>. This invariance has been noticed before by [45] as a form of implicit 
              regularization and used to explain the empirical phenomenon that trained multi-layer models can have similar
              layer-wise magnitudes. In the context of pruning, this phenomenon implies that gradient descent training, with a
              small enough learning rate, encourages the squared magnitude scores to converge to an approximate layer-wise
              conservation, as shown in Fig. 5b.
              Conservation and iterative pruning avoids layer-collapse.As explained in section 4, conservation alone
              leads to layer-collapse by assigning parameters in the largest layers with lower scores relative to parameters in
              smaller layers. However, if conservation is coupled with iterative pruning, then when the largest layer is pruned,
              becoming smaller, then in subsequent iterations the remaining parameters of this layer will be assigned higher
              relative scores. With sufﬁcient iterations, conservation coupled with iteration leads to a self-balancing pruning
              strategy allowing IMP to avoid layer-collapse. This insight on the importance of conservation and iteration
              applies more broadly to other algorithms with exact or approximate conservation properties (e.g. Skeletonization,
              SNIP, and GraSP as demonstrated in Sec. 3). Indeed, very recent work empirically conﬁrms that iteration
              improves the performance of SNIP [46].

              6 A data-agnostic algorithm satisfying Maximal Critical Compression

              In the previous section we identiﬁed two key ingredients of IMP’s ability to avoid layer-collapse: (i) approximate
              layer-wise conservation of the pruning scores, and (ii) the iterative re-evaluation of these scores. While these
              properties allow the IMP algorithm to identify high performing and highly sparse, trainable neural networks,
              it requires an impractical amount of computation to obtain them. Thus, we aim to construct a more efﬁcient
              pruning algorithm while still inheriting the key aspects of IMP’s success. So what are the essential ingredients
              for a pruning algorithm to avoid layer-collapse and provably attain Maximal Critical Compression? We prove
              the following theorem in Appendix 10.
              Theorem 3.Iterative, positive, conservative scoring achieves Maximal Critical Compression.If a pruning
              algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm
              re-evaluates the scores every time a parameter is pruned, then the algorithm satisﬁes the Maximal Critical
              Compression axiom.

              The Iterative Synaptic Flow Pruning (SynFlow) algorithm

              Theorem 3 directly motivates the design of our novel pruning algorithm, SynFlow, that provably reaches Maximal
              Critical Compression. First, the necessity for iterative score evaluation discourages algorithms that involve
              backpropagation on batches of data, and instead motivates the development of an efﬁcient data-independent
              scoring procedure. Second, positivity and conservation motives the construction of a loss function that yields
              positive synaptic saliency scores. We combine these insights to introduce a new loss function (where 1 is the all
              ones vector and <<FORMULA>> the element-wise absolute value of parameters in the lth layer),
                                                       
                                          <<FORMULA>>                          (5)

              that yields the positive, synaptic saliency scores ( @RSF ) we term Synaptic Flow. For a simple, fully connected
              network (i.e. <<FORMULA>>), we can factor the Synaptic Flow score for a parameter <<FORMULA>> as <<FORMULA>>

                                                        <<FORMULA>>               (6)

              This perspective demonstrates that Synaptic Flow score is a generalization of magnitude score (jw[l] j), where ij the scores consider the product of synaptic strengths ﬂowing through each parameter, taking the inter-layer
              interactions of parameters into account. We use the Synaptic Flow score in the Iterative Synaptic Flow Pruning
              (SynFlow) algorithm summarized in the pseudocode below.

              Algorithm 1:Iterative Synaptic Flow Pruning (SynFlow).

                                 <<ALGORITHM>>

              Given a network <<FORMULA>> and speciﬁed compression ratio , the SynFlow algorithm requires only one additional
              hyperparameter, the number of pruning iterations n. We demonstrate in Appendix 11, that an exponential pruning
              schedule <<FORMULA>> with n=100 pruning iterations essentially prevents layer-collapse whenever avoidable (Fig. 1),
              while remaining computationally feasible, even for large networks.

                                                  7 Experiments

              We empirically benchmark the performance of our algorithm, SynFlow (red), against the baselines random
              pruning and magnitude pruning, as well as the state-of-the-art algorithms SNIP [13] and GraSP [14]. In Fig. 6,
              we test the ﬁve algorithms on 12 distinct combinations of modern architectures (VGG-11, VGG-16, ResNet-
              18, WideResNet-18) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet) over an exponential sweep of
              compression ratios (<<FORMULA>>). See Appendix 12 for more details and hyperparameters
              of the experiments. Consistently, SynFlow outperforms the other algorithms in the high compression regime
              (10 1:5 < ) and demonstrates signiﬁcantly more stability, as indicated by its tight intervals. Furthermore,
              SynFlow is the only algorithm that reliably shows better performance to the random pruning baseline: SNIP and
              GraSP perform signiﬁcantly worse than random pruning with ResNet-18 and WideResNet-18 trained on Tiny
              ImageNet. SynFlow is also quite competitive in the low compression regime (<<FORMULA>>). Although magnitude
              pruning can partially outperform SynFlow in this regime with models trained on Tiny ImageNet, it suffers from
              catastrophic layer-collapse as indicated by the sharp drops in accuracy.

                         <<FORMULA>>

              Figure 6:SynFlow consistently outperforms other pruning methods.Top-1 test accuracy as a function of
              different compression ratios over 12 distinct combinations of models and datasets. We performed three runs
              with the same hyperparameter conditions and different random seeds. The solid line represents the mean, the
              shaded region represents area between minimum and maximum performance of the three runs.


              8 Conclusion

              In this paper, we developed a unifying theoretical framework that explains why existing single-shot pruning
              algorithms at initialization suffer from layer-collapse. We applied our framework to elucidate how iterative
              magnitude pruning [10] overcomes layer-collapse to identify winning lottery tickets at initialization. Building
              on the theory, we designed a new data-agnostic pruning algorithm, SynFlow, that provably avoids layer-collapse
              and reaches Maximal Critical Compression. Finally, we empirically conﬁrmed that our SynFlow algorithm
              consistently performs better than existing algorithms across 12 distinct combinations of models and datasets,
              despite the fact that our algorithm is data-agnostic and requires no pre-training. Promising future directions
              for this work are to (i) explore a larger space of potential pruning algorithms that satisfy Maximal Critical
              Compression, (ii) harness SynFlow as an efﬁcient way to compute appropriate per-layer compression ratios to
              combine with existing scoring metrics, and (iii) incorporate pruning as a part of neural network initialization
              schemes. Overall, our data-agnostic pruning algorithm challenges the existing paradigm that data must be used
              to quantify which synapses of a neural network are important.


                                                  9 Acknowledgements

              We thank Jonathan M. Bloom, Weihua Hu, Javier Sagastuy-Brena, Chengxu Zhuang, and members of the
              Stanford Neuroscience and Artiﬁcial Intelligence Laboratory for helpful discussions. We thank the Stanford
              Data Science Scholars program (DK), the Burroughs Welcome, Simons and James S. McDonnell foundations,
              and an NSF career award (SG) for support.

              References
               [1]Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network
                 via relevance assessment. InAdvances in neural information processing systems, pages 107–115, 1989.

               [2]Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. InAdvances in neural information
                 processing systems, pages 598–605, 1990.

               [3]Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon.
                 InAdvances in neural information processing systems, pages 164–171, 1993.

               [4]Steven A Janowsky. Pruning versus clipping in neural networks.Physical Review A, 39(12):6600, 1989.

               [5]Russell Reed. Pruning algorithms-a survey.IEEE transactions on Neural Networks, 4(5):740–747, 1993.

               [6]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efﬁcient
                 neural network. InAdvances in neural information processing systems, pages 1135–1143, 2015.

               [7]Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efﬁcient processing of deep neural networks:
                 A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017.

               [8]Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep
                 nets via a compression approach. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th
                 International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research,
                 pages 254–263. PMLR, 2018.

               [9]Hidenori Tanaka, Aran Nayebi, Niru Maheswaranathan, Lane McIntosh, Stephen Baccus, and Surya
                 Ganguli. From deep learning to mechanistic understanding in neuroscience: the structure of retinal
                 prediction. InAdvances in Neural Information Processing Systems, pages 8535–8545, 2019.

              [10]Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural
                 networks. InInternational Conference on Learning Representations, 2019.

              [11]Jonathan Frankle, G Karolina Dziugaite, DM Roy, and M Carbin. Stabilizing the lottery ticket hypothesis.
                 arXiv, page.

              [12] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing
                 lottery ticket initializations across datasets and optimizers. InAdvances in Neural Information Processing
                 Systems, pages 4933–4943, 2019.

              [13]Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING
                 BASED ON CONNECTION SENSITIVITY. InInternational Conference on Learning Representations,
                 2019.

              [14]Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving
                 gradient ﬂow. InInternational Conference on Learning Representations, 2020.

              [15]Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer.
                 Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.arXiv preprint
                 arXiv:1602.07360, 2016.

              [16]Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco
                 Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision
                 applications.arXiv preprint arXiv:1704.04861, 2017.

              [17]Ameya Prabhu, Girish Varma, and Anoop Namboodiri. Deep expander networks: Efﬁcient deep networks
                 from graph theory. InProceedings of the European Conference on Computer Vision (ECCV), pages 20–35,
                 2018.
              [18]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with
                 low rank expansions. InProceedings of the British Machine Vision Conference. BMVA Press, 2014. doi:
                 http://dx.doi.org/10.5244/C.28.88.
              [19]Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.
                 InAdvances in neural information processing systems, pages 442–450, 2015.
              [20]Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very
                 sparse deep networks. InInternational Conference on Learning Representations, 2018.
              [21]Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and
                 Antonio Liotta. Scalable training of artiﬁcial neural networks with adaptive sparse connectivity inspired by
                 network science.Nature communications, 9(1):1–12, 2018.
              [22]Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.arXiv preprint
                 arXiv:1902.09574, 2019.
              [23]Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural
                 network pruning?arXiv preprint arXiv:2003.03033, 2020.
              [24]Sejun Park*, Jaeho Lee*, Sangwoo Mo, and Jinwoo Shin. Lookahead: A far-sighted alternative of
                 magnitude-based pruning. InInternational Conference on Learning Representations, 2020.
              [25]Ehud D Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE
                 transactions on neural networks, 1(2):239–242, 1990.
              [26]Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural
                 networks for resource efﬁcient inference.arXiv preprint arXiv:1611.06440, 2016.
              [27]Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation
                 for neural network pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern
                 Recognition, pages 11264–11272, 2019.
              [28]Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InAdvances in
                 neural information processing systems, pages 1379–1387, 2016.
              [29]Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal
                 brain surgeon. InAdvances in Neural Information Processing Systems, pages 4857–4867, 2017.
              [30]Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-
                 Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In
                 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194–9203,
                 2018.
              [31]Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network
                 pruning. InInternational Conference on Learning Representations, 2019.
              [32]Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propaga-
                 tion perspective for pruning neural networks at initialization. InInternational Conference on Learning
                 Representations, 2020.
              [33]Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk,
                 Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efﬁcient training of deep
                 networks. InInternational Conference on Learning Representations, 2020.
              [34]Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compres-
                 sion. 2018.
              [35]Hesham Mostafa and Xin Wang. Parameter efﬁcient training of deep convolutional neural networks by
                 dynamic sparse reparameterization. InProceedings of the 36th International Conference on Machine
                 Learning, volume 97 ofProceedings of Machine Learning Research, pages 4646–4655. PMLR, 2019.
              [36]Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural
                 networks. InProceedings of the thirteenth international conference on artiﬁcial intelligence and statistics,
                 pages 249–256, 2010.
              [37]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
                 InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
              [38]Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron. InInternational
                 Conference on Learning Representations, 2019.
              [39]Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and
                 Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance
                 propagation.PloS one, 10(7), 2015.
              [40]Seul-Ki Yeom, Philipp Seegerer, Sebastian Lapuschkin, Simon Wiedemann, Klaus-Robert Müller, and
                 Wojciech Samek. Pruning by explaining: A novel criterion for deep neural network pruning.arXiv preprint
                 arXiv:1912.08881, 2019.
              [41]Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs,
                 and the supermask. InAdvances in Neural Information Processing Systems, pages 3592–3602, 2019.
              [42]Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk,
                 Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efﬁcient training of deep
                 networks. InInternational Conference on Learning Representations, 2020.
              [43]Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Linear mode connectiv-
                 ity and the lottery ticket hypothesis.arXiv preprint arXiv:1912.05671, 2019.
              [44]Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
                 multiple languages: lottery tickets in rl and nlp. InInternational Conference on Learning Representations,
                 2020.
              [45]Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models:
                 Layers are automatically balanced. InAdvances in Neural Information Processing Systems, pages 384–395,
                 2018.
              [46]Stijn Verdenius, Maarten Stol, and Patrick Forré. Pruning via iterative ranking of sensitivity statistics.
                 arXiv preprint arXiv:2006.00896, 2020.

                                                  Appendix

              10 Proofs

              We provide a proof for Theorem 2 which we rewrite below.
              Theorem 2.Network-wise Conservation of Synaptic Saliency.The sum of the synaptic saliency across any set
              of parameters that exactly separates the input neuronsxfrom the output neuronsyof a feedforward neural
              network with homogenous activation functions equals <<FORMULA>>

              Proof.We begin by deﬁning the set of neurons (V) and the set of prunable parameters (E) for a neural network.
              Consider a subset of the neurons <<FORMULA>>, such that all output neuronsyc 2Sand all input neuronsxi 2VnS.
              Consider the set of parameters cut by this partition

                                     <<FORMULA>>                     (7)

              By theorem 1, we know that that sum of the synaptic saliency over C(S) is equal to the sum of the synaptic
              saliency over the set of parameters adjacent toC(S)and between neurons in <<FORMULA>>.
              Continuing this argument, then eventually we get that this sum must be equal to the sum of the synaptic saliency
              over the set of parameters incident to the output neuronsy, which is 

                    <<FORMULA>>     (8)

              We can repeat this argument iterating through the setVnStill we reach the input neuronsxto show that this
              sum is also equal to <<FORMULA>>
              We provide a proof for Theorem 3 which we rewrite below.

              Theorem 3. Iterative, positive, conservative scoring achieves Maximal Critical Compression.If a pruning
              algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm
              re-evaluates the scores every time a parameter is pruned, then the algorithm satisﬁes the Maximal Critical
              Compression axiom.

              Proof.We prove this theorem by contradiction. Assume that a pruning algorithm with global-masking and
              iterative, positive, conservative scoring does not satisfy the Maximal Critical Compression axiom. This implies
              that at some iteration, the algorithm will prune the last parameter in a layer (layer l), despite there existing more
              than one parameters <<FORMULA>> in another layer (layer k). Because the algorithm uses global-masking, then the
              score for the last parameter in layer l,S[l] , is less than or equal to the scores for each parameter, S[k], in layer k

                                               <<FORMULA>>                             (9) 

              P Because the scores respect a layer-wise conservation, then S[l] =  N[k] S[k]. This implies, by the positivity of i=1 
              the scores and because N[k]>1, that for all i,

                                               <<FORMULA>>                             (10) 

              This is a contradiction to the previous inequality.

              11 Hyperparameters choices for the SynFlow algorithm

              Theorem 3 required that an algorithm re-evaluates the scores every time a parameter is pruned. However,
              theorem 2 provides a theoretical insight to drastically reduce the number of iterations needed to practically
              attain Maximal Critical Compression. We now introduce a modiﬁcation to theorem 3 that motivates practical
              hyperparameter choices used in the SynFlow algorithm.
              Theorem 4.Achieving Maximal Critical Compression practically. If a pruning algorithm, with global-
              masking, assigns positive scores that respect layer-wise conservation and if the prune size, the total score for the
              parameters pruned at any iteration, is strictly less than the cut size, the total score for an entire layer, whenever
              possible, then the algorithm satisﬁes the Maximal Critical Compression axiom.

              Proof.We prove this theorem by contradiction. Assume there is an iterative pruning algorithm that uses positive,
              layer-wise conserved scores and maintains that the prune size at any iteration is less than the cut size whenever
              possible, but doesn’t satisfy the Maximal Critical Compression axiom. At some iteration the algorithm will
              prune a set of parameters containing a subset separating the input neurons from the output neurons, despite
              there existing a set of the same cardinality that does not lead to layer-collapse. By theorem 2, the total score
              for the separating subset is <<FORMULA>>, which implies by the positivity of the scores, that the total prune size is at @y
              least <<FORMULA>>. This contradicts the assumption that the algorithm maintains that the prune size at any iteration is @y 
              always strictly less than the cut size whenever possible.


              Motivated by Theorem 4, we can now choose a practical, yet effective, number of pruning iteration (n) and
              schedule for the compression ratios <<FORMULA>> applied at each iteration (k) for the SynFlow algorithm. Two natural
              candidates for a compression schedule would be either linear <<FORMULA>> or exponential <<FORMULA>>. Empirically
              we ﬁnd that the SynFlow algorithm with 100 pruning iterations and an exponential compression schedule
              satisﬁes the conditions of theorem 4 over a reasonable range of compression ratios <<FORMULA>>, as
              shown in Fig. 7b. This is not true if we use a linear schedule for the compression ratios, as shown in Fig. 7a.
              Interestingly, Iterative Magnitude Pruning also uses an exponential compression schedule, but does not provide
              a thorough explanation for this hyperparameter choice [10].


                      <<FIGURE>>

              Figure 7:Choosing the number of pruning iterations and compression schedule for SynFlow.Maximum
              ratio of prune size with cut size for increasing number of pruning iterations for SynFlow with a linear (left) or
              exponential (right) compression schedule. Higher transparency represents higher compression ratios. The black
              dotted line represents the maximal prune size ratio that can be obtained while still satisfying the conditions of
              theorem 4. All data is from a VGG-19 model at initialization using ImageNet.


              Potential numerical instability. The SynFlow algorithm involves computing the SynFlow objective, <<FORMULA>>, 
              whose singular values may vanish or explode exponentially with depthL. This may lead to l=1
              potential numerical instability for very deep networks, although we did not observe this for the models presented
              in this paper. One way to address this potential challenge would be to appropriately scale network parameters
              at each layer to maintain stability. Because the SynFlow algorithm is scale invariant at each layer <<FORMULA>>, this
              modiﬁcation will not effect the performance of the algorithm.

              12 Experimental details

              An open source version of our code and the data used to generate all the ﬁgures in this paper are available at
              github.com/ganguli-lab/Synaptic-Flow.

                                                  12.1 Pruning algorithms

              All pruning algorithms we considered in our experiments use the following two steps: (i) scoring parameters,
              and (ii) masking parameters globally across the network with the lowest scores. Here we describe details of how
              we computed scores used in each of the pruning algorithms.
              Random:We sampled independently from a standard Gaussian.
              Magnitude:We computed the absolute value of the parameters. SNIP:We computed the score <<FORMULA>>
              using a random subset of the training dataset with a size ten times the
              number of classes, namely 100 for CIFAR-10, 1000 for CIFAR-100,2000 for Tiny ImageNet, and 10000 for
              ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet, and 16 for
              ImageNet, then summed across batches to obtain the score used for pruning. GraSP:
              We computed the score <<FORMULA>> using a random subset of the training dataset with a size ten
              times the number of classes, namely 100 for CIFAR-10,1000 for CIFAR-100,2000 for Tiny ImageNet, and
              10000for ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet,
              and 16 for ImageNet, then summed across batches to obtain the score used for pruning.
              SynFlow:We applied the pseudocode 1 with 100 pruning iterations motivated by the theoretical and empirical
              results discussed in Sec 11.

              12.2 Model architectures

              We adapted standard implementations of VGG-11 and VGG-16 from OpenLTH, and ResNet-18 and WideResNet-
              18 from PyTorch models. We considered all weights from convolutional and linear layers of these models as
              prunable parameters, but did not prune biases nor the parameters involved in batchnorm layers. For convolutional
              and linear layers, the weights were initialized with a Kaiming normal strategy and biases to be zero.

              12.3 Training hyperparameters

              Here we provide hyperparameters that we used to train the models presented in Fig. 1 and Fig. 6. These
              hyperparameters were chosen for the performance of the original model and were not optimized for the
              performance of the pruned networks.

                             <<TABLE>>

<|endoftext|>


<|startoftext|>
Scalable Gradients for Stochastic Differential Equations 

Xuechen Li. Ting-Kam Leonard Wong 

Google Research University of Toronto 

Abstract 

The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic Differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset. 

1 Introduction 

Deterministic dynamical systems can often be modeled by ordinary Differential equations (ODEs). The adjoint sensitivity method can efficiently compute gradients of ODE solutions with constant memory cost. This method was well-known in the physics, numerical analysis, and control communities for decades [3, 4, 60, 65]. Recently, it was combined with modern reverse-mode automatic differentiation packages, enabling ODEs with millions of parameters to be fit to data [12] and allow.
ing more flexible density estimation and time-series models [23, 32, 72]. 
Stochastic Differential equations (SDEs) generalize ODEs, adding instantaneous noise to their dynamics [55, 77, 78]. They are a natural model for phenomena governed by many small and unobserved interactions, such as motion of molecules in a liquid [8], 

allele frequencies in a gene pool [15], or prices in a market [79]. Previous attempts on fitting SDEs mostly relied on methods with poor scaling properties. The pathwise approach [22, 89], a form of forward-mode automatic differentiation, scales poorly in time with the number of parameters and states in the model. On the other hand, simply differentiating through the operations of an SDE solver [19] scales poorly in memory. 
In this work, we generalize the adjoint method to stochastic dynamics defined by SDEs. We give a sim.ple and practical algorithm for fitting SDEs with tens of thousands of parameters, while allowing the use of high-order adaptive time-stepping SDE solvers. We call this approach the stochastic adjoint sensitivity method. 

<<TABLE>> 

Table 1: Asymptotic complexity comparison. L is the number of steps used in a fixed-step solve, and D is the number of state and parameters. Both memory and time are expressed in units of the cost of evaluating the drift and diffusion functions once each. 
There are two main difficulties in generalizing the ad.joint formulation for ODEs to SDEs. The first is mathematical: SDEs are defined using nonstandard integrals that usually rely on Ito calculus. The adjoint method requires solving the dynamics backwards in time from the end state. However, it is not clear exactly what running the SDE backwards means in the context of stochastic calculus, and when it correctly reconstructs the forward trajectory. We address this problem in Section 3, deriving a backward Stratonovich SDE whose dynamics compute the necessary gradient. 
The second difficulty is computational: To retrace the steps, one needs to reconstruct the noise sampled on the forward pass, ideally without storing it. In Section 4, we give an algorithm that allows querying a Brownian motion sample at any time point arbitrarily-precisely, while only storing a single random seed. 

We combine our adjoint approach with a gradient-based stochastic variational inference scheme for efficiently marginalizing over latent SDE models with arbitrary differentiable likelihoods. This model fam.ily generalizes several existing families such as latent ODEs [12, 72], Gaussian state-space models [36, 81], and deep Kalman filters [40], and can naturally handle irregularly-sampled times series and missing observations. We train latent SDEs on toy and real datasets, demonstrating competitive performance compared to existing approaches for dynamics modeling. 

2 Background: Stochastic Flows 

2.1 Adjoint Sensitivity Method 
The adjoint sensitivity method is an efficient approach to solve control problems relying on the adjoint (co-state) system [65]. Chen et al. [12] used this method to compute the gradient with respect to parameters of a neural ODE, which is a particular model among many others inspired by the theory of dynamical systems [10, 11, 26, 44, 46, 74, 86]. The method, shown in Algorithm 1, is scalable, since the most costly computation is a vector-Jacobian product defining its backwards dynamics. In addition, since the gradient is obtained by solving another ODE, no intermediate computation is stored as in the case of regular backpropagation [73]. 

2.2 Stochastic Differential Equations 
We briefly define SDEs: Consider a filtered probability space <<FORMULA>> on which an m-dimensional adapted Wiener process (aka Brownian motion) <<FORMULA>> is defined. For a fixed terminal time t
<<FORMULA>>, we denote by <<FORMULA>> the time horizon. We denote the ith component of Wt by <<FORMULA>>. A stochastic process <<FORMULA>> can be defined by an Ito SDE 

<<FORMULA>>,        (1)

where z0 . Rd is the starting state, and <<FORMULA>> and <<FORMULA>> are the drift and diffusion functions, respectively. For ease of presentation, we let m =1 in the following unless otherwise stated. Our contributions can be easily generalized to cases where 
m> 1. Here, the second integral on the right hand side of (1) is the Ito stochastic integral [55]. When the coefficients are globally Lipschitz in both the state and time, there exists a unique strong solution to the SDE [55]. 
2.3 Neural Stochastic Differential Equations 
Similar to neural ODEs, one can consider drift and diffusion functions defined by neural networks, a model known as the neural SDE [32, 45, 82, 83]. 
Amongst work on neural SDEs, none has enabled an efficient training framework. In particular, Tzen and Raginsky [82] and Liu et al. [45] considered computing the gradient by simulating the forward dynamics of an explicit Jacobian matrix. This Jacobian has size of either the square of the number of parameters, or the number of parameters times the number of states, building on the pathwise approach [22, 89]. In contrast, our approach only requires a small number of cheap vector-Jacobian products, independent of the dimension of the parameter and state vectors. These vector-Jacobian products have the same asymptotic time cost as evaluating the drift and diffusion functions, and can be easily computed by modern automatic differentiation libraries [1, 16, 49, 59]. 
2.4 Backward Stratonovich Integral 
Our stochastic adjoint sensitivity method involves stochastic processes running both forward and back.ward in time. The Stratonovich stochastic integral, due to its symmetry, gives nice expressions for the backward dynamics and is more convenient for our purpose. Our results can be straightforwardly applied to ItSDEs as well, using a simple conversion (see e.g. [64, Sec. 2]). 
Following the treatment of Kunita [41], we introduce the forward and backward Stratonovich integrals. Let <<FORMULA>> be a two-sided filtration, where <<FORMULA>> is the \sigma-algebra generated by <<FORMULA>> for <<FORMULA>> such that <<FORMULA>>. For a continuous semi-martingale <<FORMULA>> adapted to the forward filtration <<FORMULA>>, the Stratonovich stochastic integral is 

<<FORMULA>>

where <<FORMULA>> is a partition of the interval <<FORMULA>> denotes the size of largest segment of the partition, and the limit is to be interpreted in the L2 sense. The Ito integral uses instead the left endpoint <<FORMULA>> rather than the average. In general, the Ito and Stratonovich integrals differ by a term of finite variation. 

To define the backward Stratonovich integral, we consider c the backward Wiener process <<FORMULA>> defined as <<FORMULA>> for all t that is adapted to the backward filtration <<FORMULA>>. For a continuous semimartingale <<FORMULA>> adapted to the backward filtration, 

Algorithm 1 ODE Adjoint Sensitivity 

<<ALGORITHM>>

Algorithm 2 SDE Adjoint Sensitivity (Ours) 

<<ALGORITHM>> 

Figure 1: Pseudocode of the (ODE) adjoint sensitivity method (left), and our generalization to Stratonovich SDEs (right). differences are highlighted in blue. Square brackets denote vector concatenation. 
the backward Stratonovich integral is Moreover, each .s,t is a smooth diffeomorphism N flow of diffeomorphisms generated by the SDE (2). 

<<FIGURE>>

from Rd to itself. We thus call S the stochastic (b) The backward flow <<FORMULA>> satisfies the backward SDE: 

<<FORMULA>>

where <<FORMULA>> is the partition. 

2.5 Stochastic Flow of diffeomorphisms 
<<FORMULA>>

It is well known that an ODE defines a flow of diffeomorphisms [6]. Here we consider the stochastic analog <<FORMULA>>, (3) for the Stratonovich SDE s 

<<FORMULA>>

for all <<FORMULA>> and <<FORMULA>> such that <<FORMULA>>. 

<<FORMULA>> (2) 

The coefficients in (2) and (3) differ by only a negative sign. This symmetry is due to our use of the Stratonovich integral (see Figure 2). 

<<FORMULA>>

Throughout the paper, we assume that both b and <<FORMULA>> have infinitely many bounded derivatives w.r.t. the state, and bounded first derivatives w.r.t. time, i.e. <<FORMULA>>, so that the SDE has a unique strong solution. Let <<FORMULA>> be the solution at time t
when the process is started at z at time s. Given a realization of the Wiener process, this defines a collection of continuous maps <<FORMULA>> from Rd to itself. 
The following theorem shows that these maps are diffeomorphisms (after choosing a suitable modification) and that they satisfy backward SDEs. 
Theorem 2.1 ([41, Theorem 3.7.1]). (a) With probability 1, the collection <<FORMULA>> satisfies the flow property 

<<FORMULA>>. 

3 Sensitivity via Stochastic Adjoint 

We present our main contribution: a stochastic analog of the adjoint sensitivity method for SDEs. We use (3) to derive another backward Stratonovich SDE, which we call the stochastic adjoint process. The direct implication is a gradient computation algorithm that works by solving a set of dynamics in reverse time, and relies on cheap vector-Jacobian products without storing any intermediate quantities. 
The proof included in Appendix 9.1 relies on Its lemma in the Stratonovich form [41, Theorem 2.4.1]. We stress that this lemma considers only the case where the endpoint z is fixed and deterministic. 
Now, we extend to the case where the endpoint is not deterministic, but rather computed from the forward flow. To achieve this, we compose the state process and the loss function. Consider As, <<FORMULA>>. The chain rule gives As, <<FORMULA>>. Let

<<FORMULA>>

3.1 Stochastic Adjoint Process 
The goal is to derive a stochastic adjoint process <<FORMULA>> that can be simulated by evaluating only vector-Jacobian products, where <<FORMULA>> is a 

<<FORMULA>>        (6) 
    
Note that As, <<FORMULA>>. 

Since <<FORMULA>> is scalar loss of the terminal state from the forward flow a constant, <<FORMULA>> satisfies the augmented <<FORMULA>> backward SDE system 

backward SDE for the process 

<<FORMULA>>

We first derive <<FORMULA>>, assuming that <<FORMULA>> follows the inverse flow from a deterministic end state ZT 

<<FORMULA>>

that does not depend on the realized Wiener process (Lemma 3.1). We then extend to the case where <<FORMULA>> is obtained by the forward flow starting from a deterministic initial state z0 (Theorem 3.2). This latter part is unconventional, and the resulting value cannot be interpreted as the solution to a backward SDE anymore due to loss of adaptedness. Instead, we will formulate the result with the Ito map [69]. Finally, it is straightforward to extend the state Zt to include parameters of the drift and diffusion functions such that the desired gradient can be obtained for stochastic optimization; we comment on this step in Section 3.3. 

<<FORMULA>> 
 
Since the drift and diffusion functions of this augmented system are <<FORMULA>>, the system has a unique strong solution. Let s=0 and t = T . Since (7) admits a strong solution, we may write 

<<FORMULA>>,         (8) 

We first present the SDE for the Jacobian matrix of where <<FORMULA>> denotes the path of the Wiener the backward flow. 
process and 

Lemma 3.1 (Dynamics of <<FORMULA>>). Consider the stochastic flow generated by the backward SDE (3) as in <<FORMULA>>

Theorem 2.1(b). Letting Js,t(z) := r.s,t(z), we have 
is a deterministic measurable function (the Ito map) [69, Chapter V, definition 10.9]. Intuitively, F can be thought as a black box that computes the solution 

<<FORMULA>>

to the backward SDE system (7) given the position at time T and the realized Wiener process samples. Similarly, we let G be the solution map for the forward flow (2). The next theorem follows immediately from (6) and the definition of <<FORMULA>>, we have 
for all <<FORMULA>> and <<FORMULA>>. Furthermore, letting

<<FORMULA>>, (4)

 we have 

Theorem 3.2. For <<FORMULA>>-almost all <<FORMULA>>, 

<<FORMULA>> 

where <<FORMULA>>
 
<<FORMULA>>, (5) 

for all <<FORMULA>> and <<FORMULA>> and (8). 

Proof. This is a consequence of composing <<FORMULA>>


This shows that one can obtain the gradient by "composing" the backward SDE system (7) with the original forward SDE (2) and ends our continuous-time analysis. 

3.2 Numerical Approximation 
In practice, we compute solutions to SDEs with numerical solvers Fh and Gh, where <<h = T/L>> denotes the mesh size of a fixed grid. The approximate algorithm thus outputs <<FORMULA>>. The following theorem provides sufficient conditions for convergence. 
Theorem 3.3. Suppose the schemes Fh and Gh satisfy the following conditions: (i) <<FORMULA>> in probability as <<FORMULA>>, and 
(ii) for any <<FORMULA>>, we have <<FORMULA>> in probability as <<FORMULA>>. Then, for any starting point z of the forward flow, we have 

<<FORMULA>>

in probability as <<FORMULA>>. 

See Appendix 9.2 for the proof. Usual schemes such as the Euler-Maruyama scheme (more generally ItTaylor schemes) converge pathwise (i.e. almost surely) from any fixed starting point [38] and satisfies (i). While (ii) is strong, we note that the SDEs considered here have smooth coefficients, and thus their solutions enjoy nice regularity properties in the starting position. There.fore, it is reasonable to expect that the corresponding numerical schemes to also behave nicely as a function of both the mesh size and the starting position. To the best of our knowledge, this property is not considered 
at all in the literature on numerical methods for SDEs (where the initial position is fixed), but is crucial in the proof of Theorem 3.3. In Appendix 9.3, we prove that condition (ii) holds for the Euler-Maruyama scheme. Detailed analysis for other schemes is beyond the scope of this paper. 

3.3 The Algorithm 
So far we have derived the gradient of the loss with respect to the initial state. We can extend these results to give gradients with respect to parameters of the drift and diffusion functions by treating them as an additional part of the state whose dynamics has zero drift and diffusion. We summarize this in Algorithm 2, assuming access only to a black-box solver sdeint. All terms in the augmented dynamics, such as <<FORMULA>> can be cheaply evaluated by calling <<FORMULA>> and <<FORMULA>>, respectively. 
difficulties with non-diagonal diffusion. In principle, we can simulate the forward and backward adjoint dynamics with any high-order solver of choice. However, 
for general matrix-valued diffusion functions ., to ob.tain a numerical solution with strong order 1 
beyond 1/2, we need to simulate multiple integrals of the Wiener process such as <<FORMULA>>.
These random variables are difficult to simulate and costly to approximate [87]. 
Fortunately, if we restrict our SDE to have diagonal noise, then even though the backward SDE for the stochastic adjoint will not in general have diagonal noise, it will satisfy a commutativity property [70]. In that case, we can safely adopt certain numerical schemes of strong order 1.0 (e.g. Milstein [52] and stochastic Runge-Kutta [71]) without approximating multiple integrals or the Levy area during simulation. We formally show this in Appendix 9.4. 
One may also consider numerical schemes with high weak order [39]. However, analysis of this scenario is beyond the current scope. 

3.4 Software and Implementation 
We  have  implemented  several  common  SDE  solvers  in  PyTorch  [59]  with  adaptive  time-stepping using  a  PI controller [9,  30].  Following  
torchdiffeq [12], we have created a user-friendly subclass of torchautograd. Function that facilitates gradient computation using our stochastic adjoint framework for SDEs that are subclasses of torch.nn.Module. We include a short code snippet covering the main idea of the stochastic adjoint in Appendix 9.12. The complete codebase can be found at https://github.com/google-research/torchsde. 

4 Virtual Brownian Tree 

Our formulation of the adjoint can be numerically integrated efficiently, since simulating its dynamics only requires evaluating cheap vector-Jacobian products, as opposed to whole Jacobians. However, the backward-in-time nature introduces a new difficulty: The same Wiener process sample path used in the for.ward pass must be queried again during the backward pass. Brownian storing Brownian motion increments implies a large memory consumption and complicates the usage of adaptive time-stepping integrators, where the evaluation times in the backward pass may be different from those in the forward pass. 
To overcome this issue, we combine Brownian trees with splittable pseudorandom number generators (PRNGs) to give an algorithm that can query values of a Wiener 
1A numerical scheme is of strong order p if <<FORMULA>> for all <<FORMULA>>, where Xt and XN. are respectively the coupled true solution and numerical solution, N and . are respectively the iteration index and step size such that N. = T , and C is independent of process sample path at arbitrary times. This algorithm, which we call the virtual Brownian tree, has O(1) memory cost, and time cost logarithmic with respect to the inverse error tolerance. 

<<FIGURE>>

Figure 3: Evaluating a Brownian motion sample at time tq using a virtual Brownian tree. Our algorithm repeatedly bisects the interval, sampling from a Brownian bridge at each halving to determine intermediate values. Each call to the random number generator uses a unique key whose value depends on the path taken to reach it. 

4.1 Brownian Bridges and Brownian Trees 
Levy's Brownian bridge [67] states that given a start time ts and end time te along with their respective Wiener process values ws and we, the marginal of the process at time <<FORMULA>> is a normal distribution: 
    
<<FORMULA>>.  (9)  

We can recursively apply this formula to evaluate the process at the midpoint of any two distinct timestamps where the values are already known. Constructing the whole sample path of a Wiener process in this manner results in what is known as the Brownian tree [17]. Storing this tree would be memory-intensive, but we show how to reconstruct any node in this tree as desired. 

4.2 Brownian Trees using Splittable Seeds 
We assume access to a splittable PRNG [14], which has an operation split that deterministically generates two keys from an existing key. Given a key, the function BrownianBridge samples deterministically from (9). To obtain the Wiener process value at a specific time, we must first know or sample the values at the initial and terminal times. Then, the virtual Brownian tree recursively samples from the midpoint of Brownian bridges, each sample using a key split from that of its parent node. The algorithm terminates when the most recently sampled time is close enough to the desired time. We outline the full procedure in Algorithm 3. 

Algorithm 3 Virtual Brownian Tree 

<<ALGORITHM>>

This algorithm has constant memory cost. For a fixed-step-size solver taking L steps, the tolerance that the tree will need to be queried at scales as 1/L. Thus the per-step time complexity scales as log L. Our implementation uses an efficient count-based PRNG [76] which avoids passing large random states, and instead simply passes integers. Table 1 compares the asymptotic time complexity of this approach against existing alternatives. 

5 Latent Stochastic Differential Equations 

The algorithms presented in Sections 3 and 4 allow us to efficiently compute gradients of scalar objectives with respect to SDE parameters, letting us fit SDEs to data. This raises the question: Which loss to optimize? 
Simply fitting SDE parameters to maximize likelihood will in general cause overfitting, and will result in the diffusion function going to zero. In this section, we show how to do efficient variational inference in SDE models, and optimize the marginal log-likelihood to fit both prior (hyper-)parameters and the parameters of a tractable approximate posterior over functions. 
In particular, we can parameterize both a prior over functions and an approximate posterior using SDEs: 

<<FORMULA>>, (prior) 

<<FORMULA>>, (approx. post.) 

where <<FORMULA>> and <<FORMULA>> are Lipschitz in both arguments, and both processes have the same starting value: <<FORMULA>>. 
If both processes share the same diffusion function <<FORMULA>>, then the KL divergence between them is finite (under additional mild regularity conditions; see Appendix 9.6), and can be estimated by sampling paths from the posterior process. Then, the evidence lower 

<<FIGURE>>

Figure 4: Graphical models for the generative process (decoder) and recognition network (encoder) of the latent stochastic Differential equation model. This model can be viewed as a variational autoencoder with infinite-dimensional noise. Red circles represent entire function draws from Brownian motion. Given the initial state z0 and a Brownian motion sample path <<FORMULA>>, the intermediate states <<FORMULA>> are deterministically approximated by a numerical SDE solver. 

bound (ELBO) can be written as: 

<<FORMULA>>, (10)

where <<FORMULA>> satisfies <<FORMULA>>, and the expectation is taken over the approximate posterior process defined by (approx. post.). The likelihoods of the observations x1,...,xN at times t1,...,tN depend only on the latent states zt at corresponding times. 
To compute the gradient with respect to prior parameters <<FORMULA>> and variational parameters <<FORMULA>>, we need only augment the forward SDE with an extra scalar variable whose drift function is <<FORMULA>> and whose diffusion function is zero. The backward dynamics can be derived analogously using (7). We include a detailed derivation in Appendix 9.6. Thus, a stochastic estimate of the gradients of the loss w.r.t. all parameters can be computed in a single pair of forward and backward SDE solves. 
The variational parameters . can either be optimized individually for each sequence, or if multiple time series are sharing parameters, then an encoder network can be trained to input the observations and output .. This architecture, shown in figure 4, can be viewed as an infinite-dimensional variational autoencoder [35, 68]. 
6 Related Work 
Sensitivity Analysis for SDEs. Gradient computation is closely related to sensitivity analysis. Computing gradients with respect to parameters of vector fields of an SDE has been extensively studied in the stochastic control literature [42]. In particular, for low dimensional problems, this is done effectively using dynamic programming [7] and finite differences [20, 43]. However, both approaches scale poorly with the dimensionality of the parameter vector. 
Analogous to REINFORCE (or the score-function estimator) [21, 37, 88], Yang and Kushner [89] considered deriving the gradient as rE[L(ZT )] = E[L(ZT )H] for some random variable H. However, H usually depends on the density of ZT with respect to the Lebesgue measure which can be difficult to compute. Gobet and Munos [22] extended this approach by weakening a non-degeneracy condition using Mallianvin calculus [53]. 
Closely related to the current approach is the pathwise method [89], which is also a continuous-time analog of the reparameterization trick [35, 68]. Existing meth.ods in this regime [22, 45, 82] all require simulating a (forward) SDE where each step requires computing entire Jacobian matrices. This computational cost is prohibitive for high-dimensional systems with a large number of parameters. 
Based on the Euler discretization, Giles and Glasser.man [19] considered simply performing reverse-mode automatic differentiation through all intermediate steps. They named this method the adjoint approach, which, by modern standards, is a form of "backpropagation through the operations of a numerical solver". This approach, widely adopted in the field of finance for calibrating market models [19], has high memory cost, and relies on a fixed Euler-Maruyama discretization. Recently, this approach was also used by Hegde et al. 
[27] to learn parameterized drift and diffusion functions Figure 5: (a) Same fixed step size used in both forward and reverse simulation. Boxplot generated by repeating the experiment with different Brownian motion sample paths 64 times. (b) Colors of dots represent tolerance levels and correspond to the colorbar on the right. Only atol was varied and rtol was set to 0. 


of an SDE. In scientific computing, Innes et al. [31] considered backpropagating through high-order implicit SDE solvers. 
In the machine learning literature, Ryder et al. [75] perform variational inference over the state and parameters for Euler-discretized latent SDEs and optimize the model with regular backpropagation. This approach should not be confused with the formulation of variational inference for non-discretized SDEs presented in previous works [25, 57, 82] and our work, as it is unclear whether the limit of their discretization corresponds to that obtained by operating with continuous-time SDEs using Girsanov's theorem. 
Backward SDEs. Our stochastic adjoint process re.lies on the notion of backward SDEs devised by Kunita [41], which is based on two-sided filtrations. This is different from the more traditional notion of backward SDEs where only a single filtration is defined [58, 62]. 
Based on the latter notion, forward-backward SDEs (FBSDEs) have been proposed to solve stochastic optimal control problems [63]. However, simulating FBS-DEs is costly due to the need to estimate conditional expectations in the backward pass [58]. 
Bayesian Learning of SDEs. Recent works considered the problem of inferring an approximate posterior SDE given observed data under a prior SDE with the same diffusion coefficient [25, 57, 82]. The special case with constant diffusion coefficients was considered more than a decade ago [5]. Notably, computing the KL divergence between two SDEs over a finite time horizon was well-explored in the control literature [33, 80]. We include background on this topic in Appendix 9.5. 
Bayesian learning and parameter estimation for SDEs have a long history [24]. Techniques which don't fit require positing a variational family such as then extended Kalman filter and Markov chain Monte Carlo have been considered in the literature [50]. 
7 Experiments 
The aim of this section is threefold. We first empirically verify our theory by comparing the gradients obtained by our stochastic adjoint framework against analytically derived gradients for problems having closed-form solutions. We then fit latent SDE models with our framework on two synthetic datasets, verifying that the variational inference framework allows learning a generative model of time series. Finally, we learn dynamics parameterized by neural networks with a latent SDE from a motion capture dataset, demonstrating competitive performance compared to existing approaches. 
We report results based on an implementation of Brownian motion that stores all intermediate queries. The virtual Brownian tree allowed training with much larger batch sizes on GPUs, but was not necessary for our small-scale experiments. Notably, our adjoint approach, even when combined with the Brownian motion implementation that stores noise, was able to reduce the memory usage by 1/2-1/3 compared to directly back-propagating through solver operations on the tasks we considered. 
7.1 Numerical Studies 
We consider three test problems (examples 1-3 from [66]; details in Appendix 9.7), all of which have closed-form solutions. We compare the gradient computed from simulating our stochastic adjoint process using the Milstein scheme against the exact gradient. Figure 5(a) shows that for test example 2, the error between the adjoint gradient and analytical gradient decreases with step size. 
For all three test problems, the mean squared error across dimensions tends to be smaller as the absolute tolerance of the adaptive solver is reduced (e.g. see Fig. 5 (b)). However, the Number of Function Evaluations (NFEs) tends to be much larger than that in the ODE case [12]. 

Additionally, for two out of three test problems, we found that our adjoint approach with the Milstein scheme and fixed step size can be much more time.efficient than regular backpropagation through operations of the Milstein and Euler schemes (see e.g. Fig. 5(c)). Backpropagating through the Euler scheme gives gradients of higher error compared to the Milstein method. On the other hand, directly backpropagating through the Milstein solve requires evaluating high-order derivatives and can be costly. 
Results for examples 1 and 3 are in Appendix 9.8. 

Figure 6: Learned posterior and prior dynamics on data from a stochastic Lorenz attractor. All samples from our model are continuous-time paths, and form a multi-modal, non-Gaussian distribution. 
7.2 Synthetic Datasets 
We trained latent SDEs with our adjoint framework to recover (1) a 1D Geometric Brownian motion, and (2) a 3D stochastic Lorenz attractor process. The main objective is to verify that the learned posterior can reconstruct the training data, and that the learned priors are not deterministic. We jointly optimize the evidence lower bound (10) with respect to parameters of the prior and posterior distributions at the initial latent state z0, the prior and posterior drift, the diffusion function, the encoder, and the decoder. We include the details of datasets and architectures in Appendix 9.9. 
For the stochastic Lorenz attractor, not only is the model able to reconstruct the data well, but also the learned prior process can produce bimodal samples in both data and latent space. This is showcased in the last row of Figure 6 where the latent and data space samples cluster around two modes. This is hard to achieve using a latent ODE with a unimodal Gaussian initial approximate posterior. We include additional visualizations in Appendix 9.10. 
7.3 Motion Capture Dataset 
To demonstrate that latent SDEs can learn complex dynamics from real-world datasets, we evaluated their predictive performance on a 50-dimensional motion capture dataset. The dataset, from Gan et al. [18], consists of 23 walking sequences of subject 35 partitioned into 16 training, 3 validation, and 4 test sequences. We follow the preprocessing of Wang et al. [85]. 
In designing the recognition network, we follow Yldz et al. [90] and use a fully connected network to encode the first three observations of each sequence and there.after predicted the remaining sequence. This encoder is chosen for fair comparison to existing models, and could be extended to a recurrent or attention model [84]. The overall architecture is described in Appendix 9.11 and is similar to that of ODE2VAE [90], with a similar number of parameters. We also use a fixed step size 1/5 of smallest interval between any two observations [90]. 
We train latent ODE and latent SDE models with the Adam optimizer [34] and its default hyperparameter settings, with an initial learning rate of 0.01 that is exponentially decayed with rate 0.999 during each iteration. We perform validation over the number of training iterations, KL penalty [29], and KL annealing schedule. All models were trained for at most 400 iterations, where we start to observe severe overfitting for most model instances. We report the test MSE on future observations following Yldz et al. [90]. We believe that the improved performance is due to the strong regularization in path space, as removing the KL penalty improve training error but caused validation error to deteriorate. 

Table 2: Test MSE on 297 future frames averaged over 50 samples. 95% confidence interval reported based on t-statistic results from [90]. 

<<TABLE>>

8 Discussion 

We presented a generalization of the adjoint sensitivity method to compute gradients through solutions of SDEs. In contrast to existing approaches, this method has nearly the same time and memory complexity as simply solving the SDE. We showed how our stochastic adjoint framework can be combined with a gradient-based stochastic variational inference scheme for train.ing latent SDEs. 
It is worthwhile to mention that SDEs and the commonly used GP models define two distinct classes of stochastic processes, albeit having a nonempty inter.section (e.g. Ornstein-Uhlenbeck processes fall under both). Computationally, the cost of fitting GPs lies in the matrix inversion, whereas the computational bottle.neck of training SDEs is the sequential numerical solve. Empirically, another avenue of research is to reduce the variance of gradient estimates. In the future, we may adopt techniques such as control variates or antithetic paths. 
On the application side, our method opens up a broad set of opportunities for fitting any differentiable SDE model, such as Wright-Fisher models with selection and mutation parameters [15], derivative pricing models in finance, or infinitely-deep Bayesian neural networks [61]. In addition, the latent SDE model enabled by our frame.work can be extended to include domain knowledge and structural or stationarity constraints [48] in the prior process for specific applications. 
On the theory side, there remain fundamental questions to be answered. Convergence rates of numerical gradients estimated with general schemes are unknown. Additionally, since our analyses are based on strong orders of schemes, it is natural to question whether convergence results still hold when we consider weak errors, and moreover if the method could be reformulated more coherently with rough paths theory [47]. 

Acknowledgements 
We thank Yulia Rubanova, Danijar Hafner, Mufan Li, Shengyang Sun, Kenneth R. Jackson, Simo Srkk, Daniel Lacker, and Philippe Casgrain for helpful discus.sions. We thank aatay Yldz for helpful discussions regarding evaluation settings of the mocap task. We also thank Guodong Zhang, Kevin Swersky, Chris Rackauckas, and members of the Vector Institute for helpful comments on an early draft of this paper. 

References 
[1] Martn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving, Michael Isard, et al. Tensorflow: A system for large-scale 
machine learning. In 12th Symposium on Oper.
ating Systems Design and Implementation, pages 
265283, 2016. 
[2] R Adams. Sobolev Spaces. Academic Press, 1975. 
[3] Joel Andersson. A general-purpose software frame.work for dynamic optimization. PhD thesis, Aren.berg Doctoral School, KU Leuven, 2013. 
[4] Joel Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):136, 2019. 
[5] Cdric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S Shawe-Taylor. variational inference for diffusion processes. In Advances in Neural Information Processing Systems, pages 1724, 2008. 
[6] VI Arnold. Ordinary Differential Equations. The MIT Press, 1978. 
[7] Jonathan Baxter and Peter L Bartlett. Infinite-horizon gradient-based policy search. 2001. 
[8] Robert Brown. ... microscopical observations ... on the particles contained in the pollen of plants. The Philosophical Magazine, 4(21):161173, 1828. 
[9] Pamela M Burrage, R Herdiana, and Kevin Bur-rage. Adaptive stepsize based on control theory for stochastic Differential equations. Journal of Computational and Applied Mathematics, 170(2): 317336, 2004. 
[10] Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017. 
[11] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 
[12] Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bet.tencourt, and David K Duvenaud. Neural ordinary Differential equations. In Advances in neural in.formation processing systems, pages 65716583, 2018. 
[13] Kyunghyun Cho, Bart Van Merrinboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. 
[14] Koen Claessen and Micha. H Pa.ka. Splittable pseudorandom number generators using crypto.graphic hashing. In ACM SIGPLAN Notices, vol.ume 48, pages 4758. ACM, 2013. 
[15] Warren J Ewens. Mathematical population genetics 1: theoretical introduction, volume 27. Springer Science & Business Media, 2012. 
[16] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing, 2018. 
[17] Jessica G Gaines and Terry J Lyons. Variable step size control in the numerical solution of stochastic Differential equations. SIAM Journal on Applied Mathematics, 57(5):14551484, 1997. 
[18] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sig.moid belief networks for sequence modeling. In Advances in Neural Information Processing systems, pages 24672475, 2015. 
[19] Mike Giles and Paul Glasserman. Smoking ad-joints: Fast Monte Carlo greeks. Risk, 19(1):8892, 2006. 
[20] Paul Glasserman and David D Yao. Some guide.lines and guarantees for common random numbers. Management Science, 38(6):884908, 1992. 
[21] Peter W Glynn. Likelihood ratio gradient estima.tion for stochastic systems. Communications of the ACM, 33(10):7584, 1990. 
[22] Emmanuel Gobet and Rmi Munos. Sensitivity analysis using ItMalliavin calculus and martin.gales, and application to stochastic optimal control. SIAM Journal on control and optimization, 43(5): 16761713, 2005. 
[23] Will Grathwohl, Ricky T. Q. Chen, Jesse Bet.tencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scal.able reversible generative models. International Conference on Learning Representations, 2019. 
[24] Narendra Gupta and Raman Mehra. computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations. IEEE transactions on automatic control, 19(6): 774783, 1974. 
[25] Jung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, and Han-Lim Choi. Adaptive path-integral autoencoders: Representation learning and planning for dynamical systems. In Advances in Neural Information Processing Systems, pages 89278938, 2018. 
[26] Eldad Haber and Lars Ruthotto. Stable architec.tures for deep neural networks. Inverse Problems, 34(1):014004, 2017. 
[27] Pashupati Hegde, Markus Heinonen, Harri Lhdesmki, and Samuel Kaski. Deep learning with Differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 18121821, 2019. 
[28] Markus Heinonen, Cagatay Yildiz, Henrik Man.nerstr, Jukka Intosalmi, and Harri Lhdesmki. Learning unknown ode models with gaussian pro.cesses. arXiv preprint arXiv:1803.04303, 2018. 
[29] Irina Higgins, Loic Matthey, Arka Pal, Christo.pher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta.vae: Learning basic visual concepts with a con.strained variational framework. ICLR, 2(5):6, 2017. 
[30] Silvana Ilie, Kenneth R Jackson, and Wayne H Enright. Adaptive time-stepping for the strong numerical solution of stochastic Differential equations. Numerical Algorithms, 68(4):791812, 2015. 
[31] Mike Innes, Alan Edelman, Keno Fischer, Chris Rackauckus, Elliot Saba, Viral B Shah, and Will Tebbutt. Zygote: A differentiable programming system to bridge machine learning and scien.tic computing. arXiv preprint arXiv:1907.07587, 2019. 
[32] Junteng Jia and Austin R. Benson. Neural Jump Stochastic Differential Equations. arXiv e-prints, art. arXiv:1905.10403, May 2019. 
[33] Hilbert Johan Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and in.ference. Journal of Statistical Physics, 162(5): 12441266, 2016. 
[34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 
[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 
[36] Genshiro Kitagawa and Will Gersch. Linear gaus.sian state space modeling. In Smoothness Priors Analysis of Time Series, pages 5565. Springer, 1996. 
[37] Jack PC Kleijnen and Reuven Y Rubinstein. Op.timization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3): 413427, 1996. 
[38] Peter E Kloeden and Andreas Neuenkirch. The pathwise convergence of approximation schemes for stochastic Differential equations. LMS jour.nal of Computation and Mathematics, 10:235253, 2007. 
[39] Peter E Kloeden and Eckhard Platen. Numer.ical solution of stochastic Differential equations, volume 23. Springer Science & Business Media, 2013. 
[40] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 
[41] Hiroshi Kunita. Stochastic Flows and Jump.diffusions. Springer, 2019. 
[42] Harold Kushner and Paul G Dupuis. Numerical methods for stochastic control problems in continu.ous time, volume 24. Springer Science & Business Media, 2013. 
[43] Pierre LEcuyer and Gafitan Perron. On the con.vergence rates of ipa and fdc derivative estimators. Operations Research, 42(4):643656, 1994. 
[44] Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Re.search, 18(1):59986026, 2017. 
[45] Xuanqing Liu, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural sde: Stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019. 
[46] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridg.ing deep architectures and numerical Differential equations. arXiv preprint arXiv:1710.10121, 2017. 
[47] Terry J Lyons. Differential equations driven by rough signals. Revista Matemfitica Iberoamericana, 14(2):215310, 1998. 
[48] Yi-An Ma, Tianqi Chen, and Emily Fox. A com.plete recipe for stochastic gradient mcmc. In Ad.vances in Neural Information Processing Systems, pages 29172925, 2015. 
[49] Dougal Maclaurin, David Duvenaud, M Johnson, and RP Adams. Autograd: Reverse-mode differ.entiation of native python. In ICML workshop on Automatic Machine Learning, 2015. 
[50] Isambi S Mbalawata, Simo Srkk, and Heikki Haario. Parameter estimation in stochastic differential equations with markov chain monte carlo and non-linear kalman filtering. Computational Statistics, 28(3):11951223, 2013. 
[51] Grigori Noah Milstein and Michael V Tretyakov. Stochastic Numerics for Mathematical Physics. Springer Science & Business Media, 2013. 
[52] Grigorii Noikhovich Milstein. Numerical integra.tion of stochastic Differential equations, volume 313. Springer Science & Business Media, 1994. 
[53] Ivan Nourdin and Giovanni Peccati. Normal ap.proximations with Malliavin calculus: from Steins method to universality, volume 192. Cambridge University Press, 2012. 
[54] Daniel Ocone and fitienne Pardoux. A general.ized itventzell formula. application to a class of anticipating stochastic Differential equations. 25 (1):3971, 1989. 
[55] Bernt ksendal. Stochastic Differential Equations. Springer, 2003. 
[56] Bernt Oksendal. Stochastic Differential equations: an introduction with applications. Springer Science & Business Media, 2013. 
[57] Manfred Opper. Variational inference for stochas.tic Differential equations. Annalen der Physik, 531 (3):1800233, 2019. 
[58] Etienne Pardoux and Shige Peng. Backward stochastic Differential equations and quasilinear parabolic partial Differential equations. In Stochas.tic Partial Differential Equations and Their Ap.plications, pages 200217. Springer, 1992. 
[59] Adam Paszke, Sam Gross, Soumith Chintala, Gre.gory Chanan, Edward Yang, Zachary DeVito, Zem.ing Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 
[60] Barak A Pearlmutter. Gradient calculations for dy.namic recurrent neural networks: A survey. IEEE Transactions on Neural networks, 6(5):12121228, 1995. 
[61] Stefano Peluchetti and Stefano Favaro. Neural stochastic Differential equations. arXiv preprint arXiv:1904.01681, 2019. 
[62] Shige Peng. A general stochastic maximum principle for optimal control problems. SIAM Journal on Control and Optimization, 28(4):966979, 1990. 
[63] Shige Peng and Zhen Wu. Fully coupled forward-backward stochastic Differential equations and ap.plications to optimal control. SIAM Journal on Control and Optimization, 37(3):825843, 1999. 
[64] Eckhard Platen. An introduction to numerical methods for stochastic Differential equations. Acta numerica, 8:197246, 1999. 
[65] Lev Semenovich Pontryagin. Mathematical Theory of Optimal Processes. Routledge, 2018. 
[66] Christopher Rackauckas and Qing Nie. Adaptive methods for stochastic Differential equations via natural embeddings and rejection sampling with memory. Discrete and Continuous Dynamical systems. Series B, 22(7):2731, 2017. 
[67] Daniel Revuz and Marc Yor. Continuous martin.gales and Brownian motion, volume 293. Springer Science & Business Media, 2013. 
[68] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. 
[69] L Chris G Rogers and David Williams. diffusions, Markov Processes and Martingales: Volume 2, ItCalculus, volume 2. Cambridge University Press, 2000. 
[70] Andreas Rler. RungeKutta methods for stratonovich stochastic Differential equation systems with commutative noise. Journal of Com.putational and Applied mathematics, 164:613627, 2004. 
[71] Andreas Rler. RungeKutta methods for the strong approximation of solutions of stochastic Differential equations. SIAM Journal on Numerical Analysis, 48(3):922952, 2010. 
[72] Yulia Rubanova, Ricky TQ Chen, and David Du.venaud. Latent odes for irregularly-sampled time series. Neural Information Processing Systems, 2019. 
[73] David E Rumelhart, Georey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Cognitive Modeling, 5(3):1, 1988. 
[74] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial Differential equations. arXiv preprint arXiv:1804.04272, 2018. 
[75] Thomas Ryder, Andrew Golightly, A Stephen Mc-Gough, and Dennis Prangle. Black-box variational inference for stochastic Differential equa.tions. arXiv preprint arXiv:1802.03335, 2018. 
[76] John K Salmon, Mark A Moraes, Ron O Dror, and David E Shaw. Parallel random numbers: as easyas1,2, 3. In Proceedings of 2011 Interna.tional Conference for High Performance Comput.ing, Networking, Storage and Analysis, page 16. ACM, 2011. 
[77] Simo Srkk. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013. 
[78] Simo Srkk and Arno Solin. Applied stochas.tic Differential equations, volume 10. Cambridge University Press, 2019. 
[79] Steven E Shreve. Stochastic calculus for finance II: Continuous-time models, volume 11. Springer Science & Business Media, 2004. 
[80] Evangelos Theodorou. Nonlinear stochastic con.trol and information theoretic dualities: Connec.tions, interdependencies and thermodynamic in.terpretations. Entropy, 17(5):33523375, 2015. 
[81] Ryan Turner, Marc Deisenroth, and Carl Ras.mussen. State-space inference and learning with gaussian processes. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 868875, 2010. 
[82] Belinda Tzen and Maxim Raginsky. Neural stochastic Differential equations: Deep latent gaus.sian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019. 
[83] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. Proceeings of the Conference on Learning Theory, 2019. 
[84] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, .ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 59986008, 2017. 
[85] Jack M Wang, David J Fleet, and Aaron Hertz.mann. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283298, 2007. 
[86] E Weinan. A proposal on machine learning via dy.namical systems. Communications in Mathematics and Statistics, 5(1):111, 2017. 
[87] Magnus Wiktorsson et al. Joint characteristic function and simultaneous simulation of iterated itintegrals for multiple independent brownian motions. The Annals of Applied Probability, 11(2): 470487, 2001. 
[88] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce.ment learning. Machine Learning, 8(3-4):229256, 1992. 
[89] Jichuan Yang and Harold J Kushner. A monte carlo method for sensitivity analysis and paramet.ric optimization of nonlinear stochastic systems. SIAM Journal on Control and Optimization, 29 (5):12161249, 1991. 
[90] aatay Yldz, Markus Heinonen, and Harri Lhdesmki. Ode2vae: Deep generative second order odes with bayesian neural networks. arXiv preprint arXiv:1905.10994, 2019. 

9 Appendix 

Notation. For a fixed terminal time <<FORMULA>>, we denote by <<FORMULA>> the time horizon. Let <<FORMULA>> be the class 
of infinitely differentiable functions from Rd to itself. Let Cp,q be the class of functions from <<FORMULA>> to <<FORMULA>> that <<FORMULA>> be
are p and q times continuously differentiable in the first and second component, respectively. Let <<FORMULA>> the subclass with bounded derivatives of all possible orders. For a positive integer m, we adopt the short hand 
[m]= {1, 2,...,m}. We denote the Euclidean norm of a vector v by |v|. For f . Cp,q, we denote its Jacobian with respect to the first component by rf. 

9.1 Proof of Theorem 3.1 
Proof of Theorem 3.1. We have <<FORMULA>>, where <<FORMULA>> is defined in (3). Now we take the gradient with respect to z on both sides. The solution is differentiable with respect to z and we may differentiate under the stochastic integral [41, Proposition 2.4.3]. Theorem 3.4.3 [41] is sufficient for the regularity conditions required. Since <<FORMULA>>, applying the Stratonovich version of Its formula to (4), we have (5). 

9.2 Proof of Theorem 3.3 
Proof of Theorem 3.3. By the triangle inequality, 

<<FORMULA>>

We show that both I and I converge to 0 in probability as <<FORMULA>>. For simplicity, we suppress z and W.
Bounding I(1) . Let > 0 be given. Since Gh . G in probability, there exist M1 > 0 and h0 > 0 such that <<FORMULA>>, <<FORMULA>>, for all <<FORMULA>>. 
By Lemma 2.1 (iv) of Ocone and Pardoux [54], which can be easily adapted to our context, there exists a positive random variable C1, finite almost surely, such that <<FORMULA>>, and there exists M2 > 0 such that <<FORMULA>>. Given M2, there exists h1 > 0 such that 
 
<<FORMULA>> 

Now, suppose <<FORMULA>>. Then, by the union bound, with probability at least 1, we have  

<<FORMULA>>

On this event, we have 

<<FORMULA>>            (1) 

Thus, we have shown that (1) converges to 0 in probability as <<FORMULA>>. Bounding <<FORMULA>>. The idea is similar. By condition (ii), we have

<<FORMULA>> 

in probability. Using this and condition (i), for given <<FORMULA>>, there exist <<FORMULA>> and <<FORMULA>> such that for all <<FORMULA>>, we have 

<<FORMULA>>

with probability at least 1. On this event, we have 

<<FORMULA>>

Thus <<FORMULA>> also converges to 0 in probability as <<FORMULA>>.

9.3 Euler-Maruyama Scheme satisfies Local Uniform Convergence 
Here we verify that the Euler-Maruyama scheme satisfies condition (ii) when d =1. Our proof can be extended to 
the case where d> 1 assuming an Lp estimate of the error; see the discussion after the proof of Proposition 9.1. Proposition 9.1. Let Fh(z) be the Euler-Maruyama discretization of a 1-dimensional SDE with mesh size h of F(z). Then, for any compact <<FORMULA>>, we have 

<<FORMULA>>

Usual convergence results in stochastic numerics only control the error for a single fixed starting point. Here, we strengthen the result to local uniform convergence. Our main idea is to apply a Sobolev inequality argument [54, Part II]. To do so, we need some preliminary results about the Euler-Maruyama discretization of the original SDE and its derivative. We first recall a theorem characterizing the expected squared error for general schemes. 
Theorem 9.2 (Mean-square order of convergence [51, Theorem 1.1]). Let <<FORMULA>> be the solution to an Ito SDE, and <<FORMULA>> be a numerical discretization with fixed step size h, both of which are started at <<FORMULA>> and defined on the same probability space. Let the coefficients of the SDE be <<FORMULA>>. Furthermore, suppose that the numerical scheme has order of accuracy p1 for the expectation of deviation and order of accuracy p2 for the mean-square deviation. If <<FORMULA>> and <<FORMULA>>, then, for any <<FORMULA>>, and <<FORMULA>>
for a constant C that does not depend on h or z. 

We refer the reader to [51] for the precise definitions of orders of accuracy and the proof. Given this theorem, we establish an estimate regarding errors of the discretization and its derivative with respect to the initial position. 

Lemma 9.3. We have 

 <<FORMUL>>,

where C1 is a constant independent of z and h. 

Proof of Lemma 9.3. Since the coefficients of the SDE are of class <<FORMULA>>, we may differentiate the SDE in z to
b get the SDE for the derivative rzZz [41]. Specifically, letting <<FORMULA>>, we have 

<<FORMULA>>

Note that the augmented process (F(z), rzF(z)) satisfies an SDE with <<FORMULA>> coefficients. By the chain rule,
one can easily show that the derivative of the Euler-Maruyama discretization Fh(z) is the discretization of the derivative process Y z . Thus, (Fh(z), rzFh(z)) is simply the discretization of (F(z), rzF(z)).
Since the Euler-Maruyama scheme has orders of accuracy (p1,p2) = (1.5, 1.0) [51, Section 1.1.5], by Theorem 9.2, we have 
 
<<FORMULA>>

for some constant C1 that does not depend on z or h. 

We also recall a variant of the Sobolev inequality which we will apply for d =1. Theorem 9.4 (Sobolev inequality [2, Theorem 5.4.1.c]). For any p>d, there exists a universal constant cp such that 

<<FORMULA>>

where 

<<FORMULA>>

for all continuously differentiable <<FORMULA>>. 

Proof of Proposition 9.1. define H. : . R . R, regarded as a random function <<FORMULA>>, by 

<<FORMULA>>

where <<FORMULA>> is a fixed constant. Since H. is continuously differentiable a.s., by Theorem 9.4,

<<FORMULA>>, 

Without loss of generality, we may let the compact set be <<FORMULA>> where <<FORMULA>>. Then, 

<<FORMULA>>          (11) 

It remains to estimate <<FORMULA>>. Starting from the definition of <<FORMULA>>, a standard estimation yields 

<<FORMULA>> 

where C2 is a deterministic constant depending only on . (but not z and h). 
Now we take expectation on both sides. By Lemma 9.3, we have 

<<FORMULA>>

where the last integral is finite since <<FORMULA>>. 

We have shown that <<FORMULA>>. Thus kH.k. 0 in L2 , and hence also in probability, as <<FORMULA>>. From equation 11, we have that <<FORMULA>> converges to 0 in probability as <<FORMULA>>. 
It is clear from the above proof that we may generalize to the case where d> 1 and other numerical schemes if we can bound the expected <<FORMULA>>, p-norm of <<FORMULA>> in terms of z and h, for p>d, where W 1,p here denotes the Sobolev space consisting of all real-valued functions on Rd whose weak derivatives are functions in Lp. For the Euler scheme and <<FORMULA>>, we need only bound the Lp norm of the discretization error in term of z and h for general p. To achieve this, we would need to make explicit the dependence on z for existing estimates (see e.g. [39, Chapter 10]). 
Generically extending the argument to other numerical schemes, however, is technically non-trivial. We plan to address this question in future research. 

9.4 Stochastic Adjoint has Commutative Noise when Original SDE has Diagonal Noise 
Recall the Stratonovich SDE (2) with drift and diffusion functions <<FORMULA>> governed by a set of parameters <<FORMULA>>. Consider the augmented state composed of the original state and parameters Yt =(Zt,.). The augmented state satisfies a Stratonovich SDE with the drift function <<FORMULA>> and diffusion functions <<FORMULA>> for <<FORMULA>>. By (5) and (6), the dynamics for the adjoint process of the augmented state is characterized by the backward SDE: 

<<FORMULA>> 

By definitions of f and gi, the Jacobian matrices rf(x, s) and rgi(x, s) can be written as: 
 
<<FORMULA>>

Thus, we can write out the backward SDEs for the adjoint processes of the state and parameters separately: 

<<FORMULA>>

Now assume the original SDE has diagonal noise. Then, m = d and Jacobian matrix r.i(z) can be written as: 

<<FORMULA>>

Consider the adjoint process for the augmented state along with the backward flow of the backward SDE (3). We write the overall state as <<FORMULA>>, where we abuse notation slightly to let <<FORMULA>> denote the backward 
flow process. Then, by (12) and (13), {Xt}t.T satisfies a backward SDE with a diffusion function that can be written as: 

<<FORMULA>>

Recall, for an SDE with diffusion function <<FORMULA>>, it is said to satisfy the commutativity property [70] if 

<<FORMULA>>

for all j1,j2 . [m] and k . [d]. When an SDE has commutative noise, the computationally intensive double Itintegrals (and the Levy areas) need not be simulated by having the numerical scheme take advantage of the following property of iterated integrals [30]: 

<<FORMULA>>

where the Brownian motion increment <<FORMULA>> for <<FORMULA>> can be easily sampled. To see that the diffusion function (14) indeed satisfies the commutativity condition (15), we consider several cases: 
<<FORMULA>> Both LHS and RHS are zero unless j1 == k, since for .i,j2 (x) to be non-zero, <<FORMULA>> Similar to the case above. Write <<FORMULA>>, where <<FORMULA>>. Both LHS and RHS are zero unless <<FORMULA>>, since 

<<FORMULA>>

for <<FORMULA>> to be non-zero <<FORMULA>> or <<FORMULA>> and <<FORMULA>>.

Since in all scenarios, LHS = RHS, we conclude that the commutativity condition holds. 
Finally, we comment that the Milstein scheme for the stochastic adjoint of diagonal noise SDEs can be implemented such that during each iteration of the backward solve, vjp is only called a number of times independent respect to the dimensionality of the original SDE. 

9.5 Background on Latent SDE 

Consider a filtered probability space <<FORMULA>>, where <<FORMULA>> is a finite time horizon. 
Recall the approximate posterior process that we intend to learn is governed by the SDE: 

<<FORMULA>>, (16) 

Suppose there exists a measurable function u(z, t) such that <<FORMULA>>, and <<FORMULA>> satisfies Novikov's condition, i.e. <<FORMULA>>. Novikov's condition ensures that the process 

<<FORMULA>>

is a P-martingale. By Girsanov Theorem II [56, Theorem 8.6.4], the process <<FORMULA>> is a Wiener process under the probability measure Q defined by 

<<FORMULA>>, 

Moreover, since a simple rewrite shows that 

<<FORMULA>>, (17) 

we conclude that the Q-law of (17) (or equivalently (16)) is the same as the P -law of the prior process. 

9.5.1 Deriving the Variational Bound 

Let xt1,...,xtN be observation data at times t1,...,tN , whose conditionals only depend on the respective latent states zt1,...,ztN . Since the Q-law of the approximate posterior is the same as the P-law of the prior, 

<<FORMULA>>

where the second line follows from the definition of Q and third line follows from Jensen's inequality. In the last equality we used the fact that the Ito integral <<FORMULA>> is a martingale.

9.6 Stochastic Adjoint for Latent SDE 

Note that the variational free energy (10) can be derived from Girsanov's change of measure theorem [57]. To efficiently Monte Carlo estimate this quantity and its gradient, we simplify the equation by noting that for a one-dimensional process <<FORMULA>> adapted to the filtration generated by a one-dimensional Wiener process <<FORMULA>>, 
if Novikov's condition [55] is satisfied, then the process defined by the Ito integral Vs dWs is a Martingale [55]. Hence, <<FORMULA>>, and 

<<FORMULA>>

To Monte Carlo simulate the quantity in the forward pass along with the original dynamics, we need only extend the original augmented state with an extra variable Lt such that the new drift and diffusion functions for the new augmented state <<FORMULA>> are

<<FORMULA>>

By (7), the backward SDEs of the adjoint processes become 
 
<<FORMULA>>              (18) 

In this case, neither do we need to actually simulate the backward SDE of the extra variable nor do we need to simulate its adjoint. Moreover, when considered as a single system for the augmented adjoint state, the diffusion function of the backward SDE (18) satisfies the commutativity property (15). 

9.7 Test Problems 

In the following, <<FORMULA>> and p are parameters of SDEs, and x0 is a fixed initial value. 

Example 1. 

<<FORMULA>>

Analytical solution: 

<<FORMULA>> 

Example 2. 

<<FORMULA>>

Analytical solution: 

<<FORMULA>>

Example 3. 

<<FORMULA>>

Analytical solution: 

<<FORMULA>>

In each numerical experiment, we duplicate the equation 10 times to obtain a system of SDEs where each dimension had their own parameter values sampled from the standard Gaussian distribution and then passed through a sigmoid to ensure positivity. Moreover, we also sample the initial value for each dimension from a Gaussian distribution. 

<<FIGURE>>

Figure 7: (a-c) Example 1. (d-f) Example 3. 

9.8 Results for Example 1 and 3 

9.9 Toy Datasets Configuration

9.9.1 Geometric Brownian Motion 
Consider a geometric Brownian motion SDE: 

<<FORMULA>>.

We use <<FORMULA>>, and <<FORMULA>> as the ground-truth model, where <<FORMULA>>. We sample 1024 time series, each of which is observed at intervals of 0.02 from time 0 to time 1. We corrupt this data using Gaussian noise with mean zero and standard deviation 0.01. 
To recover the dynamics, we use a GRU-based [13] latent SDE model where the GRU has 1 layer and 100 hidden units, the prior and posterior drift functions are MLPs with 1 hidden layer of 100 units, and the diffusion function is an MLP with 1 hidden layer of 100 hidden units and the sigmoid activation applied at the end. The drift function in the posterior is time-inhomogenous in the sense that it takes in a context vector of size 1 at each observation that is output by the GRU from running backwards after processing all future observations. The decoder is a linear mapping from a 4 dimensional latent space to observation space. For all nonlinearities, we use the softplus function. We <<FORMULA>> the observation model to be Gaussian with noise standard deviation 0.01. 
We optimize the model jointly with respect to the parameters of a Gaussian distribution for initial latent state distribution, the prior and posterior drift functions, the diffusion function, the GRU encoder, and the decoder. We use a fixed discretization with step size of 0.01 in both the forward and backward pass. We use the Adam optimizer [34] with an initial learning rate of 0.01 that is decay by a factor of 0.999 after each iteration. We use a linear KL annealing schedule over the first 50 iterations. 
9.9.2 Stochastic Lorenz Attractor 

Consider a stochastic Lorenz attractor SDE with diagonal noise: 

<<FORMULA>>,

<<FORMULA>>, 

<<FORMULA>>. 

We use <<FORMULA>>, and (x0,y0,z0) sampled from the standard Gaussian distribution as the ground-truth model. We sample 1024 time series, each of which is observed at intervals of 0.025 from time 0 to time 1. We normalize these samples by their mean and standard deviation across each dimension and corrupt this data by Gaussian noise with mean zero and standard deviation 0.01. 
We use the same architecture and training procedure for the latent SDE model as in the geometric Brownian motion section, except that the diffusion function consists of four small neural networks, each for a single dimension of the latent SDE. 

9.10 Additional Visualization 

<<FIGURE>>

Figure 8: Additional visualizations of learned posterior and prior dynamics on the synthetic stochastic Lorenz attractor dataset. First row displays the true data and posterior reconstructions. Second row displays samples with initial latent state for each trajectory is sampled independently. Third row displays samples with initial latent state sampled and fixed to be the same for different trajectories. 
See Figure 8 for additional visualization on the synthetic Lorenz attractor dataset. See Figure 9 for visualization on the synthetic geometric Brownian motion dataset. We comment that for the second example, the posterior reconstructs the data well, and the prior process exhibit behavior of the data. However, from the third row, we can observe that the prior process is learned such that most of the uncertainty is account for in the initial latent state. We leave the investigation of more interpretable prior process for future work. 

9.11 Model Architecture for Learning from Motion Capture Dataset 
We use a latent SDE model with an MLP encoder which takes in the first three frames and outputs the mean and log-variance of the variational distribution of the initial latent state and a context vector. The decoder has a similar architecture as that for the ODE2VAE model [90] and projects the 6-dimensional latent state into the 50-dimensional observation space. The posterior drift function takes in a 3-dimensional context vector output by the encoder and the current state and time, whereas the prior drift only takes in the current state and time. The diffusion function is composed of multiple small neural nets, each producing a scalar for the corresponding 

<<FIGURE>>

Figure 9: Visualizations of learned posterior and prior dynamics on the synthetic geometric Brownian motion dataset. First row displays the true data and posterior reconstructions. Orange contour covers 95% of 512 samples. Second row displays samples with initial latent state for each trajectory is sampled independently. Third row displays samples with initial latent state sampled and fixed to be the same for different trajectories. 

dimension such that the posterior SDE has diagonal noise. We use the same observation likelihood as that of the ODE2VAE model [90]. We comment that the overall parameter count of our model (11605) is smaller than that of ODE2VAE for the same task (12157). 
The latent ODE baseline was implemented with a similar architecture, except is does not have the diffusion and prior drift components, and its vector field defining the ODE does not take in a context vector. Therefore, the model has slightly fewer parameters (10573) than the latent SDE model. See Figure 10 for overall details of the architecture. 
The main hyperparameter we tuned was the coefficient for reweighting the KL. For both the latent ODE and SDE, we considered training the model with a reweighting coefficient in {1, 0.1, 0.01, 0.001}, either with or without a linear KL annealing schedule that increased from 0 to the prescribed value over the first 200 iterations of training. 

9.12 Stochastic Adjoint Implementation 

We include the core implementation of the stochastic adjoint, assuming access to a callable Brownian motion bm, an Euler-Maruyama integrator ito_int_diag for diagonal noise SDEs, and several helper functions whose purposes can be inferred from their names. 
<<ALGORITHM>>
<|endoftext|>


<|startoftext|>
                        Scaling Laws for Neural Language Models


                                 Jared Kaplan                   Sam McCandlish 

                          Johns Hopkins University, OpenAI                OpenAI
                                jaredk@jhu.edu                  sam@openai.com


                Tom Henighan       Tom B. Brown     Benjamin Chess       Rewon Child
                      OpenAI            OpenAI           OpenAI            OpenAI
                 henighan@openai.com   tom@openai.com   bchess@openai.com   rewon@openai.com

                    Scott Gray        Alec Radford        Jeffrey Wu         Dario Amodei
                     OpenAI           OpenAI           OpenAI             OpenAI
                 scott@openai.com   alec@openai.com   jeffwu@openai.com   damodei@openai.com


                                               Abstract

                    We study empirical scaling laws for language model performance on the cross-entropy loss.
                    The loss scales as a power-law with model size, dataset size, and the amount of compute
                    used for training, with some trends spanning more than seven orders of magnitude. Other
                    architectural details such as network width or depth have minimal effects within a wide
                    range. Simple equations govern the dependence of overﬁtting on model/dataset size and the
                    dependence of training speed on model size. These relationships allow us to determine the
                    optimal allocation of a ﬁxed compute budget. Larger models are signiﬁcantly more sample-
                    efﬁcient, such that optimally compute-efﬁcient training involves training very large models
                    on a relatively modest amount of data and stopping signiﬁcantly before convergence.


                 Equal contribution.

              Contributions: Jared Kaplan and Sam McCandlish led the research. Tom Henighan contributed the LSTM ex-
              periments. Tom Brown, Rewon Child, and Scott Gray, and Alec Radford developed the optimized Transformer
              implementation. Jeff Wu, Benjamin Chess, and Alec Radford developed the text datasets. Dario Amodei provided
              guidance throughout the project.              Contents

              1 Introduction                                                           2

               2 Background and Methods 6

               3 Empirical Results and Basic Power Laws 7

               4 Charting the Inﬁnite Data Limit and Overﬁtting 10

               5 Scaling Laws with Model Size and Training Time 12

               6 Optimal Allocation of the Compute Budget 14

               7 Related Work                                                         18

              8 Discussion                                                            18

              Appendices                                                              20

              A Summary of Power Laws 20

              B Empirical Model of Compute-Efﬁcient Frontier 20

              C Caveats                                                              22

              D Supplemental Figures 23


              1 Introduction

              Language provides a natural domain for the study of artiﬁcial intelligence, as the vast majority of reasoning
              tasks can be efﬁciently expressed and evaluated in language, and the world’s text provides a wealth of
              data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in
              language modeling, with state of the art models [RNSS18, DCLT18, YDY + 19, LOG + 19, RSR + 19] approaching
               human-level performance on many speciﬁc tasks [WPN + 19], including the composition of coherent 
               multiparagraph prompted text samples [RWC + 19].
               One might expect language modeling performance to depend on model architecture, the size of neural models,
               the computing power used to train them, and the data available for this training process. In this work we will
               empirically investigate the dependence of language modeling loss on all of these factors, focusing on the
               Transformer architecture [VSP + 17, LSP + 18]. The high ceiling and low ﬂoor for performance on language
               tasks allows us to study trends over more than seven orders of magnitude in scale.
               Throughout we will observe precise power-law scaling for performance as a function of training time,
               context length, dataset size, model size, and compute budget.

               1.1 Summary

               Our key ﬁndings for Transformer language models are are as follows:

                 2 Here we display predicted compute when using a sufﬁciently small batch size. See Figure 13 for comparison to the
              purely empirical data.

                                                  <<FIGURE>>

              Figure 1 Language modeling performance improves smoothly as we increase the model size, dataset
              size, and amount of compute 2 used for training. For optimal performance all three factors must be scaled
              up in tandem. Empirical performance has a power-law relationship with each individual factor when not
              bottlenecked by the other two.


              Performance depends strongly on scale, weakly on model shape: Model performance depends most
              strongly on scale, which consists of three factors: the number of model parameters N (excluding
              embeddings), the size of the datasetD, and the amount of compute C used for training. Within reasonable limits,
              performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section
              3)

              Smooth power laws: Performance has a power-law relationship with each of the three scale factors
              N;D;C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude
              (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance
              must ﬂatten out eventually before reaching zero loss. (Section 3)

              Universality of overﬁtting: Performance improves predictably as long as we scale up N and D in tandem,
              but enters a regime of diminishing returns if eitherNorDis held ﬁxed while the other increases. The
              performance penalty depends predictably on the ratioN0:74 =D, meaning that every time we increase the
              model size 8x, we only need to increase the data by roughly 5x to avoid a penalty. (Section 4)

              Universality of training: Training curves follow predictable power-laws whose parameters are roughly
              independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the
              loss that would be achieved if we trained for much longer. (Section 5)

              Transfer improves with test performance: When we evaluate models on text with a different distribution
              than they were trained on, the results are strongly correlated to those on the training validation set with
              a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant
              penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)

              Sample efﬁciency: Large models are more sample-efﬁcient than small models, reaching the same level of
              performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).

              Convergence is inefﬁcient: When working within a ﬁxed compute budget C but without any other restrictions
              on the model size N or available dataD, we attain optimal performance by training very large models
              and stopping signiﬁcantly short of convergence(see Figure 3). Maximally compute-efﬁcient training would
              therefore be far more sample efﬁcient than one might expect based on training small models to convergence,
              with data requirements growing very slowly as <<FORMULA>> with training compute. (Section 6)

              Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only,
              and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million
              tokens at convergence for the largest models we can train. (Section 5.1)
              Taken together, these results show that language modeling performance improves smoothly and predictably
              as we appropriately scale up model size, data, and compute. We expect that larger language models will
              perform better and be more sample efﬁcient than current models.

                                                  <<FIGURE>>

              Figure 2 We show a series of language model training runs, with models ranging in size from10 3 to10 9
              parameters (excluding embeddings).

                                 <<FIGURE>>

              Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger
              models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in
              compute. For optimally compute-efﬁcient training, most of the increase should go towards increased model
              size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to
              increase parallelism through larger batch sizes, with only a very small increase in serial training time required.


              1.2 Summary of Scaling Laws

              The test loss of a Transformer trained to auto regressively model language can be predicted using a power-law
              when performance is limited by only either the number of non-embedding parametersN, the dataset sizeD,
               or the optimally allocated compute budget C_min (see Figure 1):
                  1.For models with a limited number of parameters, trained to convergence on sufﬁciently large
                    datasets:
                       <<FORMULA>> (non-embedding parameters) (1.1)
                  2.For large models trained with a limited dataset with early stopping:
                                <<FORMULA>> (tokens) (1.2)
                  3.When training with a limited amount of compute, a sufﬁciently large dataset, an optimally-sized
                    model, and a sufﬁciently small batch size (making optimal 3 use of compute):
                                 <<FORMULA>>
                 3 We also observe an empirical power-law trend with the training computeC(Figure 1) while training at ﬁxed batch
              size, but it is the trend withCmin that should be used to make predictions. They are related by equation (5.5).

                                                  <<FIGURE>>

              Figure 4 Left: The early-stopped test lossL(N;D)varies predictably with the dataset size D and model
              size N according to Equation (1.5).Right: After an initial transient period, learning curves for all model
              sizes N can be ﬁt with Equation (1.6), which is parameterized in terms of S_min , the number of steps when
              training at large batch size (details in Section 5.1).


              These relations hold across eight orders of magnitude in C_min , six orders of magnitude inN, and over two
              orders of magnitude inD. They depend very weakly on model shape and other Transformer hyperparameters
              (depth, width, number of self-attention heads), with speciﬁc numerical values associated with the Webtext2
              training set [RWC + 19]. The power lawsN ; D ; min specify the degree of performance improvement C expected as we scale upN,D, orCmin ; for example, doubling the number of parameters yields a loss that
              is smaller by a factor <<FORMULA>>. The precise numerical values ofNc ;C min ;andDc       c depend on the
              vocabulary size and tokenization and hence do not have a fundamental meaning.
              The critical batch size, which determines the speed/efﬁciency tradeoff for data parallelism ([MKAT18]), also
              roughly obeys a power law in L:

                                        <<FORMULA>>

              Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset
              size sublinearly according to <<FORMULA>>. In fact, we ﬁnd that there is a single equation combining
              (1.1) and (1.2) that governs the simultaneous dependence on N and D and governs the degree of overﬁtting:

                                               <<FORMULA>>                       (1.5)

              with ﬁts pictured on the left in ﬁgure 4. We conjecture that this functional form may also parameterize the
              trained log-likelihood for other generative modeling tasks.
              When training a given model for a ﬁnite number of parameter update stepsSin the inﬁnite data limit, after
              an initial transient period, the learning curves can be accurately ﬁt by (see the right of ﬁgure 4)
                                             
                                     <<FORMULA>>                        (1.6)
                                     
              where <<FORMULA>> and <<FORMULA>>, and S_min(S) is the minimum possible number of optimization steps
               (parameter updates) estimated using Equation (5.4).
               When training within a ﬁxed compute budgetC, but with no other constraints, Equation (1.6) leads to the
               prediction that the optimal model sizeN, optimal batch sizeB, optimal number of stepsS, and dataset size
               Dshould grow as
                           <<FORMULA>>           (1.7)
              with
                                      <<FORMULA>>                   (1.8)
               which closely matches the empirically optimal resultsN/C0:73 ,B/C0:24 , andS/C0:03 . As the
               computational budget C increases, it should be spent primarily on larger models, without dramatic increases
               in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become
               increasingly sample efﬁcient. In practice, researchers typically train smaller models for longer than would
               be maximally compute-efﬁcient because of hardware constraints. Optimal performance depends on total
              compute as a power law (see Equation (1.3)).
              We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve ﬁts and their
              implications for training time, and a breakdown of our results per token. We also make some brief comparisons
              to LSTMs and recurrent Transformers [DGV + 18].

              1.3 Notation

              We use the following notation:
                   L– the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in
                    some cases we report the loss for speciﬁc tokens within the context.
                   N– the number of model parameters,excluding all vocabulary and positional embeddings
                   C6NBS– an estimate of the total non-embedding training compute, whereBis the batch size,
                    andSis the number of training steps (ie parameter updates). We quote numerical values in PF-days,
                    where one PF-day= 10 15 243600 = 8:6410 19 ﬂoating point operations.
                   D– the dataset size in tokens
                    B_crit  – the critical batch size [MKAT18], deﬁned and discussed in Section 5.1. Training at the
                    critical batch size provides a roughly optimal compromise between time and compute efﬁciency.
                   Cmin – an estimate of the minimum amount of non-embedding compute to reach a given value of
                    the loss. This is the training compute that would be used if the model were trained at a batch size
                    much less than the critical batch size.
                    S_min  – an estimate of the minimal number of training steps needed to reach a given value of the loss.
                    This is also the number of training steps that would be used if the model were trained at a batch size
                    much greater than the critical batch size.
                   X – power-law exponents for the scaling of the loss as <<FORMULA>> where X can be any of
                    <<FORMULA>>.

              2 Background and Methods

              We train language models on WebText2, an extended version of the WebText [RWC + 19] dataset, tokenized
               using byte-pair encoding [SHB15] with a vocabulary size n vocab = 50257. We optimize the autoregressive
               log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal
               performance metric. We record the loss on the WebText2 test distribution and on a selection of other text
               distributions. We primarily train decoder-only [LSP + 18, RNSS18] Transformer [VSP + 17] models, though
               we also train LSTM models and Universal Transformers [DGV + 18] for comparison.

               2.1 Parameter and Compute Scaling of Transformers

               We parameterize the Transformer architecture using hyperparameters n layer (number of layers),d model
               (dimension of the residual stream), d (dimension of the intermediate feed-forward layer),dattn (dimension of
               the attention output), and n heads (number of attention heads per layer). We include n ctx tokens in the input
               context, with n ctx = 1024 except where otherwise noted.
               We use N to denote the model size, which we deﬁne as the number of non-embedding parameters

                        <<FORMULA>>              (2.1)

               where we have excluded biases and other sub-leading terms. Our models also have n vocab d model parameters
               in an embedding matrix, and use n ctx d model parameters for positional embeddings, but we do not include
               these when discussing the ‘model size’N; we will see that this produces signiﬁcantly cleaner scaling laws.
               Evaluating a forward pass of the Transformer involves roughly

                                      <<FORMULA>>                    (2.2)

               add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in
               matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1.

                                                  <<TABLE>>

              Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading
              terms such as nonlinearities, biases, and layer normalization are omitted.


              For contexts and models with d model > n ctx =12, the context-dependent computational cost per token is a
              relatively small fraction of the total compute. Since we primarily study models where d model n ctx=12,
              we do not include context-dependent terms in our training compute estimate. Accounting for the backwards
              pass (approximately twice the compute as the forwards pass), we then deﬁne the estimated non-embedding
              compute as <<FORMULA>> ﬂoating point operators per training token.

              2.2 Training Procedures

              Unless otherwise noted, we train models with the Adam optimizer [KB14] for a ﬁxed <<FORMULA>> steps with
              a batch size of512sequences of1024tokens. Due to memory constraints, our largest models (more than
              1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and
              schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of
              learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate
              schedule with a 3000 step linear warmup followed by a cosine decay to zero.

              2.3 Datasets

              We train our models on an extended version of the WebText dataset described in [RWC + 19]. The original
              WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at
              least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January
              to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether
              people found the link interesting or useful. The text of the new links was extracted with the Newspaper3k
              python library. In total, the dataset consists of 20.3M documents containing 96 GB of text and <<FORMULA>>
              words (as deﬁned bywc). We then apply the reversible tokenizer described in [RWC + 19], which yields
              <<FORMULA>> tokens. We reserve <<FORMULA>> of these tokens for use as a test set, and we also test on similarly-
              prepared samples of Books Corpus [ZKZ + 15], Common Crawl [Fou], English Wikipedia, and a collection
              of publicly-available Internet Books.

              3 Empirical Results and Basic Power Laws

              To characterize language model scaling we train a wide variety of models, varying a number of factors
              including:

                   Model size (ranging in size from 768 to 1.5 billion non-embedding parameters)
                   Dataset size (ranging from 22 million to 23 billion tokens)
                   Shape (including depth, width, attention heads, and feed-forward dimension)
                   Context length (1024 for most runs, though we also experiment with shorter contexts)
                   Batch size (219 for most runs, but we also vary it to measure the critical batch size)

                                                  <<FIGURE>>

              Figure 5 Performance depends very mildly on model shape when the total number of non-embedding
               parametersNis held ﬁxed. The loss varies only a few percent over a wide range of shapes. Small differences
              in parameter counts are compensated for by using the ﬁt toL(N)as a baseline. Aspect ratio in particular can
              vary by a factor of 40 while only slightly impacting performance; an(nlayer ;d model ) = (6;4288)reaches a
              loss within 3% of the(48;1600)model used in [RWC + 19].

                    <<FIGURE>>

              Figure 6 Left:When we include embedding parameters, performance appears to depend strongly on the
              number of layers in addition to the number of parameters.Right:When we exclude embedding parameters,
              the performance of models with different depths converge to a single trend. Only models with fewer than 2
              layers or with extreme depth-to-width ratios deviate signiﬁcantly from the trend.


              In this section we will display data along with empirically-motivated ﬁts, deferring theoretical analysis to
              later sections.

              3.1 Approximate Transformer Shape and Hyperparameter Independence

              Transformer performance depends very weakly on the shape parameters n layer; n heads , and d when we hold
              the total non-embedding parameter count N ﬁxed. To establish these results we trained models with ﬁxed
              size while varying a single hyperparameter. This was simplest for the case of n heads . When varying n layer,
              we simultaneously varied d model while keeping <<FORMULA>> layer d2   ﬁxed. Similarly, to vary d model at ﬁxed
              model size we also simultaneously varied the d model parameter, as required by the parameter counts in Table
              1. Independence of n layers would follow if deeper Transformers effectively behave as ensembles of shallower
              models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5.

              3.2 Performance with Non-Embedding Parameter CountN

              In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape
              (n layer, d model) = (2,128)through billion-parameter models, ranging in shape from(6;4288)through
              (207;768). Here we have trained to near convergence on the full WebText2 dataset and observe no over-
              ﬁtting (except possibly for the very largest models).
              As shown in Figure 1, we ﬁnd a steady trend with non-embedding parameter countN, which can be ﬁt to the
              ﬁrst term of Equation (1.5), so that            

                                            <<FORMULA>>                             (3.1)

                                                 <<FIGURE>>

                                                Figure 7


              To observe these trends it is crucial to study performance as a function ofN; if we instead use the total
              parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This
              suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in
              recent work [LCG + 19].
              Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets
              is also a power-law in N with nearly identical power, as shown in Figure 8.

               3.2.1 Comparing to LSTMs and Universal Transformers
               In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter
               countN. The LSTMs were trained with the same dataset and context length. We see from these ﬁgures
              that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match
              the Transformer performance for later tokens. We present power-law relationships between performance and
              context position Appendix D.5, where increasingly large powers for larger models suggest improved ability
              to quickly recognize patterns.
              We also compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure
              17 in the appendix. These models re-use parameters, and so perform slightly better as a function ofN, at the
              cost of additional compute per-parameter.

              3.2.2 Generalization Among Data Distributions
              We have also tested our models on a set of additional text data distributions. The test loss on these datasets
              as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2
              dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct
              parallel with the improvement on WebText2. We ﬁnd that generalization depends almost exclusively on the
              in-distribution validation loss, and does not depend on the duration of training or proximity to convergence.
              We also observe no dependence on model depth (see Appendix D.8).

              3.3 Performance with Dataset Size and Compute

              We display empirical trends for the test loss as a function of dataset sizeD(in tokens) and training compute
              Cin Figure 1.
              For the trend withDwe trained a model with <<FORMULA>> on ﬁxed subsets of the WebText2
              dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be
              ﬁt with simple power-law                 

                                           <<FORMULA>>                             (3.2)
                                        
              in the dataset size. The data and ﬁt appear in Figure 1.
              The total amount of non-embedding compute used during training can be estimated asC= 6NBS, where
              Bis the batch size,Sis the number of parameter updates, and the factor of6accounts for the forward and
              backward passes. Thus for a given value ofCwe can scan over all models with variousNto ﬁnd the model

                                                  <<FIGURE>>

              Figure 8 Left:Generalization performance to other data distributions improves smoothly with model size,
               with only a small and very slowly growing offset from the WebText2 training distribution.Right:
              Generalization performance depends only on training distribution performance, and not on the phase of training.
              We compare generalization of converged models (points) to that of a single large model (dashed curves) as it
              trains.


              with the best performance on stepS= C . Note that in these results the batch size B remains ﬁxed for
              all models, which means that these empirical results are not truly optimal. We will account for this in later 6BS
              sections using an adjusted C_min to produce cleaner trends.
              The result appears as the heavy black line on the left-hand plot in Figure 1. It can be ﬁt with
                                                  
                                            <<FORMULA>>                             (3.3)

              The ﬁgure also includes images of individual learning curves to clarify when individual models are optimal.
              We will study the optimal allocation of compute more closely later on. The data strongly suggests that sample
              efﬁciency improves with model size, and we also illustrate this directly in Figure 19 in the appendix.

              4 Charting the Inﬁnite Data Limit and Overﬁtting

              In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will
              study the performance of a model of size N trained on a dataset with D tokens while varying N and D
              simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling
              law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing
              size while keeping overﬁtting under control.

              4.1 Proposed L(N;D) Equation

              We have chosen the parameterization (1.5) (repeated here for convenience):
                                               
                                       <<FORMULA>>                        (4.1)

              using three principles:

                  1.Changes in vocabulary size or tokenization are expected to rescale the loss by an overall factor. The
                    parameterization of L(N;D) (and all models of the loss) must naturally allow for such a rescaling.
                  2.Fixing D and sending N!1, the overall loss should approachL(D). Conversely, ﬁxing N and
                    sending D!1 the loss must approach L(N).
                  3.L(N;D) should be analytic atD=1, so that it has a series expansion in 1=D with integer powers.
                    Theoretical support for this principle is signiﬁcantly weaker than for the ﬁrst two.

              Our choice of L(N;D) satisﬁes the ﬁrst requirement because we can rescaleNc ;D c with changes in the
              vocabulary. This also implies that the values ofNc ;D c have no fundamental meaning.

                                                  <<FIGURE>>

              Figure 9 The early-stopped test lossL(N;D)depends predictably on the dataset size D and model sizeN
              according to Equation (1.5).Left: For largeD, performance is a straight power law inN. For a smaller ﬁxed
              D, performance stops improving as N increases and the model begins to overﬁt. (The reverse is also true,
              see Figure 4.)Right: The extent of overﬁtting depends predominantly on the ratio <<FORMULA>>, as predicted in
               equation (4.3). The line is our ﬁt to that equation.


               Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we
               expect that larger models should always perform better than smaller models. But with ﬁxed ﬁniteD, we also
               do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly,
               a model with ﬁxed size will be capacity-limited. These considerations motivate our second principle. Note
               that knowledge ofL(N)at inﬁnite D and L(D) at inﬁnite N fully determines all the parameters inL(N;D).
               The third principle is more speculative. There is a simple and general reason one might expect overﬁtting
               to scale/1=Dat very largeD. Overﬁtting should be related to the variance or the signal-to-noise ratio
               of the dataset [AS17], and this scales as1=D. This expectation should hold for any smooth loss function,
               since we expect to be able to expand the loss about theD! 1limit. However, this argument assumes that
               1=D corrections dominate over other sources of variance, such as the ﬁnite batch size and other limits on the
               efﬁcacy of optimization. Without empirical conﬁrmation, we would not be very conﬁdent of its applicability.
               Our third principle explains the asymmetry between the roles of N and D in Equation (1.5). Very similar
               symmetric expressions 4 are possible, but they would not have a 1=D expansion with integer powers, and
               would require the introduction of an additional parameter.
               In any case, we will see that our equation forL(N;D)ﬁts the data well, which is the most important justiﬁcation
               for our L(N;D).

               4.2 Results

               We regularize all our models with 10% dropout, and by tracking test loss and stopping once it is no longer
               decreasing. The results are displayed in Figure 9, including a ﬁt to the four parameters <<FORMULA>> in
              Equation (1.5):

                                <<TABLE>>

                                          Table 2 Fits to L(N;D)

              We obtain an excellent ﬁt, with the exception of the runs where the dataset has been reduced by a factor of
              1024, to about <<FORMULA>> tokens. With such a small dataset, an epoch consists of only 40 parameter updates.
              Perhaps such a tiny dataset represents a different regime for language modeling, as overﬁtting happens very
              early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in
              Section 3, as here we are ﬁtting the full L(N;D) rather than just L(N;1) or L(1;D).
              To chart the borderlands of the inﬁnite data limit, we can directly study the extent of overﬁtting. For all but
              the largest models, we see no sign of overﬁtting when training with the full 22B token WebText2 dataset,
              so we can take it as representative ofD=1. Thus we can compare ﬁniteDto the inﬁnite data limit by
              <<FORMULA>> For example, one might have used <<FORMULA>>, but this does not have a 1=D expansion.

                                                  <<FIGURE>>

              Figure 10 The critical batch size B crit follows a power law in the loss as performance increase, and does
              not depend directly on the model size. We ﬁnd that the critical batch size approximately doubles for every
              13%decrease in loss B crit is measured empirically from the data shown in Figure 18, but it is also roughly
               predicted by the gradient noise scale, as in [MKAT18].


               deﬁning
                                                  <<FORMULA>>                      (4.2)

               and studying it as a function ofN;D. In fact, we see empirically that L depends only a speciﬁc combination
               of N and D, as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies
                                            
                                                        <<FORMULA>>                    (4.3)

              Note that at large D this formula also has a series expansion in powers of 1=D.
              We estimate that the variation in the loss with different random seeds is roughly <<FORMULA>>, which means that to
              avoid overﬁtting when training to within that threshold of convergence we require

                                           <<FORMULA>>                         (4.4)

               With this relation, models smaller than10 9 parameters can be trained with minimal overﬁtting on the 22B
               token WebText2 dataset, but our largest models will encounter some mild overﬁtting. More generally, this
               relation shows that dataset size may grow sub-linearly in model size while avoiding overﬁtting. Note however
               that this does not typically represent maximally compute-efﬁcient training. We should also emphasize that
               we have not optimized regularization (eg the dropout probability) while varying dataset and model size.

               5 Scaling Laws with Model Size and Training Time

               In this section we will demonstrate that a simple scaling law provides a good description for the loss as a
               function of model size N and training time. First we will explain how to use the results of [MKAT18] to
               deﬁne a universal training step S_min , which accounts for the fact that most of our models have not been
               trained at an optimal batch size. Then we will demonstrate that we can ﬁt the model size and training time
               dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation
               of training compute between model size and training time, and then conﬁrm that prediction.

               5.1 Adjustment for Training at B_crit (L)

               A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also
               [SLA + 18, ZLN + 19]). It was argued that there is a critical batch size B_crit for training; forBup to B_crit 
               the batch size can be increased with very minimal degradation in compute-efﬁciency, whereas for <<FORMULA>> increases in
               B result in diminishing returns. It was also argued that the gradient noise scale provides a simple
               prediction for B_crit , and that neither depends directly on model size except through the value of the loss that
              has been attained. These results can be used to predict how training time and compute will vary with the
              batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch
              size <<FORMULA>>. Training at <<FORMULA>> minimizes the number of training steps, while <<FORMULA>> minimizes
               the use of compute.
               More speciﬁcally, it was demonstrated that for a wide variety of neural network tasks, the number of training
               stepsSand the number of data examples processed E=BS satisfy the simple relation

                                        <<FORMULA>>                     (5.1)
                                        
               when training to any ﬁxed value of the lossL. Here S_min is the minimum number of steps necessary to reach
               L, while E_min is the minimum number of data examples that must be processed.
               We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation deﬁnes the
               critical batch size                          
               
               <<FORMULA>>                           (5.2)

              which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal
              time/compute tradeoff, requiring 2S_min training steps and processing <<FORMULA>> data examples.
               In Figure 10 we have plotted the critical batch size and gradient noise scale 5 as a function of training loss for
               two different models. We see that B_crit(L) is independent of model size, and only depends on the lossL. So
               the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can
               be ﬁt with a power-law in the loss             

               <<FORMULA>>                           (5.3)

              where <<FORMULA>> and <<FORMULA>>.

              We have chosen this parameterization for B_crit(L) because as the loss approaches its minimum value L_min,
               the gradient noise scale is expected to diverge, and we expect  B_crit  to track this noise scale. We do not
               know L_min, as we see no sign that our models are approaching it, but L_min>0 since the entropy of natural
               language is non-zero. Since apparently L_min  is much smaller than the values ofLwe have achieved, we used
               a parameterization where B_crit  diverges asL!0.
               We will use B_crit  (L)to estimate the relation between the number of training steps S while training at batch
               sizeB= 2 19 tokens and the number of training steps while training at <<FORMULA>>. This is simply

                                          <<FORMULA>>         (5.4)
                                         
               for any given target value L for the loss. This also deﬁnes a critical value of the compute needed to train toL
               with a model of sizeNif we were to train at <<FORMULA>>. This is

                                         <<FORMULA>>        (5.5)
                                          
               where <<FORMULA>> estimates the (non-embedding) compute used at batch size B.

               5.2 Results for <<FORMULA>> and Performance with Model Size and Compute

               Now we will use  S_min  deﬁned in Equation (5.4) to obtain a simple and universal ﬁt for the dependence of the
               loss on model size and training time in the inﬁnite data limit. We will ﬁt the stable, Adam-optimized training
               runs using Equation (1.6), repeated here for convenience:
                                               
                                     <<FORMULA>>                       (5.6)

               for the loss. We include all training steps after the warmup period of the learning rate schedule, and ﬁnd a ﬁt
               to the data with the parameters:
                 5 Although the critical batch size roughly matches the gradient noise scale, we are using a direct measurements of
               B_crit  from Figures 18 and 10 for all our later analyses.

                                                  <<FIGURE>>

              Figure 11 When we hold either total compute or number of training steps ﬁxed, performance follows
              L(N;S)from Equation (5.6). Each value of compute budget has an associated optimal model size that
              maximizes performance. Mediocre ﬁts at small S are unsurprising, as the power-law equation for the learning
              curves breaks down very early in training.

                                 <<TABLE>>

                                          Table 3 Fits toL(N;S)


              With these parameters, we obtain the learning curve ﬁts in Figure 4. Though the ﬁts are imperfect, we believe
              they are quite compelling given the simplicity of Equation (5.6).
              The data and ﬁts can be visualized in a different and more interesting way, as shown in Figure 11. There we
              study the test loss as a function of model size while ﬁxing either the total non-embedding compute C used
              in training, or the number of stepsS. For the ﬁts we use Equation (5.5) and (5.4) along with the parameters
              above and Equation (5.6).
              The power-law dependence of the loss on S_min  reﬂects the interplay of optimizer dynamics and the loss
               landscape. Since the ﬁts are best late in training, when the loss may be approximately quadratic, the power-
               law should provide information about the spectrum of the Hessian of the loss. Its universality suggests that
               the Hessian eigenvalue density is roughly independent of model size.

               5.3 Lower Bound on Early Stopping Step

               The results for<<FORMULA>>can be used to derive a lower-bound (and rough estimate) of the step at which
               early stopping should occur when training is data limited. It is motivated by the idea that ﬁnite and inﬁniteD
              learning curves for a given model will be very similar until we reach <<FORMULA>>. Thus overﬁtting should
              be proportional to the correction from simply ending training at S stop . This will underestimate S_stop, because
               in reality the test loss will decrease more slowly when we have a ﬁniteD, and therefore we will require more
              training steps to reach the optimal test loss at ﬁniteD. This line of reasoning leads to the inequality

                                                       <<FORMULA>>                          (5.7)

              whereL(N;1)is the converged loss, evaluated with inﬁnite available data. This inequality and its
              comparison to the empirical data is displayed in Figure 16 in the appendix. In that ﬁgure, the values of S stop and L(N;D) are empirical (though S stop is adjusted to mimic training at <<FORMULA>>), while L(N;1) is
               computed from the ﬁt to L(N;D) evaluated at D=1.


               6 Optimal Allocation of the Compute Budget

               We displayed the empirical trend of performance as a function of the computation used during training in
               the top-right of Figure 1. However, this result involved training at a ﬁxed batch sizeB, whereas we know

                                                  <<FIGURE>>

              Figure 12 Left:Given a ﬁxed compute budget, a particular model size is optimal, though somewhat larger
              or smaller models can be trained with minimal additional compute.Right:Models larger than the compute-
               efﬁcient size require fewer steps to train, allowing for potentially faster training if sufﬁcient additional
               parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the
               power-law region of the learning curve, after initial transient effects.


                                   <<FIGURE>>

              Figure 13 When adjusting performance to simulate training far below the critical batch size, we ﬁnd a
              somewhat altered power law for L(C_min) when compared with the fully empirical results. The conspicuous
              lump at <<FORMULA>> PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks
              in the power-law ﬁts. It is the L(C_min) trend that we expect to provide a reliable extrapolation for larger
              compute.


              that in fact we could train more efﬁciently 6 by training at the batch size B_crit  discussed in Section 5.1.
              Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively,
              and correcting for this inefﬁciency by standardizing to the critical batch size results in cleaner and more
              predictable trends.
              In this section we will adjust for this oversight. More importantly, we will use the results of Section 5
              to determine the optimal allocation of compute between model size N and the quantity of data processed
              during training, namely <<FORMULA>>. We will determine this allocation both empirically and theoretically, by
               using the equation for <<FORMULA>>, and we will demonstrate that these methods agree.

               6.1 Optimal Performance and Allocations

               Let us ﬁrst study the loss as a function of the optimally allocated compute from Equation (5.5). The result is
               plotted in Figure 13, along with a power-law ﬁt. We see that as compared to the compute plot of Figure 1, the
               new ﬁt with C_min is somewhat improved.
               Given L(C_min), it is natural to ask for the optimal model size N(C_min) that provides the minimal loss with a
              given quantity of training compute. The optimal model size is shown in Figure 14. We observe that N(C_min)

                 6 One might ask why we did not simply train at B_crit  in the ﬁrst place. The reason is that it depends not only on the
              model but also on the target value of the loss we wish to achieve, and so is a moving target.

                                                  <<FORMULA>>

              Figure 14 Left:Each value of the compute budget C_min has an associated optimal model sizeN. Optimal
              model size grows very rapidly with C_min, increasing by 5x for each 10x increase in compute. The number
               of data examples processed makes up the remainder of the increase, growing relatively modestly by only 2x.
               Right:The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most
               of the growth in data examples processed can be used for increased batch sizes.


               can be ﬁt very well with a power-law

                                          <<FORMULA>>                        (6.1)
                                          
              In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4).
              By deﬁnition <<FORMULA>>, and so we can use <<FORMULA>> to extract further results. In particular, since
              prior ﬁts show <<FORMULA>> and <<FORMULA>>, we can conclude that <<FORMULA>>. This leads us to conclude min that 
              the optimal number of steps will only grow very slowly with compute, as

                                             <<FORMULA>>;                          (6.2)

               matching the empirical results in Figure 14. In fact the measured exponent is sufﬁciently small that our results
               may even be consistent with an exponent of zero.

               Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we
               should predominantly increase the model sizeN, while simultaneously scaling up the batch size via <<FORMULA>> 
               with negligible increase in the number of serial steps. Since compute-efﬁcient training uses relatively
               few optimization steps, additional work on speeding up early training dynamics may be warranted.

               6.2 Predictions from <<FORMULA>>

               The results for <<FORMULA>> and the allocations can be predicted from the <<FORMULA>> equation obtained in
               Section 5. Given our equation for <<FORMULA>>, we can substitute <<FORMULA>> and then ﬁnd the minimum
              of the loss as a function ofN, while ﬁxing the training compute. We carry out this procedure in detail in 6NB
              Appendix B, where we also provide some additional predictions.
              For the loss as a function of training compute, we predict that
                                                  
                                          <<FORMULA>>                             (6.3)

              in excellent agreement with the exponent of Figure 13. We also predict that

                                    <<FORMULA>>                 (6.5)

              which also matches the scaling of Figure 14 to within a few percent. Our scaling laws provide a predictive
              framework for the performance of language modeling.

                                                  <<FIGURE>>

              Figure 15 Far beyond the model sizes we study empirically, we ﬁnd a contradiction between our equations
              for<<FORMULA>>andL(D)due to the slow growth of data needed for compute-efﬁcient training. The intersection
              marks the point before which we expect our predictions to break down. The location of this point is highly
              sensitive to the precise exponents from our power-law ﬁts.


              6.3 Contradictions and a Conjecture

              We observe no signs of deviation from straight power-law trends at large values of compute, data, or model
              size. Our trends must eventually level off, though, since natural language has non-zero entropy.
              Indeed, the trends for compute-efﬁcient training described in this section already contain an apparent contra-
              diction. At scales several orders of magnitude above those documented here, the performance predicted by
              the<<FORMULA>>scaling law decreases below what should be possible given the slow growth in training data with
              compute. This implies that our scaling laws must break down before this point, but we conjecture that the
              intersection point has a deeper meaning: it provides an estimate of the point at which Transformer language
              models reach maximal performance.
              Since the amount of data used by compute-efﬁcient training grows slowly with the compute budget, the
              performance predicted by<<FORMULA>>eventually hits a lower bound set by theL(D)power law (see Figure 15).
              Let us work this out in more detail.
              To keep overﬁtting under control, the results of Section 4 imply that we should scale the dataset size as

                                            <<FORMULA>>                         (6.6) 

              where we have used the compute-efﬁcient <<FORMULA>> from Figure 14.
               Let us compare this to the data requirements of compute-efﬁcient training. If we train at the critical batch
               size (i.e. <<FORMULA>>) and never re-use data during training, we ﬁnd that data usage grows with compute as

                                      <<FORMULA>>                      (6.7)
                                         
              This is the maximum rate at which the dataset size can productively grow with compute, since it means that
              we are only training for a single epoch. But it grows the dataset much more slowly than in Equation (6.6).
              It appears to imply that compute-efﬁcient training will eventually run into a problem with overﬁtting, even if
              the training process never re-uses any data!
              According to Figure 1, we expect that when we are bottlenecked by the dataset size (ie by overﬁtting), the
              loss should scale as <<FORMULA>>. This implies that the loss would scale with compute as <<FORMULA>>
              once we are data-limited. Once again, we have a contradiction, as this will eventually intersect with min 
              our prediction for <<FORMULA>> from Figure 13, where we found a scaling <<FORMULA>>
              The intersection point of <<FORMULA>> and <<FORMULA>> occurs at

                  <<FORMULA>>                    (6.8)

              though the numerical values are highly uncertain, varying by an order or magnitude in either direction de-
              pending on the precise values of the exponents from the power-law ﬁts. The most obvious interpretation is
              that our scaling laws break down at or before we reach this point, which is still many orders of magnitude
              away in both compute and model size.
              One might also conjecture that this intersection point has a deeper meaning. If we cannot increase the model
              size beyond N without qualitatively different data requirements, perhaps this means that once we reach
               C and N, we have extracted all of the reliable information available in natural language data. In this min 
               interpretation, L would provide a rough estimate for the entropy-per-token 7 of natural language. In this
              scenario, we would expect the loss trend to level off at or before L.
              We can guess at the functional form of<<FORMULA>>as it levels off by considering a version of our training
              dataset with added noise. For example, we could append a random string of tokens to each context shown
              to the model to artiﬁcially boost the loss by a constant additive factor. Then, the distance from the noise
              ﬂoor LxL noise would be a more meaningful performance metric, with even a small decrease in this distance
              potentially representing a signiﬁcant boost in qualitative performance. Since the artiﬁcial noise would affect
              all of our trends equally, the critical point of 6.8 would not change (aside from the absolute value of L, and
              may be meaningful even if it occurs after the leveling off.

              7 Related Work

              Power laws can arise from a wide variety of sources [THK18]. Power-law scalings with model and dataset
              size in density estimation [Was06] and in random forest models [Bia12] may be connected with our results.
              These models suggest that power-law exponents may have a very rough interpretation as the inverse of the
              number of relevant features in the data.
              Some early [BB01, Goo01] work found power-law scalings between performance and dataset size. More
              recent work [HNA + 17, HAD19] also investigated scaling between model size and data size; their work is
              perhaps the closest to ours in the literature 8 . Note, however, that [HNA + 17] found super-linear scaling of
              dataset size with model size, whereas we ﬁnd a sub-linear scaling. There are some parallels between our
              ﬁndings on optimal allocation of compute and [Kom19], including power-law learning curves. EfﬁcientNets
              [TL19] also appear to obey an approximate power-law relation between accuracy and model size. Very recent
              work [RRBS19b] studies scaling with both dataset size and model size for a variety of datasets, and ﬁts an
              ansatz similar to ours.
              EfﬁcientNet [TL19] advocates scaling depth and width exponentially (with different coefﬁcients) for optimal
              performance of image models, resulting in a power-law scaling of width as a function of depth. We ﬁnd that
              for language models this power should be roughly one when scaling up (as width/depth should remain ﬁxed).
              But more importantly, we ﬁnd that the precise architectural hyperparameters are unimportant compared to the
              overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles
              of shallower models, which could potentially explain this ﬁnding. Earlier work [ZK16] has compared width
              and depth, and found that wide ResNets can outperform deep ResNets on image classiﬁcation. Some studies
              ﬁx computation per data example, which tends to scale in proportion to the number of model parameters,
              whereas we investigate scaling with both model size and the quantity of training computation.
              Various works [AS17, BHMM18] have investigated generalization in highly overparameterized models, ﬁnd-
              ing a “jamming transition” [GJS + 19] when the model size reaches the dataset size (this may require training
              many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do
              not observe such a transition, and ﬁnd that the necessary training data scales sublinearly in the model size.
              Expansions in the model size, particularly at large width [JGH18, LXS + 19], may provide a useful framework
              for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning
              curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions
              [ZLN + 19] in realistic settings. Making this connection quantitative will require a characterization of the
              Hessian spectrum [Pap18, GKX19, GARD18].

              8 Discussion

              We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter
              countN, dataset sizeD, and optimized training computation C_min , as encapsulated in Equations (1.5) and
              (1.6). Conversely, we ﬁnd very weak dependence on many architectural and optimization hyperparameters.
              Since scalings with <<FORMULA>> are power-laws, there are diminishing returns with increasing scale.

                 7 Deﬁning words using the wc utility, the WebText2 dataset has1:4tokens per word and <<FORMULA>> characters per token.
                 8 After this work was completed, [RRBS19a] also appeared, which makes similar predictions for the dependence of
              loss on both model and dataset size.
              We were able to precisely model the dependence of the loss on N and D, and alternatively on N and S, when
               these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude
               of overﬁtting, early stopping step, and data requirements when training large language models. So our scaling
               relations go beyond mere observation to provide a predictive framework. One might interpret these relations
               as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way,
               independent of most of the details of its microscopic constituents.
               It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a
               maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to
               test these relations on other domains, such as images, audio, and video models, and perhaps also for random
               network distillation. At this point we do not know which of our results depend on the structure of natural
               language data, and which are universal. It would also be exciting to ﬁnd a theoretical framework from
               which the scaling relations can be derived: a ‘statistical mechanics’ underlying the ‘thermodynamics’ we
               have observed. Such a theory might make it possible to derive other more precise predictions, and provide a
               systematic understanding of the limitations of the scaling laws.
               In the domain of natural language, it will be important to investigate whether continued improvement on the
               loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major
               qualitative improvements: “more is different”. For example, the smooth aggregate growth of the economy
               provides no indication of the speciﬁc technological developments that underwrite it. Similarly, the smooth
               improvements in language model loss may hide seemingly qualitative changes in capability.
               Our results strongly suggest that larger models will continue to perform better, and will also be much more
               sample efﬁcient than has been previously appreciated. Big models may be more important than big data.
               In this context, further investigation into model parallelism is warranted. Deep models can be trained using
               pipelining [HCC + 18], which splits parameters depth-wise between devices, but eventually requires increased
               batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization
               [SCP + 18], since large layers can be split between multiple workers with less serial dependency. Sparsity
               [CGRS19, GRK17] or branching (e.g. [KSH12]) may allow for even faster training of large networks through
               increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train,
               it might be possible to remain on the compute-efﬁcient frontier for an entire training run.

               Acknowledgements

               We would like to thank Shan Carter, Paul Christiano, Jack Clark, Ajeya Cotra, Ethan Dyer, Jason Eisner,
               Danny Hernandez, Jacob Hilton, Brice Menard, Chris Olah, and Ilya Sutskever for discussions and for feed-
               back on drafts of this work.


                                                  Appendices


              A Summary of Power Laws

              For easier reference, we provide a summary below of the key trends described throughout the paper.

                    <<TABLE>>

                                                Table 4

              The empirical ﬁtted values for these trends are:

                                <<TABLE>>

                                                Table 5

              The optimal parameters for compute efﬁcient training are given by:

                          <<TABLE>>

                                                Table 6


              B Empirical Model of Compute-Efﬁcient Frontier

              Throughout this appendix all values of C,S and C are adjusted for training at the critical batch size B_crit  .
              We have left off the ‘adj’ label to avoid cluttering the notation.

              B.1 Deﬁning Equations

              The power-law ﬁt to the learning curves implies a simple prescription for compute-efﬁcient training. In this
              appendix, we will derive the optimal performance, model size, and number of training steps as a function of
              the compute budget. We start with the Equation (1.6), repeated here for convenience:
                                               
                                      <<FORMULA>>                    (B.1)

               Here,S represents the number of parameter updates when training at the critical batch size[MKAT18],
               which was deﬁned in Equation (5.2) 9 :

                                                    <<FORMULA>>                         (B.2)

              We would like to determine optimal training parameters for a ﬁxed compute budget, so we replaceS=
              <<FORMULA>>, where C is the number of FLOPs used in the training run:

                                  <<FORMULA>>               (B.3)

             Now, we set@N L = 0to ﬁnd the condition for optimality: C
                                     
                     <<FORMULA>>                                       (B.4)
                         
               Equation (B.3) and (B.4) together determine the compute-efﬁcient frontier.

               B.2 Efﬁcient Training

               Now we assemble the implications of (B.3) and (B.4). First, note that inserting (B.4) into (B.3) yields

                                                <<FORMULA>>                 (B.5)

              which implies that for compute-efﬁcient training, we should train to a ﬁxed percentage N=10% above the converged loss. 
              Next, let’s determine how the optimal loss depends on the compute budget. Eliminating S
              N yields a power-law dependence of performance on compute:
                                                  
                                            <<FORMULA>>                            (B.6)
              where we deﬁned

                                 <<FORMULA>>                    (B.7)
                                            
                                 <<FORMULA>>              (B.8)

              Similarly, we can eliminateLto ﬁnd N(C):

                                          <<FORMULA>>                       (B.9)

               and
                                              
                                  <<FORMULA>>                             (B.10) 

                 9 There is a slight ambiguity here: we can imagine training either at a constant batch size <<FORMULA>>, or we could
               instead train at a variable batch sizeB~(L), whereB~is the instantaneous critical batch size (as opposed to B, which is
               the averaged version). These two prescriptions result in the same number of steps, so we can ignore this subtlety (see
               [MKAT18]).

                                                  B.3 Comparison to Inefﬁcient

              Typically, researchers train models until they appear to be close to convergence. In this section, we compare
              the efﬁcient training procedure described above to this more typical setup. We deﬁne a the convergence factor
              fas the percent deviation from the converged loss:

                                        <<FORMULA>>                     (B.11)

               For compute-efﬁcient training we have <<FORMULA>> from the previous section, but researchers
              typically use a much smaller value. Here, we choose f0=2% as an estimate. For a ﬁxed value of the loss,
              we predict:
                                             <<FORMULA>>                     (B.12)
                                                  
                                          <<FORMULA>>                     (B.13)
                                                  
                                         <<FORMULA>>                          (B.14) 

              So that compute-efﬁcient training uses 7.7x fewer parameter updates, 2.7x more parameters, and 65% less
              compute to reach the same loss.

              B.4 Suboptimal Model Sizes

              We can solve A.1 to ﬁnd an expression for the amount of compute needed to reach a given value of the loss
              L with a model of size N:
                                        
                                <<FORMULA>>            (B.15)

              Using A.6 and A.9, we can eliminateLin favor ofNe (L), the model size which reaches L most efﬁciently.
              From there, we ﬁnd an expression for the excess compute needed as a consequence of using a suboptimal
              model size:                       

                 <<FORMULA>>                     (B.16)
                 
              The result is shown in Figure X. Models between 0.6x and 2.2x the optimal size can be used with only a
              20% increase in compute budget. Using a smaller model is useful when accounting for the cost inference. A
              larger model can be trained the the same level of performance in fewer steps, allowing for more parallelism
              and faster training if sufﬁcient hardware is available (see Figure Y):

                                          <<FORMULA>>             (B.17) 

               A 2.2x larger model requires 45% fewer steps at a cost of 20% more training compute. Note that this equation
               should not be trusted for very large models, as it is only valid in the power-law region of the learning curve
               after initial transient effects.

               C Caveats

               In this section we list some potential caveats to our analysis.

                   At present we do not have a solid theoretical understanding for any of our proposed scaling laws.
                    The scaling relations with model size and compute are especially mysterious. It may be possible to
                    understand scaling at very large DS holding model size ﬁxed [AS17], and also the shape of learning
                    curves late in training, by modeling the loss with a noisy quadratic. But the scaling withDat very
                    large model size still remains mysterious. Without a theory or a systematic understanding of the
                    corrections to our scaling laws, it’s difﬁcult to determine in what circumstances they can be trusted.

                                                  <<FIGURE>>

              Figure 16 Left:We characterize the step on which early stopping occurs, as a function of the extent of
              overﬁtting. The red line indicates a lower bound for early stopping that is derived in Section 5.3.Right:
              We display train and test loss for a series of 300M parameter models trained on different sized dataset sub-
              samples. The test loss typically follows that of a run done with unrestricted data until diverging. Note that the
              degree of overﬁtting (as compared to the inﬁnite data limit) is signiﬁcantly overestimated by L_test & L_train 
              (denoted by a black bar for each run).


                   We are not especially conﬁdent in the prediction of B_crit  (L)for values of the loss far outside the
                    range we have explored. Changes in B_crit  could have a signiﬁcant impact on trade-offs between
                    data parallelism and the number of serial training steps required, which would have a major impact
                    on training time.
                   We did not thoroughly investigate the small data regime, and our ﬁts forL(N;D)were poor for
                    the smallest values ofD(where an epoch corresponded to only40steps). Furthermore, we did
                    not experiment with regularization and data augmentation. Improvements in these could alter our
                    results, quantitatively or qualitatively.
                   We used the estimated training compute <<FORMULA>>, which did not include contributions proporcional
                    to nctx (see Section 2.1). So our scalings with compute may be confounded in practice in the
                    regime of very large nctx , speciﬁcally where nctx & 12d model.
                   We tuned learning rates, and we experimented with learning rate schedules. But we may have
                    neglected to tune some hyperparameter (e.g. intialization scale or momentum) that have an important
                    effect on scaling.
                   The optimal choice of learning rate is sensitive to the target loss. When training close to convergence,
                    it may be necessary to use a smaller learning rate to avoid divergences. But when conducting a short
                    training run (eg due to compute limitations), it may be possible to use a larger learning rate. We did
                    not experiment with higher learning rates for training runs that did not proceed to convergence.

              D Supplemental Figures

              D.1 Early Stopping and Test vs Train

              In section 5.3 we described the result shown in Figure 16, which provides a prediction for a lower bound on
              the early stopping step. We also show the train and test loss for a given model size when training on different
              sized datasets.

              D.2 Universal Transformers

              We compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17.
              These models re-use parameters, and so perform slightly better as a function ofN, but slightly worse as a
              function of compute C. We include several different possibilities for parameter re-use.

              D.3 Batch Size

              We measure the critical batch size using the data displayed in ﬁgure 18. This made it possible to estimate
               B_crit(L) in ﬁgure 10.

                                                  <<FIGURE>>

              Figure 17 We compare recurrent Transformers [DGV + 18], which re-use parameters, to standard Trans-
              formers. Recurrent Transformers perform slightly better when comparing models with equal parameter count,
              but slightly worse when accounting for reuse and comparing per FLOP.

                     <<FIGURE>>

              Figure 18 These ﬁgures demonstrate ﬁts to Equation (5.1) for a large number of values of the lossL, and
              for two different Transformer model sizes. These ﬁts were used to measure B_crit  (L)for Figure 10.


              D.4 Sample Efﬁciency vs Model Size

              It is easy to see from ﬁgure 2 that larger models train faster, and are therefore more sample efﬁcient. We
              provide another way of looking at this phenomenon in ﬁgure 19, which shows when different models reach
              various ﬁxed values of the loss.

                                        <<FIGURE>>

              Figure 19 The number of minimum serial steps needed to reach any ﬁxed value of the test loss decreases
              precipitously with model size. Sample efﬁciency (show here for training far below the critical batch size)
              improves greatly as well, improving by a factor of almost 100 when comparing the smallest possible model
              to a very large one.

                                        <<FIGURE>>

              Figure 20 This ﬁgure provides information about the performance per token as a function of model size
              and training time.Left:Loss per token as a function of its positionTin the 1024-token context. Loss scales
              predictably as a power-law inT.Right: Test loss per token as a function of training step.

                                                        <<FIGURE>>

              Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve
              smoothly as model size increases. Training runs with shorter context nctx = 8 (dashed lines) perform better
               on early tokens, since they can allocate all of their capacity to them.


               D.5 Context Dependence

               The trends for loss as a function of model size are displayed for different tokens in the context in Figure 21.
               We see that models trained on nctx = 1024 show steady improvement with model size on all but the ﬁrst
               token.
               Fixing model size, it appears that the loss scales as a power-law as a function of positionTin the context, see
              Figure 20. This may be a consequence of underlying power-law correlations in language [EP94, ACDE12,
              LT16], or a more general feature of the model architecture and optimization. It provides some suggestion for
              the potential beneﬁts (or lack thereof) from training on larger contexts. Not only do larger models converge
              to better performance atT= 1024, but they also improve more quickly at early tokens, suggesting that larger
              models are more efﬁcient at detecting patterns with less contextual information. In the right-hand plot we
              show how per-token performance varies for a ﬁxed model as a function of the training step. The model begins
              by learning short-range information, and only learns longer-range correlations later in training.
              We have also included models trained with a tiny context nctx = 8 in order to compare with our longer
              context models. Even modestly sized models trained on nctx = 8 can dominate our largest nctx = 1024
               models on very early tokens. This also suggests that further improvements should be possible with much
               larger models trained on large contexts.

               D.6 Learning Rate Schedules and Error Analysis

               We experimented with a variety of learning rates and schedules. A host of schedules and resulting test
               performances for a small language model are plotted in Figure 22. We conclude that the choice of learning
               rate schedule is mostly irrelevant, as long as the total summed learning rate is sufﬁciently large, and the
               schedule includes a warmup period and a ﬁnal decay to near-vanishing learning rate. Variations among

                                                  <<FIGURE>>

              Figure 22 We test a variety of learning rate schedules including cosine decay, linear decay, as well as other
              faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experiments we
              do not decay to zero, since we ﬁnd that this tends to give a ﬁxed improvement close to the end of training.
              We ﬁnd that, as long as the learning rate is not too small and does not decay too quickly, performance does
              not depend strongly on learning rate. Run-to-run variation is at the level of 0.05 in the loss, so averaging
              multiple runs is necessary to validate performance changes smaller than this level.

                                      <<FIGURE>>

              Figure 23 The trend for performance as a function of parameter count,L(N), is ﬁt better by a power law
              than by other functions such as a logarithm at a qualitative level.


              schedules appear to be statistical noise, and provide a rough gauge for the scale of variation between different
              training runs. Experiments on larger models suggest that the variation in the ﬁnal test loss between different
              random seeds is roughly constant in magnitude for different model sizes.
              We found that larger models require a smaller learning rate to prevent divergence, while smaller models can
              tolerate a larger learning rate. To implement this, the following rule of thumb was used for most runs:

                                    <<FIGURE>>                (D.1)

              We expect that this formula could be improved. There may be a dependence on network width, likely set by
              the initialization scale. The formula also breaks down forN >10 10 parameters. Nevertheless, we found that
              it works sufﬁciently well for the models we considered.

              D.7 Fit Details and Power Law Quality

              We experimented with a number of functional forms for the ﬁts to <<FORMULA>>, and <<FORMULA>> the power-law
              ﬁts were qualitatively much more accurate than other functions such as logarithms (see Figure 23).
              ForL(C), we do not include small models with only 1 layer in the ﬁt, as the transition from 1 to 2 layers
              causes a noticeable lump in the data. For L(N) we also do not include very small models with only 1 layer in
              the ﬁt, and we exclude the largest models that have not trained fully to convergence. Fit parameters change
              marginally if we do include them, and the trend extrapolates well in both directions regardless.

              D.8 Generalization and Architecture

              In ﬁgure 24 we show that generalization to other data distributions does not depend on network depth when we
              hold the total parameter count ﬁxed. It seems to depend only on the performance on the training distribution.

                                                  <<FORMULA>>
                                                  
              Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters.
              We observe no effect of depth on generalization; generalization performance depends primarily on
              training distribution performance. The 12-layer model overﬁt the Internet Books dataset and we show the
              early-stopped performance; we have not seen this surprising result in other experiments.


              List of Figures

                 1 Summary of simple power laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
                 2 Illustration of sample efﬁciency and compute efﬁciency. . . . . . . . . . . . . . . . . . . . .4
                 3 How to scale up model size, batch size, and serial steps . . . . . . . . . . . . . . . . . . . .4
                 4 Performance when varying model and data size, or model and training steps, simultaneously5
                 5 Weak dependence of performance on hyperparameter tuning . . . . . . . . . . . . . . . . .8
                 6 Comparison of performance trend when including or excluding embeddings . . . . . . . . .8
                 7 LSTM and Transformer performance comparison . . . . . . . . . . . . . . . . . . . . . . .9
                 8 Generalization to other test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
                 9 Universality of overﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
                 10 Critical batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
                 11 Performance versus compute budget or number of parameter updates . . . . . . . . . . . . .14
                 12 Training on suboptimal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
                 13 Comparison between empirical and adjusted compute trends . . . . . . . . . . . . . . . . .15
                 14 Optimal model size and serial number of steps versus compute budget . . . . . . . . . . . .16
                 15 Contradiction between compute and data trends . . . . . . . . . . . . . . . . . . . . . . . .17
                 16 Early stopping lower bound and training curves for overﬁt models . . . . . . . . . . . . . .23
                 17 Universal transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
                 18 Batch size scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
                 19 Another look at sample efﬁciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
                 20 Power-law dependence of performance on position in context . . . . . . . . . . . . . . . . .25
                 21 Performance at different context positions versus model size . . . . . . . . . . . . . . . . .25
                 22 Learning rate schedule scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
                 23 Comparison of Power-Law and Logarithmic Fits . . . . . . . . . . . . . . . . . . . . . . .26
                 24 Generalization versus depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

                                                  List of Tables

                 1 Parameter and compute counts for Transformer . . . . . . . . . . . . . . . . . . . . . . . .7
                 2 Fits toL(N;D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
                 3 Fits toL(N;S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
                 4 Key trend equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
                 5 Key parameters to trend ﬁts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
                 6 Trends for compute-efﬁcient training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

              References

              [ACDE12]Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long-
                       range correlations in texts.Proceedings of the National Academy of Sciences, 109(29):11582–
                       11587, 2012. 25
              [AS17]Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in
                       neural networks.arXiv, 2017, 1710.03667. 11, 18, 22
              [BB01]Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam-
                       biguation. InProceedings of the 39th annual meeting on association for computational linguis-
                       tics, pages 26–33. Association for Computational Linguistics, 2001. 18
              [BHMM18]Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine
                       learning and the bias-variance trade-off.arXiv, 2018, 1812.11118. 18
              [Bia12]GÃŠrard Biau. Analysis of a random forests model.Journal of Machine Learning Research,
                       13(Apr):1063–1095, 2012. 18
              [CGRS19]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with
                       sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URLhttp://arxiv.org/
                       abs/1904.10509. 19
              [DCLT18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
                       bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2
              [DGV + 18]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni-
                       versal transformers. CoRR, abs/1807.03819, 2018, 1807.03819. URLhttp://arxiv.org/
                       abs/1807.03819. 6, 9, 23, 24
              [EP94]Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english.
                       EPL (Europhysics Letters), 26(4):241, 1994. 25
              [Fou]The Common Crawl Foundation. Common crawl. URLhttp://commoncrawl.org. 7
              [GARD18]Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.
                       2018, arXiv:1812.04754. 18
              [GJS + 19]Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli,
                       Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with
                       number of parameters in deep learning.arXiv, 2019, 1901.01608. 18
              [GKX19]Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op-
                       timization via hessian eigenvalue density. CoRR, abs/1901.10159, 2019, 1901.10159. URL
                       http://arxiv.org/abs/1901.10159. 18
              [Goo01]Joshua Goodman. A bit of progress in language modeling.CoRR, cs.CL/0108005, 2001. URL
                       http://arxiv.org/abs/cs.CL/0108005. 18
              [GRK17]Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights.ope-
                       nai.com, 2017. 19
              [HAD19]Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu-
                       tational challenges in deep learning. InProceedings of the 24th Symposium on Principles and
                       Practice of Parallel Programming, PPoPP ’19, pages 1–14, New York, NY, USA, 2019. ACM.
                       doi:10.1145/3293883.3295710. 18
               [HCC + 18]Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,
                       and Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism.
                       CoRR, abs/1811.06965, 2018, 1811.06965. URLhttp://arxiv.org/abs/1811.06965. 19
               [HNA + 17]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia-
                       ninejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is pre-
                       dictable, empirically, 2017, 1712.00409. 18
               [JGH18]Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
                       generalization in neural networks. InAdvances in neural information processing systems, pages
                       8571–8580, 2018. 18
              [KB14]Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014,
                       1412.6980. 7
              [Kom19]Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18
              [KSH12]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep
                       convolutional neural networks. InProceedings of the 25th International Conference on Neural
                       Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran
                       Associates Inc. URLhttp://dl.acm.org/citation.cfm?id=2999134.2999257. 19
               [LCG + 19]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
                       Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019,
                       1909.11942. 9
               [LOG + 19]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
                       Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretrain-
                       ing approach. CoRR, abs/1907.11692, 2019, 1907.11692. URLhttp://arxiv.org/abs/
                       1907.11692. 2
               [LSP + 18]Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and
                       Noam Shazeer. Generating wikipedia by summarizing long sequences.arXiv:1801.10198 [cs],
                       2018, 1801.10198. URLhttp://arxiv.org/abs/1801.10198. 2, 6
               [LT16]Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics.arXiv
                       preprint arXiv:1606.06737, 2016. 25
               [LXS + 19]Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-
                       Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models
                       under gradient descent, 2019, arXiv:1902.06720. 18
               [MKAT18]Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model
                       of large-batch training, 2018, arXiv:1812.06162. 3, 5, 6, 12, 13, 21
               [Pap18]Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.
                       CoRR, abs/1811.07062, 2018, 1811.07062. URLhttp://arxiv.org/abs/1811.07062. 18
               [RNSS18]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
                       understanding by generative pre-training.URL https://s3-us-west-2. amazonaws. com/openai-
                       assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018. 2, 6
              [RRBS19a]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive
                       prediction of the generalization error across scales, 2019, 1909.12673. 18
              [RRBS19b]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive
                       prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18
              [RSR + 19]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
                       Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed
                       text-to-text transformer, 2019, arXiv:1910.10683. 2
              [RWC + 19]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
                       models are unsupervised multitask learners.openai.com, 2019. 2, 5, 6, 7, 8
              [SCP + 18]Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan-
                       takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and
                       Blake Hechtman. Mesh-tensorﬂow: Deep learning for supercomputers, 2018, 1811.02084. 19
              [SHB15]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words
                       with subword units.CoRR, 2015, 1508.07909. 6
              [SLA + 18]Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and
                       George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018,
                       arXiv:1811.03600. 12
              [SS18]Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory
                       cost.CoRR, abs/1804.04235, 2018, 1804.04235. URLhttp://arxiv.org/abs/1804.04235.
                       7
              [THK18]Stefan Thurner, Rudolf Hanel, and Peter Klimek.Introduction to the theory of complex systems.
                       Oxford University Press, 2018. 18
              [TL19]Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural
                       networks.CoRR, abs/1905.11946, 2019, 1905.11946. URLhttp://arxiv.org/abs/1905.
                       11946. 18
              [VSP + 17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
                       Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
                       S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural
                       Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL
                       http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. 2, 6
              [VWB16]Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles
                       of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18
              [Was06]Larry Wasserman.All of nonparametric statistics. Springer Science & Business Media, 2006.
              [WPN + 19]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill,
                       Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose
                       language understanding systems, 2019, 1905.00537. 2
              [WRH17]Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in-
                       creasing model capacity.2017 IEEE Conference on Computer Vision and Pattern Recognition
                       (CVPR), Jul 2017. doi:10.1109/cvpr.2017.323. 19
              [WYL19]Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional
                       networks, 2019, 1906.02909. 19
              [YDY + 19]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V.
                       Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019,
                       arXiv:1906.08237. 2
              [ZK16]Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.Procedings of the British
                       Machine Vision Conference 2016, 2016. doi:10.5244/c.30.87. 18
              [ZKZ + 15]Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor-
                       ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by
                       watching movies and reading books.2015 IEEE International Conference on Computer Vision
                       (ICCV), Dec 2015. doi:10.1109/iccv.2015.11. 7
              [ZLN + 19]Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl,
                       Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch
                       sizes? insights from a noisy quadratic model.CoRR, abs/1907.04164, 2019, 1907.04164. URL
                       http://arxiv.org/abs/1907.04164. 12, 18
<|endoftext|>


<|startoftext|>
Structured Pruning of Convolutional Neural Networks via L1 Regularization 

CHEN YANG1,2, ZHENGHONG YANG1,2, ABDUL MATEEN KHATTAK2,3 , LIU YANG1,2, WENXIN ZHANG1,2, WANLIN GAO1,2 , AND MINJUAN WANG1,2 
1Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing 100083, China 2College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China 3Department of Horticulture, The University of Agriculture, Peshawar 25120, Pakistan 
Corresponding authors: Wanlin Gao (wanlin_cau@163.com) and Minjuan Wang (minjuan@cau.edu.cn) 
This work was supported by the Project of Scientific Operating Expenses from Ministry of Education of China under Grant 2017PT19. 

ABSTRACT 
Deep learning architecture has achieved amazing success in many areas with the recent advancements in convolutional neural networks (CNNs). However, real-time applications of CNNs are seriously hindered by the significant storage and computational costs. Structured pruning is a promising method to compress and accelerate CNNs and does not need special hardware or software for an auxiliary calculation. Here a simple strategy of structured pruning approach is proposed to crop unimportant filters or neurons automatically during the training stage. The proposed method introduces a mask for all filters or neurons to evaluate their importance. Thus the filters or neurons with zero mask are removed. To achieve this, the proposed method adopted L1 regularization to zero filters or neurons of CNNs. Experiments were conducted to assess the validity of this technique. The experiments showed that the proposed approach could crop 90.4%, 95.6% and 34.04% parameters on LeNet-5, VGG-16, and ResNet-32respectively, with a negligible loss of accuracy. 


INDEX 
TERMS Convolutional neural networks, regularization, structured pruning. 


I. INTRODUCTION 

During the recent years, convolutional neural networks (CNNs) [1] have accomplished successful applications in many areas such as image classification [2], object detection [3], neural style transfer [4], identity authentication [5], information security [6], speech recognition and natural language processing. However, these achievements were made through leveraging large-scale networks, which possessed millions or even billions of parameters. Those large-scale networks heavily relied on GPUs to accelerate computation. Moreover, devices with limited resources, such as mobile, FPGA or embedded devices, etc. have difficulties to deploy CNNs in actual applications. Thus, it is critical to accelerate the inference of CNNs and reduce storage for a wide range of applications [7]. 
According to the studies done so far, the major approaches for compressing deep neural networks can be categorized into four groups, i.e. low-rank decomposition [8], parameter quantization [9], knowledge distillation [10][13], and 
network pruning [14]. For the deep neural networks (DNN) that have been trained, the low-rank decomposition technology decomposes and approximates a tensor to a smaller level to achieve compression. The low-rank decomposition achieves efficient speedup because it reduces the elements of the matrix. However, it can only decompose or approximate tensors one by one within every layer, and cannot discover the redundant parameters of DNN. Besides, more research has been focused on network module designs, which are smaller, more efficient and more sophisticated. These models, such as SqueezeNet [15], MobileNet [16] and Shufflenet [17], are basically made up of low resolutions convolution with lesser parameters and better performance. 
At present, network pruning is a major focus of research, which not only accelerates DNN, but also reduces redundant parameters. Actually, using a large-scale network directly may provide state-of-the-art performance, so learning a large-scale network is needed. However, optimum network architecture may not be known. Thus, a massive redundancy exists in large neural networks. To combat this problem, network pruning is useful to remove redundant parameters, filters, channels or neurons, and address the over-fitting issue. 
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/

<<FIGURE>>

FIGURE 1. The architecture of the layer with the mask. (a) The architecture of a convolutional layer with the mask. (b) The architecture of a fully-connected layer with the mask. The proposed approach chooses the unimportant filters and neurons (highlighted in yellow) by the order of magnitude of mask value. 
Network pruning techniques can also be broadly categorized as structured pruning and non-structured pruning. Non-structured pruning aims to remove single parameters that have little influence on the accuracy of networks and non-structured pruning is efficient and effective for compact.ing networks. Nonetheless, non-structured pruning is difficult to be widely used in practical applications. Actually, the operation of convolution is reformulated as a matrix-by-matrix multiplication in many prevalent deep learning frameworks. This requires additional information to rep.resent pruned locations in non-structured pruning method. Therefore, special hardware or software is needed to assist with the calculation, which may increase computation time. Instead, structured pruning directly removes the entire filters, channels or neurons. Thus, the remaining network architecture can be used directly by the existing hardware. For example, Anwar et al. [18] employed particle filtering to structured sparsity convolutional neural network at channel-wise, kernel-wise, and intra-kernel stride levels. At present, several structured pruning methods [24], [25], [27] are mainly based on the statistical information of parameters or 
activation outputs. These methods do not consider the loss and are unable to remove parameters during training. In addition, some methods, such as those mentioned by [19], [20], require layer-by-layer iterative pruning and recovery accuracy, which involves enormous calculations. On the contrary, the proposed approach links pruning with minimization of loss and can be implemented during the training. 
It is inspiring that the filters who's weights are all zero can be safely removed, because, whatever the input, they would not extract any features. This study presents a scheme to prune filters or neurons of fully-connected layers based on L1 regularization [21] to zero out the weights of some filters or neurons. Similar to this method, Wen et al. [31] adopted group LASSO regularization [40] to zero out filters. However, all the weights are required to compute an extra gradient, which is computationally expensive for a large-scale network. 
Contrarily, in the proposed method, a mask is introduced to address this issue and the regularization term only is the l1-norm of the mask, which easily calculates the gradients of the mask. In this method, the parameters of filters or neurons are multiplied by a mask to pick unimportant filters or neurons, and once the mask is zero the corresponding filter or neuron will be removed. Here, though a mask is introduced for filters or neurons, the method does not change the architecture of the network. This allows for other compression methods to be used with the proposed technique. Similar to the proposed method, Lin et al. [32] also adopted a mask to identify unimportant filters or neurons, but the value of the mask could not be changed by training. In addition, removing unimportant filters or neurons may temporarily degrade accuracy, but the network can be retrained for recovery performance. FIGURE 1 shows the framework of the proposed method. 
In this article, a structured pruning technology is presented, which allows for simultaneously learning and removing unimportant filters or neurons of CNNs. The main contributions are as follows: 

 A simple yet effective method based L1 regularization is presented to compress CNNs model during the training stage. 

 A threshold is adopted to solve the optimization problem of l1-norm. In this approach, only some mask values are required to be near zero, though not completely zero. The detail is provided in the following section. 


II. PREVIOUS WORK 

The importance of compressing deep learning models before the application is self-evident, especially for expanding the application scenarios of deep learning [11]. For example, a compressed deep learning model can be combined with edge computing [12] to enable Internet of things devices under.stand data. In this section, we will review the contributions of others. 
Le Cun et al. [14] first proposed a saliency measurement method called Optimal Brain Damage (OBD) to selectively delete weights by second-derivative information of error function. Later, Hassibi and Strok [22] proposed the Optimal Brain Surgeon (OBS) algorithm based on OBD. The OBS not only removed unimportant weights but also automatically adjusted the remaining weights, which improved accuracy and generalization ability. All these methods are based on Taylor expansion (even OBD and OBS are required to compute Hessian matrix), which may be computationally intensive especially for large networks. In addition, they use a criterion of minimal increase in error on the training data. Guo et al. [23] introduced a binary matrix to dynamically choose important weights. Han et al. [24], [25] directly removed weights with values lower than a predefined threshold to compress networks, then followed by retraining to recover accuracy. Considering most filters in CNNs that tended to be smooth in the spatial domain, Liu et al. [26] extended Guo's work to the frequency domain by implementing Discrete Cosine Transform (DCT) to filters in the spatial domain. However, these non-structured pruning technologies were hard to use in real applications, because extra software or hardware was required for the calculation. 
Directly cropping a trained model by the value of weight is a wide method. Normally it is used to find an effective evaluation to judge the importance of weights and to cut the unimportant connection or filter to reduce the redundancy of a model. Hu et al. [27] thought the activation outputs of a significant portion of neurons were zero in a large network, whatever inputs the network received. These zero activation neurons were unimportant, so they defined the Average Percentage of Zeros (ApoZ) to observe the percentage of activations of a neuron and cropped the neurons with fewer activations. Li et al. [28] introduced a structured pruning method by measuring the norm of filters to remove unimportant filters. Luo et al. [29] took advantage of a subset of input channels to approximate output for compressing convolutional layers. Changpinyo et al. [30] proposed a random method to compress CNNs. They randomly connected the output channel to a small subset of input channels to compress CNNs. Though successful to an extent, their method did not directly relate to the loss, hence it was necessary to retrain the network for the recovery of accuracy. On the other hand, such a scheme could only be used layer-by-layer. Thus, it was essential to iterate over and over to prune, which would result in massive computation costs. 
Ding et al. [37] applied a customized L2 regularization to remove unimportant filters and simultaneously stimulate important filters to grow stronger. Lin et al. [32] proposed a Global & Dynamic Filter Pruning (GDP) method, which could dynamically recover the previously removed filters. Liu et al. [33] enforced channel-level sparsity in the net.work to compress DNNs in the training phase. In addition, Gordon et al. [39] iteratively shrank and expanded a network targeting reduction of particular resources (e.g. FLOPS, or the number of parameters). 

III. The APPROACH OF STRUCTURED PRUNING FOR CNNs 

A. NOTATIONS 
First of all, notations are clarified in this section. CNN is a multi-layer deep feed-forward neural network, which is composed of a stack of convolutional layers, pooling layers, and full-connected layers. In an l-layer CNNs model, <<FORMULA>>.
represents the k-th filter of l layer, <<FORMULA>> denotes the number of feature maps in l-1 layer and d indicates the kernel size. Let us denote feature maps in the l layer by <<FORMULA>>, where <<FORMULA>> is the size, Cl is the number of channels, and Zl is the output of l-1 layer. In addition, Zk 
represents the k-th feature map of l layer. The output feature map Zk can be computed as: 

<<FORMULA>>,         (1)

where f is a non-linear activation function, is the convolutional operation and bk is the bias. <<FORMULA>> represents the training set, where xi and yi represent the training sample and label respectively, and N indicates the number of samples. 

B. THE PROPOSED SCHEME

structured
The goal of pruning is to remove those redundant filters or neurons, which are unimportant or useless for the performance of the networks. Essentially, the main role of the convolutional layer filters is to extract local features. However, once all the parameters of a filter are zeroed, the filter is confirmed unimportant. Whatever the inputs for the filter, the outputs are always zero. Under the circumstance, the filters are unable to extract any information. When the filters are multiplied by zero, all the parameters of the filters become zero. Based on this observation, a mask is introduced for every filter to estimate its importance. This can be formulated as: 

<<FORMULA>>,                (2)

where mlk represents the k-th mask of l-layer. 

Therefore, the problem of zeroing out the values of some filters can be transformed to zero some mask. For this purpose, the following optimization solution is proposed: 

<<FORMULA>>,        (3) 

where <<FORMULA>> is a loss function, such as cross-entropy loss, <<FORMULA>> is the output of CNNs and C is a hyper-parameter that controls the 
number of pruned filters. Equation (3) is the core of the proposed method. Once the optimal solution of the equation is obtained, the pruning is achieved. 
In addition, this method can also remove redundant neurons in a fully-connected layer. The inference of fully-connected layer can be represented by: 

<<FORMULA>>,       (4) 

where <<FORMULA>> is a weight matrix and Zl.1 . Rn.1 is the input of l-th layer. Here, when fully-connected layers introduce mask, the inference of these layers can be reformu.lated as: 

<<FORMULA>>,        (5) 

where <<FORMULA>> is a mask vector and <<FORMULA>> is Hadamard product operator. 
Equation (3) can be transformed into the following form based Lagrange multiplier: 

<<FORMULA>>,      (6)

where <<FORMULA>> is a coefficient associated with C. 
Equation (6) is an NP-hard problem because of the zero norm. Thus, it is quite difficult to obtain an optimal solution with equation (6). 
Therefore, l1-norm is adopted to replace l0-norm, as: 

<<FORMULA>>.             (7)

Equation (7) can be solved by SGD in practical application, so the proposed method is simple and easy to implement. We just need to introduce a mask for each layer and train the network. Though the proposed method introduces mask, the network topology will be preserved because the mask can be absorbed into weight. 

C. THRESHOLD 

L1 regularization is a widely used sparse technology, which pushes the coefficients of uninformative features to zero. So a sparse 
network is achieved by solving equation (7). However, there is a problem in solving equation (7). 
Here the mask value cannot be completely zeroed in practical application, because the objective function (7) 
is non-convex and the global optimal solution may not be obtained. A strategy is adopted in the proposed method to solve this problem. If the order of magnitude of the mask value is small enough, it can be considered almost as zero. Thus, to decide whether the mask is zero, a threshold is introduced. However, considering only the value of the mask is meaningless if the mask is not completely zero. Because there is a linear transformation between mask and convolution. One can shrink the masks while expanding the weights to keep the product of them the same. Hence, considering the mask and weight simultaneously is necessary. The average value of the product of the mask and the weight is used to determine whether the mask is exactly zero or not? The specific definition can be presented as: 

<<FORMULA>> (8)
 
where <<FORMULA>> is a pre-defined threshold and <<FORMULA>> is the average operation. This strategy is efficient and reasonable, which can be proved by the results of the experiment. 

Algorithm 1 The Proposed Pruning Approach

<<ALGORITHM>>

Merging weights and masks and then removing the mask layer. Return the pruned network architecture and preserved weights.

D. FINE-TUNING AND OTHER REGULARIZATION STRATEGIES 

Pruning may temporarily lead to degradation in accuracy, so fine-tuning is necessary to improve accuracy. Furthermore, the proposed method can be employed iteratively to obtain a narrower architecture. Actually, a single iteration of proposed method is enough to yield noticeable compaction. The method is elaborated in Algorithm 1. 
Essentially, the purpose of this approach is to adjust some masks to adequately small order of magnitude. Therefore, L2 regularization can also serve as a regularization strategy in this approach. 

IV. EXPERIMENTS 

The approach was primarily evaluated through three net.works: LeNet-5 on MNIST dataset, VGG-16 on CIFAR-10 dataset and ResNet-32 on CIFAR-10 dataset. The implementation of this approach was accomplished through the standard Keras library. All experiments were conducted through Intel E5-2630 V4 CPU and NVIDIA 1080Ti GPU. 

A. DATASETS 

1) MNIST 
MNIST dataset of handwritten digits from 0 to 9 is widely applied to evaluate machine learning models. This dataset owns 60000 train samples and 10000 test samples. 

2) CIFAR-10 
The CIFAR-10 dataset [41] has a total of 60000 images consisting of 10 classes, each having 6000 images with 32x32 resolution. There are 50000 training images and 10000 test images. During training, a data augmentation scheme was adopted, which contained random horizontal flip, rotation, and translation. The input data was normalized using the means and standard deviations. 

B. NETWORK MODELS 

1) LENET-5 
LeNet-5 is a convolutional neural network designed by LeCun et al. [34]. It has two convolutional and two 

<<TABLE>>

TABLE 1. The result of lenet-5 on mnist. full-connected layers. This network has 44.2K learnable parameters. In this network, dropout is used in the full-connected layer. 

2) VGG-16 
The original VGG-16 [35] has thirteen convolutional and two fully-connected layers and has 130M learn-able parameters. However, VGG-16 is very complex for CIFAR-10 dataset. So the fully-connected layers were removed. Moreover, Batch Normalization was used after each convolution operation. The modified model has 14.7M learn-able parameters. 

3) RESNET-32 

Deep residual network (ResNet) [42] is a state-of-the-art multiple CNNs architecture. In this paper, ResNet-32 was implemented to evaluate the proposed method. The used ResNet-32 had the same architecture as described in [42], which contained three stages of convolutional, one global average pooling after last convolutional layer and one fully-connected layer. In addition, when the dimensions increased, 1x1 convolution was adopted as identity mapping to match the dimensions. This network has 0.47M learnable parameters. 

C. THE DETAIL OF TRAINING, PRUNING, AND FINE-TUNING 

To obtain the baseline of accuracy in the experiments, we trained LeNet-5 on MNIST, VGG-16 on CIFAR-10, and ResNet-32 on CIFAR-10 from scratch. Then, the pruning was performed on the basis of the trained network and the strategy of regularization was chosen as L1 regularization, with the mask initialized to 1. Later, we would retrain the pruned network for the recovery of accuracy. 

1) LENET-5 ON MNIST 
The original network was normally trained from scratch, for a total of 30 epochs, by Adam [43] with a batch sizes of 128. The learning rate was initialized to 0.001, the weight decay was set to 0.0005. The momentum was set to 0.9 and the dropout rate was set to 0.5 for the fully-connected layer. While implementing the pruning training, only the epochs was modified. The epochs was set at 10 and the threshold mentioned above to select pruned filters was set at 0.01. The pruned network was then retrained to compensate for the loss of accuracy. We adopted the same hyper-parameter setting as in normal training. 

2) VGG-16 ON CIFAR-10 
To get the baseline accuracy, the network was normally trained from scratch by SGD with a batch size of 128. The total epochs were set to 60. The initial learning rate was set to 0.01 and then scaled up by 0.1 every 20 epochs. The weight decay was set at 0.0005 and the momentum at 0.9. While implementing the pruning training, epochs was set to 30 , the learning rate was scaled by 0.1 every 10 epochs and other settings remained the same, while the threshold was set at 0.01. Finally, the pruned model was retrained following the same pre-processing and hyper-parameter settings as the normal training. 

3) RESNET-32 ON CIFAR-10 
Generally, the network was trained from scratch by SGD as the baseline with a batch size of 128. The weight decay was set at 0.0001, the epochs were set at 120, and the momentum was set at 0.9. The initial learning rate was set at 0.1 and then scaled by 0.1 at 60 and 100 epochs. Here, for pruning training, the epoch was set at 30, the learning rate was scaled by 
0.1 every 10 epochs and the other settings remained the same. After pruning, the network was retrained from scratch. The epochs was modified to 60 and the learning rate was scaled by 0.1 every 20 epochs. 

D. RESULTS OF THE EXPERIMENTS 

1) LENET-5 ON MNIST 
As per the results in TABLE 1, 88.84% of the parameters were removed without any impact on performance. Based on the proposed method, 95.46% of the parameters were discarded as well with an accuracy loss of 0.57%. 

<<TABLE>>

TABLE 2. Result of VGG-16 on CIFAR-10 datasets. 

<<TABLE>>

TABLE 1 also reveals that there was enormous redundancy in fully-connected layers because at least 90% parameters of fully-connected layers could easily be dropped. According to the form, the proposed method may indeed seek important connections. The reasons can be summarized in two points. First, when parameters of 83.83% are removed, the accuracy doesn't change. This indicates that the pruned parameters are unimportant for maintaining the accuracy of the network. Second, it is difficult to remove some filters or neurons, especially the neurons of fully-connected layers, when the pruning rate gradually increases. So the remaining connections are crucial. 
In addition, the convolutional layer, especially the first one, is hard to prune in comparison with the next layer. The possible explanation could be that the proposed method automatically selects the unimportant filters through a backpropagation algorithm. However, the backpropagation algorithm will cause the previous layer to suffer gradient vanishing problem. That is why the former layers are hard to prune compared to the later ones. 

2) VGG-16 ON CIFAR-10 
As depicted in TABLE 2, over 94.4% of parameters could be removed with a negligible accuracy loss of 0.51%. It can also be observed that the loss of accuracy was only 2.04% when prune parameters of 97.76%. The proposed method proved to be effective again in reducing redundancy. 
In fact, preserving the remaining architecture without retaining the parameters (training the pruned network from scratch) is also a strategy to fine-tune network. This strategy was adopted here to retrain the network and the results were promising, as shown in TABLE 2. The results reveal that a better effect can be achieved through directly retraining the pruned network from scratch. Perhaps the significance of the proposed method is that it furnishes the facility to discover excellent architectures, as mentioned by Liu et al. [36] as well. Nevertheless, training a pruned network from scratch 

FIGURE 2. Comparison of L1 regularization and L2 regularization. "accuracy loss" represents the difference of accuracy between pruned CNNs and original CNNs. A positive value indicates the improvement of network accuracy after pruning, while a negative value indicates the decrease of accuracy. 
is expensive in terms of computation cost, especially in case of large-scale datasets and networks. 

3) RESNET-32 ON CIFAR-10 
Pruning ResNet-32 based on the order of magnitude of the mask may result in different output map dimensions in the residual module. So a 1x1 convolution is needed as identity mapping to match dimensions. However, this operation brings about extra parameter and computation. To avoid this problem, a percentile was defined to remove filters of the same proportion in every convolutional layer. TABLE 3 shows that the proposed method removed 34% parameters with accuracy loss of 0.65%. Moreover, over 62.3% of parameters could also be discarded with an accuracy loss of 1.76%. Thus, it was confirmed that the proposed method could reduce the redundancy of complex network, 
i.e. ResNet. 

<<FIGURE>>

FIGURE 3. The comparison of pruned and reserved filters. (a) The comparison of parameters order of magnitude between pruned and reserved filters. The x-axis represents the distribution interval and the y-axis represents the percentage of the parameter in the interval. (b) The comparison of non-zero activations. The left bar represents average non-zero activation percentage, and the right bar represents average non-zero activation value. 

<<TABLE>>

TABLE 3. Rest lt of RESNET-32 on CIFAR-10 datasets. 

V. ANALYSIS 

A. L2 REGULARIZATION 

L2 regularization was also explored as a regularization strategy in this study. As shown in FIGURE 2, the LeNet-5 can also be compressed without degrading accuracy based L2 regularization. Nevertheless, there is some difference between L1 regularization and L2 regularization. Both L1 and L2 regularizations can improve accuracy when pruning rate is less than 84%, but the effect of L2 regularization is better. The main reason is that regularization techniques can prevent overfitting and improve the generalization ability. Moreover, with the pruning rate increasing, L1 regularization can achieve a greater compression effect in the same accuracy. As per Han et al. [24], L1 regularization pushes more parameters closer to zero, so it can prune more parameters. Having studied the difference between L1 regularization and L2 regularization, the inclination is more towards the L1 regularization from the perspective of compression and accuracy trade-off. 

B. THE EFFECT OF PRUNING 

To better describe the effect of the proposed method, a comparison was made between the pruned filters and reserved filters. The CONV3-1 layer of VGG-16, which owned 256 filters, was chosen while the . set at 0.008. Based on the above setting, 125 filters of CONV3-1 layer could be removed. Empirically, a weak filter or neuron always has lower activation outputs, lower activation frequency, and lower weight value. Hence weight values and activation outputs were chosen here to evaluate the difference between pruned and preserved filters. 
As shown in Figure 2, the bulk of values of pruned parameters, with a percentage of 96.9, are less than 10.6, in terms of the weight absolute values. However, most of the values of reserved parameters, with a percentage of 94.5, are greater than 0.001. The results indicate an enormous distribution difference between the values of the pruned and the reserved parameters. Therefore, the present approach can effectively reduce the order of magnitude of the pruned parameters. 
In addition, the test set was chosen as a sample to calculate the average non-zero activation values and percentage of CONV3-1. As obvious from Figure 3, both the average percentage of non-zero activation and the average values of non-zero activation of the pruned filters was much lower than those of the reserved filters. From the activation perspective, the pruned filters were weak, because the output and weight values of pruned filters were negligible compared with the reserved filters and could be completely ignored. Thus, using the order of magnitude of the mask to determine pruned filters or neurons was reasonable. 

C. COMPARISON WITH OTHER METHODS 
In this section, two classical structured prune methods were compared with the proposed method. First, in LeNet-5 on MNIST-10 dataset, the proposed method was compared with that of Wen et al. [31]. In this experiment, both the proposed and Wen et al. [31] methods adopted the same coefficient of sparsity regularization (. = 0.03). The results (TABLE 5) show that both the methods were analogous in terms of accuracy and compression effect. However, the proposed method is simpler and costs less computation in practice. Further, the proposed method was also compared with that 

<<TABLE>>

TABLE 4. Compare of VGG-16 on CIFAR-10. 

<<TABLE>>

TABLE 5. Compare of LENET-5 on MNIST. 

of Liu et al. [33] in VGG-16 on CIFAR-10. Again, the same sparsity regularization coefficient (. = 0.005) was adopted for both the methods. However, Liu et al. [33] adopted a fixed percentage threshold setting, whereas, the scheme of threshold setting of proposed method was different from Liu. The results (in TABLE 4) reveal that the proposed method was superior in terms of compression efficiency, although there was a slight loss of accuracy. In general, the proposed method can not only generate sparsity but also achieve better pruning effect with its improved threshold. 
Nevertheless, some shortcomings were also observed with this approach. One is that though this approach doesn't change the existing CNNs architecture, the added mask layer essentially increases the number of layers in the network. This may increase optimization difficulty. However, this problem can be solved by Batch Normalization (BN [38]). The other is that, as this method introduces a threshold, the pruning effect may not be smooth. The pruning rate may change drastically with small changes in the <<FORMULA>>, which is not conducive to finding the best <<FORMULA>>. 
VI. CONCLUSION In this article, a structured pruning technology is proposed to automatically tailor redundant filters or neurons based on regularization. A mask is introduced to remove unimportant filters or neurons by zeroing the values of some masks dur.ing training. In addition, to deal with the problem that the mask cannot be completely zeroed in practice, a threshold is designed to zero the mask. Experimentation with multiple datasets has proved that the proposed method can effectively remove parameters with a negligible loss of accuracy. In the future, establishing a relation between the hyper-parameter <<FORMULA>> and the pruning rate will be considered to facilitate the adjustment of hyper-parameter .. 


ACKNOWLEDGMENT 
All the mentioned support is gratefully acknowledged. 


REFERENCES 
[1] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, pp. 436444, May 2015. 
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Proc. Adv. Neural Inf. Pro.cess. Syst. (NIPS), 2012, pp. 10971105. 
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580587. 
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 26722680. 
[5] C. Shen, Y. Li, Y. Chen, X. Guan, and R. Maxion, Performance analysis of multi-motion sensor behavior for active smartphone authentication, IEEE Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 4862, Jan. 2018. 
[6] C. Shen, Y. Chen, X. Guan, and R. Maxion, Pattern-growth based mining mouse-interaction behavior for an active user authentication system, IEEE Trans. Dependable Secure Comput., to be published. 
[7] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, A survey of model compres.sion and acceleration for deep neural networks, 2017, arXiv:1710.09282. [Online]. Available: https://arxiv.org/abs/1710.09282 
[8] C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, Convolutional neural networks with low-rank regularization, 2015, arXiv:1511.06067. [Online]. Available: https://arxiv.org/abs/1511.06067 
[9] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, in Proc. Int. Conf. Mach. Learn., 2015, pp. 22852294. 
[10] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, 2014, arXiv:1412.6115. [Online]. Available: https://arxiv.org/abs/1412.6115 
[11] Z. Tian, S. Su, W. Shi, X. Du, M. Guizani, and X. Yu, A data-driven method for future Internet route decision modeling, Future Gener. Com-put. Syst., vol. 95, pp. 212220, Jun. 2018. 
[12] Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, S. Su, Y. Sun, and N. Guizani, Real-time lateral movement detection based on evidence reasoning net.work for edge computing environment, IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 42854294, Jul. 2019. 
[13] R. Liu, N. Fusi, and L. Mackey, Teacher-student compression with gener.ative adversarial networks, 2018, arXiv:1812.02271. [Online]. Available: https://arxiv.org/abs/1812.02271 
[14] Y. LeCun, J. S. Denker, and S. A. Solla, Optimal brain damage, in Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598605. 
[15] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size, 2016, arXiv:1602.07360. [Online]. Avail.able: https://arxiv.org/abs/1602.07360 
[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, MobileNets: efficient convolutional neu.ral networks for mobile vision applications, 2017, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861 
[17] X. Zhang, X. Zhou, M. Lin, and J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 68486856. 
[18] S. Anwar, K. Hwang, and W. Sung, Structured pruning of deep convolu.tional neural networks, ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 3, p. 32, 2017. 
[19] Y. He, X. Zhang, and J. Sun, Channel pruning for accelerating very deep neural networks, in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 13891397. 
[20] J.-H. Luo and J. Wu, An entropy-based pruning method for CNN compression, arXiv:1706.05791, 2017. [Online]. Available: https://arxiv.org/abs/1706.05791 
[21] R. Tibshirani, Regression selection and shrinkage via the lasso, J. Roy. Stat. Soc. B, vol. 58, no. 1, pp. 267288, 1996. 
[22] B. Hassibi and D. G. Stork, Second order derivatives for network pruning: Optimal brain surgeon, in Proc. Adv. Neural Inf. Process. Syst., 1993, pp. 164171. 
[23] Y. Guo, A. Yao, and Y. Chen, Dynamic network surgery for efficient DNNs, in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 13791387. 
[24] S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and con.nections for efficient neural network, in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 11351143. 
[25] S. Han, H. Mao, and W. J. Dally, Deep compression: Com.pressing deep neural networks with pruning, trained quantization and Huffman coding, 2015, arXiv:1510.00149. [Online]. Available: https://arxiv.org/abs/1510.00149 
[26] Z. Liu, J. Xu, X. Peng, and R. Xiong, Frequency-domain dynamic pruning for convolutional neural networks, in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 10431053. 
[27] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, finetwork trimming: A data-driven neuron pruning approach towards efficient deep architectures, 2016, arXiv:1607.03250. [Online]. Available: https://arxiv. org/abs/1607.03250 
[28] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, Pruning filters for efficient convNets, 2016, arXiv:1608.08710. [Online]. Available: https://arxiv.org/abs/1608.08710 
[29] J.-H. Luo, J. Wu, and W. Lin, ThiNet: A filter level pruning method for deep neural network compression, in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 50585066. 
[30] S. Changpinyo, M. Sandler, and A. Zhmoginov, The power of sparsity in convolutional neural networks, arXiv:1702.06257. [Online]. Available: https://arxiv.org/abs/1702.06257 
[31] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, Learning structured sparsity in deep neural networks, in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 20742082. 
[32] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, Accelerating convolutional networks via global & dynamic filter pruning, in Proc. IJCAI, 2018, pp. 24252432. 
[33] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, Learning efficient convolutional networks through network slimming, in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017, pp. 27362744. 
[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learn.ing applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 22782324, Nov. 1998. 
[35] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014, arXiv:1409.1556. [Online]. Avail.able: https://arxiv.org/abs/1409.1556 
[36] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, Rethinking the value of network pruning, 2018, arXiv:1810.05270. [Online]. Available: https://arxiv.org/abs/1810.05270 
[37] X. Ding, G. Ding, J. Han, and S. Tang, Auto-balanced filter pruning for efficient convolutional neural networks, in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 67976804. 
[38] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015, arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502. 03167 
[39] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, MorphNet: Fast & simple resource-constrained structure learning of deep networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 15861595. 
[40] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 68, no. 1, pp. 4967, 2006. 
[41] A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009. 
[42] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 770778. 
[43] D. P. Kingma and J. Ba, Adam: A method for stochastic opti.mization, 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/ abs/1412.6980 

CHEN YANG is currently pursuing the master's degree with the Department of College of Information and Electrical Engineering, China Agricultural University, Beijing, China. His research is about general deep learning and machine learning but his main research interest includes deep models compression. 

ZHENGHONG YANG received the master's and Ph.D. degrees from Beijing Normal University, in 1990 and 2001, respectively. He is currently a Professor with the College of Science, China Agricultural University. He has presided two projects of National Natural Science Foundation. He has written two teaching and research books and has published more than 40 academic papers in domestic and foreign journals, among them, about 30 are cited by SCI/EI/ISTP. His major research 
interests include the matrix theory, numerical algebra, image processing, and so on. He is a member of Beijing and Chinese Society of Computational Mathematics. 

ABDUL MATEEN KHATTAK received the Ph.D. degree in horticulture and landscape from the University of Reading, U.K., in 1999. He was a Research Scientist in different agriculture research organizations before joining the University of Agriculture, Peshawar, Pakistan, where he is currently a Professor with the Department of Horticulture. He has conducted academic and applied research on different aspects of tropical fruits, vegetables, and ornamental plants. He has also worked for Alberta Agriculture and Forestry, Canada, as a Research Associate, and Organic Agriculture Centre of Canada as a Research and Extension Coordinator, for Alberta province. There he helped in developing organic standards for greenhouse production and energy saving technologies for Alberta greenhouses. He is a Professor with considerable experience in teaching and research. He is currently a Visiting Professor with the College of Information and Electrical Engineering, China Agricultural University, Beijing. He has published 59 research articles in scientific journals of international repute. He has also attended and presented in several international scientific conferences. His research interests include greenhouse produc.tion, medicinal, aromatic and ornamental plants, light quality, supplemental lighting, temperature effects on greenhouse crops, aquaponics, and organic production. 

LIU YANG is currently pursuing the master's degree with the College of Information and Elec.trical Engineering, China Agricultural University, Beijing, China. Her research interests include the application of image recognition and intelligent robots in the field of agriculture. 

WENXIN ZHANG is currently pursuing the master's degree with the School of Information and Electrical Engineering, China agricultural univer.sity, Beijing, China. Her research interest includes pose estimation methods about pig based on deep learning for timely access to pig information. 

WANLIN GAO received the B.S., M.S., and Ph.D. degrees from China Agricultural University, in 1990, 2000, and 2010, respectively. He is the currently the Dean of the College of Information and Electrical Engineering, China Agricultural University. He has been the principal investiga.tor (PI) of over 20 national plans and projects. He has published 90 academic papers in domestic and foreign journals, among them, over 40 are cited by SCI/EI/ISTP. He has written two teaching 
materials, which are supported by the National Key Technology Research and Development Program of China during the 11th Five-Year Plan Period, and ve monographs. He holds 101 software copyrights, 11 patents for inventions, and eight patents for new practical inventions. His major research interests include the informationization of new rural areas, intelligence agriculture, and the service for rural comprehensive information. He is a member of Science and Technology Committee of the Ministry of Agricul.ture, a member of Agriculture and Forestry Committee of Computer Basic Education in colleges and universities, and a Senior Member of Society of Chinese Agricultural Engineering, etc. 

MINJUAN WANG received the Ph.D. degree from the School of Biological Science and Medical Engineering, Beihang University, under the super.vision of Prof. Hong Liu, in June 2017. She was a Visiting Scholar with the School of Environmen.tal Science, Ontario Agriculture College, Univer.sity of Guelph, from October 2015 to May 2017. She is currently a Postdoctoral Fellow with the College of Information and Electrical Engineer.ing, China Agricultural University. Her research 
interests mainly include bioinformatics and the Internet of Things key technologies. 
<|endoftext|>


<|startoftext|>
     The 4 Research Techniques to Train Deep Neural Network Models More Efficiently


               James Le Follow
  

      Deep learning and unsupervised feature learning have shown
      great promise in many practical applications. State-of-the-art
      performance has been reported in several domains, ranging
      from speech recognition and image recognition to text
      processing and beyond.


      It’s also been observed that increasing the scale of deep
      learning—with respect to numbers of training examples, model
      parameters, or both—can drastically improve accuracy. These
      results have led to a surge of interest in scaling up the training
      and inference algorithms used for these models and in
      improving optimization techniques for both.


      The use of GPUs is a significant advance in recent years that
      makes the training of modestly-sized deep networks practical.
      A known limitation of the GPU approach is that the training
      speed-up is small when the model doesn’t Ft in a GPU’s
      memory (typically less than 6 gigabytes).


      To use a GPU eLectively, researchers often reduce the size of
      the dataset or parameters so that CPU-to-GPU transfers are not
      a significant bottleneck. While data and parameter reduction
      work well for small problems (e.g. acoustic modeling for speech
      recognition), they are less attractive for problems with a large
      number of examples and dimensions (e.g., high-resolution
      images).


                               In the previous post, we
                               talked about 5 different
                               algorithms for efficient deep
                               learning inference. In this
                               article, we’ll discuss the
                               upper right part of the
                               quadrant on the left. What
                               are the best research
                               techniques to train deep
                               neural networks more
      efficiently?


      1 — Parallelization Training
      Let’s start with parallelization. As the Fgure below shows, the
      number of transistors keeps increasing over the years. But
      single-threaded performance and frequency are plateauing in
      recent years. Interestingly, the number of cores is increasing.

      So what we really need to know is how to parallelize the
      problem to take advantage of parallel processing. There are a
      lot of opportunities to do that in deep neural networks.


      For example, we can do data parallelism: feeding 2 images
      into the same model and running them at the same time. This
      does not aLect latency for any single input. It doesn’t make it
      shorter, but it makes the batch size larger. It also requires
      coordinated weight updates during training.


      For example, in JeL Dean’s paper “Large Scale Distributed Deep
      Networks,” there’s a parameter server (as a master) and a
      couple of model workers (as slaves) running their own pieces of
      training data and updating the gradient to the master.

      Another idea is model parallelism — splitting up the model
      and distributing each part to different processors or different
      threads. For example, imagine we want to run convolution in
      the image below by doing a 6-dimension “for” loop. What we
      can do is cut the input image by 2x2 blocks, so that each
      thread/processor handles 1/4 of the image. Also, we can
      parallelize the convolutional layers by the output or input
      feature map regions, and the fully-connected layers by the
      output activation.

     2 — Mixed Precision Training
     Larger models usually require more compute and memory
     resources to train. These requirements can be lowered by using
     reduced precision representation and arithmetic.

     Performance (speed) of any program, including neural network
     training and inference, is limited by one of three factors:
     arithmetic bandwidth, memory bandwidth, or latency.
     Reduced precision addresses two of these limiters. Memory
     bandwidth pressure is lowered by using fewer bits to store the
     same number of values. Arithmetic time can also be lowered on
     processors that oLer higher throughput for reduced precision
     math. For example, half-precision math throughput in recent
     GPUs is 2× to 8× higher than for single-precision. In addition
     to speed improvements, reduced precision formats also reduce
     the amount of memory required for training.

     Modern deep learning training systems use a single-precision
     (FP32) format. In their paper “Mixed Precision Training,”
     researchers from NVIDIA and Baidu addressed training with
     reduced precision while maintaining model accuracy.

     Specifically, they trained various neural networks using the
     IEEE half-precision format (FP16). Since FP16 format has a
     narrower dynamic range than FP32, they introduced three
      techniques to prevent model accuracy loss: maintaining a
      master copy of weights in FP32, loss-scaling that minimizes
      gradient values becoming zeros, and FP16 arithmetic with
      accumulation in FP32.


                               Using these techniques, they
                               demonstrated that a wide
                               variety of network
                               architectures and
                               applications can be trained
                               to match the accuracy of
                               FP32 training. Experimental
                               results include convolutional
                               and recurrent network
      architectures, trained for classification, regression, and
      generative tasks.


      Applications include image classification, image generation,
      object detection, language modeling, machine translation, and
      speech recognition. The proposed methodology requires no
      changes to models or training hyperparameters.


      3 — Model Distillation
      Model distillation refers to the idea of model compression by
      teaching a smaller network exactly what to do, step-by-step,
      using a bigger, already-trained network. The ‘soft labels’ refer
      to the output feature maps by the bigger network after every
      convolution layer. The smaller network is then trained to learn
      the exact behavior of the bigger network by trying to replicate
      its outputs at every level (not just the Final loss).


      The method was First proposed by Bucila et al., 2006 and
      generalized by Hinton et al., 2015. In distillation, knowledge is
      transferred from the teacher model to the student by
      minimizing a loss function in which the target is the
      distribution of class probabilities predicted by the teacher
      model. That is — the output of a softmax function on the
      teacher model’s logits.


                               So how do teacher-student
                               networks exactly work?


                               The highly-complex teacher
                               network is Frst trained
                               separately using the
                               complete dataset. This step
                               requires high computational
                               performance and thus can
                               only be done ohine (on
         high-performing GPUs).

         While designing a student network, correspondence needs
         to be established between intermediate outputs of the
         student network and the teacher network. This
         correspondence can involve directly passing the output of a
         layer in the teacher network to the student network, or
         performing some data augmentation before passing it to the
         student network.

         Next, the data are forward-passed through the teacher
         network to get all intermediate outputs, and then data
         augmentation (if any) is applied to the same.

         Finally, the outputs from the teacher network are back-
         propagated through the student network so that the student
         network can learn to replicate the behavior of the teacher
         network.

      4 — Dense-Sparse-Dense Training
      The research paper “Dense-Sparse-Dense Training for Deep
      Neural Networks” was published back in 2017 by researchers
      from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense-
      Sparse-Dense (DSD) takes 3 sequential steps:

         Dense: Normal neural net training…business as usual. It’s
         notable that even though DSD acts as a regularizer, the
         usual regularization methods such as dropout and weight
         regularization can be applied as well. The authors don’t
         mention batch normalization, but it would work as well.

                               Sparse: We regularize the
                               network by removing
                               connections with small
                               weights. From each layer in
                               the network, a percentage of
                               the layer’s weights that are
         closest to 0 in absolute value is selected to be pruned. This
         means that they are set to 0 at each training iteration. It’s
         worth noting that the pruned weights are selected only
         once, not at each SGD iteration. Eventually, the network
         recovers the pruned weights’ knowledge and condenses it in
         the remaining ones. We train this sparse net until
         convergence.

         Dense: First, we re-enable the pruned weights from the
         previous step. The net is again trained normally until
         convergence. This step increases the capacity of the model.
         It can use the recovered capacity to store new knowledge.
         The authors note that the learning rate should be 1/10th of
         the original. Since the model is already performing well, the
         lower learning rate helps preserve the knowledge gained in
         the previous step.
<|endoftext|>


<|startoftext|>
                 THE LOTTERY TICKET HYPOTHESIS. FINDING SPARSE , TRAINABLE NEURAL NETWORKS


                  Jonathan Frankle                    Michael Carbin
                  MIT CSAIL                         MIT CSAIL
                  jfrankle@csail.mit.edu           mcarbin@csail.mit.edu


                                              ABSTRACT

                       Neural network pruning techniques can reduce the parameter counts of trained net-
                       works by over 90%, decreasing storage requirements and improving computational
                       performance of inference without compromising accuracy. However, contemporary
                       experience is that the sparse architectures produced by pruning are difﬁcult to train
                       from the start, which would similarly improve training performance.
                       We ﬁnd that a standard pruning technique naturally uncovers subnetworks whose
                       initializations made them capable of training effectively. Based on these results, we
                       articulate the lottery ticket hypothesis. dense, randomly-initialized, feed-forward
                       networks contain subnetworks (winning tickets) that—when trained in isolation—
                       reach test accuracy comparable to the original network in a similar number of
                       iterations. The winning tickets we ﬁnd have won the initialization lottery. their
                       connections have initial weights that make training particularly effective.
                       We present an algorithm to identify winning tickets and a series of experiments
                       that support the lottery ticket hypothesis and the importance of these fortuitous
                       initializations. We consistently ﬁnd winning tickets that are less than 10-20% of
                       the size of several fully-connected and convolutional feed-forward architectures
                       for MNIST and CIFAR10. Above this size, the winning tickets that we ﬁnd learn
                       faster than the original network and reach higher test accuracy.


                  1 INTRODUCTION

                 Techniques for eliminating unnecessary weights from neural networks (pruning) (LeCun et al., 1990;
                 Hassibi & Stork, 1993; Han et al., 2015; Li et al., 2016) can reduce parameter-counts by more than
                 90% without harming accuracy. Doing so decreases the size (Han et al., 2015; Hinton et al., 2015)
                 or energy consumption (Yang et al., 2017; Molchanov et al., 2016; Luo et al., 2017) of the trained
                 networks, making inference more efﬁcient. However, if a network can be reduced in size, why do we
                 not train this smaller architecture instead in the interest of making training more efﬁcient as well?
                 Contemporary experience is that the architectures uncovered by pruning are harder to train from the
                 start, reaching lower accuracy than the original networks. 1

                 Consider an example. In Figure 1, we randomly sample and train subnetworks from a fully-connected
                 network for MNIST and convolutional networks for CIFAR10. Random sampling models the effect
                 of the unstructured pruning used by LeCun et al. (1990) and Han et al. (2015). Across various levels
                 of sparsity, dashed lines trace the iteration of minimum validation loss 2 and the test accuracy at that
                 iteration. The sparser the network, the slower the learning and the lower the eventual test accuracy.

                    1 “Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate
                 the difﬁculty of training a network with a small capacity.” (Li et al., 2016) “During retraining, it is better to retain
                 the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the
                 pruned layers...gradient descent is able to ﬁnd a good solution when the network is initially trained, but not after
                 re-initializing some layers and retraining them.” (Han et al., 2015)
                    2 As a proxy for the speed at which a network learns, we use the iteration at which an early-stopping criterion
                 would end training. The particular early-stopping criterion we employ throughout this paper is the iteration of
                 minimum validation loss during training. See Appendix C for more details on this choice.

                                                  <<FIGURE>>

                 Figure 1. The iteration at which early-stopping would occur (left) and the test accuracy at that iteration
                 (right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for
                 CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled
                 sparse networks (average of ten trials). Solid lines are winning tickets (average of ﬁve trials).


                 In this paper, we show that there consistently exist smaller subnetworks that train from the start and
                 learn at least as fast as their larger counterparts while reaching similar test accuracy. Solid lines in
                 Figure 1 show networks that we ﬁnd. Based on these results, we state the lottery ticket hypothesis.
                 The Lottery Ticket Hypothesis.A randomly-initialized, dense neural network contains a subnet-
                 work that is initialized such that—when trained in isolation—it can match the test accuracy of the
                 original network after training for at most the same number of iterations.

                 More formally, consider a dense feed-forward neural network <<FORMULA>> with initial parameters <<FORMULA>> 
                 When optimizing with stochastic gradient descent (SGD) on a training set, f reaches
                 minimum validation loss lat iteration j with test accuracy a. In addition, consider training <<FORMULA>>
                 with a mask <<FORMULA>> on its parameters such that its initialization is <<FORMULA>> When optimizing
                 with SGD on the same training set (with m ﬁxed), f reaches minimum validation loss l0 at iteration j0
                 with test accuracy a0 . The lottery ticket hypothesis predicts that 9m for which <<FORMULA>> (commensurate
                 training time), <<FORMULA>> (commensurate accuracy), and <<FORMULA>> (fewer parameters).
                 We ﬁnd that a standard pruning technique automatically uncovers such trainable subnetworks from
                 fully-connected and convolutional feed-forward networks. We designate these trainable subnetworks,
                 <<FORMULA>>,winning tickets, since those that we ﬁnd have won the initialization lottery with a
                 combination of weights and connections capable of learning. When their parameters are randomly
                 reinitialized (<<FORMULA>> where <<FORMULA>>), our winning tickets no longer match the performance of
                 the original network, offering evidence that these smaller networks do not train effectively unless
                 they are appropriately initialized.

                 Identifying winning tickets. We identify a winning ticket by training a network and pruning its
                 smallest-magnitude weights. The remaining, unpruned connections constitute the architecture of the
                 winning ticket. Unique to our work, each unpruned connection’s value is then reset to its initialization
                 from original network before it was trained. This forms our central experiment.
                     1.Randomly initialize a neural network <<FORMULA>> (where <<FORMULA>>).
                     2.Train the network for j iterations, arriving at parameters <<FORMULA>>.
                     3.Prune p% of the parameters in j, creating a mask m.
                     4.Reset the remaining parameters to their values in <<FORMULA>>, creating the winning ticket<<FORMULA>>.
                 As described, this pruning approach is one-shot. the network is trained once,p%of weights are
                 pruned, and the surviving weights are reset. However, in this paper, we focus on iterative pruning,
                 which repeatedly trains, prunes, and resets the network over n rounds; each round prunes <<FORMULA>> of the
                 weights that survive the previous round. Our results show that iterative pruning ﬁnds winning tickets
                 that match the accuracy of the original network at smaller sizes than does one-shot pruning.
                 Results.We identify winning tickets in a fully-connected architecture for MNIST and convolutional
                 architectures for CIFAR10 across several optimization strategies (SGD, momentum, and Adam) with
                 techniques like dropout, weight decay, batchnorm, and residual connections. We use an unstructured
                 pruning technique, so these winning tickets are sparse. In deeper networks, our pruning-based strategy
                 for ﬁnding winning tickets is sensitive to the learning rate. it requires warmup to ﬁnd winning tickets
                 at higher learning rates. The winning tickets we ﬁnd are 10-20% (or less) of the size of the original

                                                  <<FIGURE>>

                 Figure 2. Architectures tested in this paper. Convolutions are 3x3. Lenet is from LeCun et al. (1998).
                 Conv-2/4/6 are variants of VGG (Simonyan & Zisserman, 2014). Resnet-18 is from He et al. (2016).
                 VGG-19 for CIFAR10 is adapted from Liu et al. (2019). Initializations are Gaussian Glorot (Glorot
                 & Bengio, 2010). Brackets denote residual connections around layers.


                 network (smaller size). Down to that size, they meet or exceed the original network’s test accuracy
                 (commensurate accuracy) in at most the same number of iterations (commensurate training time).
                 When randomly reinitialized, winning tickets perform far worse, meaning structure alone cannot
                 explain a winning ticket’s success.
                 The Lottery Ticket Conjecture.Returning to our motivating question, we extend our hypothesis
                 into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense,
                 randomly-initialized networks are easier to train than the sparse networks that result from pruning
                 because there are more possible subnetworks from which training might recover a winning ticket.
                 Contributions.
                      We demonstrate that pruning uncovers trainable subnetworks that reach test accuracy comparable
                       to the original networks from which they derived in a comparable number of iterations.
                      We show that pruning ﬁnds winning tickets that learn faster than the original network while
                       reaching higher test accuracy and generalizing better.
                      We propose the lottery ticket hypothesis as a new perspective on the composition of neural
                       networks to explain these ﬁndings.
                 Implications.In this paper, we empirically study the lottery ticket hypothesis. Now that we have
                 demonstrated the existence of winning tickets, we hope to exploit this knowledge to.
                 Improve training performance.Since winning tickets can be trained from the start in isolation, a hope
                 is that we can design training schemes that search for winning tickets and prune as early as possible.
                 Design better networks.Winning tickets reveal combinations of sparse architectures and initializations
                 that are particularly adept at learning. We can take inspiration from winning tickets to design new
                 architectures and initialization schemes with the same properties that are conducive to learning. We
                 may even be able to transfer winning tickets discovered for one task to many others.
                 Improve our theoretical understanding of neural networks.We can study why randomly-initialized
                 feed-forward networks seem to contain winning tickets and potential implications for theoretical
                 study of optimization (Du et al., 2019) and generalization (Zhou et al., 2018; Arora et al., 2018).

                  2 WINNING TICKETS IN FULLY-CONNECTED NETWORKS

                 In this Section, we assess the lottery ticket hypothesis as applied to fully-connected networks trained
                 on MNIST. We use the Lenet-300-100 architecture (LeCun et al., 1998) as described in Figure 2.
                 We follow the outline from Section 1. after randomly initializing and training a network, we prune
                 the network and reset the remaining connections to their original initializations. We use a simple
                 layer-wise pruning heuristic. remove a percentage of the weights with the lowest magnitudes within
                 each layer (as in Han et al. (2015)). Connections to outputs are pruned at half of the rate of the rest of
                 the network. We explore other hyperparameters in Appendix G, including learning rates, optimization
                 strategies (SGD, momentum), initialization schemes, and network sizes.

                                                  <<FIGURE>>

                 Figure 3. Test accuracy on Lenet (iterative pruning) as training proceeds. Each curve is the average
                 of ﬁve trials. Labels arePm —the fraction of weights remaining in the network after pruning. Error
                 bars are the minimum and maximum of any trial.


                 Notation.Pm =kmk0 is the sparsity of mask m, e.g., <<FORMULA>>  m = 25% when 75% of weights are pruned.
                 Iterative pruning.The winning tickets we ﬁnd learn faster than the original network. Figure 3 plots
                 the average test accuracy when training winning tickets iteratively pruned to various extents. Error
                 bars are the minimum and maximum of ﬁve runs. For the ﬁrst pruning rounds, networks learn faster
                 and reach higher test accuracy the more they are pruned (left graph in Figure 3). A winning ticket
                 comprising 51.3% of the weights from the original network (i.e.,Pm = 51.3%) reaches higher test
                 accuracy faster than the original network but slower than whenPm = 21.1%. When Pm < 21.1%,
                 learning slows (middle graph). When Pm = 3.6%, a winning ticket regresses to the performance of
                 the original network. A similar pattern repeats throughout this paper.
                 Figure 4a summarizes this behavior for all pruning levels when iteratively pruning by 20% per
                 iteration (blue). On the left is the iteration at which each network reaches minimum validation loss
                 (i.e., when the early-stopping criterion would halt training) in relation to the percent of weights
                 remaining after pruning; in the middle is test accuracy at that iteration. We use the iteration at which
                 the early-stopping criterion is met as a proxy for how quickly the network learns.
                 The winning tickets learn faster asPm decreases from 100% to 21%, at which point early-stopping
                 occurs38%earlier than for the original network. Further pruning causes learning to slow, returning
                 to the early-stopping performance of the original network whenPm = 3.6%. Test accuracy increases
                 with pruning, improving by more than 0.3 percentage points whenPm = 13.5%; after this point,
                 accuracy decreases, returning to the level of the original network whenPm = 3.6%.
                 At early stopping, training accuracy (Figure 4a, right) increases with pruning in a similar pattern to
                 test accuracy, seemingly implying that winning tickets optimize more effectively but do not generalize
                 better. However, at iteration 50,000 (Figure 4b), iteratively-pruned winning tickets still see a test
                 accuracy improvement of up to 0.35 percentage points in spite of the fact that training accuracy
                 reaches 100% for nearly all networks (Appendix D, Figure 12). This means that the gap between
                 training accuracy and test accuracy is smaller for winning tickets, pointing to improved generalization.
                 Random reinitialization. To measure the importance of a winning ticket’s initialization, we retain
                 the structure of a winning ticket (i.e., the mask m) but randomly sample a new initialization <<FORMULA>>.
                 We randomly reinitialize each winning ticket three times, making 15 total per point in Figure 4. We
                 ﬁnd that initialization is crucial for the efﬁcacy of a winning ticket. The right graph in Figure 3
                 shows this experiment for iterative pruning. In addition to the original network and winning tickets at
                 Pm = 51% and 21% are the random reinitialization experiments. Where the winning tickets learn
                 faster as they are pruned, they learn progressively slower when randomly reinitialized.
                 The broader results of this experiment are orange line in Figure 4a. Unlike winning tickets, the
                 reinitialized networks learn increasingly slower than the original network and lose test accuracy after
                 little pruning. The average reinitialized iterative winning ticket’s test accuracy drops off from the
                 original accuracy when Pm = 21.1%, compared to 2.9% for the winning ticket. When Pm = 21%,
                 the winning ticket reaches minimum validation loss 2.51x faster than when reinitialized and is half a
                 percentage point more accurate. All networks reach 100% training accuracy for Pm = 5%; Figure

                                                  <<FIGURE>>

                 Figure 4. Early-stopping iteration and accuracy of Lenet under one-shot and iterative pruning.
                 Average of ﬁve trials; error bars for the minimum and maximum values. At iteration 50,000, training
                 accuracy 100% for Pm = 2% for iterative winning tickets (see Appendix D, Figure 12).


                 4b therefore shows that the winning tickets generalize substantially better than when randomly
                 reinitialized. This experiment supports the lottery ticket hypothesis’ emphasis on initialization.
                 the original initialization withstands and beneﬁts from pruning, while the random reinitialization’s
                 performance immediately suffers and diminishes steadily.
                 One-shot pruning.Although iterative pruning extracts smaller winning tickets, repeated training
                 means they are costly to ﬁnd. One-shot pruning makes it possible to identify winning tickets
                 without this repeated training. Figure 4c shows the results of one-shot pruning (green) and randomly
                 reinitializing (red); one-shot pruning does indeed ﬁnd winning tickets. When 67.5% Pm = 17.6%,
                 the average winning tickets reach minimum validation accuracy earlier than the original network.
                 When 95.0% Pm = 5.17%, test accuracy is higher than the original network. However, iteratively-
                 pruned winning tickets learn faster and reach higher test accuracy at smaller network sizes. The
                 green and red lines in Figure 4c are reproduced on the logarithmic axes of Figure 4a, making this
                 performance gap clear. Since our goal is to identify the smallest possible winning tickets, we focus
                 on iterative pruning throughout the rest of the paper.

                  3 WINNING TICKETS IN CONVOLUTIONAL NETWORKS

                 Here, we apply the lottery ticket hypothesis to convolutional networks on CIFAR10, increasing
                 both the complexity of the learning problem and the size of the networks. We consider the Conv-2,
                 Conv-4, and Conv-6 architectures in Figure 2, which are scaled-down variants of the VGG (Simonyan
                 & Zisserman, 2014) family. The networks have two, four, or six convolutional layers followed by
                 two fully-connected layers; max-pooling occurs after every two convolutional layers. The networks
                 cover a range from near-fully-connected to traditional convolutional networks, with less than 1% of
                 parameters in convolutional layers in Conv-2 to nearly two thirds in Conv-6. 3

                 Finding winning tickets. The solid lines in Figure 5 (top) show the iterative lottery ticket experiment
                 on Conv-2 (blue), Conv-4 (orange), and Conv-6 (green) at the per-layer pruning rates from Figure 2.
                 The pattern from Lenet in Section 2 repeats. as the network is pruned, it learns faster and test accuracy
                 rises as compared to the original network. In this case, the results are more pronounced. Winning

                    3 Appendix H explores other hyperparameters, including learning rates, optimization strategies (SGD,
                 momentum), and the relative rates at which to prune convolutional and fully-connected layers.

                                                  <<FIGURE>>

                 Figure 5. Early-stopping iteration and test and training accuracy of the Conv-2/4/6 architectures when
                 iteratively pruned and when randomly reinitialized. Each solid line is the average of ﬁve trials; each
                 dashed line is the average of ﬁfteen reinitializations (three per trial). The bottom right graph plots test
                 accuracy of winning tickets at iterations corresponding to the last iteration of training for the original
                 network (20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6); at this iteration, training
                 accuracy100%forPm 2%for winning tickets (see Appendix D).


                  tickets reach minimum validation loss at best 3.5x faster for Conv-2 (Pm = 8.8%), 3.5x for Conv-4
                 (Pm = 9.2%), and 2.5x for Conv-6 (Pm = 15.1%). Test accuracy improves at best 3.4 percentage
                 points for Conv-2 (Pm = 4.6%), 3.5 for Conv-4 (Pm = 11.1%), and 3.3 for Conv-6 (Pm = 26.4%).
                 All three networks remain above their original average test accuracy when Pm > 2%.
                 As in Section 2, training accuracy at the early-stopping iteration rises with test accuracy. However, at
                 iteration 20,000 for Conv-2, 25,000 for Conv-4, and 30,000 for Conv-6 (the iterations corresponding
                 to the ﬁnal training iteration for the original network), training accuracy reaches 100% for all networks
                 when Pm = 2% (Appendix D, Figure 13) and winning tickets still maintain higher test accuracy
                 (Figure 5 bottom right). This means that the gap between test and training accuracy is smaller for
                 winning tickets, indicating they generalize better.
                 Random reinitialization.We repeat the random reinitialization experiment from Section 2, which
                 appears as the dashed lines in Figure 5. These networks again take increasingly longer to learn upon
                 continued pruning. Just as with Lenet on MNIST (Section 2), test accuracy drops off more quickly
                 for the random reinitialization experiments. However, unlike Lenet, test accuracy at early-stopping
                 time initially remains steady and even improves for Conv-2 and Conv-4, indicating that—at moderate
                 levels of pruning—the structure of the winning tickets alone may lead to better accuracy.
                 Dropout.Dropout (Srivastava et al., 2014; Hinton et al., 2012) improves accuracy by randomly dis-
                 abling a fraction of the units (i.e., randomly sampling a subnetwork) on each training iteration. Baldi
                 & Sadowski (2013) characterize dropout as simultaneously training the ensemble of all subnetworks.
                 Since the lottery ticket hypothesis suggests that one of these subnetworks comprises a winning ticket,
                 it is natural to ask whether dropout and our strategy for ﬁnding winning tickets interact.
                 Figure 6 shows the results of training Conv-2, Conv-4, and Conv-6 with a dropout rate of 0.5. Dashed
                 lines are the network performance without dropout (the solid lines in Figure 5). 4 We continue to ﬁnd
                 winning tickets when training with dropout. Dropout increases initial test accuracy (2.1, 3.0, and 2.4
                 percentage points on average for Conv-2, Conv-4, and Conv-6, respectively), and iterative pruning
                 increases it further (up to an additional 2.3, 4.6, and 4.7 percentage points, respectively, on average).
                 Learning becomes faster with iterative pruning as before, but less dramatically in the case of Conv-2.


                    4 We choose new learning rates for the networks as trained with dropout—see Appendix H.5.

                                                  <<FIGURE>>

                 Figure 6. Early-stopping iteration and test accuracy at early-stopping of Conv-2/4/6 when iteratively
                 pruned and trained with dropout. The dashed lines are the same networks trained without dropout
                 (the solid lines in Figure 5). Learning rates are 0.0003 for Conv-2 and 0.0002 for Conv-4 and Conv-6.

                               <<FIGURE>>

                   Figure 7. Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned.


                 These improvements suggest that our iterative pruning strategy interacts with dropout in a complementary
                 way. Srivastava et al. (2014) observe that dropout induces sparse activations in the ﬁnal
                 network; it is possible that dropout-induced sparsity primes a network to be pruned. If so, dropout
                 techniques that target weights (Wan et al., 2013) or learn per-weight dropout probabilities (Molchanov
                 et al., 2017; Louizos et al., 2018) could make winning tickets even easier to ﬁnd.

                  4 VGG AND RESNET FOR CIFAR10

                 Here, we study the lottery ticket hypothesis on networks evocative of the architectures and techniques
                 used in practice. Speciﬁcally, we consider VGG-style deep convolutional networks (VGG-19 on
                 CIFAR10—Simonyan & Zisserman (2014)) and residual networks (Resnet-18 on CIFAR10—He
                 et al. (2016)). 5 These networks are trained with batchnorm, weight decay, decreasing learning
                 rate schedules, and augmented training data. We continue to ﬁnd winning tickets for all of these
                 architectures; however, our method for ﬁnding them, iterative pruning, is sensitive to the particular
                 learning rate used. In these experiments, rather than measure early-stopping time (which, for these
                 larger networks, is entangled with learning rate schedules), we plot accuracy at several moments
                 during training to illustrate the relative rates at which accuracy improves.
                 Global pruning.On Lenet and Conv-2/4/6, we prune each layer separately at the same rate. For
                 Resnet-18 and VGG-19, we modify this strategy slightly. we prune these deeper networks globally,
                 removing the lowest-magnitude weights collectively across all convolutional layers. In Appendix
                 I.1, we ﬁnd that global pruning identiﬁes smaller winning tickets for Resnet-18 and VGG-19. Our
                 conjectured explanation for this behavior is as follows. For these deeper networks, some layers have
                 far more parameters than others. For example, the ﬁrst two convolutional layers of VGG-19 have
                 1728 and 36864 parameters, while the last has 2.35 million. When all layers are pruned at the same
                 rate, these smaller layers become bottlenecks, preventing us from identifying the smallest possible
                 winning tickets. Global pruning makes it possible to avoid this pitfall.
                 VGG-19.We study the variant VGG-19 adapted for CIFAR10 by Liu et al. (2019); we use the
                 the same training regime and hyperparameters. 160 epochs (112,480 iterations) with SGD with
                    5 See Figure 2 and Appendices I for details on the networks, hyperparameters, and training regimes.

                                                  <<FIGURE>>

                   Figure 8. Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned.


                 momentum (0.9) and decreasing the learning rate by a factor of 10 at 80 and 120 epochs. This
                 network has 20 million parameters. Figure 7 shows the results of iterative pruning and random
                 reinitialization on VGG-19 at two initial learning rates. 0.1 (used in Liu et al. (2019)) and 0.01. At the
                 higher learning rate, iterative pruning does not ﬁnd winning tickets, and performance is no better than
                 when the pruned networks are randomly reinitialized. However, at the lower learning rate, the usual
                 pattern reemerges, with subnetworks that remain within 1 percentage point of the original accuracy
                 whilePm 3.5%. (They are not winning tickets, since they do not match the original accuracy.)
                 When randomly reinitialized, the subnetworks lose accuracy as they are pruned in the same manner as
                 other experiments throughout this paper. Although these subnetworks learn faster than the unpruned
                 network early in training (Figure 7 left), this accuracy advantage erodes later in training due to the
                 lower initial learning rate. However, these subnetworks still learn faster than when reinitialized.
                 To bridge the gap between the lottery ticket behavior of the lower learning rate and the accuracy
                 advantage of the higher learning rate, we explore the effect of linear learning rate warmup from 0 to
                 the initial learning rate over k iterations. Training VGG-19 with warmup (k= 10000, green line) at
                 learning rate 0.1 improves the test accuracy of the unpruned network by about one percentage point.
                 Warmup makes it possible to ﬁnd winning tickets, exceeding this initial accuracy whenPm 1.5%.
                 Resnet-18.Resnet-18 (He et al., 2016) is a 20 layer convolutional network with residual connections
                 designed for CIFAR10. It has 271,000 parameters. We train the network for 30,000 iterations with
                 SGD with momentum (0.9), decreasing the learning rate by a factor of 10 at 20,000 and 25,000
                 iterations. Figure 8 shows the results of iterative pruning and random reinitialization at learning
                 rates 0.1 (used in He et al. (2016)) and 0.01. These results largely mirror those of VGG. iterative
                 pruning ﬁnds winning tickets at the lower learning rate but not the higher learning rate. The accuracy
                 of the best winning tickets at the lower learning rate (89.5% when 41.7%, Pm = 21.9%) falls
                 short of the original network’s accuracy at the higher learning rate (90.5%). At lower learning rate,
                 the winning ticket again initially learns faster (left plots of Figure 8), but falls behind the unpruned
                 network at the higher learning rate later in training (right plot). Winning tickets trained with warmup
                 close the accuracy gap with the unpruned network at the higher learning rate, reaching 90.5% test
                 accuracy with learning rate 0.03 (warmup,k= 20000) atPm = 27.1%. For these hyperparameters,
                 we still ﬁnd winning tickets whenPm 11.8%. Even with warmup, however, we could not ﬁnd
                 hyperparameters for which we could identify winning tickets at the original learning rate, 0.1.

                  5 DISCUSSION

                 Existing work on neural network pruning (e.g., Han et al. (2015)) demonstrates that the function
                 learned by a neural network can often be represented with fewer parameters. Pruning typically
                 proceeds by training the original network, removing connections, and further ﬁne-tuning. In effect,
                 the initial training initializes the weights of the pruned network so that it can learn in isolation during
                 ﬁne-tuning. We seek to determine if similarly sparse networks can learn from the start. We ﬁnd that
                 the architectures studied in this paper reliably contain such trainable subnetworks, and the lottery
                 ticket hypothesis proposes that this property applies in general. Our empirical study of the existence
                 and nature of winning tickets invites a number of follow-up questions.
                 The importance of winning ticket initialization.When randomly reinitialized, a winning ticket
                 learns more slowly and achieves lower test accuracy, suggesting that initialization is important to
                 its success. One possible explanation for this behavior is these initial weights are close to their ﬁnal
                 values after training—that in the most extreme case, they are already trained. However, experiments
                 in Appendix F show the opposite—that the winning ticket weights move further than other weights.
                 This suggests that the beneﬁt of the initialization is connected to the optimization algorithm, dataset,
                 and model. For example, the winning ticket initialization might land in a region of the loss landscape
                 that is particularly amenable to optimization by the chosen optimization algorithm.
                 Liu et al. (2019) ﬁnd that pruned networks are indeed trainable when randomly reinitialized, seemingly
                 contradicting conventional wisdom and our random reinitialization experiments. For example, on
                 VGG-19 (for which we share the same setup), they ﬁnd that networks pruned by up to 80% and
                 randomly reinitialized match the accuracy of the original network. Our experiments in Figure 7
                 conﬁrm these ﬁndings at this level of sparsity (below which Liu et al. do not present data). However,
                 after further pruning, initialization matters. we ﬁnd winning tickets when VGG-19 is pruned by up
                 to 98.5%; when reinitialized, these tickets reach much lower accuracy. We hypothesize that—up
                 to a certain level of sparsity—highly overparameterized networks can be pruned, reinitialized, and
                 retrained successfully; however, beyond this point, extremely pruned, less severely overparamterized
                 networks only maintain accuracy with fortuitous initialization.
                 The importance of winning ticket structure.The initialization that gives rise to a winning ticket
                 is arranged in a particular sparse architecture. Since we uncover winning tickets through heavy
                 use of training data, we hypothesize that the structure of our winning tickets encodes an inductive
                 bias customized to the learning task at hand. Cohen & Shashua (2016) show that the inductive bias
                 embedded in the structure of a deep network determines the kinds of data that it can separate more
                 parameter-efﬁciently than can a shallow network; although Cohen & Shashua (2016) focus on the
                 pooling geometry of convolutional networks, a similar effect may be at play with the structure of
                 winning tickets, allowing them to learn even when heavily pruned.
                 The improved generalization of winning tickets.We reliably ﬁnd winning tickets that generalize
                 better, exceeding the test accuracy of the original network while matching its training accuracy.
                 Test accuracy increases and then decreases as we prune, forming anOccam’s Hill(Rasmussen &
                 Ghahramani, 2001) where the original, overparameterized model has too much complexity (perhaps
                 overﬁtting) and the extremely pruned model has too little. The conventional view of the relationship
                 between compression and generalization is that compact hypotheses can better generalize (Rissanen,
                 1986). Recent theoretical work shows a similar link for neural networks, proving tighter generalization
                 bounds for networks that can be compressed further (Zhou et al. (2018) for pruning/quantization
                 and Arora et al. (2018) for noise robustness). The lottery ticket hypothesis offers a complementary
                 perspective on this relationship—that larger networks might explicitly contain simpler representations.
                 Implications for neural network optimization.Winning tickets can reach accuracy equivalent to
                 that of the original, unpruned network, but with signiﬁcantly fewer parameters. This observation
                 connects to recent work on the role of overparameterization in neural network training. For example,
                 Du et al. (2019) prove that sufﬁciently overparameterized two-layer relu networks (with ﬁxed-size
                 second layers) trained with SGD converge to global optima. A key question, then, is whether the
                 presence of a winning ticket is necessary or sufﬁcient for SGD to optimize a neural network to a
                 particular test accuracy. We conjecture (but do not empirically show) that SGD seeks out and trains a
                 well-initialized subnetwork. By this logic, overparameterized networks are easier to train because
                 they have more combinations of subnetworks that are potential winning tickets.


                  6 LIMITATIONS AND FUTURE WORK

                 We only consider vision-centric classiﬁcation tasks on smaller datasets (MNIST, CIFAR10). We do
                 not investigate larger datasets (namely Imagenet (Russakovsky et al., 2015)). iterative pruning is
                 computationally intensive, requiring training a network 15 or more times consecutively for multiple
                 trials. In future work, we intend to explore more efﬁcient methods for ﬁnding winning tickets that
                 will make it possible to study the lottery ticket hypothesis in more resource-intensive settings.
                 Sparse pruning is our only method for ﬁnding winning tickets. Although we reduce parameter-counts,
                 the resulting architectures are not optimized for modern libraries or hardware. In future work, we
                 intend to study other pruning methods from the extensive contemporary literature, such as structured
                 pruning (which would produce networks optimized for contemporary hardware) and non-magnitude
                 pruning methods (which could produce smaller winning tickets or ﬁnd them earlier).
                 The winning tickets we ﬁnd have initializations that allow them to match the performance of the
                 unpruned networks at sizes too small for randomly-initialized networks to do the same. In future
                 work, we intend to study the properties of these initializations that, in concert with the inductive
                 biases of the pruned network architectures, make these networks particularly adept at learning.
                 On deeper networks (Resnet-18 and VGG-19), iterative pruning is unable to ﬁnd winning tickets
                 unless we train the networks with learning rate warmup. In future work, we plan to explore why
                 warmup is necessary and whether other improvements to our scheme for identifying winning tickets
                 could obviate the need for these hyperparameter modiﬁcations.

                  7 RELATED WORK

                  In practice, neural networks tend to be dramatically overparameterized. Distillation (Ba & Caruana,
                 2014; Hinton et al., 2015) and pruning (LeCun et al., 1990; Han et al., 2015) rely on the fact that
                 parameters can be reduced while preserving accuracy. Even with sufﬁcient capacity to memorize
                 training data, networks naturally learn simpler functions (Zhang et al., 2016; Neyshabur et al., 2014;
                 Arpit et al., 2017). Contemporary experience (Bengio et al., 2006; Hinton et al., 2015; Zhang et al.,
                 2016) and Figure 1 suggest that overparameterized networks are easier to train. We show that dense
                 networks contain sparse subnetworks capable of learning on their own starting from their original
                 initializations. Several other research directions aim to train small or sparse networks.
                 Prior to training.Squeezenet (Iandola et al., 2016) and MobileNets (Howard et al., 2017) are
                 speciﬁcally engineered image-recognition networks that are an order of magnitude smaller than
                 standard architectures. Denil et al. (2013) represent weight matrices as products of lower-rank factors.
                 Li et al. (2018) restrict optimization to a small, randomly-sampled subspace of the parameter space
                 (meaning all parameters can still be updated); they successfully train networks under this restriction.
                 We show that one need not even update all parameters to optimize a network, and we ﬁnd winning
                 tickets through a principled search process involving pruning. Our contribution to this class of
                 approaches is to demonstrate that sparse, trainable networks exist within larger networks.
                 After training.Distillation (Ba & Caruana, 2014; Hinton et al., 2015) trains small networks to mimic
                 the behavior of large networks; small networks are easier to train in this paradigm. Recent pruning
                 work compresses large models to run with limited resources (e.g., on mobile devices). Although
                 pruning is central to our experiments, we study why training needs the overparameterized networks
                 that make pruning possible. LeCun et al. (1990) and Hassibi & Stork (1993) ﬁrst explored pruning
                 based on second derivatives. More recently, Han et al. (2015) showed per-weight magnitude-based
                 pruning substantially reduces the size of image-recognition networks. Guo et al. (2016) restore
                 pruned connections as they become relevant again. Han et al. (2017) and Jin et al. (2016) restore
                 pruned connections to increase network capacity after small weights have been pruned and surviving
                 weights ﬁne-tuned. Other proposed pruning heuristics include pruning based on activations (Hu et al.,
                 2016), redundancy (Mariet & Sra, 2016; Srinivas & Babu, 2015a), per-layer second derivatives (Dong
                 et al., 2017), and energy/computation efﬁciency (Yang et al., 2017) (e.g., pruning convolutional
                 ﬁlters (Li et al., 2016; Molchanov et al., 2016; Luo et al., 2017) or channels (He et al., 2017)). Cohen
                 et al. (2016) observe that convolutional ﬁlters are sensitive to initialization (“The Filter Lottery”);
                 throughout training, they randomly reinitialize unimportant ﬁlters.
                 During training.Bellec et al. (2018) train with sparse networks and replace weights that reach
                 zero with new random connections. Srinivas et al. (2017) and Louizos et al. (2018) learn gating
                 variables that minimize the number of nonzero parameters. Narang et al. (2017) integrate magnitude-
                 based pruning into training. Gal & Ghahramani (2016) show that dropout approximates Bayesian
                 inference in Gaussian processes. Bayesian perspectives on dropout learn dropout probabilities during
                 training (Gal et al., 2017; Kingma et al., 2015; Srinivas & Babu, 2016). Techniques that learn per-
                 weight, per-unit (Srinivas & Babu, 2016), or structured dropout probabilities naturally (Molchanov
                 et al., 2017; Neklyudov et al., 2017) or explicitly (Louizos et al., 2017; Srinivas & Babu, 2015b)
                 prune and sparsify networks during training as dropout probabilities for some weights reach 1. In
                 contrast, we train networks at least once to ﬁnd winning tickets. These techniques might also ﬁnd
                 winning tickets, or, by inducing sparsity, might beneﬁcially interact with our methods.

                  REFERENCES
                 Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for
                   deep nets via a compression approach.ICML, 2018.
                 Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S
                   Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at
                   memorization in deep networks. InInternational Conference on Machine Learning, pp. 233–242,
                   2017.
                 Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in neural information
                   processing systems, pp. 2654–2662, 2014.
                 Pierre Baldi and Peter J Sadowski. Understanding dropout. InAdvances in neural information
                   processing systems, pp. 2814–2822, 2013.
                 Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring. Training
                   very sparse deep networks.Proceedings of ICLR, 2018.
                 Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex
                   neural networks. InAdvances in neural information processing systems, pp. 123–130, 2006.
                 Joseph Paul Cohen, Henry Z Lo, and Wei Ding. Randomout. Using a convolutional gradient norm to
                   win the ﬁlter lottery.ICLR Workshop, 2016.
                 Nadav Cohen and Amnon Shashua. Inductive bias of deep convolutional networks through pooling
                   geometry.arXiv preprint arXiv.1605.06743, 2016.
                 Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. Predicting parameters in deep
                   learning. InAdvances in neural information processing systems, pp. 2148–2156, 2013.
                 Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise
                   optimal brain surgeon. InAdvances in Neural Information Processing Systems, pp. 4860–4874,
                   2017.
                 Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes
                   over-parameterized neural networks. InInternational Conference on Learning Representations,
                   2019. URLhttps.//openreview.net/forum?id=S1eK3i09YQ.
                 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation. Representing model
                   uncertainty in deep learning. Ininternational conference on machine learning, pp. 1050–1059,
                   2016.
                 Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. InAdvances in Neural Information
                   Processing Systems, pp. 3584–3593, 2017.
                 Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural
                   networks. InProceedings of the thirteenth international conference on artiﬁcial intelligence and
                   statistics, pp. 249–256, 2010.
                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InAdvances
                   In Neural Information Processing Systems, pp. 1379–1387, 2016.
                 Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
                   efﬁcient neural network. InAdvances in neural information processing systems, pp. 1135–1143,
                   2015.
                 Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John
                   Tran, and William J Dally. Dsd. Regularizing deep neural networks with dense-sparse-dense
                   training ﬂow.Proceedings of ICLR, 2017.
                 Babak Hassibi and David G Stork. Second order derivatives for network pruning. Optimal brain
                   surgeon. InAdvances in neural information processing systems, pp. 164–171, 1993.
                 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
                   recognition. InProceedings of the IEEE conference on computer vision and pattern recognition,
                   pp. 770–778, 2016.
                 Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.
                   InInternational Conference on Computer Vision (ICCV), volume 2, pp. 6, 2017.
                 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv
                   preprint arXiv.1503.02531, 2015.
                 Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov.
                   Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
                   arXiv.1207.0580, 2012.
                 Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
                   Marco Andreetto, and Hartwig Adam. Mobilenets. Efﬁcient convolutional neural networks for
                   mobile vision applications.arXiv preprint arXiv.1704.04861, 2017.
                 Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming. A data-driven
                   neuron pruning approach towards efﬁcient deep architectures.arXiv preprint arXiv.1607.03250,
                   2016.
                 Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt
                   Keutzer. Squeezenet. Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.
                   arXiv preprint arXiv.1602.07360, 2016.
                 Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks
                   with iterative hard thresholding methods.arXiv preprint arXiv.1607.05423, 2016.
                 Diederik P Kingma and Jimmy Ba. Adam. A method for stochastic optimization.arXiv preprint
                   arXiv.1412.6980, 2014.
                 Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameteri-
                   zation trick. InAdvances in Neural Information Processing Systems, pp. 2575–2583, 2015.
                 Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
                 Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. InAdvances in neural
                   information processing systems, pp. 598–605, 1990.
                 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
                   document recognition.Proceedings of the IEEE, 86(11).2278–2324, 1998.
                 Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension
                   of objective landscapes.Proceedings of ICLR, 2018.
                 Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for
                   efﬁcient convnets.arXiv preprint arXiv.1608.08710, 2016.
                 Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value
                   of network pruning. InInternational Conference on Learning Representations, 2019. URL
                   https.//openreview.net/forum?id=rJlnB3C5Ym.
                 Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In
                   Advances in Neural Information Processing Systems, pp. 3290–3300, 2017.
                 Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through
                   l_0regularization.Proceedings of ICLR, 2018.
                 Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet. A ﬁlter level pruning method for deep neural
                   network compression.arXiv preprint arXiv.1707.06342, 2017.
                  Zelda Mariet and Suvrit Sra. Diversity networks.Proceedings of ICLR, 2016.
                 Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁes deep neural
                   networks.arXiv preprint arXiv.1701.05369, 2017.
                 Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional
                   neural networks for resource efﬁcient transfer learning.arXiv preprint arXiv.1611.06440, 2016.
                 Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent
                   neural networks.Proceedings of ICLR, 2017.
                 Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian
                   pruning via log-normal multiplicative noise. InAdvances in Neural Information Processing
                   Systems, pp. 6778–6787, 2017.
                 Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias. On the
                   role of implicit regularization in deep learning.arXiv preprint arXiv.1412.6614, 2014.
                 Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In T. K. Leen, T. G. Dietterich,
                   and V. Tresp (eds.),Advances in Neural Information Processing Systems 13, pp. 294–300. MIT
                   Press, 2001. URLhttp.//papers.nips.cc/paper/1925-occams-razor.pdf.
                 Jorma Rissanen. Stochastic complexity and modeling.The annals of statistics, pp. 1080–1100, 1986.
                 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
                   Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
                   challenge.International Journal of Computer Vision, 115(3).211–252, 2015.
                 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
                   recognition.arXiv preprint arXiv.1409.1556, 2014.
                 Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv
                   preprint arXiv.1507.06149, 2015a.
                 Suraj Srinivas and R Venkatesh Babu. Learning neural network architectures using backpropagation.
                   arXiv preprint arXiv.1511.05497, 2015b.
                 Suraj Srinivas and R Venkatesh Babu. Generalized dropout.arXiv preprint arXiv.1611.06791, 2016.
                 Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks.
                   InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
                   pp. 138–145, 2017.
                 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
                   Dropout. A simple way to prevent neural networks from overﬁtting.The Journal of Machine
                   Learning Research, 15(1).1929–1958, 2014.
                 Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural
                   networks using dropconnect. InInternational Conference on Machine Learning, pp. 1058–1066,
                   2013.
                 Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efﬁcient convolutional neural
                   networks using energy-aware pruning.arXiv preprint, 2017.
                 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
                   deep learning requires rethinking generalization.arXiv preprint arXiv.1611.03530, 2016.
                 Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Compressibility
                   and generalization in large-scale deep learning.arXiv preprint arXiv.1804.05862, 2018.

                  A ACKNOWLEDGMENTS

                 We gratefully acknowledge IBM, which—through the MIT-IBM Watson AI Lab—contributed the
                 computational resources necessary to conduct the experiments in this paper. We particularly thank
                 IBM researchers German Goldszmidt, David Cox, Ian Molloy, and Benjamin Edwards for their
                 generous contributions of infrastructure, technical support, and feedback. We also wish to thank
                 Aleksander Madry, Shaﬁ Goldwasser, Ed Felten, David Bieber, Karolina Dziugaite, Daniel Weitzner,
                 and R. David Edelman for support, feedback, and helpful discussions over the course of this project.
                 This work was supported in part by the Ofﬁce of Naval Research (ONR N00014-17-1-2699).

                  B ITERATIVE PRUNING STRATEGIES

                 In this Appendix, we examine two different ways of structuring the iterative pruning strategy that we
                 use throughout the main body of the paper to ﬁnd winning tickets.

                 Strategy 1. Iterative pruning with resetting.

                     1.Randomly initialize a neural network <<FORMULA>> where <<FORMULA>> and <<FORMULA>> is a mask.
                     2.Train the network forjiterations, reaching parameters <<FORMULA>>.
                     3.Prune s% of the parameters, creating an updated mask m0 where <<FORMULA>>.
                     4.Reset the weights of the remaining portion of the network to their values in <<FORMULA>>. That is, let
                       <<FORMULA>>.
                     5.Let <<FORMULA>> and repeat steps 2 through 4 until a sufﬁciently pruned network has been
                       obtained.

                 Strategy 2. Iterative pruning with continued training.

                     1.Randomly initialize a neural network <<FORMULA>> where <<FORMULA>> and <<FORMULA>> is a mask.
                     2.Train the network for j iterations.
                     3.Prune s% of the parameters, creating an updated mask m0 where <<FORMULA>>.
                     4.Let <<FORMULA>> and repeat steps 2 and 3 until a sufﬁciently pruned network has been obtained.
                     5.Reset the weights of the remaining portion of the network to their values in <<FORMULA>>. That is, let
                       <<FORMULA>>.

                 The difference between these two strategies is that, after each round of pruning, Strategy 2 retrains
                 using the already-trained weights, whereas Strategy 1 resets the network weights back to their initial
                 values before retraining. In both cases, after the network has been sufﬁciently pruned, its weights are
                 reset back to the original initializations.
                 Figures 9 and 10 compare the two strategies on the Lenet and Conv-2/4/6 architectures on the
                 hyperparameters we select in Appendices G and H. In all cases, the Strategy 1 maintains higher
                 validation accuracy and faster early-stopping times to smaller network sizes.

                  C EARLY STOPPING CRITERION

                 Throughout this paper, we are interested in measuring the speed at which networks learn. As a proxy
                 for this quantity, we measure the iteration at which an early-stopping criterion would end training.
                 The speciﬁc criterion we employ is the iteration of minimum validation loss. In this Subsection, we
                 further explain that criterion.
                 Validation and test loss follow a pattern where they decrease early in the training process, reach a
                 minimum, and then begin to increase as the model overﬁts to the training data. Figure 11 shows an
                 example of the validation loss as training progresses; these graphs use Lenet, iterative pruning, and
                 Adam with a learning rate of 0.0012 (the learning rate we will select in the following subsection).
                 This Figure shows the validation loss corresponding to the test accuracies in Figure 3.

                                                  <<FIGURE>>

                 Figure 9. The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket
                 experiment on the Lenet architecture when iteratively pruned using the resetting and continued
                 training strategies.

                           <<FIGURE>>

                 Figure 10. The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket
                 experiment on the Conv-2, Conv-4, and Conv-6 architectures when iteratively pruned using the
                 resetting and continued training strategies.

                              <<FIGURE>>

                  Figure 11. The validation loss data corresponding to Figure 3, i.e., the validation loss as training
                  progresses for several different levels of pruning in the iterative pruning experiment. Each line is
                  the average of ﬁve training runs at the same level of iterative pruning; the labels are the percentage
                  of weights from the original network that remain after pruning. Each network was trained with
                 Adam at a learning rate of 0.0012. The left graph shows winning tickets that learn increasingly faster
                  than the original network and reach lower loss. The middle graph shows winning tickets that learn
                  increasingly slower after the fastest early-stopping time has been reached. The right graph contrasts
                  the loss of winning tickets to the loss of randomly reinitialized networks.

                                                  <<FIGURE>>

                 Figure 12. Figure 4 augmented with a graph of the training accuracy at the end of 50,000 iterations.


                 In all cases, validation loss initially drops, after which it forms a clear bottom and then begins
                 increasing again. Our early-stopping criterion identiﬁes this bottom. We consider networks that reach
                 this moment sooner to have learned “faster.” In support of this notion, the ordering in which each
                 experiment meets our early-stopping criterion in Figure 3 is the same order in which each experiment
                 reaches a particular test accuracy threshold in Figure 3.
                 Throughout this paper, in order to contextualize this learning speed, we also present the test accuracy
                 of the network at the iteration of minimum validation loss. In the main body of the paper, we ﬁnd
                 that winning tickets both arrive at early-stopping sooner and reach higher test accuracy at this point.


                  D TRAINING ACCURACY FOR LOTTERY TICKET EXPERIMENTS

                 This Appendix accompanies Figure 4 (the accuracy and early-stopping iterations of Lenet on MNIST
                 from Section 2) and Figure 5 (the accuracy and early-stopping iterations of Conv-2, Conv-4, and
                 Conv-6 in Section Section 3) in the main body of the paper. Those ﬁgures show the iteration of
                 early-stopping, the test accuracy at early-stopping, the training accuracy at early-stopping, and the
                 test accuracy at the end of the training process. However, we did not have space to include a graph
                 of the training accuracy at the end of the training process, which we assert in the main body of the
                 paper to be 100% for all but the most heavily pruned networks. In this Appendix, we include those
                 additional graphs in Figure 12 (corresponding to Figure 4) and Figure 13 (corresponding to Figure 5).
                 As we describe in the main body of the paper, training accuracy reaches 100% in all cases for all but
                 the most heavily pruned networks. However, training accuracy remains at 100% longer for winning
                 tickets than for randomly reinitialized networks.


                  E COMPARING RANDOM REINITIALIZATION AND RANDOM SPARSITY

                 In this Appendix, we aim to understand the relative performance of randomly reinitialized winning
                 tickets and randomly sparse networks.

                     1.Networks found via iterative pruning with the original initializations (blue in Figure 14).
                     2.Networks found via iterative pruning that are randomly reinitialized (orange in Figure 14).
                     3.Random sparse subnetworks with the same number of parameters as those found via iterative
                       pruning (green in Figure 14).

                                                  <<FIGURE>>

                      Figure 13. Figure 5 augmented with a graph of the training accuracy at the end of the training process.


                 Figure 14 shows this comparison for all of the major experiments in this paper. For the fully-connected
                 Lenet architecture for MNIST, we ﬁnd that the randomly reinitialized networks outperform random
                 sparsity. However, for all of the other, convolutional networks studied in this paper, there is no
                 signiﬁcant difference in performance between the two. We hypothesize that the fully-connected
                 network for MNIST sees these beneﬁts because only certain parts of the MNIST images contain
                 useful information for classiﬁcation, meaning connections in some parts of the network will be more
                 valuable than others. This is less true with convolutions, which are not constrained to any one part of
                 the input image.


                  F EXAMINING WINNING TICKETS

                 In this Appendix, we examine the structure of winning tickets to gain insight into why winning tickets
                 are able to learn effectively even when so heavily pruned. Throughout this Appendix, we study the
                 winning tickets from the Lenet architecture trained on MNIST. Unless otherwise stated, we use the
                 same hyperparameters as in Section 2. glorot initialization and adam optimization.

                  F.1 WINNING TICKET INITIALIZATION (ADAM)

                 Figure 15 shows the distributions of winning ticket initializations for four different levels ofPm . To
                 clarify, these are the distributions of the initial weights of the connections that have survived the
                 pruning process. The blue, orange, and green lines show the distribution of weights for the ﬁrst
                 hidden layer, second hidden layer, and output layer, respectively. The weights are collected from ﬁve
                 different trials of the lottery ticket experiment, but the distributions for each individual trial closely
                 mirror those aggregated from across all of the trials. The histograms have been normalized so that the
                 area under each curve is 1.
                 The left-most graph in Figure 15 shows the initialization distributions for the unpruned networks. We
                 use glorot initialization, so each of the layers has a different standard deviation. As the network is
                 pruned, the ﬁrst hidden layer maintains its distribution. However, the second hidden layer and the
                 output layer become increasingly bimodal, with peaks on either side of 0. Interestingly, the peaks
                 are asymmetric. the second hidden layer has more positive initializations remaining than negative
                 initializations, and the reverse is true for the output layer.
                 The connections in the second hidden layer and output layer that survive the pruning process tend
                 to have higher magnitude-initializations. Since we ﬁnd winning tickets by pruning the connections
                 with the lowest magnitudes in each layer at theend, the connections with the lowest-magnitude
                 initializations must still have the lowest-magnitude weights at the end of training. A different trend
                 holds for the input layer. it maintains its distribution, meaning a connection’s initialization has less
                 relation to its ﬁnal weight.

                  F.2 WINNING TICKET INITIALIZATIONS (SGD)

                 We also consider the winning tickets obtained when training the network with SGD learning rate 0.8
                 (selected as described in Appendix G). The bimodal distributions from Figure 15 are present across
                 all layers (see Figure 16. The connections with the highest-magnitude initializations are more likely
                 to survive the pruning process, meaning winning ticket initializations have a bimodal distribution
                 with peaks on opposite sides of 0. Just as with the adam-optimized winning tickets, these peaks are
                 of different sizes, with the ﬁrst hidden layer favoring negative initializations and the second hidden
                 layer and output layer favoring positive initializations. Just as with the adam results, we conﬁrm that
                 each individual trial evidences the same asymmetry as the aggregate graphs in Figure 16.

                  F.3 REINITIALIZING FROM WINNING TICKET INITIALIZATIONS

                 Considering that the initialization distributions of winning ticketsDm are so different from the
                 Gaussian distributionDused to initialize the unpruned network, it is natural to ask whether randomly
                 reinitializing winning tickets fromDm rather thanDwill improve winning ticket performance. We do
                 not ﬁnd this to be the case. Figure 17 shows the performance of winning tickets whose initializations
                 are randomly sampled from the distribution of initializations contained in the winning tickets for

                                                  <<FIGURE>>

                        Figure 14. The test accuracy at the ﬁnal iteration for each of the networks studied in this paper.

                                                              <<FIGURE>>

                 Figure 15. The distribution of initializations in winning tickets pruned to the levels speciﬁed in the
                 titles of each plot. The blue, orange, and green lines show the distributions for the ﬁrst hidden layer,
                 second hidden layer, and output layer of the Lenet architecture for MNIST when trained with the
                 adam optimizer and the hyperparameters used in 2. The distributions have been normalized so that
                 the area under each curve is 1.

                                                                          <<FIGURE>>

                        Figure 16. Same as Figure 15 where the network is trained with SGD at rate 0.8.


                 adam. More concretely, let <<FORMULA>> be the set of initializations found in the winning 0 ticket with maskm. We sample a new set of parameters 
                 <<FORMULA>> and train the network <<FORMULA>> We perform this sampling on a per-layer basis. The results of this experiment are in Figure 17.
                 Winning tickets reinitialized fromDm perform little better than when randomly reinitialized from D.
                 We attempted the same experiment with the SGD-trained winning tickets and found similar results.

                  F.4 PRUNING AT ITERATION 0

                 One other way of interpreting the graphs of winning ticket initialization distributions is as follows.
                 weights that begin small stay small, get pruned, and never become part of the winning ticket. (The
                 only exception to this characterization is the ﬁrst hidden layer for the adam-trained winning tickets.)
                 If this is the case, then perhaps low-magnitude weights were never important to the network and can
                 be pruned from the very beginning. Figure 18 shows the result of attempting this pruning strategy.
                 Winning tickets selected in this fashion perform even worse than when they are found by iterative

                                                                <<FIGURE>>

                 Figure 17. The performance of the winning tickets of the Lenet architecture for MNIST when the
                 layers are randomly reinitialized from the distribution of initializations contained in the winning
                 ticket of the corresponding size.

                                                  <<FIGURE>>

                 Figure 18. The performance of the winning tickets of the Lenet architecture for MNIST when
                 magnitude pruning is performed before the network is ever trained. The network is subsequently
                 trained with adam.

                                             <<FIGURE>>

                 Figure 19. Between the ﬁrst and last training iteration of the unpruned network, the magnitude by
                 which weights in the network change. The blue line shows the distribution of magnitudes for weights
                 that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for
                 weights that are in the eventual winning ticket.


                 pruning and randomly reinitialized. We attempted the same experiment with the SGD-trained winning
                 tickets and found similar results.

                  F.5 COMPARING INITIAL AND FINAL WEIGHTS IN WINNING TICKETS

                 In this subsection, we consider winning tickets in the context of the larger optimization process. To
                 do so, we examine the initial and ﬁnal weights of the unpruned network from which a winning ticket
                 derives to determine whether weights that will eventually comprise a winning ticket exhibit properties
                 that distinguish them from the rest of the network.
                 We consider the magnitude of the difference between initial and ﬁnal weights. One possible rationale
                 for the success of winning tickets is that they already happen to be close to the optimum that gradient
                 descent eventually ﬁnds, meaning that winning ticket weights should change by a smaller amount
                 than the rest of the network. Another possible rationale is that winning tickets are well placed in the
                 optimization landscape for gradient descent to optimize productively, meaning that winning ticket
                 weights should change by a larger amount than the rest of the network. Figure 19 shows that winning
                 ticket weights tend to change by a larger amount then weights in the rest of the network, evidence
                 that does not support the rationale that winning tickets are already close to the optimum.
                 It is notable that such a distinction exists between the two distributions. One possible explanation for
                 this distinction is that the notion of a winning ticket may indeed be a natural part of neural network
                 optimization. Another is that magnitude-pruning biases the winning tickets we ﬁnd toward those
                 containing weights that change in the direction of higher magnitude. Regardless, it offers hope that
                 winning tickets may be discernible earlier in the training process (or after a single training run),
                 meaning that there may be more efﬁcient methods for ﬁnding winning tickets than iterative pruning.
                 Figure 20 shows the directions of these changes. It plots the difference between the magnitude of the
                 ﬁnal weight and the magnitude of the initial weight, i.e., whether the weight moved toward or away

                                                  <<FIGURE>>

                 Figure 20. Between the ﬁrst and last training iteration of the unpruned network, the magnitude by
                 which weights move away from 0. The blue line shows the distribution of magnitudes for weights
                 that are not in the eventual winning ticket; the orange line shows the distribution of magnitudes for
                 weights that are in the eventual winning ticket.


                                                                  <<FIGURE>>

                 Figure 21. The fraction of incoming connections that survive the pruning process for each node in
                 each layer of the Lenet architecture for MNIST as trained with adam.


                 from 0. In general, winning ticket weights are more likely to increase in magnitude (that is, move
                 away from 0) than are weights that do not participate in the eventual winning ticket.

                  F.6 WINNING TICKET CONNECTIVITY

                 In this Subsection, we study the connectivity of winning tickets. Do some hidden units retain a
                 large number of incoming connections while others fade away, or does the network retain relatively
                 even sparsity among all units as it is pruned? We ﬁnd the latter to be the case when examining the
                 incoming connectivity of network units. for both adam and SGD, each unit retains a number of
                 incoming connections approximately in proportion to the amount by which the overall layer has
                 been pruned. Figures 21 and 22 show the fraction of incoming connections that survive the pruning
                 process for each node in each layer. Recall that we prune the output layer at half the rate as the rest of
                 the network, which explains why it has more connectivity than the other layers of the network.

                                 <<FIGURE>>

                        Figure 22. Same as Figure 21 where the network is trained with SGD at rate 0.8.

                                                  <<FIGURE>>

                 Figure 23. The fraction of outgoing connections that survive the pruning process for each node in
                 each layer of the Lenet architecture for MNIST as trained with adam. The blue, orange, and green
                 lines are the outgoing connections from the input layer, ﬁrst hidden layer, and second hidden layer,
                 respectively.

                                <<FIGURE>>

                        Figure 24. Same as Figure 23 where the network is trained with SGD at rate 0.8.


                 However, this is not the case for the outgoing connections. To the contrary, for the adam-trained
                 networks, certain units retain far more outgoing connections than others (Figure 23). The distributions
                 are far less smooth than those for the incoming connections, suggesting that certain features are far
                 more useful to the network than others. This is not unexpected for a fully-connected network on a
                 task like MNIST, particularly for the input layer. MNIST images contain centered digits, so the pixels
                 around the edges are not likely to be informative for the network. Indeed, the input layer has two
                 peaks, one larger peak for input units with a high number of outgoing connections and one smaller
                 peak for input units with a low number of outgoing connections. Interestingly, the adam-trained
                 winning tickets develop a much more uneven distribution of outgoing connectivity for the input layer
                 than does the SGD-trained network (Figure 24).


                  F.7 ADDING NOISE TO WINNING TICKETS

                 In this Subsection, we explore the extent to which winning tickets are robust to Gaussian noise added
                 to their initializations. In the main body of the paper, we ﬁnd that randomly reinitializing a winning
                 ticket substantially slows its learning and reduces its eventual test accuracy. In this Subsection,
                 we study a less extreme way of perturbing a winning ticket. Figure 25 shows the effect of adding
                 Gaussian noise to the winning ticket initializations. The standard deviation of the noise distribution
                 of each layer is a multiple of the standard deviation of the layer’s initialization Figure 25 shows noise
                 distributions with standard deviation 0.5%,1%,2%, and 3%. Adding Gaussian noise reduces the test
                 accuracy of a winning ticket and slows its ability to learn, again demonstrating the importance of
                 the original initialization. As more noise is added, accuracy decreases. However, winning tickets
                 are surprisingly robust to noise. Adding noise of 0.5% barely changes winning ticket accuracy. Even
                 after adding noise of 3%, the winning tickets continue to outperform the random reinitialization
                 experiment.

                                                  <<FIGURE>>

                 Figure 25. The performance of the winning tickets of the Lenet architecture for MNIST when
                 Gaussian noise is added to the initializations. The standard deviations of the noise distributions for
                 each layer are a multiple of the standard deviations of the initialization distributions; in this Figure,
                 we consider multiples 0.5, 1, 2, and 3.


                  G HYPERPARAMETER EXPLORATION FOR FULLY-CONNECTED NETWORKS

                 This Appendix accompanies Section 2 of the main paper. It explores the space of hyperparameters
                 for the Lenet architecture evaluated in Section 2 with two purposes in mind.

                     1.To explain the hyperparameters selected in the main body of the paper.
                     2.To evaluate the extent to which the lottery ticket experiment patterns extend to other choices
                       of hyperparameters.

                  G.1 EXPERIMENTAL METHODOLOGY

                 This Section considers the fully-connected Lenet architecture (LeCun et al., 1998), which comprises
                 two fully-connected hidden layers and a ten unit output layer, on the MNIST dataset. Unless otherwise
                 stated, the hidden layers have 300 and 100 units each.
                 The MNIST dataset consists of 60,000 training examples and 10,000 test examples. We randomly
                 sampled a 5,000-example validation set from the training set and used the remaining 55,000 training
                 examples as our training set for the rest of the paper (including Section 2). The hyperparameter
                 selection experiments throughout this Appendix are evaluated using the validation set for determining
                 both the iteration of early-stopping and the accuracy at early-stopping; the networks in the main body
                 of this paper (which make use of these hyperparameters) have their accuracy evaluated on the test set.
                 The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire
                 training set is shufﬂed.
                 Unless otherwise noted, each line in each graph comprises data from three separate experiments. The
                 line itself traces the average performance of the experiments and the error bars indicate the minimum
                 and maximum performance of any one experiment.
                 Throughout this Appendix, we perform the lottery ticket experiment iteratively with a pruning rate of
                 20% per iteration (10% for the output layer); we justify the choice of this pruning rate later in this
                 Appendix. Each layer of the network is pruned independently. On each iteration of the lottery ticket
                 experiment, the network is trained for 50,000 training iterations regardless of when early-stopping
                 occurs; in other words, no validation or test data is taken into account during the training process, and
                 early-stopping times are determined retroactively by examining validation performance. We evaluate
                 validation and test performance every 100 iterations.
                 For the main body of the paper, we opt to use the Adam optimizer (Kingma & Ba, 2014) and Gaussian
                 Glorot initialization (Glorot & Bengio, 2010). Although we can achieve more impressive results on
                 the lottery ticket experiment with other hyperparameters, we intend these choices to be as generic
                 as possible in an effort to minimize the extent to which our main results depend on hand-chosen
                 hyperparameters. In this Appendix, we select the learning rate for Adam that we use in the main body
                 of the paper.
                 In addition, we consider a wide range of other hyperparameters, including other optimization
                 algorithms (SGD with and without momentum), initialization strategies (Gaussian distributions

                                                  <<FIGURE>>

                 Figure 26. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using the Adam optimizer at various
                 learning rates. Each line represents a different learning rate.

                               <<FIGURE>>

                 Figure 27. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent at
                 various learning rates.


                 with various standard deviations), network sizes (larger and smaller hidden layers), and pruning
                 strategies (faster and slower pruning rates). In each experiment, we vary the chosen hyperparameter
                 while keeping all others at their default values (Adam with the chosen learning rate, Gaussian Glorot
                 initialization, hidden layers with 300 and 100 units). The data presented in this appendix was collected
                 by training variations of the Lenet architecture more than 3,000 times.

                  G.2 LEARNING RATE

                 In this Subsection, we perform the lottery ticket experiment on the Lenet architecture as optimized
                 with Adam, SGD, and SGD with momentum at various learning rates.
                 Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria for
                 selecting the learning rate are as follows.

                     1.On the unpruned network, it should minimize training iterations necessary to reach early-
                       stopping and maximize validation accuracy at that iteration. That is, it should be a reasonable
                       hyperparameter for optimizing the unpruned network even if we are not running the lottery
                       ticket experiment.
                     2. When running the iterative lottery ticket experiment, it should make it possible to match
                       the early-stopping iteration and accuracy of the original network with as few parameters as
                       possible.
                     3.Of those options that meet (1) and (2), it should be on the conservative (slow) side so that it is
                       more likely to productively optimize heavily pruned networks under a variety of conditions
                       with a variety of hyperparameters.

                                                  <<FIGURE>>

                 Figure 28. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Lenet architecture trained with MNIST using stochastic gradient descent
                 with momentum (0.9) at various learning rates.


                 Figure 26 shows the early-stopping iteration and validation accuracy at that iteration of performing
                 the iterative lottery ticket experiment with the Lenet architecture optimized with Adam at various
                 learning rates. According to the graph on the right of Figure 26, several learning rates between 0.0002
                 and 0.002 achieve similar levels of validation accuracy on the original network and maintain that
                 performance to similar levels as the network is pruned. Of those learning rates, 0.0012 and 0.002
                 produce the fastest early-stopping times and maintain them to the smallest network sizes. We choose
                 0.0012 due to its higher validation accuracy on the unpruned network and in consideration of criterion
                 (3) above.
                 We note that, across all of these learning rates, the lottery ticket pattern (in which learning becomes
                 faster and validation accuracy increases with iterative pruning) remains present. Even for those
                 learning rates that did not satisfy the early-stopping criterion within 50,000 iterations (2.5e-05 and
                 0.0064) still showed accuracy improvements with pruning.

                  G.3 OTHER OPTIMIZATION ALGORITHMS

                  G.3.1 SGD
                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 stochastic gradient descent (SGD) at various learning rates. The results of doing so appear in Figure
                 27. The lottery ticket pattern appears across all learning rates, including those that fail to satisfy the
                 early-stopping criterion within 50,000 iterations. SGD learning rates 0.4 and 0.8 reach early-stopping
                 in a similar number of iterations as the best Adam learning rates (0.0012 and 0.002) but maintain
                 this performance when the network has been pruned further (to less than 1% of its original size for
                 SGD vs. about 3.6% of the original size for Adam). Likewise, on pruned networks, these SGD
                 learning rates achieve equivalent accuracy to the best Adam learning rates, and they maintain that
                 high accuracy when the network is pruned as much as the Adam learning rates.

                  G.3.2 MOMENTUM
                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 28.
                 Once again, the lottery ticket pattern appears across all learning rates, with learning rates between
                 0.025 and 0.1 maintaining high validation accuracy and faster learning for the longest number of
                 pruning iterations. Learning rate 0.025 achieves the highest validation accuracy on the unpruned
                 network; however, its validation accuracy never increases as it is pruned, instead decreasing gradually,
                 and higher learning rates reach early-stopping faster.

                  G.4 ITERATIVE PRUNING RATE

                 When running the iterative lottery ticket experiment on Lenet, we prune each layer of the network
                 separately at a particular rate. That is, after training the network, we prunek%of the weights in

                                                  <<FIGURE>>

                 Figure 29. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment when pruned at different rates. Each line represents a differentpruning rate—the
                 percentage of lowest-magnitude weights that are pruned from each layer after each training iteration.

                                     <<FIGURE>>

                 Figure 30. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment initialized with Gaussian distributions with various standard deviations. Each line
                 is a different standard deviation for a Gaussian distribution centered at 0.


                 each layer ( k %of the weights in the output layer) before resetting the weights to their original
                 initializations and training again. In the main body of the paper, we ﬁnd that iterative pruning ﬁnds 2
                 smaller winning tickets than one-shot pruning, indicating that pruning too much of the network at
                 once diminishes performance. Here, we explore different values ofk.
                 Figure 29 shows the effect of the amount of the network pruned on each pruning iteration on early-
                 stopping time and validation accuracy. There is a tangible difference in learning speed and validation
                 accuracy at early-stopping between the lowest pruning rates (0.1 and 0.2) and higher pruning rates (0.4
                 and above). The lowest pruning rates reach higher validation accuracy and maintain that validation
                 accuracy to smaller network sizes; they also maintain fast early-stopping times to smaller network
                 sizes. For the experiments throughout the main body of the paper and this Appendix, we use a
                 pruning rate of 0.2, which maintains much of the accuracy and learning speed of 0.1 while reducing
                 the number of training iterations necessary to get to smaller network sizes.
                 In all of the Lenet experiments, we prune the output layer at half the rate of the rest of the network.
                 Since the output layer is so small (1,000 weights out of 266,000 for the overall Lenet architecture),
                 we found that pruning it reaches a point of diminishing returns much earlier the other layers.

                  G.5 INITIALIZATION DISTRIBUTION

                 To this point, we have considered only a Gaussian Glorot (Glorot & Bengio, 2010) initialization
                 scheme for the network. Figure 30 performs the lottery ticket experiment while initializing the Lenet
                 architecture from Gaussian distributions with a variety of standard deviations. The networks were
                 optimized with Adam at the learning rate chosen earlier. The lottery ticket pattern continues to appear
                 across all standard deviations. When initialized from a Gaussian distribution with standard deviation
                 0.1, the Lenet architecture maintained high validation accuracy and low early-stopping times for the
                 longest, approximately matching the performance of the Glorot-initialized network.

                  G.6 NETWORK SIZE

                                                    <<FIGURE>>

                 Figure 31. The early-stopping iteration and validation accuracy at at that iteration of the iterative
                 lottery ticket experiment on the Lenet architecture with various layer sizes. The label for each line
                 is the size of the ﬁrst and second hidden layers of the network. All networks had Gaussian Glorot
                 initialization and were optimized with Adam (learning rate 0.0012). Note that the x-axis of this plot
                 charts the number ofweightsremaining, while all other graphs in this section have charted thepercent
                 of weights remaining.

                 Throughout this section, we have considered the Lenet architecture with 300 units in the ﬁrst hidden
                 layer and 100 units in the second hidden layer. Figure 31 shows the early-stopping iterations and
                 validation accuracy at that iteration of the Lenet architecture with several other layer sizes. All
                 networks we tested maintain the 3.1 ratio between units in the ﬁrst hidden layer and units in the
                 second hidden layer.
                 The lottery ticket hypothesis naturally invites a collection of questions related to network size. Gener-
                 alizing, those questions tend to take the following form. according to the lottery ticket hypothesis, do
                 larger networks, which contain more subnetworks, ﬁnd “better” winning tickets? In line with the
                 generality of this question, there are several different answers.
                 If we evaluate a winning ticket by the accuracy it achieves, then larger networks do ﬁnd better
                 winning tickets. The right graph in Figure 31 shows that, for any particular number of weights (that
                 is, any particular point on the x-axis), winning tickets derived from initially larger networks reach
                 higher accuracy. Put another way, in terms of accuracy, the lines are approximately arranged from
                 bottom to top in increasing order of network size. It is possible that, since larger networks have
                 more subnetworks, gradient descent found a better winning ticket. Alternatively, the initially larger
                 networks have more units even when pruned to the same number of weights as smaller networks,
                 meaning they are able to contain sparse subnetwork conﬁgurations that cannot be expressed by
                 initially smaller networks.
                 If we evaluate a winning ticket by the time necessary for it to reach early-stopping, then larger
                 networks have less of an advantage. The left graph in Figure 31 shows that, in general, early-stopping
                 iterations do not vary greatly between networks of different initial sizes that have been pruned to the
                 same number of weights. Upon exceedingly close inspection, winning tickets derived from initially
                 larger networks tend to learn marginally faster than winning tickets derived from initially smaller
                 networks, but these differences are slight.
                 If we evaluate a winning ticket by the size at which it returns to the same accuracy as the original
                 network, the large networks do not have an advantage. Regardless of the initial network size, the
                 right graph in Figure 31 shows that winning tickets return to the accuracy of the original network
                 when they are pruned to between about 9,000 and 15,000 weights.

                  H HYPERPARAMETER EXPLORATION FOR CONVOLUTIONAL NETWORKS

                 This Appendix accompanies Sections 3 of the main paper. It explores the space of optimization
                 algorithms and hyperparameters for the Conv-2, Conv-4, and Conv-6 architectures evaluated in
                 Section 3 with the same two purposes as Appendix G. explaining the hyperparameters used in the main
                 body of the paper and evaluating the lottery ticket experiment on other choices of hyperparameters.

                  H.1 EXPERIMENTAL METHODOLOGY

                 The Conv-2, Conv-4, and Conv-6 architectures are variants of the VGG (Simonyan & Zisserman,
                 2014) network architecture scaled down for the CIFAR10 (Krizhevsky & Hinton, 2009) dataset. Like
                 VGG, the networks consist of a series of modules. Each module has two layers of 3x3 convolutional
                 ﬁlters followed by a maxpool layer with stride 2. After all of the modules are two fully-connected
                 layers of size 256 followed by an output layer of size 10; in VGG, the fully-connected layers are of
                 size 4096 and the output layer is of size 1000. Like VGG, the ﬁrst module has 64 convolutions in
                 each layer, the second has 128, the third has 256, etc. The Conv-2, Conv-4, and Conv-6 architectures
                 have 1, 2, and 3 modules, respectively.
                 The CIFAR10 dataset consists of 50,000 32x32 color (three-channel) training examples and 10,000
                 test examples. We randomly sampled a 5,000-example validation set from the training set and used the
                 remaining 45,000 training examples as our training set for the rest of the paper. The hyperparameter
                 selection experiments throughout this Appendix are evaluated on the validation set, and the examples
                 in the main body of this paper (which make use of these hyperparameters) are evaluated on test set.
                 The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire
                 training set is shufﬂed.
                 The Conv-2, Conv-4, and Conv-6 networks are initialized with Gaussian Glorot initialization (Glorot
                 & Bengio, 2010) and are trained for the number of iterations speciﬁed in Figure 2. The number
                 of training iterations was selected such that heavily-pruned networks could still train in the time
                 provided. On dropout experiments, the number of training iterations is tripled to provide enough time
                 for the dropout-regularized networks to train. We optimize these networks with Adam, and select the
                 learning rate for each network in this Appendix.
                 As with the MNIST experiments, validation and test performance is only considered retroactively
                 and has no effect on the progression of the lottery ticket experiments. We measure validation and test
                 loss and accuracy every 100 training iterations.
                 Each line in each graph of this section represents the average of three separate experiments, with
                 error bars indicating the minimum and maximum value that any experiment took on at that point.
                 (Experiments in the main body of the paper are conducted ﬁve times.)
                 We allow convolutional layers and fully-connected layers to be pruned at different rates; we select
                 those rates for each network in this Appendix. The output layer is pruned at half of the rate of the
                 fully-connected layers for the reasons described in Appendix G.

                  H.2 LEARNING RATE

                 In this Subsection, we perform the lottery ticket experiment on the the Conv-2, Conv-4, and Conv-6
                 architectures as optimized with Adam at various learning rates.

                                                  <<FIGURE>>

                         Figure 32. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using the Adam optimizer at various learning rates. Each line represents a different learning rate.


                 Here, we select the learning rate that we use for Adam in the main body of the paper. Our criteria
                 for selecting the learning rate are the same as in Appendix G. minimizing training iterations and
                 maximizing accuracy at early-stopping, ﬁnding winning tickets containing as few parameters as
                 possible, and remaining conservative enough to apply to a range of other experiments.
                 Figure 32 shows the results of performing the iterative lottery ticket experiment on the Conv-2 (top),
                 Conv-4 (middle), and Conv-6 (bottom) architectures. Since we have not yet selected the pruning rates
                 for each network, we temporarily pruned fully-connected layers at 20% per iteration, convolutional
                 layers at 10% per iteration, and the output layer at 10% per iteration; we explore this part of the
                 hyperparameter space in a later subsection.
                 For Conv-2, we select a learning rate of 0.0002, which has the highest initial validation accuracy,
                 maintains both high validation accuracy and low early-stopping times for the among the longest,
                 and reaches the fastest early-stopping times. This learning rate also leads to a 3.3 percentage point
                 improvement in validation accuracy when the network is pruned to 3% of its original size. Other
                 learning rates, such 0.0004, have lower initial validation accuracy (65.2% vs 67.6%) but eventually
                 reach higher absolute levels of validation accuracy (71.7%, a 6.5 percentage point increase, vs. 70.9%,
                 a 3.3 percentage point increase). However, learning rate 0.0002 shows the highest proportional
                 decrease in early-stopping times. 4.8x (when pruned to 8.8% of the original network size).
                 For Conv-4, we select learning rate 0.0003, which has among the highest initial validation accuracy,
                 maintains high validation accuracy and fast early-stopping times when pruned by among the most,
                 and balances improvements in validation accuracy (3.7 percentage point improvement to 78.6%
                 when 5.4% of weights remain) and improvements in early-stopping time (4.27x when 11.1% of
                 weights remain). Other learning rates reach higher validation accuracy (0.0004—3.6 percentage point
                 improvement to 79.1% accuracy when 5.4% of weights remain) or show better improvements in
                 early-stopping times (0.0002—5.1x faster when 9.2% of weights remain) but not both.
                 For Conv-6, we also select learning rate 0.0003 for similar reasons to those provided for Conv-4.
                 Validation accuracy improves by 2.4 percentage points to 81.5% when 9.31% of weights remain
                 and early-stopping times improve by 2.61x when pruned to 11.9%. Learning rate 0.0004 reaches
                 high ﬁnal validation accuracy (81.9%, an increase of 2.7 percentage points, when 15.2% of weights
                 remain) but with smaller improvements in early-stopping times, and learning rate 0.0002 shows
                 greater improvements in early-stopping times (6.26x when 19.7% of weights remain) but reaches
                 lower overall validation accuracy.
                 We note that, across nearly all combinations of learning rates, the lottery ticket pattern—where
                 early-stopping times were maintain or decreased and validation accuracy was maintained or increased
                 during the course of the lottery ticket experiment—continues to hold. This pattern fails to hold at
                 the very highest learning rates. early-stopping times decreased only brieﬂy (in the case of Conv-2 or
                 Conv-4) or not at all (in the case of Conv-6), and accuracy increased only brieﬂy (in the case of all
                 three networks). This pattern is similar to that which we observe in Section 4. at the highest learning
                 rates, our iterative pruning algorithm fails to ﬁnd winning tickets.

                  H.3 OTHER OPTIMIZATION ALGORITHMS

                  H.3.1 SGD
                 Here, we explore the behavior of the lottery ticket experiment when the Conv-2, Conv-4, and Conv-6
                 networks are optimized with stochastic gradient descent (SGD) at various learning rates. The results
                 of doing so appear in Figure 33. In general, these networks—particularly Conv-2 and Conv-4—
                 proved challenging to train with SGD and Glorot initialization. As Figure 33 reﬂects, we could not
                 ﬁnd SGD learning rates for which the unpruned networks matched the validation accuracy of the
                 same networks when trained with Adam; at best, the SGD-trained unpruned networks were typically
                 2-3 percentage points less accurate. At higher learning rates than those in Figure 32, gradients tended
                 to explode when training the unpruned network; at lower learning rates, the networks often failed to
                 learn at all.
                 At all of the learning rates depicted, we found winning tickets. In all cases, early-stopping times
                 initially decreased with pruning before eventually increasing again, just as in other lottery ticket
                 experiments. The Conv-6 network also exhibited the same accuracy patterns as other experiments,
                 with validation accuracy initially increasing with pruning before eventually decreasing again.

                                                  <<FIGURE>>

                         Figure 33. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using SGD at various learning rates. Each line represents a different learning rate. The legend for
                         each pair of graphs is above the graphs.

                 However, the Conv-2 and Conv-4 architectures exhibited a different validation accuracy pattern
                 from other experiments in this paper. Accuracy initially declined with pruning before rising as
                 the network was further pruned; it eventually matched or surpassed the accuracy of the unpruned
                 network. When they eventually did surpass the accuracy of the original network, the pruned networks
                 reached early-stopping in about the same or fewer iterations than the original network, constituting
                 a winning ticket by our deﬁnition. Interestingly, this pattern also appeared for Conv-6 networks at
                 slower SGD learning rates, suggesting that faster learning rates for Conv-2 and Conv-4 than those in
                 Figure 32 might cause the usual lottery ticket accuracy pattern to reemerge. Unfortunately, at these
                 higher learning rates, gradients exploded on the unpruned networks, preventing us from running these
                 experiments.

                  H.3.2 MOMENTUM

                 Here, we explore the behavior of the lottery ticket experiment when the network is optimized with
                 SGD with momentum (0.9) at various learning rates. The results of doing so appear in Figure 34.
                 In general, the lottery ticket pattern continues to apply, with early-stopping times decreasing and
                 accuracy increasing as the networks are pruned. However, there were two exceptions to this pattern.

                     1.At the very lowest learning rates (e.g., learning rate 0.001 for Conv-4 and all but the highest
                       learning rate for Conv-2), accuracy initially decreased before increasing to higher levels
                       than reached by the unpruned network; this is the same pattern we observed when training
                       these networks with SGD.
                     2.At the very highest learning rates (e.g., learning rates 0.005 and 0.008 for Conv-2 and Conv-
                       4), early-stopping times never decreased and instead remained stable before increasing; this
                       is the same pattern we observed for the highest learning rates when training with Adam.


                  H.4 ITERATIVE PRUNING RATE

                 For the convolutional network architectures, we select different pruning rates for convolutional and
                 fully-connected layers. In the Conv-2 and Conv-4 architectures, convolutional parameters make up a
                 relatively small portion of the overall number of parameters in the models. By pruning convolutions
                 more slowly, we are likely to be able to prune the model further while maintaining performance.
                 In other words, we hypothesize that, if all layers were pruned evenly, convolutional layers would
                 become a bottleneck that would make it more difﬁcult to ﬁnd lower parameter-count models that are
                 still able to learn. For Conv-6, the opposite may be true. since nearly two thirds of its parameters are
                 in convolutional layers, pruning fully-connected layers could become the bottleneck.
                 Our criterion for selecting hyperparameters in this section is to ﬁnd a combination of pruning rates
                 that allows networks to reach the lowest possible parameter-counts while maintaining validation
                 accuracy at or above the original accuracy and early-stopping times at or below that for the original
                 network.
                 Figure 35 shows the results of performing the iterative lottery ticket experiment on Conv-2 (top),
                 Conv-4 (middle), and Conv-6 (bottom) with different combinations of pruning rates.
                 According to our criteria, we select an iterative convolutional pruning rate of 10% for Conv-2, 10% for
                 Conv-4, and 15% for Conv-6. For each network, any rate between 10% and 20% seemed reasonable.
                 Across all convolutional pruning rates, the lottery ticket pattern continued to appear.

                  H.5 LEARNING RATES (DROPOUT )

                 In order to train the Conv-2, Conv-4, and Conv-6 architectures with dropout, we repeated the exercise
                 from Section H.2 to select appropriate learning rates. Figure 32 shows the results of performing
                 the iterative lottery ticket experiment on Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) with
                 dropout and Adam at various learning rates. A network trained with dropout takes longer to learn, so
                 we trained each architecture for three times as many iterations as in the experiments without dropout.
                 60,000 iterations for Conv-2, 75,000 iterations for Conv-4, and 90,000 iterations for Conv-6. We
                 iteratively pruned these networks at the rates determined in Section H.4.

                                                  <<FIGURE>>

                 Figure 34. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                 ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                 using SGD with momentum (0.9) at various learning rates. Each line represents a different learning
                 rate. The legend for each pair of graphs is above the graphs. Lines that are unstable and contain large
                 error bars (large vertical lines) indicate that some experiments failed to learn effectively, leading to
                 very low accuracy and very high early-stopping times; these experiments reduce the averages that the
                 lines trace and lead to much wider error bars.

                                                              <<FIGURE>>

                         Figure 35. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures with an
                         iterative pruning rate of 20% for fully-connected layers. Each line represents a different iterative
                         pruning rate for convolutional layers.


                 The Conv-2 network proved to be difﬁcult to consistently train with dropout. The top right graph
                 in Figure 36 contains wide error bars and low average accuracy for many learning rates, especially
                 early in the lottery ticket experiments. This indicates that some or all of the training runs failed to
                 learn; when they were averaged into the other results, they produced the aforementioned pattern
                 in the graphs. At learning rate 0.0001, none of the three trials learned productively until pruned to
                 more than 26.5%, at which point all three trials started learning. At learning rate 0.0002, some of the
                 trials failed to learn productively until several rounds of iterative pruning had passed. At learning
                 rate 0.0003, all three networks learned productively at every pruning level. At learning rate 0.0004,
                 one network occasionally failed to learn. We selected learning rate 0.0003, which seemed to allow
                 networks to learn productively most often while achieving among the highest initial accuracy.
                 It is interesting to note that networks that were unable to learn at a particular learning rate (for
                 example, 0.0001) eventually began learning after several rounds of the lottery ticket experiment (that
                 is, training, pruning, and resetting repeatedly). It is worth investigating whether this phenomenon
                 was entirely due to pruning (that is, removing any random collection of weights would put the
                 network in a conﬁguration more amenable to learning) or whether training the network provided
                 useful information for pruning, even if the network did not show improved accuracy.
                 For both the Conv-4 and Conv-6 architectures, a slightly slower learning rate (0.0002 as opposed to
                 0.0003) leads to the highest accuracy on the unpruned networks in addition to the highest sustained
                 accuracy and fastest sustained learning as the networks are pruned during the lottery ticket experiment.
                 With dropout, the unpruned Conv-4 architecture reaches an average validation accuracy of 77.6%, a
                 2.7 percentage point improvement over the unpruned Conv-4 network trained without dropout and
                 one percentage point lower than the highest average validation accuracy attained by a winning ticket.
                 The dropout-trained winning tickets reach 82.6% average validation accuracy when pruned to 7.6%.
                 Early-stopping times improve by up to 1.58x (when pruned to 7.6%), a smaller improvement than
                 then 4.27x achieved by a winning ticket obtained without dropout.
                 With dropout, the unpruned Conv-6 architecture reaches an average validation accuracy of 81.3%,
                 an improvement of 2.2 percentage points over the accuracy without dropout; this nearly matches
                 the 81.5% average accuracy obtained by Conv-6 trained without dropout and pruned to 9.31%.
                 The dropout-trained winning tickets further improve upon these numbers, reaching 84.8% average
                 validation accuracy when pruned to 10.5%. Improvements in early-stopping times are less dramatic
                 than without dropout. a 1.5x average improvement when the network is pruned to 15.1%.
                 At all learning rates we tested, the lottery ticket pattern generally holds for accuracy, with improve-
                 ments as the networks are pruned. However, not all learning rates show the decreases in early-stopping
                 times. To the contrary, none of the learning rates for Conv-2 show clear improvements in early-
                 stopping times as seen in the other lottery ticket experiments. Likewise, the faster learning rates for
                 Conv-4 and Conv-6 maintain the original early-stopping times until pruned to about 40%, at which
                 point early-stopping times steadily increase.

                  H.6 PRUNING CONVOLUTIONS VS PRUNING FULLY-CONNECTED LAYERS

                 Figure 37 shows the effect of pruning convolutions alone (green), fully-connected layers alone
                 (orange) and pruning both (blue). The x-axis measures the number of parameters remaining to
                 emphasize the relative contributions made by pruning convolutions and fully-connected layers to
                 the overall network. In all three cases, pruning convolutions alone leads to higher test accuracy
                 and faster learning; pruning fully-connected layers alone generally causes test accuracy to worsen
                 and learning to slow. However, pruning convolutions alone has limited ability to reduce the overall
                 parameter-count of the network, since fully-connected layers comprise 99%, 89%, and 35% of the
                 parameters in Conv-2, Conv-4, and Conv-6.

                                                              <<FIGURE>>

                         Figure 36. The early-stopping iteration and validation accuracy at that iteration of the iterative lottery
                         ticket experiment on the Conv-2 (top), Conv-4 (middle), and Conv-6 (bottom) architectures trained
                         using dropout and the Adam optimizer at various learning rates. Each line represents a different
                         learning rate.

                                                            <<FIGURE>>

                      Figure 37. Early-stopping iteration and accuracy of the Conv-2 (top), Conv-4 (middle), and Conv-6
                      (bottom) networks when only convolutions are pruned, only fully-connected layers are pruned, and
                      both are pruned. The x-axis measures the number of parameters remaining, making it possible to
                      see the relative contributions to the overall network made by pruning FC layers and convolutions
                      individually.


                  I HYPERPARAMETER EXPLORATION FOR VGG-19 AND RESNET-18 ON CIFAR10

                 This Appendix accompanies the VGG-19 and Resnet-18 experiments in Section 4. It details the
                 pruning scheme, training regimes, and hyperparameters that we use for these networks.

                  I.1 GLOBAL PRUNING

                 In our experiments with the Lenet and Conv-2/4/6 architectures, we separately prune a fraction of
                 the parameters in each layer (layer-wise pruning). In our experiments with VGG-19 and Resnet-18,
                 we instead pruneglobally; that is, we prune all of the weights in convolutional layers collectively
                 without regard for the speciﬁc layer from which any weight originated.
                 Figures 38 (VGG-19) and 39 (Resnet-18) compare the winning tickets found by global pruning
                 (solid lines) and layer-wise pruning (dashed lines) for the hyperparameters from Section 4. When
                 training VGG-19 with learning rate 0.1 and warmup to iteration 10,000, we ﬁnd winning tickets when
                 Pm 6.9%for layer-wise pruning vs.Pm 1.5%for global pruning. For other hyperparameters,
                 accuracy similarly drops off when sooner for layer-wise pruning than for global pruning. Global
                 pruning also ﬁnds smaller winning tickets than layer-wise pruning for Resnet-18, but the difference is
                 less extreme than for VGG-19.
                 In Section 4, we discuss the rationale for the efﬁcacy of global pruning on deeper networks. In
                 summary, the layers in these deep networks have vastly different numbers of parameters (particularly
                 severely so for VGG-19); if we prune layer-wise, we conjecture that layers with fewer parameters
                 become bottlenecks on our ability to ﬁnd smaller winning tickets.
                 Regardless of whether we use layer-wise or global pruning, the patterns from Section 4 hold. at
                 learning rate 0.1, iterative pruning ﬁnds winning tickets for neither network; at learning rate 0.01, the
                 lottery ticket pattern reemerges; and when training with warmup to a higher learning rate, iterative
                 pruning ﬁnds winning tickets. Figures 40 (VGG-19) and 41 (Resnet-18) present the same data as
                 Figures 7 (VGG-19) and 8 (Resnet-18) from Section 4 with layer-wise pruning rather than global
                 pruning. The graphs follow the same trends as in Section 4, but the smallest winning tickets are larger
                 than those found by global pruning.

                  I.2 VGG-19 DETAILS

                 The VGG19 architecture was ﬁrst designed by Simonyan & Zisserman (2014) for Imagenet. The
                 version that we use here was adapted by Liu et al. (2019) for CIFAR10. The network is structured
                 as described in Figure 2. it has ﬁve groups of 3x3 convolutional layers, the ﬁrst four of which are
                 followed by max-pooling (stride 2) and the last of which is followed by average pooling. The network
                 has one ﬁnal dense layer connecting the result of the average-pooling to the output.
                 We largely follow the training procedure for resnet18 described in Appendix I.

                      We use the same train/test/validation split.
                      We use the same data augmentation procedure.
                      We use a batch size of 64.
                      We use batch normalization.
                      We use a weight decay of 0.0001.
                      We use three stages of training at decreasing learning rates. We train for 160 epochs (112,480
                       iterations), decreasing the learning rate by a factor of ten after 80 and 120 epochs.
                      We use Gaussian Glorot initialization.

                 We globally prune the convolutional layers of the network at a rate of 20% per iteration, and we do
                 not prune the 5120 parameters in the output layer.
                 Liu et al. (2019) uses an initial pruning rate of 0.1. We train VGG19 with both this learning rate and
                 a learning rate of 0.01.


                  I.3 RESNET-18 DETAILS

                 The Resnet-18 architecture was ﬁrst introduced by He et al. (2016). The architecture comprises 20
                 total layers as described in Figure 2. a convolutional layer followed by nine pairs of convolutional
                 layers (with residual connections around the pairs), average pooling, and a fully-connected output
                 layer.
                 We follow the experimental design of He et al. (2016).

                      We divide the training set into 45,000 training examples and 5,000 validation examples. We
                       use the validation set to select hyperparameters in this appendix and the test set to evaluate
                       in Section 4.
                      We augment training data using random ﬂips and random four pixel pads and crops.
                      We use a batch size of 128.
                      We use batch normalization.
                      We use weight decay of 0.0001.
                      We train using SGD with momentum (0.9).
                      We use three stages of training at decreasing learning rates. Our stages last for 20,000,
                       5,000, and 5,000 iterations each, shorter than the 32,000, 16,000, and 16,000 used in He
                       et al. (2016). Since each of our iterative pruning experiments requires training the network
                       15-30 times consecutively, we select this abbreviated training schedule to make it possible
                       to explore a wider range of hyperparameters.
                      We use Gaussian Glorot initialization.

                 We globally prune convolutions at a rate of 20% per iteration. We do not prune the 2560 parameters
                 used to downsample residual connections or the 640 parameters in the fully-connected output layer,
                 as they comprise such a small portion of the overall network.

                  I.4 LEARNING RATE

                 In Section 4, we observe that iterative pruning is unable to ﬁnd winning tickets for VGG-19 and
                 Resnet-18 at the typical, high learning rate used to train the network (0.1) but it is able to do so at a
                 lower learning rate (0.01). Figures 42 and 43 explore several other learning rates. In general, iterative
                 pruning cannot ﬁnd winning tickets at any rate above 0.01 for either network; for higher learning
                 rates, the pruned networks with the original initialization perform no better than when randomly
                 reinitialized.

                  I.5 WARMUP ITERATION

                 In Section 4, we describe how adding linear warmup to the initial learning rate makes it possible to
                 ﬁnd winning tickets for VGG-19 and Resnet-18 at higher learning rates (and, thereby, winning tickets
                 that reach higher accuracy). In Figures 44 and 45, we explore the number of iterationskover which
                 warmup should occur.
                 For VGG-19, we were able to ﬁnd values ofkfor which iterative pruning could identify winning
                 tickets when the network was trained at the original learning rate (0.1). For Resnet-18, warmup made
                 it possible to increase the learning rate from 0.01 to 0.03, but no further. When exploring values ofk,
                 we therefore us learning rate 0.1 for VGG-19 and 0.03 for Resnet-18.
                 In general, the greater the value ofk, the higher the accuracy of the eventual winning tickets.

                 Resnet-18. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship
                 reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select
                 k= 20000, which achieves the highest validation accuracy.

                 VGG-19. For values ofkbelow 5000, accuracy improves rapidly askincreases. This relationship
                 reaches a point of diminishing returns abovek= 5000. For the experiments in Section 4, we select
                 k= 10000, as there is little beneﬁt to larger values ofk.

                                          <<FIGURE>>

                      Figure 38. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned with global (solid) and layer-wise (dashed) pruning.

                                     <<FIGURE>>

                      Figure 39. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned with global (solid) and layer-wise (dashed) pruning.

                                                      <<FIGURE>>

                      Figure 40. Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned with
                      layer-wise pruning. This is the same as Figure 7, except with layer-wise pruning rather than global
                      pruning.

                                       <<FIGURE>>

                      Figure 41. Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned with
                      layer-wise pruning. This is the same as Figure 8 except with layer-wise pruning rather than global
                      pruning.

                                                          <<FIGURE>>

                      Figure 42. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned and trained with various learning rates.

                                      <<FIGURE>>

                      Figure 43. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned and trained with various learning rates.

                          <<FIGURE>>

                      Figure 44. Validation accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively
                      pruned and trained with varying amounts of warmup at learning rate 0.03.

                          <<FIGURE>>

                      Figure 45. Validation accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively
                      pruned and trained with varying amounts of warmup at learning rate 0.1.
<|endoftext|>


<|startoftext|>
The State of Sparsity in Deep Neural Networks 

Trevor Gale *1  Erich Elsen *2 Sara Hooker 1  

Abstract 

like image classification and machine translation commonly 

We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Based on insights from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 using only magnitude pruning. Additionally, we repeat the experiments performed by Frankle & Carbin (2018) and Liu et al. (2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification. 

1. Introduction 
Deep neural networks achieve state-of-the-art performance 
in a variety of domains including image classification (He 
et al., 2016), machine translation (Vaswani et al., 2017), 
and text-to-speech (van den Oord et al., 2016; Kalchbrenner et al., 2018). 
While model quality has been shown to 
scale with model and dataset size (Hestness et al., 2017), 
the resources required to train and deploy large neural net.
works can be prohibitive. State-of-the-art models  
have tens of millions of parameters, and require billions of floating-point operations to make a prediction for a single input sample. 
Sparsity has emerged as a leading approach to address these challenges. By sparsity, we refer to the property that a subset of the model parameters have a value of exactly zero2. With zero valued weights, any multiplications (which dominate neural network computation) can be skipped, and models can be stored and transmitted compactly using sparse matrix formats. It has been shown empirically that deep neural networks can tolerate high levels of sparsity (Han et al., 2015; Narang et al., 2017; Ullrich et al., 2017), and this property has been leveraged to significantly reduce the cost associated with the deployment of deep neural networks, and to enable the deployment of state-of-the-art models in severely resource constrained environments (Theis et al., 2018; Kalchbrenner et al., 2018; Valin & Skoglund, 2018). 
Over the past few years, numerous techniques for induc.ing sparsity have been proposed and the set of models and datasets used as benchmarks has grown too large to reasonably expect new approaches to explore them all. In addition to the lack of standardization in modeling tasks, the distribution of benchmarks tends to slant heavily towards convolutional architectures and computer vision tasks, and the tasks used to evaluate new techniques are frequently not representative of the scale and complexity of real-world tasks where model compression is most useful. These char.acteristics make it difficult to come away from the sparsity literature with a clear understanding of the relative merits of different approaches. 
In addition to practical concerns around comparing techniques, multiple independent studies have recently proposed that the value of sparsification in neural networks has been misunderstood (Frankle & Carbin, 2018; Liu et al., 2018). While both papers suggest that sparsification can be viewed as a form of neural architecture search, they disagree on what is necessary to achieve this. Specically, Liu et al. 
2 The term sparsity is also commonly used to refer to the pro.portion of a neural networks weights that are zero valued. Higher sparsity corresponds to fewer weights, and smaller computational and storage requirements. We use the term in this way throughout this paper. 

(2018) re-train learned sparse topologies with a random weight initialization, whereas Frankle & Carbin (2018) posit that the exact random weight initialization used when the sparse architecture was learned is needed to match the test set performance of the model sparsified during optimization. 
In this paper, we address these ambiguities to provide a strong foundation for future work on sparsity in neural networks. Our main contributions: (1) We perform a comprehensive evaluation of variational dropout (Molchanov et al., 2017), l0 regularization (Louizos et al., 2017b), and magnitude pruning (Zhu & Gupta, 2017) on Transformer trained on WMT 2014 English-to-German and ResNet-50 trained on ImageNet. To the best of our knowledge, we are the first to apply variational dropout and l0 regularization to models of this scale. While variational dropout and l0 regularization achieve state-of-the-art results on small datasets, we show that they perform inconsistently for large-scale tasks and that simple magnitude pruning can achieve comparable or better results for a reduced computational budget. (2) Through insights gained from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 using only magnitude pruning. (3) We repeat the lottery ticket (Frankle & Carbin, 2018) and scratch (Liu et al., 2018) experiments on Transformer and ResNet-50 across a full range of sparsity levels. We show that unstruc.tured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with pruning as part of the optimization process. (4) We open-source our code, model checkpoints, and results of all hyperparameter settings to establish rigorous baselines for future work on model compression and sparsification 3. 

2. Sparsity in Neural Networks 

We briefly provide a non-exhaustive review of proposed approaches for inducing sparsity in deep neural networks. 

Simple heuristics based on removing small magnitude weights have demonstrated high compression rates with minimal accuracy loss (Strom, 1997; Collins & Kohli, 2014; Han et al., 2015), and further refinement of the sparsification process for magnitude pruning techniques has increased achievable compression rates and greatly reduced computational complexity (Guo et al., 2016; Zhu & Gupta, 2017). Many techniques grounded in Bayesian statistics and in.formation theory have been proposed (Dai et al., 2018; Molchanov et al., 2017; Louizos et al., 2017b;a; Ullrich et al., 2017). These methods have achieved high compres.sion rates while providing deep theoretical motivation and connections to classical sparsification and regularization techniques. 
3https://bit.ly/2ExE8Yj 

Some of the earliest techniques for sparsifying neural networks make use of second-order approximation of the loss surface to avoid damaging model quality (LeCun et al., 1989; Hassibi & Stork, 1992). More recent work has achieved comparable compression levels with more computationally efficient first-order loss approximations, and further refinements have related this work to efficient empirical estimates of the Fisher information of the model parameters (Molchanov et al., 2016; Theis et al., 2018). 
Reinforcement learning has also been applied to automat.ically prune weights and convolutional filters (Lin et al., 2017; He et al., 2018), and a number of techniques have been proposed that draw inspiration from biological phenomena, and derive from evolutionary algorithms and neuromorphic computing (Guo et al., 2016; Bellec et al., 2017; Mocanu et al., 2018). 
A key feature of a sparsity inducing technique is if and how it imposes structure on the topology of sparse weights. While unstructured weight sparsity provides the most flexibility for the model, it is more difficult to map efficiently to parallel processors and has limited support in deep learn.ing software packages. For these reasons, many techniques focus on removing whole neurons and convolutional filters, or impose block structure on the sparse weights (Liu et al., 2017; Luo et al., 2017; Gray et al., 2017). While this is practical, 
there is a trade-off between achievable compression levels for a given model quality and the level of structure imposed on the model weights. In this work, we focus on unstructured sparsity with the expectation that it upper bounds the compression-accuracy trade-off achievable with structured sparsity techniques. 

3. Evaluating sparsification Techniques at Scale 

As a first step towards addressing the ambiguity in the sparsity literature, we rigorously evaluate magnitude-based pruning (Zhu & Gupta, 2017), sparse variational dropout (Molchanov et al., 2017), and l0 regularization (Louizos et al., 2017b) on two large-scale deep learning applications: ImageNet classification with ResNet-50 (He et al., 2016), and neural machine translation (NMT) with the Transformer on the WMT 2014 English-to-German dataset (Vaswani et al., 2017). For each model, we also benchmark a random weight pruning technique, representing the lower bound of compression-accuracy trade-off any method should be expected to achieve. 
Here we briefly review the four techniques and introduce our experimental framework. We provide a more detailed overview of each technique in Appendix A. 

3.1. Magnitude Pruning 
Magnitude-based weight pruning schemes use the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training. For our experiments, we use the approach introduced in Zhu & Gupta (2017), which is conveniently available in the TensorFlow model pruning library 4. This technique allows for masked weights to reactivate during training based on gradient updates, and makes use of a gradual sparsification schedule with sorting-based weight thresholding to achieve a user specified level of sparsification. These features enable high compression ratios at a reduced computational cost relative to the iterative pruning and re-training approach used by Han et al. (2015), while requiring less hyperparameter tuning relative to the technique proposed by Guo et al. (2016). 

3.2. Variational Dropout 
Variational dropout was originally proposed as a re.interpretation of dropout training as variational inference, providing a Bayesian justification for the use of dropout in neural networks and enabling useful extensions to the standard dropout algorithms like learnable dropout rates (Kingma et al., 2015). It was later demonstrated that by learning a model with variational dropout and per-parameter dropout rates, weights with high dropout rates can be re.moved post-training to produce highly sparse solutions (Molchanov et al., 2017). 
Variational dropout performs variational inference to learn the parameters of a fully-factorized Gaussian posterior over the weights under a log-uniform prior. In the standard formulation, we apply a local reparameterization to move the sampled noise from the weights to the activations, and then apply the additive noise reparameterization to further reduce the variance of the gradient estimator. Under this parameterization, we directly optimize the mean and variance of the neural network parameters. After training a model with variational dropout, the weights with the highest learned dropout rates can be removed to produce a sparse model. 

3.3. l0 Regularization 
l0 regularization explicitly penalizes the number of non.zero weights in the model to induce sparsity. However, the l0-norm is both non-convex and non-differentiable. To address the non-differentiability of the l0-norm, Louizos et al. (2017b) propose a reparameterization of the neural network weights as the product of a weight and a stochastic gate variable sampled from a hard-concrete distribution. The parameters of the hard-concrete distribution can be 

4 https://bit.ly/2T8hBGn 

Table 1. Constant hyperparameters for all Transformer experiments. More details on the standard configuration for training the Transformer can be found in Vaswani et al. (2017). 

<<TABLE>>

optimized directly using the reparameterization trick, and the expected l0-norm can be computed using the value of the cumulative distribution function of the random gate variable evaluated at zero. 

3.4. Random Pruning Baseline 
For our experiments, we also include a random sparsification procedure adapted from the magnitude pruning technique of Zhu & Gupta (2017). Our random pruning technique uses the same sparsity schedule, but differs by selecting the weights to be pruned each step at random rather based on magnitude and does not allow pruned weights to reactivate. This technique is intended to represent a lower-bound of the accuracy-sparsity trade-off curve. 

3.5. Experimental Framework 
For magnitude pruning, we used the TensorFlow model pruning library. We implemented variational dropout and l0 regularization from scratch. For variational dropout, we verified our implementation by reproducing the results from the original paper. To verify our l0 regularization implementation, we applied our weight-level code to Wide ResNet (Zagoruyko & Komodakis, 2016) trained on CIFAR-10 and replicated the training FLOPs reduction and accuracy results from the original publication. Verification results for variational dropout and l0 regularization are included in Appendices B and C. For random pruning, we modified the TensorFlow model pruning library to randomly select weights as opposed to sorting them based on magnitude. 
For each model, we kept the number of training steps constant across all techniques and performed extensive hyper-parameter tuning. While magnitude pruning is relatively simple to apply to large models and achieves reasonably consistent performance across a wide range of hyperparameters, variational dropout and l0-regularization are much less well understood. To our knowledge, we are the first to apply these techniques to models of this scale. To produce a fair comparison, we did not limit the amount of hyperparameter tuning we performed for each technique. In total, our results encompass over 4000 experiments. 

<<FIGURE>>

Figure 1. Sparsity-BLEU trade-off curves for the Transformer. 
Top: Pareto frontiers for each of the four sparsification techniques applied to the Transformer. Bottom: All experimental results with each technique. Despite the diversity of approaches, the relative performance of all three techniques is remarkably consistent. Magnitude pruning notably outperforms more complex techniques for high levels of sparsity. 

4. Sparse Neural Machine Translation 

We adapted the Transformer (Vaswani et al., 2017) model for neural machine translation to use these four sparsification techniques, and trained the model on the WMT 2014 English-German dataset. We sparsified all fully-connected layers and embeddings, which make up 99.87% of all of the parameters in the model (the other parameters coming from biases and layer normalization). The constant hyper-parameters used for all experiments are listed in table 1. We followed the standard training procedure used by Vaswani et al. (2017), but did not perform checkpoint averaging. This setup yielded a baseline BLEU score of 27.29 averaged across five runs. 
We extensively tuned the remaining hyperparameters for each technique. Details on what hyperparameters we explored, and the results of what settings produced the best models can be found in Appendix D. 

4.1. Sparse Transformer Results & Analysis 
All results for the Transformer are plotted in Figure 1. De.spite the vast differences in these approaches, the relative performance of all three techniques is remarkably consistent. While l0 regularization and variational dropout pro.duce the top performing models in the low-to-mid sparsity range, magnitude pruning achieves the best results for highly sparse models. While all techniques were able to outperform the random pruning technique, randomly removing weights produces surprisingly reasonable results, which is perhaps indicative of the models ability to recover from damage during optimization. 
What is particularly notable about the performance of Magnitude pruning is that our experiments uniformly remove the same fraction of weights for each layer. This is in stark contrast to variational dropout and l0 regularization, where the distribution of sparsity across the layers is learned through the training process. Previous work has shown that a non-uniform sparsity among different layers is key to achieving high compression rates (He et al., 2018), and variational dropout and l0 regularization should theoretically be able to leverage this feature to learn better distributions of weights for a given global sparsity. 
Figure 2 shows the distribution of sparsity across the differ.ent layer types in the Transformer for the top performing model at 90% global sparsity for each technique. Both l0 regularization and variational dropout learn to keep more parameters in the embedding, FFN layers, and the output transforms for the multi-head attention modules and induce more sparsity in the transforms for the query and value in.puts to the attention modules. Despite this advantage, l0 regularization and variational dropout did not significantly outperform magnitude pruning, even yielding inferior results at high sparsity levels. 
It is also important to note that these results maintain a constant number of training steps across all techniques and that the Transformer variant with magnitude pruning trains 1.24x and 1.65x faster than l0 regularization and variational dropout respectively. While the standard Transformer train.ing scheme produces excellent results for machine translation, it has been shown that training the model for longer can improve its performance by as much as 2 BLEU (Ott et al., 2018). Thus, when compared for a fixed training cost magnitude pruning has a distinct advantage over these more complicated techniques. 

<<FIGURE>>

Figure 2. Average sparsity in Transformer layers. Distributions calculated on the top performing model at 90% sparsity for each technique. l0 regularization and variational dropout are able to learn non-uniform distributions of sparsity, while magnitude pruning induces user-specified sparsity distributions (in this case, uniform). 

Table 2. Constant hyperparameters for all RN50 experiments. 

<<TABLE>>

5. Sparse Image classification 

To benchmark these four sparsity techniques on a large-scale computer vision task, we integrated each method into ResNet-50 and trained the model on the ImageNet large-scale image classification dataset. We sparsified all convolutional and fully-connected layers, which make up 99.79% of all of the parameters in the model (the other parameters coming from biases and batch normalization). 
The hyperparameters we used for all experiments are listed in Table 2. Each model was trained for 128000 iterations with a batch size of 1024 images, stochastic gradient descent with momentum, and the standard learning rate schedule (see Appendix E.1). This setup yielded a baseline top-1 accuracy of 76.69% averaged across three runs. We trained each model with 8-way data parallelism across 8 accelerators. Due to the extra parameters and operations required for variational dropout, the model was unable to fit into device memory in this configuration. For all variational dropout experiments, we used a per-device batch size of 32 images and scaled the model over 32 accelerators. 

5.1. ResNet-50 Results & Analysis 
Figure 3 shows results for magnitude pruning, variational dropout, and random pruning applied to ResNet-50. Surprisingly, we were unable to produce sparse ResNet-50 models with l0 regularization that did not significantly damage model quality. Across hundreds of experiments, our models were either able to achieve full test set performance with no sparsification, or sparsification with test set performance akin to random guessing. Details on all hyperparameter settings explored are included in Appendix E. 
This result is particularly surprising given the success of l0 regularization on Transformer. One nuance of the l0 regularization technique of Louizos et al. (2017b) is that the model can have varying sparsity levels between the training and test-time versions of the model. At training time, a parameter with a dropout rate of 10% will be zero 10% of the time when sampled from the hard-concrete distribution. How.ever, under the test-time parameter estimator, this weight 


Figure 3. Sparsity-accuracy trade-off curves for ResNet-50. 
Top: Pareto frontiers for variational dropout, magnitude pruning, and random pruning applied to ResNet-50. Bottom: All experimental results with each technique. We observe large variation in performance for variational dropout and l0 regularization between Transformer and ResNet-50. Magnitude pruning and variational dropout achieve comparable performance for most sparsity levels, with variational dropout achieving the best results for high sparsity levels. 
will be non-zero.5. Louizos et al. (2017b) reported results applying l0 regularization to a wide residual network (WRN) (Zagoruyko & Komodakis, 2016) on the CIFAR-10 dataset, and noted that they observed small accuracy loss at as low as 8% reduction in the number of parameters during training. Applying our weight-level l0 regularization implementation to WRN produces a model with comparable training time sparsity, but with no sparsity in the test-time parameters. For models that achieve test-time sparsity, we observe significant accuracy degradation on CIFAR-10. This result is consistent with our observation for l0 regularization applied to ResNet-50 on ImageNet. 
The variation in performance for variational dropout and l0 regularization between Transformer and ResNet-50 is striking. While achieving a good accuracy-sparsity trade-off, variational dropout consistently ranked behind l0 regularization on Transformer, and was bested by magnitude pruning for sparsity levels of 80% and up. However, on ResNet-50 we observe that variational dropout consistently produces 
5The fraction of time a parameter is set to zero during training depends on other factors, e.g. the . parameter of the hard-concrete distribution. However, this point is generally true that the training and test-time sparsities are not necessarily equivalent, and that there exists some dropout rate threshold below which a weight that is sometimes zero during training will be non-zero at test-time. 


Figure 4. Average sparsity in ResNet-50 layers. Distributions calculated on the top performing model at 95% sparsity for each technique. Variational dropout is able to learn non-uniform distributions of sparsity, decreasing sparsity in the input and output layers that are known to be disproportionately important to model quality. 
models on-par or better than magnitude pruning, and that l0 regularization is not able to produce sparse models at all. Variational dropout achieved particularly notable results in the high sparsity range, maintaining a top-1 accuracy over 70% with less than 4% of the parameters of a standard ResNet-50. 
The distribution of sparsity across different layer types in the best variational dropout and magnitude pruning models at 95% sparsity are plotted in Figure 4. While we kept sparsity constant across all layers for magnitude and random pruning, variational dropout significantly reduces the amount of sparsity induced in the first and last layers of the model. 
It has been observed that the first and last layers are often disproportionately important to model quality (Han et al., 2015; Bellec et al., 2017). In the case of ResNet-50, the first convolution comprises only .037% of all the parameters in the model. At 98% sparsity the first layer has only 188 non-zero parameters, for an average of less than 3 parameters per output feature map. With magnitude pruning uniformly sparsifying each layer, it is surprising that it is able to achieve any test set performance at all with so few parameters in the input convolution. 
While variational dropout is able to learn to distribute sparsity non-uniformly across the layers, it comes at a significant increase in resource requirements. For ResNet-50 trained with variational dropout we observed a greater than 2x in.crease in memory consumption. When scaled across 32 accelerators, ResNet-50 trained with variational dropout completed training in 9.75 hours, compared to ResNet-50 with magnitude pruning finishing in 12.50 hours on only 8 accelerators. Scaled to a 4096 batch size and 32 accelerators, ResNet-50 with magnitude pruning can complete the same number of epochs in just 3.15 hours. 
Figure 5. Sparsity-accuracy trade-off curves for ResNet-50 with modified sparsification scheme. Altering the distribution of sparsity across the layers and increasing training time yield significant improvement for magnitude pruning. 

5.2. Pushing the Limits of Magnitude Pruning 
Given that a uniform distribution of sparsity is suboptimal, and the significantly smaller resource requirements for ap.plying magnitude pruning to ResNet-50 it is natural to won.der how well magnitude pruning could perform if we were to distribute the non-zero weights more carefully and increase training time. 
To understand the limits of the magnitude pruning heuristic, we modify our ResNet-50 training setup to leave the first convolutional layer fully dense, and only prune the final fully-connected layer to 80% sparsity. This heuristic is reasonable for ResNet-50, as the first layer makes up a small fraction of the total parameters in the model and the final layer makes up only .03% of the total FLOPs. While tuning the magnitude pruning ResNet-50 models, we observed that the best models always started and ended pruning during the third learning rate phase, before the second learning rate drop. To take advantage of this, we increase the number of training steps by 1.5x by extending this learning rate region. Results for ResNet-50 trained with this scheme are plotted in Figure 5. 
With these modifications, magnitude pruning outperforms variational dropout at all but the highest sparsity levels while still using less resources. However, variational dropout's performance in the high sparsity range is particularly notable. With very low amounts of non-zero weights, we find it likely that the models performance on the test set is closely tied to precise allocation of weights across the different layers, and that variational dropout's ability to learn this distribution enables it to better maintain accuracy at high sparsity levels. This result indicates that efficient sparsification techniques that are able to learn the distribution of sparsity across layers are a promising direction for future work. 
Its also worth noting that these changes produced models at 80% sparsity with top-1 accuracy of 76.52%, only .17% off our baseline ResNet-50 accuracy and .41% better than the results reported by He et al. (2018), without the extra complexity and computational requirements of their reinforcement learning approach. This represents a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 trained on ImageNet. 

6. sparsification as Architecture Search 
While sparsity is traditionally thought of as a model com.pression technique, two independent studies have recently suggested that the value of sparsification in neural networks is misunderstood, and that once a sparse topology is learned it can be trained from scratch to the full performance achieved when sparsification was performed jointly with optimization. 
Frankle & Carbin (2018) posited that over-parameterized neural networks contain small, trainable subsets of weights, deemed "winning lottery tickets". They suggest that sparsity inducing techniques are methods for finding these sparse topologies, and that once found the sparse architectures can be trained from scratch with the same weight initialization that was used when the sparse architecture was learned. They demonstrated that this property holds across different convolutional neural networks and multi-layer perceptrons trained on the MNIST and CIFAR-10 datasets. 
Liu et al. (2018) similarly demonstrated this phenomenon for a number of activation sparsity techniques on convolutional neural networks, as well as for weight level sparsity learned with magnitude pruning. However, they demonstrate this result using a random initialization during re.training. 
The implications of being able to train sparse architectures from scratch once they are learned are large: once a sparse topology is learned, it can be saved and shared as with any other neural network architecture. Re-training then can be done fully sparse, taking advantage of sparse linear algebra to greatly accelerate time-to-solution. However, the combination of these two studies does not clearly establish how this potential is to be realized. 
Beyond the question of whether or not the original random weight initialization is needed, both studies only explore convolutional neural networks (and small multi-layer perceptrons in the case of Frankle & Carbin (2018)). The majority of experiments in both studies also limited their analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets. While these are standard benchmarks for deep learning models, they are not indicative of the complexity of real-world tasks where model compression is most useful. Liu et al. (2018) do explore convolutional architectures on the Ima.geNet datasets, but only at two relatively low sparsity levels (30% and 60%). They also note that weight level sparsity on ImageNet is the only case where they are unable to re.produce the full accuracy of the pruned model. 

<<FIGURE>>

Figure 6. Scratch and lottery ticket experiments with magnitude pruning. Top: results with Transformer. Bottom: Results with ResNet-50. Across all experiments, training from scratch using a learned sparse architecture is unable to re-produce the performance of models trained with sparsification as part of the optimization process. 
To clarify the questions surrounding the idea of sparsification as a form of neural architecture search, we repeat the experiments of Frankle & Carbin (2018) and Liu et al. (2018) on ResNet-50 and Transformer. For each model, we explore the full range of sparsity levels (50% -98%) and compare to our well-tuned models from the previous sections. 

6.1. Experimental Framework 
The experiments of Liu et al. (2018) encompass taking the final learned weight mask from a magnitude pruning model, randomly re-initializing the weights, and training the model with the normal training procedure (i.e., learning rate, num.ber of iterations, etc.). To account for the presence of sparsity at the start of training, they scale the variance of the initial weight distribution by the number of non-zeros in the matrix. They additionally train a variant where they increase the number of training steps (up to a factor of 2x) such that the re-trained model uses approximately the same number of FLOPs during training as model trained with sparsification as part of the optimization process. They refer to these two experiments as "scratch-e" and "scratch-b" respectively. 
Frankle & Carbin (2018) follow a similar procedure, but use the same weight initialization that was used when the sparse weight mask was learned and do not perform the longer training time variant. 
For our experiments, we repeat the scratch-e, scratch-b and lottery ticket experiments with magnitude pruning on Transformer and ResNet-50. For scratch-e and scratch-b, we also train variants that do not alter the initial weight distribution. For the Transformer, we re-trained five replicas of the best magnitude pruning hyperparameter settings at each sparsity level and save the weight initialization and final sparse weight mask. For each of the five learned weight masks, we train five identical replicas for the scratch-e, scratch-b, scratch-e with augmented initialization, scratch-b with augmented initialization, and the lottery ticket experiments. For ResNet-50, we followed the same procedure with three re-trained models and three replicas at each sparsity level for each of the five experiments. Figure 6 plots the averages and min/max of all experiments at each sparsity level 6. 

6.2. Scratch and Lottery Ticket Results & Analysis 
Across all of our experiments, we observed that training from scratch using a learned sparse architecture is not able to match the performance of the same model trained with sparsification as part of the optimization process. 
Across both models, we observed that doubling the number of training steps did improve the quality of the results for the scratch experiments, but was not sufficient to match the test set performance of the magnitude pruning baseline. As sparsity increased, we observed that the deviation between the models trained with magnitude pruning and those trained from scratch increased. For both models, we did not observe a benefit from using the augmented weight initialization for the scratch experiments. 
For ResNet-50, we experimented with four different learn.ing rates schemes for the scratch-b experiments. We found that scaling each learning rate region to double the number of epochs produced the best results by a wide margin. These results are plotted in Figure 6. Results for the ResNet-50 scratch-b experiments with the other learning rate variants are included with our release of hyperparameter tuning results. 
For the lottery ticket experiments, we were not able to replicate the phenomenon observed by Frankle & Carbin (2018). The key difference between our experiments is the complex.ity of the tasks and scale of the models, and it seems likely that this is the main factor contributing to our inability to train these architecture from scratch. 
For the scratch experiments, our results are consistent with the negative result observed by (Liu et al., 2018) for Im.
ageNet and ResNet-50 with unstructured weight pruning. By replicating the scratch experiments at the full range of 
6Two of the 175 Transformer experiments failed to train from scratch at all and produced BLEU scores less than 1.0. We omit these outliers in Figure 6 
sparsity levels, we observe that the quality of the models degrades relative to the magnitude pruning baseline as sparsity increases. For unstructured weight sparsity, it seems likely that the phenomenon observed by Liu et al. (2018) was produced by a combination of low sparsity levels and small-to-medium sized tasks. We'd like to emphasize that this result is only for unstructured weight sparsity, and that prior work Liu et al. (2018) provides strong evidence that activation pruning behaves differently. 

7. Limitations of This Study 
Hyperparameter exploration. For all techniques and models, we carefully hand-tuned hyperparameters and per.formed extensive sweeps encompassing thousands of experiments over manually identified ranges of values. However, the number of possible settings vastly outnumbers the set of values that can be practically explored, and we cannot eliminate the possibility that some techniques significantly outperform others under settings we did not try. 
Neural architectures and datasets. Transformer and ResNet-50 were chosen as benchmark tasks to represent a cross section of large-scale deep learning tasks with diverse architectures. We can fit exclude the possibility that some techniques achieve consistently high performance across other architectures. More models and tasks should be thoroughly explored in future work. 
8. Conclusion 
In this work, we performed an extensive evaluation of three state-of-the-art sparsification techniques on two large-scale learning tasks. Notwithstanding the limitations discussed in section 7, we demonstrated that complex techniques shown to yield state-of-the-art compression on small datasets per.form inconsistently, and that simple heuristics can achieve comparable or better results on a reduced computational bud.get. Based on insights from our experiments, we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet.50 with only magnitude pruning and highlight promising directions for research in sparsity inducing techniques. 
Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression. As such, we open-source our code, check.points, and results of all hyperparameter configurations to establish rigorous baselines for future work. 

Acknowledgements 
We would like to thank Benjamin Caine, Jonathan Frankle, 
Raphael Gontijo Lopes, Sam Greydanus, and Keren Gu for 
helpful discussions and feedback on drafts of this paper. 

References 
Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. Deep Rewiring: Training Very Sparse Deep Networks. CoRR, abs/1711.05136, 2017. 
Collins, M. D. and Kohli, P. Memory Bounded Deep convolutional Networks. CoRR, abs/1412.1442, 2014. URL http://arxiv.org/abs/1412.1442. 
Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural Networks using the Variational Information Bottleneck. CoRR, abs/1802.10399, 2018. 
Frankle, J. and Carbin, M. The Lottery Ticket Hy.pothesis: Training Pruned Neural Networks. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/ 
abs/1803.03635. 
Gray, S., Radford, A., and Kingma, D. P. Block-sparse gpu kernels. https://blog.openai.com/ 
block-sparse-gpu-kernels/, 2017. 
Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery for efficient DNNs. In NIPS, 2016. 
Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both Weights and Connections for efficient Neural Network. In NIPS, pp. 11351143, 2015. 
Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, pp. 164171. Morgan Kaufmann, 1992. 
He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn.ing for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770778, 2016. 
He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC: automl for model compression and acceleration on mo.bile devices. In Computer Vision -ECCV 2018 -15th European Conference, Munich, Germany, September 8.14, 2018, Proceedings, Part VII, pp. 815832, 2018. 
Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. 
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., and Kavukcuoglu, K. efficient Neural Audio Synthesis. In Proceedings of the 35th Interna.tional Conference on Machine Learning, ICML 2018, Stockholmsm
assan, Stockholm, Sweden, July 10-15, 2018, pp. 24152424, 2018. 
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013. 
Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. CoRR, abs/1506.02557, 2015. 
LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain Damage. In NIPS, pp. 598605. Morgan Kaufmann, 1989. 
Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning. In NIPS, pp. 21782188, 2017. 
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, 
C. Learning efficient Convolutional Networks through Network Slimming. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 27552763, 2017. 
Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the Value of Network Pruning. CoRR, abs/1810.05270, 2018. 
Louizos, C., Ullrich, K., and Welling, M. Bayesian Com.pression for Deep Learning. In Advances in Neural In.formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 De.cember 2017, Long Beach, CA, USA, pp. 32903300, 2017a. 
Louizos, C., Welling, M., and Kingma, D. P. Learn.ing Sparse Neural Networks through L0 Regularization. CoRR, abs/1712.01312, 2017b. 
Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level Pruning Method for Deep Neural Network Compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 50685076, 2017. 
Mitchell, T. J. and Beauchamp, J. J. Bayesian Variable Selection in Linear Regression. Journal of the American Statistical Association, 83(404):10231032, 1988. 
Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. Scalable Training of Arti.cial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science. Nature Communications, 2018. 
Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational Dropout Sparsifies Deep Neural Networks. In Proceed.ings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au.gust 2017, pp. 24982507, 2017. 
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning Convolutional Neural Networks for Resource Ef.cient Transfer Learning. CoRR, abs/1611.06440, 2016. 
Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex.ploring Sparsity in Recurrent Neural Networks. CoRR, abs/1704.05119, 2017. 
Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 -November 1, 2018, pp. 19, 2018. 
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative models. In ICML, volume 32 of JMLR Workshop and Conference Proceedings, pp. 12781286. JMLR.org, 2014. 
Strom, N. Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks. In EUROSPEECH, 1997. 
Theis, L., Korshunova, I., Tejani, A., and Huszar, F. Faster gaze prediction with dense networks and Fisher pruning. CoRR, abs/1801.05787, 2018. URL http://arxiv. 
org/abs/1801.05787. 
Ullrich, K., Meeds, E., and Welling, M. Soft Weight-Sharing for Neural Network Compression. CoRR, abs/1702.04008, 2017. 
Valin, J. and Skoglund, J. Lpcnet: Improving Neural Speech Synthesis Through Linear Prediction. CoRR, abs/1810.11846, 2018. URL http://arxiv.org/ 
abs/1810.11846. 
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A Generative Model for Raw Audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, pp. 125, 2016. 
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten.tion is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural In.formation Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 60006010, 2017. 
Zagoruyko, S. and Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016. 
Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017. URL http://arxiv.org/ 
abs/1710.01878. 

The State of Sparsity in Deep Neural Networks: Appendix 

A. Overview of Sparsity Inducing Techniques 

Here we provide a more detailed review of the three sparsity techniques we benchmarked. 

A.1. Magnitude Pruning 
Magnitude-based weight pruning schemes use the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training. Many variants have been proposed (Collins & Kohli, 2014; Han et al., 2015; Guo et al., 2016; Zhu & Gupta, 2017), with the key differences lying in when weights are removed, whether weights should be sorted to remove a precise pro.portion or thresholded based on a fixed or decaying value, and whether or not weights that have been pruned still re.ceive gradient updates and have the potential to return after being pruned. 
Han et al. (2015) use iterative magnitude pruning and re.training to progressively sparsify a model. The target model is first trained to convergence, after which a portion of weights are removed and the model is re-trained with these weights fixed to zero. This process is repeated until the target sparsity is achieved. Guo et al. (2016) improve on this approach by allowing masked weights to still receive gradient updates, enabling the network to recover from incorrect pruning decisions during optimization. They achieve higher compression rates and interleave pruning steps with gradient update steps to avoid expensive re-training. Zhu & Gupta (2017) similarly allow gradient updates to masked weights, and make use of a gradual sparsification schedule with sorting-based weight thresholding to maintain accuracy while achieving a user specified level of sparsification. 
Its worth noting that magnitude pruning can easily be adapted to induce block or activation level sparsity by re.moving groups of weights based on their p-norm, average, max, or other statistics. Variants have also been proposed that maintain a constant level of sparsity during optimization to enable accelerated training (Mocanu et al., 2018). 

A.2. Variational Dropout 
Consider the setting of a dataset D of N i.i.d. samples (x, y) and a standard classification problem where the goal is to learn the parameters w of the conditional probability p(y|x, w). Bayesian inference combines some initial belief over the parameters w in the form of a prior distribution p(w) with observed data D into an updated belief over the parameters in the form of the posterior distribution p(w|D). In practice, computing the true posterior using Bayes' rule is computationally intractable and good approximations are needed. In variational inference, we optimize the parameters <<FORMULA>> of some parameterized model <<FORMULA>> such that <<FORMULA>> is a close approximation to the true posterior distribution p(w|D) as measured by the Kullback-Leibler divergence between the two distributions. The divergence of our ap.proximate posterior from the true posterior is minimized in practice by maximizing the variational lower-bound 

<<FORMULA>>

where <<FORMULA>>

Using the Stochastic Gradient Variational Bayes (SGVB) (Kingma et al., 2015) algorithm to optimize this bound, <<FORMULA>> reduces to the standard cross-entropy loss, and the KL divergence between our approximate posterior and prior over the parameters serves as a regularizer that enforces our initial belief about the parameters w. 
In the standard formulation of variational dropout, we as.sume the weights are drawn from a fully-factorized Gaussian approximate posterior. 

<<FORMULA>> 

Where <<FORMULA>> and <<FORMULA>> are neural network parameters. For each training step, we sample weights from this distribution and use the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014) to differentiate the loss w.r.t. the parameters through the sampling operation. Given the weights are normally distributed, the distribution of the activations B after a linear operation like matrix multiplication or convolution is also Gaussian and can be calculated in closed form 7. 

<<FORMULA>>

with <<FORMULA>> and <<FORMULA>> where <<FORMULA>> are the inputs to the layer. Thus, rather 

7 We ignore correlation in the activations, as is done by Molchanov et al. (2017) 

than sample weights, we can directly sample the activations at each layer. This step is known as the local reparameterization trick, and was shown by Kingma et al. (2015) to reduce the variance of the gradients relative to the standard formulation in which a single set of sampled weights must be shared for all samples in the input batch for efficiency. Molchanov et al. (2017) showed that the variance of the gradients could be further reduced by using an additive noise reparameterization, where we define a new parameter 

<<FORMULA>>

Under this parameterization, we directly optimize the mean and variance of the neural network parameters. 
Under the assumption of a log-uniform prior on the weights w, the KL divergence component of our objective function <<FORMULA>> can be accurately approximated (Molchanov et al., 2017): 

<<FORMULA>>

After training a model with variational dropout, the weights with the highest . values can be removed. For all their experiments, Molchanov et al. (2017) removed weights with log . larger than 3.0, which corresponds to a dropout rate greater than 95%. Although they demonstrated good results, it is likely that the optimal <<FORMULA>> threshold varies across different models and even different hyperparameter settings of the same model. We address this question in our experiments. 

A.3. l0 Regularization 
To optimize the l0-norm, we reparameterize the model weights . as the product of a weight and a random vari.able drawn from the hard-concrete distribution. 

<<FORMULA>> where <<FORMULA>> and <<FORMULA>> 

In this formulation, the <<FORMULA>> parameter that controls the posi.tion of the hard-concrete distribution (and thus the probability that zj is zero) is optimized with gradient descent. <<FORMULA>> and <<FORMULA>> are fixed parameters that control the shape of the hard-concrete distribution. <<FORMULA>> controls the curvature or temperature of the hard-concrete probability density function, and <<FORMULA>> and <<FORMULA>> stretch the distribution s.t. zj takes value 0 or 1 with non-zero probability. 

On each training iteration, zj is sampled from this distri.bution and multiplied with the standard neural network weights. The expected l0-norm LC can then be calcu.lated using the cumulative distribution function of the hard-concrete distribution and optimized directly with stochastic gradient descent. 

<<FORMULA>> 

At test-time, Louizos et al. (2017b) use the following estimate for the model parameters. 

<<FORMULA>>

Interestingly, Louizos et al. (2017b) showed that their objective function under the l0 penalty is a special case of a variational lower-bound over the parameters of the network under a spike and slab (Mitchell & Beauchamp, 1988) prior. 

B. Variational Dropout Implementation Verification 

To verify our implementation of variational dropout, we applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST and compared our results to the original paper (Molchanov et al., 2017). We matched our hyperparameters to those used in the code released with the paper8. All results are listed in table 3 

Table 3. Variational Dropout MNIST Reproduction Results. 

<<TABLE>> 

Our baseline LeNet-300-100 model achieved test set accuracy of 98.42%, slightly higher than the baseline of 98.36% reported in (Molchanov et al., 2017). Applying our varia.tional dropout implementation to LeNet-300-100 with these hyperparameters produced a model with 97.52% global sparsity and 98.42% test accuracy. The original paper produced 

8 https://github.com/ars-ashuha/variational-dropout-Sparsifies.dnn 

<<FIGURE>>

Figure 7. Forward pass FLOPs for WRN-28-10 trained with l0 regularization. Our implementation achieves FLOPs reductions comparable to those reported in Louizos et al. (2017b). 

a model with 98.57% global sparsity, and 98.08% test accuracy. While our model achieves .34% higher tests accuracy with 1% lower sparsity, we believe the discrepancy is mainly due to difference in our software packages: the authors of (Molchanov et al., 2017) used Theano and Lasagne for their experiments, while we use TensorFlow. 
Given our model achieves highest accuracy, we can decrease the log . threshold to trade accuracy for more sparsity. With a <<FORMULA>> threshold of 2.0, our model achieves 98.5% global sparsity with a test set accuracy of 98.40%. With a log . threshold of 0.1, our model achieves 99.1% global sparsity with 98.13% test set accuracy, exceeding the sparsity and accuracy of the originally published results. 
On LeNet-5-Caffe, our implementation achieved a global sparsity of 99.29% with a test set accuracy of 99.26%, ver.sus the originaly published results of 99.6% sparsity with 99.25% accuracy. Lowering the <<FORMULA>> threshold to 2.0, our model achieves 99.5% sparsity with 99.25% test accuracy. 

C. l0 Regularization Implementation Verification 

The original l0 regularization paper uses a modified version of the proposed technique for inducing group sparsity in models, so our weight-level implementation is not directly comparable. However, to verify our implementation we trained a Wide ResNet (WRN) (Zagoruyko & Komodakis, 2016) on CIFAR-10 and compared results to those reported in the original publication for group sparsity. 
As done by Louizos et al. (2017b), we apply l0 to the first convolutional layer in the residual blocks (i.e., where dropout would normally be used). We use the weight decay formulation for the re-parameterized weights, and scale the weight decay coefficient to maintain the same initial length scale of the parameters. We use the same batch size of 128 samples and the same initial <<FORMULA>>, and train our model on a single GPU. 
Our baseline WRN-28-10 implementation trained on CIFAR-10 achieved a test set accuracy of 95.45%. Using our l0 regularization implementation and a l0-norm weight of .0003, we trained a model that achieved 95.34% accuracy on the test set while achieving a consistent training-time FLOPs reduction comparable to that reported by Louizos et al. (2017b). Floating-point operations (FLOPs) required to compute the forward over the course of training WRN.28-10 with l0 are plotted in Figure 7. 
During our re-implementation of the WRN experiments from Louizos et al. (2017b), we identified errors in the original publications FLOP calculations that caused the number of floating-point operations in WRN-28-10 to be miscalculated. Wefive contacted the authors, and hope to resolve this issue to clarify their performance results. 

D. Sparse Transformer Experiments 

D.1. Magnitude Pruning Details 
For our magnitude pruning experiments, we tuned four key hyperparameters: the starting iteration of the sparsification process, the ending iteration of the sparsification process, the frequency of pruning steps, and the combination of other regularizers (dropout and label smoothing) used during train.ing. We trained models with 7 different target sparsities: 50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of these sparsity levels, we tried pruning frequencies of 1000 and 10000 steps. During preliminary experiments we identi.ed that the best settings for the training step to stop pruning at were typically closer to the end of training. Based on this insight, we explored every possible combination of start and end points for the sparsity schedule in increments of 100000 steps with an ending step of 300000 or greater. 

By default, the Transformer uses dropout with a dropout rate of 10% on the input to the encoder, decoder, and before each layer and performs label smoothing with a smoothing parameter of <<FORMULA>>. We found that decreasing these other regularizers produced higher quality models in the mid to high sparsity range. For each hyperparameter combination, we tried three different regularization settings: standard label smoothing and dropout, label smoothing only, and no regularization. 

D.2. Variational Dropout Details 
For the Transformer trained with variational dropout, we extensively tuned the coefficient for the KL divergence component of the objective function to find models that achieved high accuracy with sparsity levels in the target range. We found that KL divergence weights in the range 
<<FORMULA>>, where N is the number of samples in the training set, produced models in our target sparsity range. (Molchanov et al., 2017) noted difficulty training some models from scratch with variational dropout, as large portions of the model adopt high dropout rates early in training before the model can learn a useful representation from the data. To address this issue, they use a gradual ramp-up of the KL divergence weight, linearly increasing the regularizer coefficient until it reaches the desired value. 
For our experiments, we explored using a constant regu.larizer weight, linearly increasing the regularizer weight, and also increasing the regularizer weight following the cubic sparsity function used with magnitude pruning. For the linear and cubic weight schedules, we tried each combination of possible start and end points in increments of 100000 steps. For each hyperparameter combination, we also tried the three different combinations of dropout and label smoothing as with magnitude pruning. For each trained model, we evaluated the model with 11 <<FORMULA>> thresholds in the range [0, 5]. For all experiments, we initialized all <<FORMULA>> parameters to the constant value <<FORMULA>>. 

D.3. l0 Regularization Details 
For Transformers trained with l0 regularization, we simi.larly tuned the coefficient for the l0-norm in the objective function. We observed that much higher magnitude regu.larization coefficients were needed to produce models with the same sparsity levels relative to variational dropout. We 
found that l0-norm weights in the range <<FORMULA>> produced models in our target sparsity range. 
For all experiments, we used the default settings for the paramters of the hard-concrete distribution: <<FORMULA>>, and <<FORMULA>>. We initialized the <<FORMULA>> parameters to 2.197, corresponding to a 10% dropout rate. 
For each hyperparameter setting, we explored the three reg.ularizer coefficient schedules used with variational dropout and each of the three combinations of dropout and label smoothing. 

D.4. Random Pruning Details 
We identified in preliminary experiments that random pruning typically produces the best results by starting and ending pruning early and allowing the model to finish the rest of the training steps with the final sparse weight mask. For our experiments, we explored all hyperparameter combinations that we explored with magnitude pruning, and also included start/end pruning step combinations with an end step of less than 300000. 

E. Sparse ResNet-50

E.1. Learning Rate 
For all experiments, the we used the learning rate scheme used by the official TensorFlow ResNet-50 implementation9. With our batch size of 1024, this includes a linear ramp-up for 5 epochs to a learning rate of .4 followed by learning rate drops by a factor of 0.1 at epochs 30, 60, and 80. 

E.2. Magnitude Pruning Details 
For magnitude pruning on ResNet-50, we trained models with a target sparsity of 50%, 70%, 80%, 90%, 95%, and 98%. At each sparsity level, we tried starting pruning at steps 8k, 20k, and 40k. For each potential starting point, we tried ending pruning at steps 68k, 76k, and 100k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k steps and explored training with and without label smoothing. During preliminary experiments, we observed that removing weight decay from the model consistently caused significant decreases in test accuracy. Thus, for all hyperparameter combinations, we left weight decay on with the standard coefficient. 
For a target sparsity of 98%, we observed that very few hy.perparameter combinations were able to complete training without failing due to numerical issues. Out of all the hyper-parameter configurations we tried, only a single model was able to complete training without erroring from the presence of NaNs. As explained in the main text, at high sparsity levels the first layer of the model has very few non-zero parameters, leading to instability during training and low test set performance. Pruned ResNet-50 models with the first layer left dense did not exhibit these issues. 

E.3. Variational Dropout Details 
For variational dropout applied to ResNet-50, we explored the same combinations of start and end points for the kl-divergence weight ramp up as we did for the start and end points of magnitude pruning. For all transformer experi.ments, we did not observe a significant gain from using a cubic kl-divergence weight ramp-up schedule and thus only explored the linear ramp-up for ResNet-50. For each combi.nation of start and end points for the kl-divergence weight, we explored 9 different coefficients for the kl-divergence loss term: .01/N,.03/N,.05/N,.1/N,.3/N,.5/N,1/ N, 10 / N, and 100 / N. 
Contrary to our experience with Transformer, we found ResNet-50 with variational dropout to be highly sensitive to the initialization for the <<FORMULA>> parameters. With the standard setting of -10, we couldnfit match the baseline accuracy, and with an initialization of -20 our models achieved 

9 https://bit.ly/2Wd2Lk0 

good test performance but no sparsity. After some exper.imentation, we were able to produce good results with an initialization of -15. 
While with Transformer we saw a reasonable amount of variance in test set performance and sparsity with the same model evaluated at different log . thresholds, we did not observe the same phenomenon for ResNet-50. Across a range of log . values, we saw consistent accuracy and nearly identical sparsity levels. For all of the results reported in the main text, we used a <<FORMULA>> threshold of 0.5, which we found to produce slightly better results than the standard threshold of 3.0. 

E.4. l0 Regularization Details 
For l0 regularization, we explored four different initial <<FORMULA>> values corresponding to dropout rates of 1%, 5%, 10%, and 30%. For each dropout rate, we extenively tuned the l0 .norm weight to produce models in the desired sparsity range. After identifying the proper range of l0-norm coefficients, we ran experiments with 20 different coefficients in that range. For each combination of these hyperparameters, we tried all four combinations of other regularizers: standard weight decay and label smoothing, only weight decay, only label smoothing, and no regularization. For weight decay, we used the formulation for the reparameterized weights provided in the original paper, and followed their approach of scaling the weight decay coefficient based on the initial dropout rate to maintain a constant length-scale between the l0 regularized model and the standard model. 
Across all of these experiments, we were unable to produce ResNet models that achieved a test set performance better than random guessing. For all experiments, we observed that training proceeded reasonably normally until the l0-norm loss began to drop, at which point the model incurred severe accuracy loss. We include the results of all hyperparameter combinations in our data release. 
Additionally, we tried a number of tweaks to the learning process to improve the results to no avail. We explored training the model for twice the number of epochs, training with much higher initial dropout rates, modifying the <<FORMULA>> parameter for the hard-concrete distribution, and a modified test-time parameter estimator. 

E.5. Random Pruning Details 
For random pruning on ResNet-50, we shifted the set of possible start and end points for pruning earlier in training relative to those we explored for magnitude pruning. At each of the sparsity levels tried with magnitude pruning, we tried starting pruning at step 0, 8k, and 20k. For each potential starting point, we tried ending pruning at steps 40k, 68k, and 76k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k and explored training with and without label smoothing. 

E.6. Scratch-B Learning Rate Variants 
For the scratch-b (Liu et al., 2018) experiments with ResNet.
50, we explored four different learning rate schemes for the extended training time (2x the default number of epochs). 

The first learning rate scheme we explored was uniformly scaling each of the five learning rate regions to last for double the number of epochs. This setup produced the best results by a wide margin. We report these results in the main text. 
The second learning rate scheme was to keep the standard learning rate, and maintain the final learning rate for the extra training steps as is common when fine-tuning deep neural networks. The third learning rate scheme was to maintain the standard learning rate, and continually drop the learning rate by a factor of 0.1 every 30 epochs. The last scheme we explored was to skip the learning rate warm-up, and drop the learning rate by 0.1 every 30 epochs. This learning rate scheme is closest to the one used by Liu et al. (2018). We found that this scheme underperformed relative to the scaled learning rate scheme with our training setup. 
Results for all learning rate schemes are included with the released hyperparameter tuning data. 
<|endoftext|>


<|startoftext|>
                            NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications

                              Tien-Ju Yang 1⋆[0000−0003−4728−0321] , Andrew Howard 2 ,BoChen 2 ,
                          Xiao Zhang 2 ,AlecGo 2 ,MarkSandler 2 , Vivienne Sze 1 , and Hartwig Adam 2

                                         1 Massachusetts Institute of Technology
                                                   2 Google Inc.
                          {tjy,sze}@mit.edu,{howarda,bochen,andypassion,ago,sandler,hadam}@google.com


                                                Abstract.

                              This work proposes an algorithm, called NetAdapt, that 
                              automatically adapts a pre-trained deep neural network to a mobile plat-
                              form given a resource budget. While many existing algorithms simplify
                              networks based on the number of MACs or weights, optimizing those
                              indirect metrics may not necessarily reduce the direct metrics, such as
                              latency and energy consumption. To solve this problem, NetAdapt
                              incorporates direct metrics into its adaptation algorithm. These direct metrics
                              are evaluated using empirical measurements, so that detailed knowledge
                              of the platform and tool chain is not required. NetAdapt automatically
                              and progressively simpliﬁes a pre-trained network until the resource bud-
                              get is met while maximizing the accuracy. Experiment results show that
                              NetAdapt achieves better accuracy versus latency tradeoffs on both 
                              mobile CPU and mobile GPU, compared with the state-of-the-art automated
                              network simpliﬁcation algorithms. For image classiﬁcation on the
                              ImageNet dataset, NetAdapt achieves up to a 1.7× speedup in-measured
                              inference latency with equal or higher accuracy on MobileNets (V1&V2).


                        1 Introduction

                        Deep neural networks (DNNs or networks) have become an indispensable component
                        of artiﬁcial intelligence, delivering near or super-human accuracy on com-
                        mon vision tasks such as image classiﬁcation and object detection. However,
                        DNN-based AI applications are typically too computationally intensive to be
                        deployed on resource-constrained platforms, such as mobile phones. This hinders
                        the enrichment of a large set of user experiences.
                           A signiﬁcant amount of recent work on DNN design has focused on improving
                        the eﬃciency of networks. However, the majority of works are based on optimizing
                        the “indirect metrics”, such as the number of multiply-accumulate operations
                        (MACs) or the number of weights, as proxies for the resource consumption of
                        a network. Although these indirect metrics are convenient to compute and 
                        integrate into the optimization framework, they may not be good approximations
                        to the “direct metrics” that matter for the real applications such as latency

                         <<FIGURE>>

                      Fig. 1.NetAdapt automatically adapts a pretrained network to a mobile platform
                      given a resource budget. This algorithm is guided by the direct metrics for resource
                      consumption. NetAdapt eliminates the requirement of platform-speciﬁc knowledge by
                      using empirical measurements to evaluate the direct metrics. At each iteration, Ne-
                      tAdapt generates many network proposals and measures the proposals on the target
                      platform. The measurements are used to guide NetAdapt to generate the next set of
                      network proposals at the next iteration.


                      and energy consumption. The relationship between an indirect metric and the
                      corresponding direct metric can be highly non-linear and platform-dependent as
                      observed by [15, 25, 26]. In this work, we will also demonstrate empirically that
                      a network with a fewer number of MACs can be slower when actually running
                      on mobile devices; speciﬁcally, we will show that a network of 19% less MACs
                      incurs 29% longer latency in practice (see Table 1).
                        There are two common approaches to designing eﬃcient network architectures.
                      The ﬁrst is designing a single architecture with no regard to the underlying
                      platform. It is hard for a single architecture to run optimally on all the platforms
                      due to the diﬀerent platform characteristics. For example, the fastest architecture
                      on a desktop GPU may not be the fastest one on a mobile CPU with the
                      same accuracy. Moreover, there is little guarantee that the architecture could
                      meet the resource budget (e.g., latency) on all platforms of interest. The second
                      approach is manually crafting architectures for a given target platform based
                      on the platform’s characteristics. However, this approach requires deep knowledge
                      about the implementation details of the platform, including the toolchains,
                      the conﬁguration and the hardware architecture, which are generally unavailable
                      given the proprietary nature of hardware and the high complexity of modern sys-
                      tems. Furthermore, manually designing a diﬀerent architecture for each platform
                      can be taxing for researchers and engineers.
                        In this work, we propose a platform-aware algorithm, calledNetAdapt,to
                      address the aforementioned issues and facilitate platform-speciﬁc DNN deployment.                                                                   NetAdapt 3
                      NetAdapt (Fig. 1) incorporates direct metrics in the optimization loop, so
                      it does not suﬀer from the discrepancy between the indirect and direct metrics.
                      The direct metrics are evaluated by the empirical measurements taken from the
                      target platform. This enables the algorithm to support any platform without
                      detailed knowledge of the platform itself, although such knowledge could still be
                      incorporated into the algorithm to further improve results. In this paper, we use
                      latency as the running example of a direct metric and resource to target even
                      though our algorithm is generalizable to other metrics or a combination of them
                      (Sec. 4.3).
                        The network optimization of NetAdapt is carried out in an automatic way to
                      gradually reduce the resource consumption of a pretrained network while
                      maximizing the accuracy. The optimization runs iteratively until the resource budget
                      is met. Through this design, NetAdapt can generate not only a network that
                      meets the budget, but also a family of simpliﬁed networks with diﬀerent trade-
                      oﬀs, which allows dynamic network selection and further study. Finally, instead
                      of being a black box, NetAdapt is designed to be easy to interpret. For exam-
                      ple, through studying the proposed network architectures and the corresponding
                      empirical measurements, we can understand why a proposal is chosen and this
                      sheds light on how to improve the platform and network design.
                        The main contributions of this paper are:
                       A framework that uses direct metrics when optimizing a pretrained network
                         to meet a given resource budget. Empirical measurements are used to evaluate
                         the direct metrics such that no platform-speciﬁc knowledge is required.
                       An automated constrained network optimization algorithm that maximizes
                         accuracy while satisfying the constraints (i.e., the predeﬁned resource bud-
                         get). The algorithm outperforms the state-of-the-art automatic network 
                         simpliﬁcation algorithms by up to 1.7×in terms of reduction inmeasured inference
                         latency while delivering equal or higher accuracy. Moreover, a family
                         of simpliﬁed networks with diﬀerent trade-oﬀs will be generated to allow
                         dynamic network selection and further study.
                       Experiments that demonstrate the eﬀectiveness of NetAdapt on diﬀerent
                         platforms and on real-time-class networks, such as the small MobileNetV1,
                         which is more diﬃcult to simplify than larger networks.


                      2 Related Work

                      There is a large body of work that aims to simplify DNNs.We refer the readers
                      to [21] for a comprehensive survey, and summarize the main approaches below.
                        The most related works are pruning-based methods. [6, 14, 16] aim to remove
                      individual redundant weights from DNNs. However, most platforms cannot fully
                      take advantage of unstructured sparse ﬁlters [26]. Hu et al. [10] and Srinivas et
                      al. [20] focus on removing entire ﬁlters instead of individual weights. The draw-
                      back of these methods is the requirement of manually choosing the compression
                      rate for each layer. MorphNet [5] leverages the sparsifying regularizers to
                      automatically determine the layerwise compression rate. ADC [8] uses reinforcement        
                      learning to learn a policy for choosing the compression rates. The crucial 
                      difference between all the aforementioned methods and ours is that they are not
                      guided by the direct metrics, and thus may lead to sub-optimal performance, as
                      we see in Sec. 4.3.
                        Energy-aware pruning [25] uses an energy model [24] and incorporates the
                      estimated energy numbers into the pruning algorithm. However, this requires de-
                      signing models to estimate the direct metrics of each target platform, which re-
                      quires detailed knowledge of the platform including its hardware architecture [3],
                      and the network-to-array mapping used in the toolchain [2]. NetAdapt does not
                      have this requirement since it can directly use empirical measurements.
                        DNNs can also be simpliﬁed by approaches that involve directly designing 
                      efﬁcient network architectures, decomposition or quantization. MobileNets [9, 18]
                      and ShufleNets [27] provide eﬃcient layer operations and reference architecture
                      design. Layer-decomposition-based algorithms [13, 23] exploit matrix 
                      decomposition to reduce the number of operations. Quantization [11, 12, 17] reduces
                      the complexity by decreasing the computation accuracy. The proposed
                      algorithm, NetAdapt, is complementary to these methods. For example, NetAdapt
                      can adapt MobileNets to further push the frontier of eﬃcient networks as shown
                      in Sec. 4 even though MobileNets are more compact and much harder to simplify
                      than the other larger networks, such as VGG [19].

                      3 Methodology: NetAdapt

                      We propose an algorithm, called NetAdapt, that will allow a user to automatically
                      simplify a pretrained network to meet the resource budget of a platform
                      while maximizing the accuracy. NetAdapt is guided by direct metrics for resource
                      consumption, and the direct metrics are evaluated by using empirical measurements,
                      thus eliminating the requirement of detailed platform-speciﬁc knowledge.

                      3.1 Problem Formulation
                      NetAdapt aims to solve the following non-convex constrained problem:

                                  <<FORMULA>>                                      (1)

                      where Net is a simpliﬁed network from the initial pretrained network, <<FORMULA>>
                      computes the accuracy, <<FORMULA>> evaluates the direct metric for resource con-
                      sumption of the jth resource, and <<FORMULA>> is the budget of the jth resource and
                      the constraint on the optimization. The resource can be latency, energy, memory
                      footprint, etc., or a combination of these metrics.
                        Based on an idea similar to progressive barrier methods [1], NetAdapt breaks
                      this problem into the following series of easier problems and solves it iteratively:

                            <<FORMULA>>                                             (2)
                            

                       Algorithm 1:NetAdapt
                       
                        <<ALGORITHM>>

                      where <<FORMULA>> is the network generated by the ith iteration, and Net_0 is the initial
                      pretrained network. As the number of iterations increases, the constraints (i.e.,
                      current resource budget <<FORMULA>> gradually become tighter. <<FORMULA>>,
                      which is larger than zero, indicates how much the constraint tightens for the jth
                      resource in the ith iteration and can vary from iteration to iteration. This is
                      referred to as “resource reduction schedule”, which is similar to the concept of
                      learning rate schedule. The algorithm terminates when Res <<FORMULA>>
                      is equal to or smaller thanBud j for every resource type. It outputs the ﬁnal
                      adapted network and can also generate a sequence of simpliﬁed networks (i.e.,
                      the highest accuracy network from each iteration <<FORMULA>>) to provide the
                      eﬃcient frontier of accuracy and resource consumption trade-oﬀs.

                      3.2 Algorithm Overview

                      For simplicity, we assume that we only need to meet the budget of one resource,
                      speciﬁcally latency. One method to reduce the latency is to remove ﬁlters from
                      the convolutional (CONV) or fully-connected (FC) layers. While there are other
                      ways to reduce latency, we will use this approach to demonstrate NetAdapt.
                        The NetAdapt algorithm is detailed in pseudo code in Algorithm 1 and in
                      Fig. 2. Each iteration solves Eq. 2 by reducing the number of ﬁlters in a single
                      CONV or FC layer (theChoose # of Filters and Choose Which Filters
                      blocks in Fig. 2). The number of ﬁlters to remove from a layer is guided by
                      empirical measurements. NetAdapt removes entire ﬁlters instead of individual
                      weights because most platforms can take advantage of removing entire ﬁlters,                      

                                        <<FIGURE>>

                      Fig. 2.This ﬁgure visualizes the algorithm ﬂow of NetAdapt. At each iteration, Ne-
                      tAdapt decreases the resource consumption by simplifying (i.e., removing ﬁlters from)
                      one layer. In order to maximize accuracy, it tries to simplify each layer individually
                      and picks the simpliﬁed network that has the highest accuracy. Once the target budget
                      is met, the chosen network is then ﬁne-tuned again until convergence.

                      and this strategy allows reducing both ﬁlters and feature maps, which play an
                      important role in resource consumption [25]. The simpliﬁed network is then
                      ﬁne-tuned for a short length of time in order to restore some accuracy (the
                      Short-Term Fine-Tuneblock).
                        In each iteration, the previous three steps (highlighted in bold) are applied on
                      each of the CONV or FC layers individually 3 . As a result, NetAdapt generates
                      K (i.e., the number of CONV and FC layers) network proposals in one iteration,
                      each of which has a single layer modiﬁed from the previous iteration. The network
                      proposal with the highest accuracy is carried over to the next iteration (the
                      Pick Highest Accuracy block). Finally, once the target budget is met, the
                      chosen network is ﬁne-tuned again until convergence (theLong-Term Fine-Tuneblock).


                      3.3 Algorithm Details

                      This section describes the key blocks in the NetAdapt algorithm (Fig. 2).
                        Choose Number of FiltersThis step focuses on determining how many
                      ﬁlters to preserve in a speciﬁc layer based on empirical measurements. NetAdapt
                      gradually reduces the number of ﬁlters in the target layer and measures the
                      resource consumption of each of the simpliﬁed networks. The maximum number
                      3 The algorithm can also be applied to a group of multiple layers as a single unit
                        (instead of a single layer). For example, in ResNet [7], we can treat a residual block
                        as a single unit to speed up the adaptation process.                                                                  

                                            <<FIGURE>>

                      Fig. 3.This ﬁgure illustrates how layer-wise look-up tables are used for fast resource
                      consumption estimation.


                      of ﬁlters that can satisfy the current resource constraint will be chosen. Note
                      that when some ﬁlters are removed from a layer, the associated channels in the
                      following layers should also be removed. Therefore, the change in the resource
                      consumption of other layers needs to be factored in.
                        Choose Which FiltersThis step chooses which ﬁlters to preserve based on
                      the architecture from the previous step. There are many methods proposed in
                      the literature, and we choose the magnitude-based method to keep the algorithm
                      simple. In this work, the N ﬁlters that have the largest ℓ2-norm magnitude will
                      be kept, whereNis the number of ﬁlters determined by the previous step. More
                      complex methods can be adopted to increase the accuracy, such as removing the
                      ﬁlters based on their joint inﬂuence on the feature maps [25].
                        Short-/Long-Term Fine-TuneBoth the short-term ﬁne-tune and long-
                      term ﬁne-tune steps in NetAdapt involve network-wise end-to-end ﬁne-tuning.
                      Short-term ﬁne-tune has fewer iterations than long-term ﬁne-tune.
                        At each iteration of the algorithm, we ﬁne-tune the simpliﬁed networks with
                      a relatively smaller number of iterations (i.e., short-term) to regain accuracy, in
                      parallel or in sequence. This step is especially important while adapting small
                      networks with a large resource reduction because otherwise the accuracy will
                      drop to zero, which can cause the algorithm to choose the wrong network proposal.
                        As the algorithm proceeds, the network is continuously trained but does not
                      converge. Once the ﬁnal adapted network is obtained, we ﬁne-tune the network
                      with more iterations until convergence (i.e., long-term) as the ﬁnal step.


                      3.4 Fast Resource Consumption Estimation

                      As mentioned in Sec. 3.3, NetAdapt uses empirical measurements to determine
                      the number of ﬁlters to keep in a layer given the resource constraint. In theory,
                      we can measure the resource consumption of each of the simpliﬁed networks
                      on the ﬂy during adaptation. However, taking measurements can be slow and
                      diﬃcult to parallelize due to the limited number of available devices. Therefore,
                      it may be prohibitively expensive and become the computation bottleneck.                      
                      
                                            <<FIGURE>>

                      Fig. 4.The comparison between the estimated latency (using layer-wise look-up tables)
                      and the real latency on a single large core of Google Pixel 1 CPU while adapting the
                      100% MobileNetV1 with the input resolution of 224 [9].


                        We solve this problem by building layer-wise look-up tables with pre-measured
                      resource consumption of each layer. When executing the algorithm, we look up
                      the table of each layer, and sum up the layer-wise measurements to estimate
                      the network-wise resource consumption, which is illustrated in Fig. 3. The rea-
                      son for not using a network-wise table is that the size of the table will grow
                      exponentially with the number of layers, which makes it intractable for deep
                      networks. Moreover, layers with the same shape and feature map size only need
                      to be measured once, which is common for modern deep networks.
                        Fig. 4 compares the estimated latency (the sum of layer-wise latency from the
                      layer-wise look-up tables) and the real latency on a single large core of Google
                      Pixel 1 CPU while adapting the 100% MobileNetV1 with the input resolution of
                      224 [9]. The real and estimated latency numbers are highly correlated, and the
                      diﬀerence between them is suﬃciently small to be used by NetAdapt.


                      4 Experiment Results

                      In this section, we apply the proposed NetAdapt algorithm to MobileNets [9, 18],
                      which are designed for mobile applications, and experiment on the ImageNet
                      dataset [4]. We did not apply NetAdapt on larger networks like ResNet [7] and
                      VGG [19] because networks become more diﬃcult to simplify as they become
                      smaller; these networks are also seldom deployed on mobile platforms. We benchmark
                      NetAdapt against three state-of-the-art network simpliﬁcation methods:
                        Multipliers[9] are simple but eﬀective methods for simplifying networks.
                         Two commonly used multipliers are the width multiplier and the resolution
                         multiplier; they can also be used together. Width multiplier scales the
                         number of ﬁlters by a percentage across all convolutional (CONV) and fully-
                         connected (FC) layers, and resolution multiplier scales the resolution of the
                         input image. We use the notation “50% MobileNetV1 (128)” to denote ap-
                         plying a width multiplier of 50% on MobileNetV1 with the input image
                         resolution of 128.           
                        MorphNet[5] is an automatic network simpliﬁcation algorithm based on sparsifying regularization.
                       ADC[8] is an automatic network simpliﬁcation algorithm based on reinforcement learning.

                        We will show the performance of NetAdapt on the small MobileNetV1 (50%
                      MobileNetV1 (128)) to demonstrate the eﬀectiveness of NetAdapt on real-time-
                      class networks, which are much more diﬃcult to simplify than larger networks.
                      To show the generality of NetAdapt, we will also measure its performance on
                      the large MobileNetV1 (100% MobileNetV1 (224)) across diﬀerent platforms.
                      Lastly, we adapt the large MobileNetV2 (100% MobileNetV2 (224)) to push the
                      frontier of eﬃcient networks.


                      4.1 Detailed Settings for MobileNetV1 Experiments

                      We perform most of the experiments and study on MobileNetV1 and detail the
                      settings in this section.
                        NetAdapt ConﬁgurationMobileNetV1 [9] is based on depthwise separable
                      convolutions, which factorize am×m standard convolution layer into am×m
                      depthwise layer and a 1×1 standard convolution layer called a pointwise layer. In
                      the experiments, we adapt each depthwise layer with the corresponding pointwise
                      layer and choose the ﬁlters to keep based on the pointwise layer. When adapting
                      the small MobileNetV1 (50% MobileNetV1 (128)), the latency reduction (<<FORMULA>>
                      in Eq. 2) starts at 0.5 and decays at the rate of 0.96 per iteration. When adapting
                      other networks, we use the same decay rate but scale the initial latency reduction
                      proportional to the latency of the initial pretrained network.
                        Network TrainingWe preserve ten thousand images from the training
                      set, ten images per class, as the holdout set. The new training set without the
                      holdout images is used to perform short-term ﬁne-tuning, and the holdout set is
                      used to pick the highest accuracy network out of the simpliﬁed networks at each
                      iteration. The whole training set is used for the long-term ﬁne-tuning, which is
                      performed once in the last step of NetAdapt.
                        Because the training conﬁguration can have a large impact on the accuracy,
                      we apply the same training conﬁguration to all the networks unless otherwise
                      stated to have a fairer comparison. We adopt the same training conﬁguration as
                      MorphNet [5] (except that the batch size is 128 instead of 96). The learning rate
                      for the long-term ﬁne-tuning is 0.045 and that for the short-term ﬁne-tuning is
                      0.0045. This conﬁguration improves ADC network’s top-1 accuracy by 0.3% and
                      almost all multiplier networks’ top-1 accuracy by up to 3.8%, except for one data
                      point, whose accuracy is reduced by 0.2%. We use these numbers in the following
                      analysis. Moreover, all accuracy numbers are reported on the validation set to
                      show the true performance.
                        Mobile Inference and Latency MeasurementWe use Google’s Tensor-
                      Flow Lite engine [22] for inference on a mobile CPU and Qualcomm’s Snap-
                      dragon Neural Processing Engine (SNPE) for inference on a mobile GPU. For
                      experiments on mobile CPUs, the latency is measured on a single large core of                     

                                      <<FIGURE>>

                      Fig. 5.The ﬁgure compares NetAdapt (adapting the small MobileNetV1) with the
                      multipliers [9] and MorphNet [5] on a mobile CPU of Google Pixel 1.


                      Google Pixel 1 phone. For experiments on mobile GPUs, the latency is measured
                      on the mobile GPU of Samsung Galaxy S8 with SNPE’s benchmarking tool. For
                      each latency number, we report the median of 11 latency measurements.

                      4.2 Comparison with Benchmark Algorithms
                      Adapting Small MobileNetV1 on a Mobile CPUIn this experiment, we
                      apply NetAdapt to adapt the small MobileNetV1 (50% MobileNetV1 (128)) to
                      a mobile CPU. It is one of the most compact networks and achieves real-time
                      performance. It is more challenging to simplify than other larger networks
                      (include the large MobileNet V1). The results are summarized and compared with
                      the multipliers [9] and MorphNet [5] in Fig. 5. We observe that NetAdapt 
                      outperforms the multipliers by up to 1.7×faster with the same or higher accuracy.
                      For MorphNet, NetAdapt’s result is 1.6×faster with 0.3% higher accuracy.

                      Adapting Large MobileNetV1 on a Mobile CPUIn this experiment, we
                      apply NetAdapt to adapt the large MobileNetV1 (100% MobileNetV1 (224))
                      on a mobile CPU. It is the largest MobileNetV1 and achieves the highest ac-
                      curacy. Because its latency is approximately 8×higher than that of the small
                      MobileNetV1, we scale the initial latency reduction by 8×. The results are shown
                      and compared with the multipliers [9] and ADC [8] in Fig. 6. NetAdapt achieves
                      higher accuracy than the multipliers and ADC while increasing the speed by
                      1.4× and 1.2×, respectively.
                        While the training conﬁguration is kept the same when comparing to the
                      benchmark algorithms discussed above, we also show in Fig. 6 that the accuracy
                      of the networks adapted using NetAdapt can be further improved with a better
                      training conﬁguration. After simply adding dropout and label smoothing, the
                      accuracy can be increased by 1.3%. Further tuning the training conﬁguration
                      for each adapted network can give higher accuracy numbers, but it is not the
                      focus of this paper.                                                                   
                      
                                                        <<FIGURE>>

                      Fig. 6.The ﬁgure compares NetAdapt (adapting the large MobileNetV1) with the
                      multipliers [9] and ADC [8] on a mobile CPU of Google Pixel 1. Moreover, the accuracy
                      of the adapted networks can be further increased by up to 1.3% through using a better
                      training conﬁguration (simply adding dropout and label smoothing).

                            <<FIGURE>>

                      Fig. 7.This ﬁgure compares NetAdapt (adapting the large MobileNetV1) with the
                      multipliers [9] and ADC [8] on a mobile GPU of Samsung Galaxy S8. Moreover, the
                      accuracy of the adapted networks can be further increased by up to 1.3% through using
                      a better training conﬁguration (simply adding dropout and label smoothing).


                      Adapting Large MobileNetV1 on a Mobile GPUIn this experiment, we
                      apply NetAdapt to adapt the large MobileNetV1 on a mobile GPU to show the
                      generality of NetAdapt. Fig. 7 shows that NetAdapt outperforms other benchmark
                      algorithms by up to 1.2×speed-up with higher accuracy. Due to the 
                      limitation of the SNPE tool, the layerwise latency breakdown only considers the
                      computation time and does not include the latency of other operations, such as
                      feature map movement, which can be expensive [25]. This aﬀects the precision
                      of the look-up tables used for this experiment. Moreover, we observe that there
                      is an approximate 6.2ms (38% of the latency of the network before applying
                      NetAdapt) non-reducible latency. These factors cause a smaller improvement on
                      the mobile GPU compared with the experiments on the mobile CPU. Moreover,
                      when the better training conﬁguration is applied as previously described, the
                      accuracy can be further increased by 1.3%.                      

                                      <<FIGURE>>                           <<FIGURE>>

                      Fig. 8.The accuracy of different short-  Fig. 9.The comparison between before
                      term ﬁne-tuning iterations when adapt-  and after long-term ﬁne-tuning when
                      ing the small MobileNetV1 (without long-  adapting the small MobileNetV1 on a mo-
                      term ﬁne-tuning) on a mobile CPU of  bile CPU of Google Pixel 1. Although the
                      Google Pixel 1. Zero iterations means no  short-term ﬁne-tuning preserves the accu-
                      short-term ﬁne-tuning.              racy well, the long-term ﬁne-tuning gives
                                                    the extra 3.4% on average (from 1.8% to
                                                    4.5%).


                      4.3 Ablation Studies
                      Impact of Direct MetricsIn this experiment, we use the indirect metric (i.e.,
                      the number of MACs) instead of the direct metric (i.e., the latency) to guide
                      NetAdapt to investigate the importance of using direct metrics. When computing
                      the number of MACs, we only consider the CONV and FC layers because batch
                      normalization layers can be folded into the corresponding CONV layers, and the
                      other layers are negligibly small. Table 1 shows that NetAdapt outperforms the
                      benchmark algorithms with lower numbers of MACs and higher accuracy. This
                      demonstrates the eﬀectiveness of NetAdapt. However, we also observe that the
                      network with lower numbers of MACs may not necessarily be faster. This shows
                      the necessity of incorporating direct measurements into the optimization ﬂow.

                      Impact of Short-Term Fine-TuningFig. 8 shows the accuracy of adapting
                      the small MobileNetV1 with diﬀerent short-term ﬁne-tuning iterations (without
                      long-term ﬁne-tuning). The accuracy rapidly drops to nearly zero if no short-
                      term ﬁne-tuning is performed (i.e., zero iterations). In this low accuracy region,
                      the algorithm picks the best network proposal solely based on noise and hence                                                                   NetAdapt 13

                                  <<FIGURE>>

                      Fig. 10.NetAdapt and the multipliers generate different simpliﬁed networks when
                      adapting the small MobileNetV1 to match the latency of 25% MobileNetV1 (128).


                      gives poor performance. After ﬁne-tuning a network for a short amount of time
                      (ten thousand iterations), the accuracy is always kept above 20%, which allows
                      the algorithm to make a better decision. Although further increasing the number
                      of iterations improves the accuracy, we ﬁnd that using forty thousand iterations
                      leads to a good accuracy versus speed trade-oﬀ for the small MobileNetV1.

                      Impact of Long-Term Fine-TuningFig. 9 illustrates the importance of per-
                      forming the long-term ﬁne-tuning. Although the short-term ﬁne-tuning preserves
                      the accuracy well, the long-term ﬁne-tuning can still increase the accuracy by
                      up to another 4.5% or 3.4% on average. Since the short-term ﬁne-tuning has a
                      short training time, the training is terminated far before convergence. Therefore,
                      it is not surprising that the ﬁnal long-term ﬁne-tuning can further increase the
                      accuracy.

                      Impact of Resource Reduction Schedules Table 2 shows the impact of
                      using three diﬀerent resource reduction schedules, which are deﬁned in Sec. 3.1.
                      Empirically, using a larger resource reduction at each iteration increases the
                      adaptation speed (i.e., reducing the total number of adaptation iterations) at the
                      cost of accuracy. With the same number of total iterations, the result suggests
                      that a smaller initial resource reduction with a slower decay is preferable.

                      4.4 Analysis of Adapted Network Architecture
                      The network architectures of the adapted small MobileNetV1 by using NetAdapt
                      and the multipliers are shown and compared in Fig. 10. Both of them have similar
                      latency as 25% MobileNetV1 (128). There are two interesting observations.                      

                                <<TABLE>>

                      Table 3.The comparison between NetAdapt (adapting the large MobileNetV2 (100%
                      MobileNetV2 (224))) and the multipliers [18] on a mobile CPU of Google Pixel 1. We
                      compare the latency at similar accuracy and the accuracy at similar latency.


                        First, NetAdapt removes more ﬁlters in layers 7 to 10, but fewer in layer 6.
                      Since the feature map resolution is reduced in layer 6 but not in layers 7 to 10,
                      we hypothesize that when the feature map resolution is reduced, more ﬁlters are
                      needed to avoid creating an information bottleneck.
                        The second observation is that NetAdapt keeps more ﬁlters in layer 13 (i.e.
                      the last CONV layer). One possible explanation is that the ImageNet dataset
                      contains one thousand classes, so more feature maps are needed by the last FC
                      layer to do the correct classiﬁcation.

                      4.5 Adapting Large MobileNetV2 on a Mobile CPU
                      In this section, we show encouraging early results of applying NetAdapt to 
                      MobileNetV2 [18]. MobileNetV2 introduces the inverted residual with linear 
                      bottleneck into MobileNetV1 and becomes more eﬃcient. Because MobileNetV2
                      utilizes residual connections, we only adapt individual inner (expansion) layers
                      or reduce all bottleneck layers of the same resolution in lockstep. The main
                      differences between the MobileNetV1 and MobileNetV2 experiment settings are that
                      each network proposal is short-term ﬁne-tuned with ten thousand iterations, the
                      initial latency reduction is 1ms, the latency reduction decay is 0.995, the batch
                      size is 96, and dropout and label smoothing are used. NetAdapt achieves 1.1%
                      higher accuracy or 1.2×faster speed than the multipliers as shown in Table 3.

                      5 Conclusion

                      In summary, we proposed an automated algorithm, called NetAdapt, to adapt a
                      pretrained network to a mobile platform given a real resource budget. NetAdapt
                      can incorporate direct metrics, such as latency and energy, into the optimization
                      to maximize the adaptation performance based on the characteristics of the
                      platform. By using empirical measurements, NetAdapt can be applied to any
                      platform as long as we can measure the desired metrics, without any knowledge
                      of the underlying implementation of the platform. We demonstrated empirically
                      that the proposed algorithm can achieve better accuracy versus latency trade-oﬀ
                      (by up to 1.7×faster with equal or higher accuracy) compared with other
                      state-of-the-art network simpliﬁcation algorithms. In this work, we aimed to highlight
                      the importance of using direct metrics in the optimization of eﬃcient networks;
                      we hope that future research eﬀorts will take direct metrics into account in order
                      to further improve the performance of eﬃcient networks.                                           
                      
                      
                      Bibliography

                       [1] Audet, C., J. E. Dennis, J.: A progressive barrier for derivative-free nonlin-
                         ear programming. SIAM Journal on Optimization20(1), 445–472 (2009)
                       [2] Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A Spatial Architecture for Energy-
                         Eﬃcient Dataﬂow for Convolutional Neural Networks. In: Proceedings of the
                         43rd Annual International Symposium on Computer Architecture (ISCA)
                         (2016)
                       [3] Chen, Y.H., Krishna, T., Emer, J., Sze, V.: Eyeriss: An Energy-Eﬃcient
                         Reconﬁgurable Accelerator for Deep Convolutional Neural Networks. IEEE
                         Journal of Solid-State Circuits52, 127–138 (2016)
                       [4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A
                         large-scale hierarchical image database. In: IEEE Conference on Computer
                         Vision and Pattern Recognition (CVPR). pp. 248–255. IEEE (2009)
                       [5] Gordon, A., Eban, E., Nachum, O., Chen, B., Yang, T.J., Choi, E.: Mor-
                         phnet: Fast & simple resource-constrained structure learning of deep net-
                         works. In: IEEE Conference on Computer Vision and Pattern Recognition
                         (CVPR) (2018)
                       [6] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections
                         for eﬃcient neural network. In: Advances in Neural Information Processing
                         Systems. pp. 1135–1143 (2015)
                       [7] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image
                         Recognition. In: IEEE Conference on Computer Vision and Pattern Recog-
                         nition (CVPR) (2016)
                       [8] He, Y., Han, S.: Adc: Automated deep compression and acceleration with
                         reinforcement learning. arXiv preprint arXiv:1802.03494 (2018)
                       [9] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
                         T., Andreetto, M., Adam, H.: Mobilenets: Eﬃcient convolutional neural
                         networks for mobile vision applications. arXiv preprint arXiv:1704.04861
                         (2017)
                      [10] Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network Trimming: A Data-
                         Driven Neuron Pruning Approach towards Eﬃcient Deep Architectures.
                         arXiv preprint arXiv:1607.03250 (2016)
                      [11] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized
                         neural networks. In: Advances in Neural Information Processing Systems.
                         pp. 4107–4115 (2016)
                      [12] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H.,
                         Kalenichenko, D.: Quantization and training of neural networks for eﬃcient
                         integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877 (2017)
                      [13] Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of
                         deep convolutional neural networks for fast and low power mobile applica-
                         tions. arXiv preprint arXiv:1511.06530 (2015)
                      [14] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances
                         in Neural Information Processing Systems (1990)                      16 T.-J. Yang et al.
                      [15] Liangzhen Lai, Naveen Suda, V.C.: Not all ops are created equal! In: SysML
                         (2018)
                      [16] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolu-
                         tional neural networks for resource eﬃcient transfer learning. arXiv preprint
                         arXiv:1611.06440 (2016)
                      [17] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet
                         classiﬁcation using binary convolutional neural networks. In: European Con-
                         ference on Computer Vision (ECCV) (2016)
                      [18] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted
                         residuals and linear bottlenecks: Mobile networks for classiﬁcation, detection
                         and segmentation. In: IEEE Conference on Computer Vision and Pattern
                         Recognition (CVPR) (2018)
                      [19] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-
                         Scale Image Recognition. In: International Conference on Learning Repre-
                         sentations (ICLR) (2014)
                      [20] Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural net-
                         works. arXiv preprint arXiv:1507.06149 (2015)
                      [21] Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Eﬃcient processing of deep
                         neural networks: A tutorial and survey. Proceedings of the IEEE105(12),
                         2295–2329 (Dec 2017). https://doi.org/10.1109/JPROC.2017.2761740
                      [22] TensorFlow Lite: https://www.tensorﬂow.org/mobile/tﬂite/
                      [23] Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L.,
                         Wang, Z.: Deep fried convnets. In: Proceedings of the IEEE International
                         Conference on Computer Vision. pp. 1476–1483 (2015)
                      [24] Yang, Tien-Ju and Chen, Yu-Hsin and Emer, Joel and Sze, Vivienne: A
                         Method to Estimate the Energy Consumption of Deep Neural Networks.
                         In: Asilomar Conference on Signals, Systems and Computers (2017)
                      [25] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne: Designing energy-
                         eﬃcient convolutional neural networks using energy-aware pruning. In:
                         IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
                         (2017)
                      [26] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel:
                         Customizing dnn pruning to the underlying hardware parallelism. In: Pro-
                         ceedings of the 44th Annual International Symposium on Computer Archi-
                         tecture (2017)
                      [27] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shuenet: An extremely ef-
                         ﬁcient convolutional neural network for mobile devices. arXiv preprint
                         arXiv:1707.01083 (2017)
<|endoftext|>


<|startoftext|>
            TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING 

                                  Peter Henderson y , Jieru Hu z , Joshua Romoff 
                                 Emma Brunskill y , Dan Jurafsky y , Joelle Pineau z;
                              y Stanford University, z Facebook,  Mila, McGill University


                                            February 14, 2020

                                              ABSTRACT

                 Accurate reporting of energy and carbon usage is essential for understanding the potential climate
                 impacts of machine learning research. We introduce a framework that makes this easier by providing a
                 simple interface for tracking realtime energy consumption and carbon emissions, as well as generating
                 standardized online appendices. Utilizing this framework, we create a leaderboard for energy efﬁcient
                 reinforcement learning algorithms to incentivize responsible research in this area as an example for
                 other areas of machine learning. Finally, based on case studies using our framework, we propose
                 strategies for mitigation of carbon emissions and reduction of energy consumption. By making
                 accounting easier, we hope to further the sustainable development of machine learning experiments
                 and spur more research into energy efﬁcient algorithms.

           1 Introduction

           Global climate change is a scientiﬁcally well-recognized phenomenon and appears to be accelerated due to greenhouse
           gas (GHG) emissions such as carbon dioxide or equivalents (CO 2eq ) (Crowley,2000;IPCC,2018). The harmful health
           and safety impacts of global climate change are projected to “fall disproportionately on the poor and vulnerable” (IPCC,
           2018). Energy production remains a large factor in GHG emissions, contributing about 25% of GHG emissions in
           2010 (IPCC,2018). With the compute and energy demands of many modern machine learning (ML) methods growing
           exponentially (Amodei and Hernandez,2018), ML systems have the potential to signiﬁcantly contribute to carbon
           emissions. Recent work has demonstrated these potential impacts through case studies and suggested various mitigating
           strategies (Strubell et al.,2019;Schwartz et al.,2019).
           Systematic and accurate measurements are needed to better estimate the broader energy and carbon footprints of ML –
           in both research and production settings. Accurate accounting of carbon and energy impacts aligns incentives with
           energy efﬁciency (Schwartz et al.,2019), raises awareness, and drives mitigation efforts (Sundar et al.,2018;LaRiviere
           et al.,2016), among other beneﬁts. 1 Yet, most ML research papers do not regularly report energy or carbon emissions
           metrics. 2

           We hypothesize that part of the reason that much research does not report energy and carbon metrics is due to the
           complexities of collecting them. Collecting carbon emission metrics requires understanding emissions from energy
           grids, recording power outputs from GPUs and CPUs, and navigating among different tools to accomplish these tasks.
           To reduce this overhead, we present experiment-impact-tracker a lightweight framework for consistent, easy, and
           more accurate reporting of energy, compute, and carbon impacts of ML systems.
           In Section4, we introduce the design and capabilities of our framework and the issues with accounting we aim to solve
           with this new framework. Section5expands on the challenges of using existing accounting methods and discusses our

              1 See Section4.1for an extended discussion on the importance of accounting.
              2 See Section3and AppendixBfor more information.
              3 https.//github.com/Breakend/experiment-impact-tracker                                                 


           learnings from analyzing experiments with experiment-impact-tracker. For example, in an empirical case study on
           image classiﬁcation algorithms, we demonstrate that ﬂoating point operations (FPOs), a common measure of efﬁciency,
           are often uncorrelated with energy consumption with energy metrics gathered by experiment-impact-tracker.
           In Section6, we focus on recommendations for promoting energy-efﬁcient research and mitigation strategies for carbon
           emissions. Using our framework, we present aReinforcement Learning Energy Leaderboard in Section6.1to encourage
           development of energy efﬁcient algorithms. We also present a case study in machine translation to show how regional
           energy grid differences can result in large variations inCO 2eq emissions. Emissions can be reduced by up to 30x just
           by running experiments in locations powered by more renewable energy sources (Section6.2). Finally, we suggest
           systemic and immediate changes based on our ﬁndings.

                •incentivizing energy-efﬁcient research through leaderboards (Section6.1)
                •running experiments in carbon-friendly regions (Section6.2)
                •reducing overheads for utilizing efﬁcient algorithms and resources (Section7.1)
                •considering energy-performance trade-offs before deploying energy hungry models (Section7.2)
                •selecting efﬁcient test environment especially in RL (Section7.3)
                •ensuring reproducibility to reduce energy consumption from replication difﬁculties (Section7.4)
                •consistently reporting energy and carbon metrics (Section7.5)

           2 Related Work

           Estimating GHG emissions and their downstream consequences is important for setting regulatory standards (U.S.
           Environment Protection Agency,2013) and encouraging self-regulation (Byerly et al.,2018). In particular, these
           estimates are used to set carbon emissions reduction targets and in turn set carbon prices for taxes or emissions trading
           systems. 4 A large body of work has examined modeling and accounting of carbon emissions 5 at different levels of
           granularity. at the global scale (IPCC,2018); using country-speciﬁc estimates (Ricke et al.,2018); targeting a particular
           industrial sector like Information and Communication Technologies, for example, modeled byMalmodin et al.(2013);
           or even targeting a particular application like bitcoin mining, for example, modeled byMora et al.(2018).
           At the application level, some work has already modeled carbon impacts speciﬁcally in computationally intensive
           settings like bitcoin mining (Krause and Tolaymat,2018;Stoll et al.,2019;Zade et al.,2019;Mora et al.,2018).
           Such application-speciﬁc efforts are important for prioritizing emissions mitigation strategies. without understanding
           projected impacts, policy decisions could focus on ineffective regulation. However, with large amounts of heterogeneity
           and endogeneity in the underlying data, it can be difﬁcult to model all aspects of an application’s usage. For example,
           one study suggested that “bitcoin emissions alone could push global warming above 2°C” (Mora et al.,2018). But
           Masanet et al.(2019),Houy(2019), and others, criticized the underlying modeling assumptions which led to such large
           estimates of carbon emissions. This shows that it is vital that these models provide accurate measurements if they are to
           be used for informed decision making.
           With ML models getting more computationally intensive (Amodei and Hernandez,2018), we want to better understand
           how machine learning in research and industry impacts climate change. However, estimating aggregate climate change
           impacts of ML research and applications would require many assumptions due to a current lack of reporting and
           accounting. Instead, we aim to emphasize and aid systematic reporting strategies such that accurate ﬁeld-wide estimates
           can be conducted in the future.
           Some recent work investigates climate impacts of machine learning research, speciﬁcally Strubell et al.(2019)
           demonstrate the issue of carbon and energy impacts of large NLP models by evaluating estimated power usage and
           carbon emissions for a set of case studies. The authors suggest that. “authors should report training time and sensitivity
           to hyperparameters”, “academic researchers need equitable access to computation resources”, and “researchers should
           prioritize computationally efﬁcient hardware and algorithms”.Schwartz et al.(2019) provide similar proposals,
           suggesting ﬂoating point operations (FPOs) as a guiding efﬁciency metric. Lacoste et al.(2019) recently provided a
           website for estimating carbon emissions based on GPU type, experiment length, and cloud provider. In Section5, we
              4 An emissions trading system is a cap on total allowed carbon emissions for a company with permits issued. When a company
           emits a certain amount of carbon, they trade in a permit, creating a market for emissions permits. This is a market-based approach to
           incentivize emission reductions. See Ramstein et al.(2019) for a description of such carbon pricing efforts across different countries.
              5 See also assorted examinations on carbon accounting, standardized reporting, and policy recommendations (Stechemesser and
           Guenther,2012; Dayarathna et al.,2015; IPCC,2018; Ajani et al.,2013; Bellassen and Stephan,2015;Andrew and Cortese,2011;
           Tang and Demeritt, 2018;Cotter et al.,2011;Tol,2011;U.S. Environment Protection Agency,2013; Ricke et al.,2018).
           discuss how while the estimation methods of these works provide some understanding of carbon and energy impacts,
           nuances in the estimation methods may make them inaccurate – particularly in experiments which utilize combined CPU
           and GPU workloads heavily. We build a framework aiming to provide more accurate and easier systematic reporting of
           carbon and energy footprints. We also provide additional mitigation and reporting strategies – beyond those discussed
           by these prior works – to emphasize how both companies and research labs can be more carbon and energy efﬁcient.
           It is worth noting that prior work has also examined the carbon impacts of research in other ﬁelds, focusing mostly on
           emissions from conference travel (Spinellis and Louridas,2013;Astudillo and AzariJafari,2018;Hackel and Sparkman,
           2018). We provide a brief discussion on ML-related conference travel in AppendixA, but will focus mainly on accurate
           accounting of energy and carbon footprints of ML compute.

           3 Background

           We brieﬂy provide a primer on energy and carbon accounting, which form the basis of our proposed framework for
           measuring and reporting the ecological footprint of ML research.

           3.1 Energy Accounting

           Energy accounting is fairly straightforward. The energy consumption of a system can be measured in Joules (J) or
           Watt-hours (Wh), 6 representing the amount of energy needed to power the system. Life-cycle accounting might also
           consider the energy required to manufacture components of the system – for example, the production of GPUs or
           CPUs (Jones et al.,2013). However, we largely ignore life-cycle aspects of energy accounting due to the difﬁculties in
           attributing manufacturing impacts on a per-experiment basis. Measuring data-center energy impacts also contain several
           layers, focusing on hardware-centric and software-centric analyses. Many parts contribute to the power consumption
           of any computational system. Dayarathna et al.(2015) survey energy consumption components of a data center and
           their relative consumption. cooling (50%), lighting (3%), power conversion (11%), network hardware (10%), and
           server/storage (26%). The server and storage component can further be broken down into contributions from DRAM,
           CPUs, among other compute components. Accurate accounting for all of these components requires complex modeling
           and varies depending on workload. Since we aim to provide a framework at the per-experiment software level, we only
           account for aspects of energy consumption which expose interfaces for energy metrics. For the purpose of our work, this
           is constrained to DRAM, CPUs, and GPUs. To account for all other components, we rely on a power usage effectiveness
           (PUE) factor (Strubell et al.,2019). This factor rescales the available power metrics by an average projected overhead
           of other components. With more available software interfaces, more robust modeling can be performed as reviewed by
           Dayarathna et al.(2015).

           3.2 Carbon Accounting

           Carbon accounting can be all-expansive, so we focus on a narrow deﬁnition provided by Stechemesser and Guenther
           (2012). “carbon accounting at the project scale can be deﬁned as the measuring and non-monetary valuation of carbon
           and GHG emissions and offsetting from projects, and the monetary assessment of these emissions with offset credits to
           inform project-owners and investors but also to establish standardized methodologies.” Carbon and GHG emissions are
           typically measured in some form close to unitsCO 2eq . This is the amount of carbon – and other GHG converted to
           carbon amounts – released into the atmosphere as a result of the project. Carbon offsetting is the amount of carbon
           emissions saved as a result of the project. For example, a company may purchase renewable energy in excess of
           the energy required for their project to offset for the carbon emissions they contributed. Since our goal is to inform
           and assess carbon emissions of machine learning systems, we ignore carbon offsetting 7 . We also do not consider
           carbon accounting in the ﬁnancial sense, but do provide metrics on monetary impacts through the social cost of carbon
           (SC-CO2). TheU.S. Environment Protection Agency(2013) uses this metric when developing administrative rules and
           regulations. According to the EPA, “The SC-CO2 is a measure, in dollars, of the long-term damage done by a ton of
           carbon dioxide (CO2) emissions in a given year. This dollar ﬁgure also represents the value of damages avoided for
           a small emission reduction (i.e., the beneﬁt of a CO2 reduction).” We rely on the per-country social cost of carbon
           developed byRicke et al.(2018), which accounts for different risk proﬁles of country-level policies and GDP growth in
           their estimates of SC-CO2.
           Carbon emissions from a project can also consider life-cycle emissions (for example, manufacturing of CPUs may emit
           carbon as part of the process). We do not consider these aspects of emissions. We instead, consider only carbon emissions
           from energy consumption. A given energy grid powering an experiment will have a carbon intensity. the grams of

              6 One Watt is a unit of power – equivalent to one Joule per second.
              7 See discussion in AppendixCfor more information on why.

           CO2 emitted per kWh of energy used. This carbon intensity is determined based on the energy sources supplying the
           grid. Each energy source has its own carbon intensity accounted for through a full life-cycle analysis (IPCC,2015). For
           example, coal power has a median carbon intensity of 820 gCO 2eq / kWh, while hydroelectricity has a mean carbon
           intensity of 24 gCO 2eq / kWh. Carbon emissions for a compute system can be estimated by understanding the carbon
           intensity of the local energy grid and the energy consumption of the system. Similar analyses have been done for
           bitcoin (Krause and Tolaymat,2018). These analyses, however, attempt to extrapolate impacts of bitcoin mining in
           general, while in this work we attempt to examine machine learning impacts on a per-experiment basis.

           3.3 Current State of Reporting in Machine Learning Research

           We brieﬂy examine the current state of accounting in the machine learning literature and review commonly reported
           computational metrics. Here we look at a non-exhaustive list of reported metrics from papers we surveyed and group
           them into different categories.

                •Energy
                   –Energy in Joules (Assran et al.,2019)
                   –Power consumption in Watts (Canziani et al.,2016)
                •Compute
                   –PFLOPs-hr (Amodei and Hernandez,2018), the ﬂoating point operations per second needed to run the
                     experiment in one hour
                   –Floating Point Operations (FPOs) or Multiply-Additions (Madds), typically reported as the computations
                     required to perform one forward pass through a neural network (Howard et al.,2017;Sandler et al.,2018;
                     Schwartz et al.,2019)
                   –The number of parameters deﬁned by a neural network (often reported together with FPOs) (Howard
                     et al.,2017;Sandler et al.,2018)
                   –GPU/CPU utilization as a percentage (Assran et al.,2019;Dalton et al.,2019)
                   –GPU-hours or CPU-hours, the processor cycles utilized (or in the case of the GPU percentage utilized),
                     times the runtime (Soboczenski et al.,2018)
                •Runtime
                   –Inference time, the time it takes to run one forward pass through a neural network, (Jeon and Kim,2018;
                     Qin et al.,2018)
                   –Wall clock training time, the total time it takes to train a network (Assran et al.,2019;Dalton et al.,2019).
                   –Hardware and time together (e.g., 8 v100 GPUs for 5 days) (Krizhevsky et al.,2012;Ott et al.,2018;
                     Gehring et al.,2017)
                •Carbon Emissions
                   –US-average carbon emissions (Strubell et al.,2019)

           Example 1 To get a rough estimate of the prevalence of these metrics, we randomly sampled 100 NeurIPS papers from
           the 2019 proceedings. In addition to the metrics above, we also investigate whether hardware information was reported
           (important for extrapolating energy and carbon information with partial information). Of these papers, we found 1
           measured energy in some way, 45 measured runtime in some way, 46 provided the hardware used, 17 provided some
           measure of computational complexity (e.g., compute-time, FPOs, parameters), and 0 provided carbon metrics. See
           Appendix B for more details on methodology.

           Some of these metrics, when combined, can also be used to roughly estimate energy or carbon metrics. For example,
           the experiment time (h) can be multiplied by the thermal design power (TDP) of the GPUs used (W) 8 . This results
           in a Watt-hour energy metric. This can then be multiplied by the carbon intensity of the local energy grid to assess
           the amount ofCO 2eq emitted. This method of estimation omits CPU usage and assumes a 100% GPU utilization.
           Alternatively, Amodei and Hernandez(2018) use a utilization factor of 33% for GPUs. Similarly, the PFLOPs-hr metric
           can by multiplied by TDP (Watts) and divided by the maximum computational throughput of the GPU (in PFLOPs).
           This once again provides a Watt-hour energy metric. This, however, makes assumptions based on maximum efﬁciency
           of a GPU and disregards variations in optimizations made by underlying frameworks (e.g., Tensorﬂow versus Pytorch;
           AMD versus NVIDIA drivers).

              8 This is a rough estimate of the maximum operating capacity of a GPU.

           As we will demonstrate using our framework (see Section5.2), the assumptions of these estimation methods lead to
           signiﬁcant inaccuracies. However, aggregating all necessary accounting information is not straightforward or easy; it
           requires ﬁnding compatible tools, handling nuances on shared machines, among other challenges.
           It is worth noting that some metrics focus on the computational requirements of training (which require additional
           resources to compute gradients and backpropagate, in the case of neural networks) versus the computational requirements
           of inference. The former is often more energy and carbon intensive in machine learning research, while the later is more
           intensive in production systems (the cost of training is insigniﬁcant when compared to the lifetime costs of running
           inference millions of times per day, every day). We will remain largely agnostic to this differentiation until some
           discussions in Sections6.2and7.2.

           4 A New Framework for Tracking Machine Learning Impacts

           4.1 Motivation

           The goal of our experiment-impact-tracker framework is to provide an easy to deploy, reproducible, and quickly
           understood mechanism for all machine learning papers to report carbon impact summaries, along with additional
           appendices showing detailed energy, carbon, and compute metrics.

           Example 2A carbon impact summary generated by our framework can be found at the end of this paper in the Carbon
           Impact Statement section. In brief, the experiments in our paper contributed 8.021 kg ofCO 2eq to the atmosphere and
           used 24.344 kWh of electricity, having a USA-speciﬁc social cost of carbon of $0.38 ($0.00, $0.95) (Ricke et al.,2018).

           Such statements and informational reporting are important for, among other reasons, awareness, aligning incentives,
           and enabling accurate cost-beneﬁt analyses.
           Awareness. Informational labels and awareness campaigns have been shown to be effective drivers of eco-friendly
           behaviors (depending on the context) (Banerjee and Solomon,2003;Sundar et al.,2018;Newell and Siikamäki,2014;
           Byerly et al.,2018). Without consistent and accurate accounting, many researchers will simply be unaware of the
           impacts their models might have and will not pursue mitigating strategies. Consistent reporting also may provide social
           incentives to reduce carbon impacts in research communities.
           Aligning Incentives. While current reporting often focuses solely on performance metrics (accuracy in classiﬁcation,
           perplexity in language modeling, average return in reinforcement learning, etc), standardized reporting of energy in
           addition to these metrics aligns incentives towards energy efﬁcient models in research output (Schwartz et al.,2019).
           Those who accurately report carbon emissions may have more incentive to reduce their carbon footprint. This may also
           drive trafﬁc to low-emission regions, spurring construction of more carbon-friendly data centers. 9

           Cost-Beneﬁt Analysis and Meta-Analysis. Cost-beneﬁt analyses can be conducted with accurate energy metrics
           reporting, but are impossible without it. For example, the estimated generated revenue of a model can be weighed
           against the cost of electricity. In the case of models suggested by Rolnick et al.(2019), the carbon emissions saved by a
           model can be weighed against the emissions generated by the model. Consistent reporting also opens the possibility for
           performing meta-analyses on energy and carbon impacts (Henderson and Brunskill,2018). Larger extrapolations to
           ﬁeld-wide impacts of research conferences can also be assessed with more frequent reporting.

           4.2 Design Considerations

           We consider ﬁve main principles when designing the framework for systematic reporting. usability, interpretability,
           extensibility, reproducibility, and fault tolerance.
           Usability. Perceived ease-of-use can be an important factor in adoption of new technologies and methods (Gefen and
           Straub,2000). Since gathering key energy (kWh) and carbon (CO 2eq ) metrics requires speciﬁc knowledge about – and
           aggregation of – different sources of information, there may be a barrier to the ease-of-use in the current status quo. As
           a result, a core design consideration in developing tools for these metrics is usability, or ease-of-use. We accomplish
           this by abstracting away and distilling required knowledge of information sources, keeping amount of required action
           from the user to a minimum.
           Interpretability. Along with ease-of-use, a key factor in adoption is perceived usefulness (Gefen and Straub,2000).
           Since we wish for the reporting of carbon and energy metrics to become widespread, we consider perceived usefulness

              9 See discussion in Section6.2on regional carbon emission differences. See discussion by LaRiviere et al.(2016) on how more accurate carbon accounting can result in reduced carbon emissions.

           through interpretability. We aim to make reporting tools within the framework useful through simple generation of
           graphs and web pages from metrics for easy interpretation. We also provide a mechanism to generate a carbon impact
           statement with the social cost of carbon. This dollar amount represents the projected damage from the experiment’s
           carbon emissions and helps ground results in values that may be more interpretable.
           Extensibility.We design the framework in a modular fashion to handle evolving driver support (see Section5) and
           new metrics. The ML community can add new metrics, carbon intensity information, and other capabilities easily. For
           each metric, a central data router stores a description, the function which gathers metric data, and a list of compatibility
           checks (e.g., the metric can only be gathered on a Linux system). New metrics can be added to this router. 10 Similarly,
           new carbon region and electricity grid information can be added as needed to similar centralized locations. 11

           Reproducibility. Running an algorithm on different sets of hardware has been shown to affect the reproducibility of
           algorithmic results (Gundersen and Kjensmo,2018;Sukhoy and Stoytchev,2019). Our framework aides in automating
           reproducibility by logging additional metrics like hardware information, Python package versions, etc. These metrics can
           help future work assess statistically signiﬁcant differences in model energy requirements by accounting for controlled
           and random variates (Boquet et al.,2019).
           Fault tolerance.Mistakes in software are inevitable – as is discussed inSidor and Schulman(2017). We try to log all
           rawinformation so that accounting can be recreated and updated based on new information. We also log the version
           number of the tool itself, to ensure future comparisons do not mismatch information between versions that may have
           changed.

           4.3 Proposed Framework

           Theexperiment-impact-trackerrequires a simple code change to automatically gather available metrics and a script to
           generate online appendices for reporting the data. Currently, on compatible Linux systems, we gather.

                •all python packages and version numbers
                •CPU and GPU hardware information
                •experiment start and end-times
                •the version of theexperiment-impact-trackerframework used
                •the energy grid region the experiment is being run in (based on IP address)
                •the average carbon intensity in the energy grid region
                •CPU- and GPU-package power draw
                •per-process utilization of CPUs and GPUs
                •GPU performance states
                •memory usage
                •the realtime CPU frequency (in Hz)
                •realtime carbon intensity (only supported in CA right now)
                •disk write speed

           The code change required for immediate logging of metrics can be seen in Listing 1. In the background, the framework
           launches a thread which polls system supported tools. For example, the thread pollspsutil(Rodola,2016) for measuring
           CPU utilization. All of these metrics are logged in parallel with the main machine learning process as described in
           Figure1. A script 12 is provided to generate an HTML web page showing graphs and tables for all these metrics, meant
           to serve as an online appendix for research papers. 13 Results in the generated appendix can be aggregated across
           multiple experiments to show averages along with standard error as recommended in prior work (Henderson et al.,
           2018;Colas et al.,2018;Reimers and Gurevych,2017).

             10 Seehttps.//breakend.github.io/experiment-impact-tracker/contributing_new_metric.html
             11 Seehttps.//breakend.github.io/experiment-impact-tracker/contributing_carbon_region.html.
             12 https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/create-compute-appendix
             13 Appendices generated by our framework for Figure7and Figure3are available at.https.//breakend.github.io/ClimateChangeFromMachineLearningResearch/measuring_and_mitigating_energy_and_carbon_footprints_in_machine_learning/. Experiments in Figure5are available athttps.//breakend.github.io/RL-Energy-Leaderboard/
           reinforcement_learning_energy_leaderboard/index.html.

                                        <<FORMULA>>
            
                      Listing 1. Simple code addition required to log experiment details via our framework.


                    <<FORMULA>>

           Figure 1. A diagram demonstrating how the released version of the tool works. The main process launches a monitoring
           thread which iterates over a list of metrics associated with function calls to other tools. For example, if available, we
           call Intel RAPL to collect CPU power draw or querycaiso.orgto get realtime carbon intensity data for California.
           Once all the data that is compatible with the current system is gathered, it is logged to a standardized log ﬁle and the
           process repeats. The main thread may check in on this thread for exceptions, but the thread will not interrupt the main
           process. Once the main thread exits, anatexithook (which is called whenever the main process exits, either successfully
           or through an exception) gathers the ﬁnal information (such as the time the experiment ended), logs it, and then ends
           both the monitor and main process.


           4.3.1 Tracking Energy Consumption
           Different hardware vendors provide different tooling for tracking energy consumption. Our framework hides these
           complications from users. We currently use Intel’s RAPL tool with the powercap interface (David et al.,2010) to gather
           CPU/DRAM power draw and Nvidia’snvidia-smi 14 for GPU power draw. We usepsutilfor gathering per-process CPU
           utilization andnvidia-smifor per-process GPU utilization. We found that on a shared machine – as when running a
           job on Slurm – using Intel’s RAPL would provide energy metrics for the entire machine (including other jobs running
           on the worker). If two experiments were launched with Slurm to the same worker, using measurements from RAPL
           without corrections would double count energy usage from the CPU.
           As a result, we assign energy credits on a per-process basis (though we log system-wide information as well). We
           track the parent process, and any children spawned. Power credits are provided based on relative usage of system
           resources. If a process uses 25% of the CPU (relative to the entire system’s usage), we will credit the process with 25%
           of the CPU-based power draw. This ensures that any non-experiment-related background processes – software updates,
           weekly jobs, or multiple experiments on the same machine – will not be taken into account during training.

             14 https.//developer.nvidia.com/nvidia-system-management-interface       

           We calculate total energy as.
                                          <<FORMULA>>                                      (1) 

           where presource are the percentages of each system resource used by the attributable processes relative to the total in-use
           resources anderesource is the energy usage of that resource. This is the per-process equivalent of the method which
           Strubell et al.(2019) use. We assume the same constant power usage effectiveness (PUE) asStrubell et al.(2019). This
           value compensates for excess energy from cooling or heating the data-center.

           4.3.2 Carbon Accounting

                                                        <<FIGURE>>

           Figure 2. Realtime carbon intensity (CO2 / kWh) collected during one experiment using our framework. As the
           experiment continued, the sun rose in California, and with it the carbon intensity decreased.

           For calculating carbon emissions, we use the power estimate from the previous section in kilowatt-hours (kWh) and
           multiply it by the carbon intensity of the local energy grid (CO2 / kWh). To gather carbon intensity metrics
           for energy grids, we build on the open-source portions ofhttps.//www.electricitymap.organd deﬁne regions
           based on map-based geometries, using the smallest bounding region for a given location as the carbon intensity
           estimate of choice. For example, for an experiment run in San Francisco, if the average carbon intensity is available
           for both the USA and California, the latter will be used. We estimate the region the experiment is conducted in
           based on the machine’s IP address. Carbon intensities are gathered from the average fallback values provided in the
           https.//www.electricitymap.orgcode where available and supplemented with additional metrics from various
           governmental or corporate reports. We note thatelectricitymap.orgestimates are based on a closed-source system
           and uses the methodology described byTranberg et al.(2019). All estimates fromelectricitymap.orgare of
           the regional supply, rather than production (accounting for imports from other regions). Since https.//caiso.com
           provides realtime intensities including imports for free, for experiments run in California, we also provide realtime
           carbon intensity information. We do this by polling https.//caiso.com for the current intensity of the California
           energy grid every ﬁve minutes. This helps gather even more accurate estimates of carbon emissions to account for daily
           shifts in supply. For example, experiments run in California during the day time use roughly 2 of night-time experiments.
           This is because much of California’s renewable energy comes from solar plants. Figure2is an automatically generated 3
           graph showing this phenomenon from an experiment using our framework. We hope that as users ﬁnd more accurate
           realtime or average measurements of regional supply-based carbon intensities, they will add them to the tool for even
           more accurate measurements in the future.

           5 The Importance and Challenges of Accounting. Why a New Framework?

           5.1 FPOs Can Be Misleading

           Floating Point Operations (FPOs) are the de facto standard for reporting “efﬁciency” of a deep learning model (Schwartz
           et al.,2019), and intuitively they should be correlated with energy efﬁciency – after all, fewer operations should result
           in faster and more energy efﬁcient processing. However, this is not always the case.
           Previously,Jeon and Kim(2018) demonstrated mechanisms for constructing networks with larger FPOs, but lower
           inference time – discussing the “Trap of FLOPs”. Similarly,Qin et al.(2018) show how Depthwise 3x3 Convolutions
           comprised just 3.06% of an example network’s Multiply-Add operations, while utilizing 82.86% of the total training
           time in the FPO-efﬁcient MobileNet architectureHoward et al.(2017). Underlying optimizations at the ﬁrmware, deep
           learning framework, memory, or even hardware level can change energy efﬁciency and run-time. This discrepancy has
           led to Github Issues where users expect efﬁciency gains from FPO-efﬁcient operations, but do not observe them. 15

            Example 3 To investigate this empirically, we repeatedly run inference through pre-trained image classiﬁcation models
           and measure FPOs, parameters, energy usage, and experiment length using theexperiment-impact-trackerframework.
           As described in Figure3, we ﬁnd little correlation between FPOs and energy usage or experiment runtime when
           comparing across different neural network architectures. However, within an architecture – relying on the same
           operation types, but with different numbers of operations – FPOs are almost perfectly correlated with energy and
           runtime efﬁciency. Thus, while FPOs are useful for measuring relative ordering within architecture classes, they are not
           adequate on their own to measure energy or even runtime efﬁciency.

                                                    <<FIGURE>>

           Figure 3. We run 50,000 rounds of inference on a single sampled image through pre-trained image classiﬁcation models
           and record kWh, experiment time, FPOs, and number of parameters (repeating 4 times on different random seeds).
           References for models, code, and expanded experiment details can be found in AppendixD. We run a similar analysis
           toCanziani et al.(2016) and ﬁnd (left) that FPOs are not strongly correlated with energy consumption (R2 = 0.083,
           Pearson 0.289) nor with time (R2 = 0.005, Pearson 0.074) when measured across different architectures. However,
           within an architecture (right) correlations are much stronger. Only considering different versions of VGG, FPOs are
           strongly correlated with energy (R2 =.999, Pearson 1.0) and time (R2 =.998, Pearson .999). Comparing parameters
           against energy yields similar results (see AppendixDfor these results and plots against experiment runtime).


           5.2 Estimates with Partial Information Can Be Inaccurate

           The current state of accounting for energy and carbon varies across ﬁelds and papers (see Section 3). Few works, if any,
           report all of the metrics that our framework collects. However, it is possible to extrapolate energy and carbon impacts
           from some subsets of these metrics. This can give a very rough approximation of the energy used by an experiment in
           kWh (see Section 3 for background).

           Example 4 We demonstrate how several such estimation methods compare against the more ﬁne-grained accounting
           methods we describe in Section4.16 As seen in Figure4, we ﬁnd signiﬁcant differences from when we track all data
           (as through theexperiment-impact-trackerframework) to when we use partial data to extrapolate energy and carbon
           emissions. Only using GPUs and the experiment time ignores memory or CPU effects; only using the average case US
           region ignores regional differences. More details for this experiment can be found in AppendixE.

           We also note that the possible estimation differences in Figure4do not include possible errors from counting multiple
           processes at once, as described in Section4.3.1. Clearly, without detailed accounting, it is easy to severely over- or
           underestimate carbon or energy emissions by extrapolating from partial information.
             15 See for example.https.//github.com/tensorﬂow/tensorﬂow/issues/12132andhttps.//github.com/tensorﬂow/tensorﬂow/issues/12940
             16 We also provide a script to do the rough calculation of energy and carbon footprints based on GPU type, IP address (which
           is used to retrieve the location of the machine and that region’s carbon intensity), experiment length, and utilization factor.
           https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/get-rough-emissions-estimate

                                                  <<FIGURE>>

           Figure 4. We compare carbon emissions (left) and kWh (right) of our Pong PPO experiment (see AppendixEfor more
           details) by using different estimation methods. By only using country wide or even regional average estimates, carbon
           emissions may be over or under-estimated (respectively). Similarly, by using partial information to estimate energy
           usage (right, for more information about the estimation methods see AppendixE), estimates signiﬁcantly differ from
           when collecting all data in real time (as in our method). Clearly, without detailed accounting, it is easy to over- or
           under-estimate carbon or energy emissions in a number of situations. Stars indicate level of signiﬁcance. * p < .05, ** p
           < .01, *** p < .001, **** p < .0001. Annotation provided via.https.//github.com/webermarcolivier/statannot.


           6 Encouraging Efﬁciency and Mitigating Carbon Impacts. Immediate Mitigation Strategies

           With experiment-impact-tracker, we hope to ease the burden of standardized reporting. We have demonstrated
           differences in more detailed estimation strategies from the current status quo. In this Section, we examine how accurate
           reporting can be used to drive immediate mitigating strategies for energy consumption and carbon emissions.

           6.1 Energy Efﬁciency Leaderboards

           A body of recent work has emphasized making more computationally efﬁcient models (Wu et al.,2019;Coleman
           et al.,2019;Jiang et al.,2019), yet another line of work has focused on the opposite. building larger models with
           more parameters to tackle more complex tasks (Amodei and Hernandez,2018;Sutton,2019). We suggest leaderboards
           which utilize carbon emissions and energy metrics to promote an informed balance of performance and efﬁciency.
           DawnBench (Wu et al.,2019) has done this in terms of runtime and cost, 17 but by doing the same for energy and carbon
           emissions, baseline implementations can converge to more efﬁcient climate-friendly settings. This can also help spread
           information about the most energy and climate-friendly combinations of hardware, software, and algorithms such that
           new work can be built on top of these systems instead of more energy-hungry conﬁgurations.
           A Deep RL Energy Leaderboard.
           To demonstrate how energy leaderboards can be used to disseminate information on energy efﬁciency, we create a Deep
           RL Energy Leaderboard. 18 The website is generated using the same tool for creating HTML appendices described in
           Section4. All information (except for algorithm performance on tasks) comes from theexperiment-impact-tracker
           framework. We populate the leaderboard for two common RL benchmarking environments, PongNoFrameskip-v4 and
           BreakNoFrameskip-v4 (Bellemare et al.,2013;Brockman et al.,2016;Mnih et al.,2013), and four baseline algorithms,
           PPO (Schulman et al.,2017), A2C (Mnih et al.,2016), A2C with V-Traces (Espeholt et al.,2018;Dalton et al.,2019),
           and DQN (Mnih et al.,2013). The experimental details and results can also be found in Figure5. We ﬁnd that no
           algorithm is the energy efﬁciency winner across both environments, though the PPO implementation provided byHill
           et al.(2018) attains balance between efﬁciency and performance when using default settings across algorithms.

           Example 5To see how such a leaderboard might help save energy, consider a Deep RL class of 235 students. 19 For a
           homework assignment, each student must run an algorithm 5 times on Pong. The class would save 888 kWh of energy

             17 For image classiﬁcation and question answering tasks.
             18 https.//breakend.github.io/RL-Energy-Leaderboard/reinforcement_learning_energy_leaderboard/index.html
             19 See for example,Stanford’s CS 234.

                                      <<FIGURE>>

           Figure 5. We evaluate A2C, PPO, DQN, and A2C+VTraces on PongNoFrameskip-v4 (left) and BreakoutNoFrameskip-
           v4 (right), two common evaluation environments included in OpenAI Gym. We train for only 5M timesteps, less than
           prior work, to encourage energy efﬁciency and evaluate for 25 episodes every 250k timesteps. We show the Average
           Return across all evaluations throughout training (giving some measure of both ability and speed of convergence of an
           algorithm) as compared to the total energy in kWh. Weighted rankings of Average Return per kWh place A2C+Vtrace
           ﬁrst on Pong and PPO ﬁrst on Breakout. Using PPO versus DQN can yield signiﬁcant energy savings, while retaining
           performance on both environments (in the 5M samples regime). See AppendixFfor more details and results in terms of
           asymptotic performance.


           by using PPO versus DQN, while achieving similar performance. 20 This is roughly the same amount needed to power a
           US home for one month. 21

           We, thus, encourage the community to submit more data to the leaderboard to ﬁnd even more energy efﬁcient algorithms
           and conﬁgurations.

           6.2 Running In Carbon-Friendly Regions

           We noted in Section4that it is important to assess which energy grid experiments are run on due to the large differences
           in carbon emissions between energy grids. Figure6showsCO 2eq intensities for an assortment of locations, cloud-
           provider regions, and energy production methods. We note that an immediate drop in carbon emission can be made by
           moving all training jobs to carbon-efﬁcient energy grids. In particular, Quebec is the cleanest available cloud region
           to our knowledge. Running a job in Quebec would result in carbon emission 30x lower than running a job in Estonia
           (based on 2017 averages).

           Example 6To demonstrate this in practice, we run inference on two translation models 1000 times and measure energy
           usage. We extrapolate the amount of emissions and the difference between the two algorithms if run in different energy
           grids, seen in Figure7. The absolute difference in emissions between the two models is fairly small (though signiﬁcant)
           if run in Quebec (.09 gCO 2eq ), yet the gap increases as one runs the jobs in less carbon-friendly regions (at 3.04 g
           CO 2eq in Estonia).

           We provide a script with our framework to show all cloud provider region with emission statistics to make this decision-
           making process easier. 22 We note thatLacoste et al.(2019) provide a website using partial information estimation to
           extrapolate carbon emissions based on cloud provider region, GPU type, and experiment length in hours. Their tool
           may also be used for estimating carbon emissions in cloud-based experiments ahead of time.
           For companies that train and deploy large models often, shifting these resources is especially important. ML training
           is not usually latency bound. companies can run training in cloud regions geographically far away since training
           models usually does not require round trip communication requirements. Contrary to some opinions, 23 there is not a
           necessary need to eliminate computation-heavy models entirely, as shifting training resources to low carbon regions will
           immediately reduce carbon emissions with little impact to production systems. For companies seeking to hit climate

             20 These rankings may change with different code-bases and hyperparameters.
             21 https.//www.eia.gov/tools/faqs/faq.php?id=97&t=3
             22 See.get-region-emissions-info scriptandlookup-cloud-region-info script.
             23 https.//www.theguardian.com/technology/2019/sep/17/tech-climate-change-luddites-data

                                                  <<FIGURE>>

           Figure 6. Carbon Intensity (gCO 2eq /kWh) of selected energy grid regions is shown from least carbon emissions (left) to
           most carbon emissions (right). Red/unshaded boxes indicate carbon intensities of cloud provider regions. Blue/shaded
           boxes indicate carbon intensities of various generation methods. Oil shale is the most carbon emitting method of energy
           production in the Figure. Estonia is powered mainly by oil shale and thus is close to it in carbon intensity. Similarly,
           Québec is mostly powered by hydroelectric methods and is close to it in carbon intensity. Cloud provider carbon
           intensities are based on the regional energy grid in which they are located. Thus, us-west-1, located in California, has
           the same carbon intensity as the state. Seehttps.//github.com/Breakend/experiment-impact-tracker/for
           data sources of regional information. Energy source information fromKrey et al.(2014);International Energy Agency
           (2015).


           change policy targets, promotion of carbon neutral regions and shifting of all machine learning systems to those regions
           would accelerate reaching targets signiﬁcantly and reduce the amount of offset purchasing required to meet goals (thus
           saving resources). 24 It is worth noting that some companies like Google already purchase offsets (Google,2016), so it
           may be unclear why shifting resources is necessary. We provide an extended discussion on this in AppendixC. As a
           matter of total emissions reductions, running compute in carbon-friendly regions prevents emissions now, while offsets
           may not come into effect for several years. Moreover, continuing offset purchasing at current levels, while shifting
           resources to green regions would result in a net-negative carbon footprint.


           7 Discussion. Systemic Changes


           We demonstrated several use cases for accounting which can drive immediate mitigation strategies. However, the
           question remains. how can we encourage systemic changes which lead to energy and carbon efﬁciency in ML systems?


           7.1 Green Defaults for Common Platforms and Tools

           Energy leaderboards help provide information on energy efﬁcient conﬁgurations for the whole stack. However, to truly
           spread energy efﬁcient conﬁgurations, underlying frameworks should by default use the most energy-efﬁcient settings
           possible. This has been shown to be an effective way to drive pro-environmental behavior (Pichert and Katsikopoulos,
           2008). For example, Nvidia apex provides easy mixed-precision computing as an add-on which yields efﬁciency
           gains. 25 However, it requires knowing this and using it.Merity(2019) also discusses the current difﬁculties in using
           highly efﬁcient components. Making such resources supported as defaults in frequently used frameworks, like PyTorch,
           would immediately improve the efﬁciency of all downstream projects. We encourage maintainers of large projects to
           prioritize and support such changes.


             24 See, for example, Amazon’s goal.https.//press.aboutamazon.com/news-releases/news-release-details/amazon-co-founds-climate-
           pledge-setting-goal-meet-paris
             25 https.//github.com/NVIDIA/apex

                                                  <<FIGURE>>

            Figure 7. We use pre-trained En-Fr translation models downloaded from PyTorch Hub. a convolutional network (Gehring
            et al.,2017) and transformer (Ott et al.,2018). We generate 1000 random sequences either between 3-50 words in
            length using the essential_generators Python package.https.//pypi.org/project/essential-generators/.
           We repeat with 20 random seeds. [Left] We show the true difference in energy consumption. [Right] We show estimated
            kgCO 2eq released if the experiment had been conducted in a number of increasingly carbon-intensive energy grids.
            Differences remain signiﬁcant throughout, but the absolute difference increases as more carbon-intensive regions are
            assumed.

           7.2 How much is your performance gain worth? Balancing gains with cost

           While training jobs can easily be shifted to run in clean regions, there are often restrictions for inference-time use of
           machine learning models which prevent such a move. Many companies are deploying large machine learning models
           powered by GPUs for everyday services.

           Example 7 Production translation services, can process 100B words per day (Turovsky,2016). roughly 4.2 million
           times our experiment in Figure 7. If all translation trafﬁc were in Estonia, 12,768 kgCO 2eq (the carbon sequestered by
           16.7 acres of forest in one year (Agency,2008)) would be saved per day by using the more efﬁcient model, yet if all
           trafﬁc were in Québec, 378 kgCO 2eq would be saved (the carbon sequestered by .5 acres of forest in one year (Agency,
           2008)). Considering the amounts of required compute, small differences in efﬁciency can scale to large emissions and
           energy impacts.

           These services are latency-bound at inference time and thus cannot mitigate carbon emissions by shifting to different
           regions. Instead, energy-efﬁciency is key. We encourage companies to consider weighing energy costs (both social and
           monetary) with the performance gains of a new model before deploying it. In the case of our translation experiment in
           Figure7, the pre-trained convolutional model we use is signiﬁcantly more energy hungry across than the transformer
           model we use. When deploying a new energy-hungry translation model, we ask companies to consider is the BLEU
           score improvement really worth the energy cost of deploying it? Are there ways to route to different models to balance
           this trade-off? For example, suppose an energy-hungry model only improves performance in some subset of the data.
           Routing to this model only in that subset would maximize performance while minimizing energy footprint. We note
           that considering such trade-offs is of increased importance for models aiming to reduce carbon emissions as described
           by Rolnick et al.(2019). Deploying a large deep learning model for, say, improving the energy efﬁciency of a building,
           is not worth it if the energy costs of the model outweigh the gains. We also leave an open question to economists to
           help assess the welfare beneﬁts of gains on a particular machine learning metric (e.g., how much is BLEU score worth
           in a translation service). This would allow the social welfare of the metric to be balanced against the social cost of
           carbon (Ricke et al.,2018) for deployment decisions.
           Central to all of these cost-beneﬁt analyses are accurate accounting. Our tool provides one step in consistent and
           accurate accounting for such purposes.

           7.3 Efﬁcient testing environments

           In Section7.1we discuss the adoption of green default conﬁgurations and Section7.2discusses cost-beneﬁt analyses for
           deployments. Another consideration particular to research – especially RL – is the selection of the most efﬁcient testing
           environments which assess the mechanism under test. For example, if an RL algorithm solves a particularly complex task
           in an interesting way, like solving a maze environment, is there a way to demonstrate the same phenomenon in a more
           efﬁcient environment. Several works have developed efﬁcient versions of RL environments which reduce run-times
           signiﬁcantly. In particular,Dalton et al.(2019) improve the efﬁciency of Atari experiments by keeping resources on
           the GPU (and thus avoiding energy and time overheads from moving memory back and forth).Chevalier-Boisvert
           et al.(2018) develop a lightweight Grid World environment with efﬁcient runtimes for low-overhead experiments. An
           important cost-beneﬁt question for researchers is whether the same point can be proven in a more efﬁcient setting.

           7.4 Reproducibility

           A key aspect to our work is helping to promote reproducibility by aiding in consistent reporting of experimental details.
           We encourage all researchers to release code and models (when it is socially and ethically responsible to do so), to
           prevent further carbon emissions. Replicating results is an important, if not required, part of research. If replication
           resources are not available, then more energy and emissions must be spent to replicate results – in the case of extremely
           large models, the social cost of carbon may be equivalently large. Thus, we ask researchers to also consider energy and
           environmental impacts from replication efforts, when weighing model and code release. We note that there may very
           well be cases where safety makes this trade-off lean in the direction of withholding resources, but this is likely rare
           in most current research. For production machine learning systems, we encourage developers to release models and
           codebases internally within a company. This may encourage re-use rather than spending energy resources developing
           similar products.

             26 See for example, search which now uses transformer networks at both Microsoft and Google.
           https.//www.blog.google/products/search/search-language-understanding-bert/andhttps.//azure.microsoft.com/en-us/blog/microsoft-
           makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
             27 Efﬁcient routing of trafﬁc to regions has been considered before byNguyen et al.(2012) andBerral et al.(2010). It may be
           worth considering efﬁcient routing of trafﬁc to particular models as well.
                                                  
           7.5 Standardized reporting

           We suggest that all papers include standardized reporting of energy and carbon emissions. We also suggest adding a
           Carbon Impact Statement at the end of papers (just like ours below) which estimates the carbon emissions of the paper.
           This can be reported in a dollar amount via the country-speciﬁc social cost of carbonRicke et al.(2018). We provide a
           script 28 to parse logs from theexperiment-impact-trackerframework and generate such a statement automatically. We
           suggest this to spread awareness and bring such considerations to the forefront. We also emphasize that research, even
           when compute intensive, is immensely important for progress. It is unknown what sequence of papers may inspire a
           breakthrough (Stanley and Lehman,2015) which would reduce emissions by more than any suggestion here. While
           emissions should be minimized when possible, we suggest that impact statements be only used for awareness.
           We also suggest that, when developing features which visualize compute intensity for cloud or internal workloads,
           developers consider providing built-in tools to visualize energy usage and carbon emissions. For example, the Colab
           Research Environment shows RAM and Disk capacity, 29 but could also show and provide access to these other metrics
           more easily. Providing similar informational labels (Byerly et al.,2018) within internal tooling could mitigate some
           energy and carbon impacts within companies.

           7.6 Badging

           Informational labeling has had a long history of being used in public policy (Banerjee and Solomon,2003). In the
           USA, the “Energy Star” label has been used to guide customers to eco-friendly products. More recently, “badges”
           rewarded by thePsychological Sciencejournal were shown to be effective, with a jump from 3% of articles reporting
           open data to 39% one year later. ACM has introduced similar reproducibility badges. 30 With consistent reporting of
           carbon and energy metrics, climate friendly research badges can be introduced by conferences to recognize any paper
           that demonstrates a signiﬁcant effort to mitigate its impacts. For example, a compute intensive paper, when showing
           evidence of explicitly running resources in a clean region can be rewarded with such a badge. Another example badge
           can be awarded to papers that create energy-friendly algorithms with similar performance as the state-of-the-art 31 .
           The goal of these badges is to draw further attention to efﬁcient versions of state-of-the-art systems and to encourage
           mitigation efforts while, again, not punishing compute-intensive experiments.

           7.7 Driver and Implementation Difﬁculties

           The experiment-impact-tracker framework abstracts away many of the previously mentioned difﬁculties in estimating
           carbon and energy impacts. it handles routing to appropriate tools for collecting information, aggregates information
           across tools to handle carbon calculations, ﬁnds carbon intensity information automatically, and corrects for multiple
           processes on one machine. Yet, a few other challenges may be hidden by using the framework which remain difﬁcult to
           circumvent.
           AsKhan et al.(2018) discuss, and we encounter ourselves, poor driver support makes tracking energy difﬁcult. Not
           every chipset supports RAPL, nor does every Linux kernel. Neither NVIDIA or Intel provide ﬁrst party supported python
           libraries for access to measurements.nvidia-smiper-process measurements in docker containers are not supported. 32
           A body of work has also looked at improving estimates of energy usage from RAPL by ﬁtting a regression model to
           real energy usage patterns (Povoa et al.,2019;Kavanagh and Djemame,2019;Ghosh et al.,2013;Song et al.,2013).
           The Slurm workload manager provides an energy accounting plugin, 33 but requires administrator access to add. For
           those without access to Slurm, Intel’s RAPL supports access to measurements through three mechanisms, but only one
           of these (the powercap interface only available on Linux systems) does not require root access (see more discussion
           byKhan et al.(2018)). To promote widespread reporting, we avoid any tool which requires administrative access or
           would not be accessible on most Linux systems. Providing better supported tools for user-level access to power metrics
           would make it possible to more robustly measure energy usage. Aggregating metrics and handling the intricacies of
           these downstream tools requires time and knowledge. We try to abstract as much of these challenges away in the
           experiment-impact-tracker, though some driver-related issues require driver developer support.

             28 https.//github.com/Breakend/experiment-impact-tracker/blob/master/scripts/
           generate-carbon-impact-statement
             29 https.//colab.research.google.com/
             30 https.//www.acm.org/publications/policies/artifact-review-badging
             31 See, for example,Clark et al.(2020) which creates a more efﬁcient version of text encoder pre-training.
             32 https.//github.com/NVIDIA/nvidia-docker/issues/179#issuecomment-242150861
             33 https.//slurm.schedmd.com/acct_gather_energy_plugins.html
                                              
           We also note that carbon intensities for machines in cloud data centers may not reﬂect the regional carbon intensities.
           Some providers buy clean energy directly for some data centers, changing the realtime energy mix for that particular
           data center. We were unable to ﬁnd any information regarding realtime energy mixes in such cases and thus could not
           account for these scenarios. If providers exposed realtime APIs for such information this would help in generating
           more accurate estimates. Moreover, customized hardware in cloud provider regions does not always provide energy
           accounting mechanisms or interfaces. If cloud providers supported libraries for custom hardware, this could be used for
           more detailed accounting in a wider range of cloud-based compute scenarios

           8 Concluding Remarks and Recommendations

           We have shown how theexperiment-impact-trackerand associated tools can help ease the burden of consistent
           accounting and reporting of energy, compute, and carbon metrics; we encourage contribution to help expand the
           framework. We hope the Deep RL Energy Leaderboard helps spread information on energy efﬁcient algorithms and
           encourages research in efﬁciency. While we focus on compute impacts of machine learning production and research, a
           plethora of other work considers costs of transportation for conferences (Holden et al.,2017;Spinellis and Louridas,
           2013;Bossdorf et al.,2010) and compute hardware manufacturing (Venkatesan,2015). We encourage researchers and
           companies to consider these other sources of carbon impacts as well. Finally, we recap several points that we have
           highlighted in mitigating emissions and supporting consistent accountability.
           What can machine learning researchers do?

                •Run cloud jobs in low carbon regions only (see Section6.2).
                •Report metrics as we do here, make energy-efﬁcient conﬁgurations more accessible by reporting these results
                 (see Section7.5).
                •Work on energy-efﬁcient systems, create energy leaderboards (see Section6).
                •Release code and models whenever safe to do so (see Section7.4).
                •Integrate energy efﬁcient conﬁgurations as defaults in baseline implementations (see Section7.1).
                •Encourage climate-friendly initiatives at conferences (see Sections7.6and7.5).

           What can industry machine learning developers and framework maintainers do?

                •Move training jobs to low carbon regions immediately. Make default launch conﬁgurations and documentation
                 point to low carbon regions (see Section6.2).
                •Provide more robust tooling for energy tracking and carbon intensities (see Section7.7).
                •Integrate energy efﬁcient operations as default in frameworks (see Section7.1).
                •Release code and models (even just internally in the case of production systems) whenever safe to do so (see
                 Section7.4).
                •Consider energy-based costs versus beneﬁts of deploying new models (see Section7.2).
                •Report model-related energy metrics (see Section7.5).

           We hope that regardless of which tool is used to account for carbon and energy emissions, the insights we provide here
           will help promote responsible machine learning research and practices.

           Carbon Impact Statement

           This work contributed 8.021 kg ofCO 2eq to the atmosphere and used 24.344 kWh of electricity, having a
           USA-speciﬁc social cost of carbon of $0.38 ($0.00, $0.95). Carbon accounting information can be found
           here.   https.//breakend.github.io/ClimateChangeFromMachineLearningResearch/measuring_and_
           mitigating_energy_and_carbon_footprints_in_machine_learning/ and https.//breakend.github.
           io/RL-Energy-Leaderboard/reinforcement_learning_energy_leaderboard/index.html. The social cost
           of carbon uses models from (Ricke et al.,2018). This statement and carbon emissions information was generated using
           experiment-impact-trackerdescribed in this paper.

           References
           US Environmental Protection Agency. Greenhouse gas equivalencies calculator, 2008. URLhttps.//www.epa.gov/
             energy/greenhouse-gas-equivalencies-calculator.
           Judith I Ajani, Heather Keith, Margaret Blakers, Brendan G Mackey, and Helen P King. Comprehensive carbon stock
             and ﬂow accounting. a national framework to support climate change mitigation policy.Ecological Economics, 89.
             61–72, 2013.
           Dario Amodei and Danny Hernandez. AI and Compute.https.//blog.openai.com/openai-five/, 2018.
           Jane Andrew and Corinne Cortese. Accounting for climate change and the self-regulation of carbon disclosures. In
             Accounting Forum, volume 35, pages 130–138. Taylor & Francis, 2011.
           Mahmoud ("Mido") Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, and Mike Rabbat. Gossip-based actor-
             learner architectures for deep reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
             E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32, pages 13299–13309. Curran
             Associates, Inc., 2019.
           Miguel F. Astudillo and Hessam AzariJafari. Estimating the global warming emissions of the LCAXVII conference.
             connecting ﬂights matter.The International Journal of Life Cycle Assessment, 23(7).1512–1516, Jul 2018. ISSN
             1614-7502.
           Abhijit Banerjee and Barry D Solomon. Eco-labeling for energy efﬁciency and sustainability. a meta-evaluation of us
             programs.Energy policy, 31(2).109–123, 2003.
            Valentin Bellassen and Nicolas Stephan.Accounting for Carbon. Cambridge University Press, 2015.
           Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment. An
             Evaluation Platform for General Agents.Journal of Artiﬁcial Intelligence Research, 47.253–279, 2013.
           Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi Guitart, Ricard Gavaldà, and Jordi Torres. Towards energy-
             aware scheduling in data centers using machine learning. InProceedings of the 1st International Conference on
             Energy-Efﬁcient Computing and Networking, e-Energy ’10, page 215–224, New York, NY, USA, 2010. Association
             for Computing Machinery. ISBN 9781450300421.
           Thomas Boquet, Laure Delisle, Denis Kochetkov, Nathan Schucher, Parmida Atighehchian, Boris Oreshkin, and
             Julien Cornebise. DECoVaC. Design of Experiments with Controlled Variability Components. arXiv preprint
             arXiv.1909.09859, 2019.
           Oliver Bossdorf, Madalin Parepa, and Markus Fischer. Climate-neutral ecology conferences. just do it!Trends in
             Ecology & Evolution, 25(2).61, 2010.
           Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
             OpenAI Gym, 2016.
           Hilary Byerly, Andrew Balmford, Paul J Ferraro, Courtney Hammond Wagner, Elizabeth Palchak, Stephen Polasky,
             Taylor H Ricketts, Aaron J Schwartz, and Brendan Fisher. Nudging pro-environmental behavior. evidence and
             opportunities.Frontiers in Ecology and the Environment, 16(3).159–168, 2018.
           Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical
             applications.arXiv preprint arXiv.1605.07678, 2016.
           Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. Hardnet. A low memory trafﬁc
             network. InProceedings of the IEEE International Conference on Computer Vision, pages 3552–3561, 2019.
           Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI Gym.
             https.//github.com/maximecb/gym-minigrid, 2018.
           Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. {ELECTRA}. Pre-training text encoders
             as discriminators rather than generators. InInternational Conference on Learning Representations, 2020.
           Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep
             reinforcement learning experiments.arXiv preprint arXiv.1806.08295, 2018.
           Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun,
             Chris Ré, and Matei Zaharia. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance
             Benchmark.SIGOPS Oper. Syst. Rev., 53(1).14–25, July 2019. ISSN 0163-5980.
           Julie Cotter, Muftah Najah, and Shihui Sophie Wang. Standardized reporting of climate change information in australia.
             Sustainability accounting, management and policy journal, 2(2).294–321, 2011.
           Thomas J Crowley. Causes of climate change over the past 1000 years.Science, 289(5477).270–277, 2000.
           Steven Dalton, Iuri Frosio, and Michael Garland. GPU-Accelerated Atari Emulation for Reinforcement Learning, 2019.
           Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. RAPL. memory power estimation
             and capping. In2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pages
             189–194. IEEE, 2010.
           Miyuru Dayarathna, Yonggang Wen, and Rui Fan. Data center energy consumption modeling. A survey. IEEE
             Communications Surveys & Tutorials, 18(1).732–794, 2015.
           Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu,
             Tim Harley, Iain Dunning, et al. IMPALA. Scalable Distributed Deep-RL with Importance Weighted Actor-Learner
             Architectures. InInternational Conference on Machine Learning, pages 1406–1415, 2018.
           David Gefen and Detmar W Straub. The relative importance of perceived ease of use in is adoption. A study of
             e-commerce adoption.Journal of the association for Information Systems, 1(1).8, 2000.
           Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence
             learning. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252.
             JMLR. org, 2017.
           Sayan Ghosh, Sunita Chandrasekaran, and Barbara Chapman. Statistical modeling of power/energy of scientiﬁc kernels
             on a multi-gpu system. In2013 International Green Computing Conference Proceedings, pages 1–6. IEEE, 2013.
           Google. Google’s Green PPAs. What, How, and Why.https.//static.googleusercontent.com/media/www.
             google.com/en//green/pdfs/renewable-energy.pdf, 2013.
           Google. Achieving Our 100% Renewable Energy Purchasing Goal and Going Be-
             yond.         https.//static.googleusercontent.com/media/www.google.com/en//green/pdf/
             achieving-100-renewable-energy-purchasing-goal.pdf, 2016.
           Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art. Reproducibility in artiﬁcial intelligence. InThirty-Second
             AAAI Conference on Artiﬁcial Intelligence, 2018.
           Leor Hackel and Gregg Sparkman. Evaluating the climate impact of psychological science. Costs and opportunities.
             Affective Seminar, 2018. URLhttps.//osf.io/dg5ap/?show=view.
           Peter Henderson and Emma Brunskill. Distilling information from a ﬂood. A possibility for the use of meta-analysis
             and systematic review in machine learning research. InCritiquing and Correcting Trends in Machine Learning
             Workshop (CRACT) at NeurIPS, 2018.
           Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement
             learning that matters. InThirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.
           Ashley Hill, Antonin Rafﬁn, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal,
             Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and
             Yuhuai Wu. Stable baselines.https.//github.com/hill-a/stable-baselines, 2018.
           Matthew H Holden, Nathalie Butt, Alienor Chauvenet, Michaela Plein, Martin Stringer, and Iadine Chadès. Academic
             conferences urgently need environmental policies.Nature ecology & evolution, 2017.
           Nicolas Houy. Rational mining limits bitcoin emissions.Nature Climate Change, 9(9).655–655, 2019.
           Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,
             and Hartwig Adam. Mobilenets. Efﬁcient convolutional neural networks for mobile vision applications.arXiv
             preprint arXiv.1704.04861, 2017.
           Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional
             networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708,
             2017.
           Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet.
             AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size.arXiv preprint arXiv.1602.07360, 2016.
           International Energy Agency.CO2 Emissions from Fuel Combustion. 2015.
           IPCC.Climate Change 2014. Mitigation of Climate Change. Working Group III Contribution to the IPCC Fifth
             Assessment Report. Cambridge University Press, 2015.
           IPCC.Global Warming of 1.5 °C. 2018.
           Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. InAdvances in Neural
             Information Processing Systems, pages 5951–5961, 2018.
           Angela H. Jiang, Daniel L. K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi,
             Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, and Padmanabhan Pillai. Accelerating Deep Learning by
             Focusing on the Biggest Losers.arXiv e-prints, art. arXiv.1910.00762, Oct 2019.
           Alex K Jones, Liang Liao, William O Collinge, Haifeng Xu, Laura A Schaefer, Amy E Landis, and Melissa M Bilec.
             Green computing. A life cycle perspective. In2013 International Green Computing Conference Proceedings, pages
             1–6. IEEE, 2013.
           Richard Kavanagh and Karim Djemame. Rapid and accurate energy models through calibration with ipmi and rapl.
             Concurrency and Computation. Practice and Experience, 31(13).e5124, 2019.
           Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen, and Zhonghong Ou. RAPL in Action. Experiences
             in Using RAPL for Power Measurements.ACM Trans. Model. Perform. Eval. Comput. Syst., 3(2).9.1–9.26, March
             2018. ISSN 2376-3639.
           Max J Krause and Thabet Tolaymat. Quantiﬁcation of energy and carbon costs for mining cryptocurrencies.Nature
             Sustainability, 1(11).711, 2018.
           V. Krey, O. Masera, G. Blanford, T. Bruckner, R. Cooke, K. Fisher-Vanden, H. Haberl, E. Hertwich, E. Kriegler,
             D. Mueller, S. Paltsev, L. Price, S. Schlömer, D. Ürge-Vorsatz, D. van Vuuren, and T. Zwickel. Annex 2 - metrics and
             methodology. InClimate Change 2014. Mitigation of Climate Change. IPCC Working Group III Contribution to
             AR5. Cambridge University Press, November 2014. URLhttp.//pure.iiasa.ac.at/id/eprint/11109/.
           Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classiﬁcation with Deep Convolutional Neural
             Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances in Neural Information
             Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
           Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of
             machine learning.arXiv preprint arXiv.1910.09700, 2019.
           Jacob LaRiviere, Gavin Mccormick, and Sho Kawano. How better accounting can more cheaply reduce carbon
             emissions.Policy Brief, 4, 2016.
           Jens Malmodin, Pernilla Bergmark, and Dag Lundén. The future carbon footprint of the ict and e&m sectors.on
             Information and Communication Technologies, page 12, 2013.
           Eric Masanet, Arman Shehabi, Nuoa Lei, Harald Vranken, Jonathan Koomey, and Jens Malmodin. Implausible
             projections overestimate near-term bitcoin co2 emissions.Nature Climate Change, 9(9).653–654, 2019.
           Stephen Merity. Single Headed Attention RNN. Stop Thinking With Your Head.arXiv preprint arXiv.1911.11423,
             2019.
           Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin
             Riedmiller. Playing Atari With Deep Reinforcement Learning. InNIPS Deep Learning Workshop. 2013.
           Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
             and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on
             machine learning, pages 1928–1937, 2016.
           Camilo Mora, Randi L Rollins, Katie Taladay, Michael B Kantar, Mason K Chock, Mio Shimada, and Erik C Franklin.
             Bitcoin emissions alone could push global warming above 2 °C.Nature Climate Change, 8(11).931, 2018.
           Richard G Newell and Juha Siikamäki. Nudging energy efﬁciency behavior. The role of information labels.Journal of
             the Association of Environmental and Resource Economists, 1(4).555–598, 2014.
           Kim Khoa Nguyen, Mohamed Cheriet, Mathieu Lemay, Victor Reijs, Andrew Mackarel, and Alin Pastrama.
             Environmental-aware virtual data center network.Computer Networks, 2012.
           Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. InProceedings of the
             Third Conference on Machine Translation. Research Papers, Brussels, Belgium, 2018. Association for Computational
             Linguistics.
           Daniel Pichert and Konstantinos V. Katsikopoulos. Green defaults. Information presentation and pro-environmental
             behaviour.Journal of Environmental Psychology, 28(1).63 – 73, 2008. ISSN 0272-4944. doi. https.//doi.org/10.1016/
             j.jenvp.2007.09.004. URLhttp.//www.sciencedirect.com/science/article/pii/S0272494407000758.
           Lucas Venezian Povoa, Cesar Marcondes, and Hermes Senger. Modeling energy consumption based on resource
             utilization. InInternational Conference on Computational Science and Its Applications, pages 225–240. Springer,
             2019.
           Zheng Qin, Zhaoning Zhang, Dongsheng Li, Yiming Zhang, and Yuxing Peng. Diagonalwise Refactorization. An
             Efﬁcient Training Method for Depthwise Convolutions. In2018 International Joint Conference on Neural Networks
             (IJCNN), pages 1–8. IEEE, 2018.
           Celine Ramstein, Goran Dominioni, Sanaz Ettehad, Long Lam, Maurice Quant, Jialiang Zhang, Louis Mark, Sam
             Nierop, Tom Berg, Paige Leuschner, et al. State and trends of carbon pricing 2019, 2019.
           Nils Reimers and Iryna Gurevych. Reporting Score Distributions Makes a Difference. Performance Study of LSTM-
             networks for Sequence Tagging. InEMNLP, 2017.
           Katharine Ricke, Laurent Drouet, Ken Caldeira, and Massimo Tavoni. Country-level social cost of carbon.Nature
             Climate Change, 2018.
           Giampaolo Rodola. Psutil package. a cross-platform library for retrieving information on running processes and system
             utilization, 2016.
           David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, Andrew Slavin
             Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan Maharaj,
             Evan D. Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Hassabis, John C.
             Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio. Tackling Climate Change with Machine Learning.arXiv
             e-prints, art. arXiv.1906.05433, Jun 2019.
           Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2. Inverted
             residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
             pages 4510–4520, 2018.
           John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
             algorithms.arXiv preprint arXiv.1707.06347, 2017.
           Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.arXiv e-prints, art. arXiv.1907.10597, Jul
             2019.
           Sam Shead. AI Researchers Left Disappointed As NIPS Sells Out In Under 12 Min-
             utes.     Forbes, Sep 2018. URL https.//www.forbes.com/sites/samshead/2018/09/05/
             ai-researchers-left-disappointed-as-nips-sells-out-in-under-12-minutes/#7dda67fc20e9.
           Yoav Shoham, Erik Brynjolfsson, Jack Clark, John Etchemendy, Barbara Grosz, Terah Lyons, James Manyika, Saurabh
             Mishra, and Juan Carlos Niebles. The ai index 2019 annual report.AI Index Steering Committee, Human-Centered
             AI Initiative, Stanford University., 2019.
           Szymon Sidor and John Schulman. Openai baselines. Dqn (blogpost). 2017.
           Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv
             preprint arXiv.1409.1556, 2014.
           Frank Soboczenski, Michael D Himes, Molly D O’Beirne, Simone Zorzan, Atilim Gunes Baydin, Adam D Cobb,
             Yarin Gal, Daniel Angerhausen, Massimo Mascaro, Giada N Arney, et al. Bayesian deep learning for exoplanet
             atmospheric retrieval.arXiv preprint arXiv.1811.03390, 2018.
           Shuaiwen Leon Song, Kevin Barker, and Darren Kerbyson. Uniﬁed performance and power modeling of scientiﬁc
             workloads. InProceedings of the 1st International Workshop on Energy Efﬁcient Supercomputing, page 4. ACM,
             2013.
           Diomidis Spinellis and Panos Louridas. The carbon footprint of conference papers.PloS one, 8(6).e66508, 2013.
            Kenneth O Stanley and Joel Lehman.Why greatness cannot be planned. The myth of the objective. Springer, 2015.
           Kristin Stechemesser and Edeltraud Guenther. Carbon accounting. a systematic literature review.Journal of Cleaner
             Production, 36.17–38, 2012.
           Christian Stoll, Lena Klaaßen, and Ulrich Gallersdörfer. The carbon footprint of bitcoin.Joule, 2019.
           Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP.
             arXiv preprint arXiv.1906.02243, 2019.
           Vladimir Sukhoy and Alexander Stoytchev. Eliminating the Variability of Cross-Validation Results with LIBLINEAR
             due to Randomization and Parallelization. 2019.
           Shyam Sundar, Ashish Kumar Mishra, and Ram Naresh. Modeling the impact of media awareness programs on
             mitigation of carbon dioxide emitted from automobiles.Modeling Earth Systems and Environment, 4(1).349–357,
             2018.
           Richard Sutton. The bitter lesson.Incomplete Ideas (blog), March, 13, 2019.
           Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
             Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition
             (CVPR), 2015.
           Samuel Tang and David Demeritt. Climate change and mandatory carbon reporting. Impacts on business process and
             performance.Business Strategy and the Environment, 27(4).437–455, 2018.
           Richard SJ Tol. The social cost of carbon.Annu. Rev. Resour. Econ., 3(1).419–443, 2011.
           Bo Tranberg, Olivier Corradi, Bruno Lajoie, Thomas Gibon, Iain Staffell, and Gorm Bruun Andresen. Real-time carbon
             accounting method for the european electricity markets.Energy Strategy Reviews, 26.100367, 2019.
           Barak Turovsky. Ten years of Google Translate.Google Ofﬁcial Blog, 2016.
           U.S. Environment Protection Agency. Social Cost of Carbon.https.//www.epa.gov/sites/production/ﬁles/2016-
             12/documents/social_cost_of_carbon_fact_sheet.pdf, 2013.
           Chandramouli Venkatesan. Comparative Carbon Footprint Assessment of the Manufacturing and Use Phases of Two
             Generations of AMD Accelerated Processing Units, 2015.
           Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay Less Attention with Lightweight and
             Dynamic Convolutions. InInternational Conference on Learning Representations, 2019.
           Michel Zade, Jonas Myklebost, Peter Tzscheutschler, and Ulrich Wagner. Is bitcoin the only problem? a scenario model
             for the power demand of blockchains.Frontiers in Energy Research, 7, 2019.
           Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv.1605.07146, 2016.
           Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufﬂenet. An extremely efﬁcient convolutional neural
             network for mobile devices. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
             pages 6848–6856, 2018.


           A Conference Travel

           Prior work has also examined conference travel for various ﬁelds as a major source of impact Spinellis and Louridas
           (2013); Astudillo and AzariJafari(2018);Hackel and Sparkman(2018). For example,Spinellis and Louridas(2013)
           found that theCO 2eq emissions from travel per conference participant was about 801 kgCO 2eq ,Astudillo and AzariJafari
           (2018) estimated around 883 kgCO 2eq emissions per participant, andHackel and Sparkman(2018) estimate around 910
           kg ofCO 2eq emissions per participant. Interestingly, these separate papers all align around the same carbon emissions
           numbers per conference participant. Using this and ML conference participant statistics we can gain some (very) rough
           insight into the carbon emissions caused by conference travel (not including food purchases, accommodations, and
           travel within the conference city).
           Conference participation has grown particularly popular in ML research, attracting participants from industry and
           academia. In 2018 the Neural Information Processing Systems (NeurIPS) conference sold out registrations in 12
           minutes (Shead,2018). In 2019, according to the AI Index Report 2019 (Shoham et al.,2019), conferences had the
           following attendance. CVPR (9,227); IJCAI (3,015); AAAI (3,227); NeurIPS (13,500); IROS (3,509); ICML (6,481);
           ICLR (2,720); AAMAS (701); ICAPS (283); UAI (334). The larger conferences also showed continued growth.
           NeurIPS showed a year-over-year growth 41% from 2018 to 2019. Given only these conferences and their attendances
           in 2019, the lower 801kgCO 2eq average emissions estimate per participant (Spinellis and Louridas,2013), this adds up
           to roughly 34,440,597 kgCO 2eq emitted in 2019 from ML-related conferences (not considering co-location and many
           other factors).

           B NeurIPS Sampling on Metric Reporting

           We randomly sampled 100 NeurIPS papers from the 2019 proceedings, of these papers we found 1 mea-
           sured energy in some way, 45 measured runtime in some way, 46 provided the hardware used, 17 pro-
           vided some measure of computational complexity (e.g., compute-time, FPOs, parameters), and 0 pro-
           vided carbon metrics. We sampled from the NeurIPS proceedings page. https.//papers.nips.cc/book/
           advances-in-neural-information-processing-systems-32-2019. We ﬁrst automatically check for key
           words (below) related to energy, compute, and carbon. We then examined the context of the word to classify it
           as relating to hardware details (e.g., Nvidia Titan X GPU), computational efﬁciency (e.g., FPOs, MAdds, GPU-hours),
           runtime (e.g., the experiment ran for 8 hours), energy (e.g., a plot of performance over Joules or Watts), or carbon (e.g.,
           we estimate 10 kg CO 2eq were emitted). We also manually validate papers for similar metrics that didn’t appear in the
           keyword search. If a paper did not contain experiments we removed it and randomly redrew a new paper. In many cases,
           metrics are only provided for some subset of experiments (or for particular ablation experiments). We nonetheless count
           these as reporting the metric. Where a neural network diagram or architecture description was provided, we did not
           consider this to be reporting a compute metric.

           compute_terms = ["ﬂop", "fpo", "pﬂop", "tﬂops", "tﬂop", "parameters", "params", "pﬂops", "ﬂops", "fpos", "gpu-hours",
           "cpu-hours", "cpu-time", "gpu-time", "multiply-add", "madd"]
           hardware_terms = ["nvidia", "intel", "amd", "radeon", "gtx", "titan", "v100", "tpu", "ryzen", "cpu", "gpu"]
           time_terms = ["seconds", "second", "hour", "hours", "day", "days", "time", "experiment length", "run-time", "runtime"]
           energy_terms = ["watt", "kWh", "joule", "joules", "wh", "kwhs", "watts", "rapl", "energy", "power"]
           carbon_terms = ["co2", "carbon", "emissions"]

           C Carbon Discussion

           But cloud providers claim 100% carbon neutrality in my region, why do I need to shift my resources?
           While we estimate energy mixes based on regional grids, cloud providers sometimes aim for carbonneutralitythrough
           a mixture of mechanisms which may change the energy mix being provided to a data center in an otherwise carbon
           intensive energy grid or otherwise offset unclean energy usage. Data centers draw energy from the local energy grids
           and as a result the mix of energy they consume largely depends on the composition of the power running in the grids. If
           the local energy grids are powered by a mix of fuel and renewable energy, a data center will inevitably consume fuel
           energy as well.
           Due to the fact that the consumers do not know the origin of the physical electricity from the utility grid, it is difﬁcult to
           assign ownership of the renewable energy consumption. The Environmental Protection Agency (EPA) uses renewable
           energy certiﬁcates (RECs) to track the generation and consumption of renewable energy. one REC is issued when
           one megawatt-hour (MWh) of electricity is generated from a renewable source and delivered to the energy grid. 34
           Consumers can then purchase RECs from a renewable energy provider and apply them to their electricity usage. This
           means consumers can claim they run on renewable energy by purchasing RECs from providers that doesn’t actually
           power the energy grids that they draw electricity from. Although this means that the consumers’ realtime carbon
           footprints will still be decided by the composition of renewable and fuel energy in their local energy grids, more
           renewable energy can ﬂow onto the grid by purchasing the RECs and future development of renewable sources is
           supported. Google, to offset its carbon emissions, uses RECs and power purchase agreements (PPAs) with renewable
           energy providers to ensure that more renewable energy powers the same electricity grids that its data centers are in. 35
           Google then sells the renewable energy as it becomes available back to the electricity grids and strips away the RECs.
           Over one year, Google applies equal amounts of RECs to its data centers’ total energy consumption. This method
           helps green energy provider development by creating a long term demand. However, PPAs provide RECs forfuture
           renewables, not only current energy on the grid which may remain unchanged. As it states. “While the renewable
           facility output is not being used directly to power a Google data center, the PPA arrangement assures that additional
           renewable generation sufﬁcient to power the data center came on line in the area.” 
           We can see that even if a cloud provider’s data centers are carbon neutral, the actual CO2 eq emissions can vary largely
           and depends on the region and even time of the day (solar energy cannot be generated at night). We suggest that cloud
           providers release tools for understanding the carbon intensity for each data center region regardless of offset purchasing.
           While the purchases of PPAs and RECs are valuable for driving towards renewable energy in otherwise dirty regions,
           for machine learning model training, where the resources can be moved, we believe shifting resources to low intensity
           regions is more beneﬁcial to long term carbon impacts. Other cloud-based jobs where latency requirements prevent
           shifting resources will remain to drive PPA/REC purchasing, and consequently renewable energy demand.

           D ImageNet Experiments

           We load pre-trained models available through PyTorch Hub (see https.//pytorch.org/hub) – namely
           AlexNet (Krizhevsky et al.,2012), DenseNet (Huang et al.,2017), GoogLeNet (Szegedy et al.,2015), HardNet (Chao
           et al.,2019), MobileNetv2 (Sandler et al.,2018), ShufﬂeNet (Zhang et al.,2018), SqueezeNet (Iandola et al.,2016),
           VGG (Simonyan and Zisserman,2014), and Wide ResNets (Zagoruyko and Komodakis,2016). We run 50,000 rounds
           of inference on a single image through pre-trained image classiﬁcation models and run similar analysis toCanziani et al.
           (2016). We repeat experiments on 4 random seeds.

             34 https.//www.epa.gov/greenpower/renewable-energy-certificates-recs
             35 We note that this process is likely similar for most cloud providers, but Google is the most open with their methodology, so we
           are able to gain more insight from the materials they publish. Information described here is mainly put together fromGoogle(2016)
           andGoogle(2013).
             36 https.//static.googleusercontent.com/media/www.google.com/en/us/green/pdfs/renewable-energy.pdf

           We count ﬂops and parameters using the thop package (for package version numbers see automated logs in the online
            appendix linked above).https.//github.com/Lyken17/pytorch-OpCounter
           Code for running the experiment is available at.     https.//github.com/Breakend/
           ClimateChangeFromMachineLearningResearch/blob/master/paper_specific/run_inference.py
           An online appendix showing all per-experiment details can be seen here. https.//breakend.github.io/
           ClimateChangeFromMachineLearningResearch/measuring_and_mitigating_energy_and_carbon_
           footprints_in_machine_learning/

           The plot of FPOs versus runtime can be seen in Figure8and plots against number of parameters can be seen in Figure9.
           Number of parameters similarly have no strong correlation with energy consumption (R2 = 0.002, Pearson 0.048),
           nor time (R2 = 0.14, Pearson 0.373). We note that our runtime results likely differ fromCanziani et al.(2016) due to
           the architectural differences in the model sets we use.
           For parameter plots, see Figure9, for extended time and energy Figures, see Figure8.

                                <<FIGURE>>

           Figure 8. We seek to investigate the connection between FPOs, energy usage, and experiment time, similarly toCanziani
           et al.(2016). We run 50,000 rounds of inference on a single image through pre-trained image classiﬁcation models
           available through PyTorch Hub (seehttps.//pytorch.org/hub) – namely (Krizhevsky et al.,2012;Huang et al.,
           2017;Szegedy et al.,2015;Chao et al.,2019;Sandler et al.,2018;Zhang et al.,2018;Iandola et al.,2016;Simonyan
           and Zisserman,2014;Zagoruyko and Komodakis,2016). We record experiment time and the kWh of energy used to run
           the experiments and repeat experiments 4 times, averaging results. We ﬁnd that FPOs are not strongly correlated with
           energy consumption (R2 = 0.083, Pearson0.289) nor with time (R2 = 0.005, Pearson 0.074). Number of parameters
           (plotted in Appendix) similarly have no strong correlation with energy consumption (R2 = 0.002, Pearson 0.048), nor
           time (R2 = 0.14, Pearson 0.373). We note, however, thatwithin an architecturecorrelations are much stronger. For
           example, only considering different versions of VGG, FPOs are strongly correlated with energy (R2 =.999, Pearson
           1.0) and time (R2 =.998, Pearson .999). See Appendix for experiment details, code, and data links. Our runtime
           results likely differ fromCanziani et al.(2016) due to the architectural differences in the model sets we use.

           E Estimation Methods

           We use our PPO Pong experiment (see AppendixFfor more details) as the experiment under comparison. For carbon
           emission estimates, we use three estimation methods. realtime emissions data for California (collected by our framework
           fromcaiso.org) times the power usage at that time integrated over the length of the experiment; multiplying total
           energy usage recorded by our method by the California average carbon intensity; multiplying total energy usage
           recorded by our method by the EPA US average carbon intensity (Strubell et al.,2019). For energy estimates, we use.
           (1) the experiment time multiplied by the number of GPUs, a utilization factor of 1/3 or 1, and the Thermal Design
           Power (TDP) – which can be thought of as the maximum Watt draw – of the GPU (Amodei and Hernandez,2018); (2)
           the measured GPU-hrs of our tool multiplied by the TDP; a rough calculation of PFLOPs-hr (following the methodology

                                                  <<FIGURE>>

           Figure 9. The same experiments as in Figure3, plotting parameters as the varying factor instead. See Figure3for
           correlation values.


           of (Amodei and Hernandez,2018) by the PFLOPs/TDP of the GPU; (3) our tool’s accounting method which tracks
           energy from GPU readings, accounts for CPU time/energy, and measures utilization in realtime.

           F Reinforcement Learning

           We investigate the energy efﬁciency of four baseline RL algorithms. PPO (Hill et al.,2018;Schulman et al.,2017),
           A2C (Hill et al.,2018;Mnih et al.,2016), A2C with VTraces (Espeholt et al.,2018;Dalton et al.,2019), and DQN (Hill
           et al.,2018;Mnih et al.,2016). We evaluate on PongNoFrameskip-v4 (left) and BreakoutNoFrameskip-v4 (right), two
           common evaluation environments included in OpenAI Gym (Bellemare et al.,2013;Brockman et al.,2016;Mnih et al.,
           2013).
           We train for only 5M timesteps, less than prior work, to encourage energy efﬁciency (Mnih et al.,2016,2013). We use
           default settings from code provided in stable-baselines (Hill et al.,2018) and cule (Dalton et al.,2019), we only modify
           evaluation code slightly. Modiﬁcations can be found here.

                •https.//github.com/Breakend/rl-baselines-zoo-1(for stable-baselines modiﬁcations)
                •https.//github.com/Breakend/cule(for cule modiﬁcations)

           Since we compare both on-policy and off-policy methods, for fairness all evaluation is based on 25 separate rollouts
           completed every 250k timesteps. This is to ensure parity across algorithms. We execute these in parallel together as
           seen in the cule code.https.//github.com/Breakend/cule/blob/master/examples/a2c/test.py.
           While average return across all evaluation episodes (e.g., averaging together the step at 250k timesteps and every
           evaluation step until 5M timesteps) can be seen in the main text, the asymptotic return (for the ﬁnal round of evaluation
           episodes) can be seen Figure10. Plots comparing experiment runtime to asymptotic and average returns (respectively)
           can be seen in Figure11and Figure12.
           Our online leaderboard can be seen at.  https.//breakend.github.io/RL-Energy-Leaderboard/
           reinforcement_learning_energy_leaderboard/index.html
           We note that while DQN underperforms as compared to PPO here, better hyperparameters may be found such that DQN
           is the more energy efﬁcient algorithm. Moreover, we only use the 5M samples regime, whereas prior work has used
           10M or more samples for training, so DQN results seen here would correspond to earlier points in training in other
           papers.

                                                  <<FIGURE>>

                              Figure 10. Pong (left) and Breakout (right) asymptotic return.

                                          <<FIGURE>>

                 Figure 11. Pong (left) and Breakout (right) as a function of experiment length and asymptotic return.

                                                              <<FIGURE>>

                  Figure 12. Pong (left) and Breakout (right) as a function of experiment length and average return.
<|endoftext|>


<|startoftext|>
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design 

Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulqar Stephen W. Keckler NVIDIA Santa Clara, CA 95050 
{mrhu, ngimelshein, jclemons, azulfiqar, skeckler}@nvidia.com  

Abstract

The most widely used machine learning frame.works require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN. 

I. INTRODUCTION 
Deep neural networks (DNNs) have recently been success.fully deployed in various application domains such as computer vision [1], speech recognition [2], and natural language processing [3] thanks to their superior performance compared to traditional state-of-the-art approaches. Such proliferation of deep learning techniques has led several software frameworks to be developed in recent years to analyze and facilitate the design of neural networks [4, 5, 6, 7]. The list of available frameworks continue to expand with developers constantly adding more features and improving computational efficiency to foster research in the area of deep learning. Due to the tremendous compute horsepower offered by graphics processing units (GPUs), these frameworks provide strong backend support for GPU software libraries such as cuDNN [8]. In fact, almost every group today involved in training neural networks is deploying GPUs for accelerated deep learning [9]. 
While these popular machine learning (ML) frameworks facilitate the study of DNNs, a major limitation of the use of these frameworks is that the DRAM capacity limits of the GPU(s) in the system eventually limit the size the of the DNN that can be trained (Section II-C). To work around the memory capacity bottleneck [10, 11], ML practitioners must either use less desirable DNN architectures (e.g., smaller number of 
Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 2016. 

<<FIGURE>>

Fig. 1: GPU memory usage when using the baseline, network-wide allocation policy (left axis). The right axis shows the maximum fraction of this baseline allocation actually utilized when traversing through the network layer-wise. The numbers next to the names of each network refer to the batch size throughout this paper. Studied DNNs are detailed in Section IV-C. 
layers, smaller batch sizes, less performing but more memory-Efficient convolutional algorithms) or parallelize the DNN across multiple GPUs [12]. Figure 1 highlights how the memory consumption trends of the ImageNet [13] winning DNNs have evolved over time. AlexNet [1], for instance, only contained 5 convolutional layers with 2 fully-connected layers and required a "mere" 1.1 GB of memory allocation for training, which is well below the 12 GB memory capacity of the state-of-the-art NVIDIA Titan X. The more recent VGG.16 [14], on the other hand, contains 16 convolutional layers and 3 fully-connected layers, incurring a total of 28 GB of memory usage for batch size 256. Because a single GPU can only accommodate a batch size of 64 for VGG-16, training with batch 256 requires parallelization across multiple GPUs or the network must be sequentially executed multiple times with smaller batches. With the most recent ImageNet winning network adopting more than a hundred convolutional layers [15], the trend in deep learning is to move towards larger and deeper network designs [14, 16, 17, 18]. As a result, alleviating the rigid physical memory limitations of GPUs is becoming increasingly important. 
In this paper, we propose virtualized Deep Neural Network (vDNN), a runtime memory management solution that virtualizes the memory usage of deep neural networks across both GPU and CPU memories. Our vDNN allows ML practitioners to deploy larger and deeper networks beyond the physical 
capacity of available GPUs, enabling them to focus more on their algorithms while the system architecture and run.time system transparently manage the allocation, placement, movement, and release of their data. The motivation behind vDNN is based on the following three key observations: 

1) DNNs trained via stochastic gradient-descent (SGD) are designed and structured with multiple layers [19]; 2) the training of these neural networks involves a series of layer-wise computations, the order of which is statically xed and repeated for millions to billions of iterations throughout the entire training process; and 3) even though the GPU can, at any given time, only process a single layer's computation (due to the layer-wise computational characteristics of SGD-based DNN training), popular ML frameworks adopt a network-wide memory allocation policy because DNN training requires the intermediate feature maps of all the layers in the network to be backed up in GPU memory for gradient updates (Section II-C). In other words, existing memory management schemes overprovision the memory allocations to accommo.date the usage of the entire network layers, even though the GPU is only using a subset of this allocation for the layer-wise requirements. We observe that such memory underutilization issue becomes more severe for deeper networks, leading to 53% to 79% of allocated memory not being used at all at any given time (Figure 1). The goal of vDNN is to conservatively allocate GPU memory for the immediate usage of a given layer's computation so that the maximum and average memory usage is drastically reduced, allowing re.searchers to train larger networks. To achieve this goal, vDNN exploits the data dependencies of allocated data structures, particularly the intermediate feature maps that account for the majority of memory usage (Section II-C), and either releases or moves these intermediate data between GPU and CPU memory. Specifically, vDNN either 1) aggressively releases these feature maps from the GPU memory if no further reuse exists, or 2) offloads (and later prefetches) to (from) CPU memory if further reuse does exist but is not immediately required. By exploiting the inter-layer memory access and reuse patterns of DNNs, our vDNN memory manager intelligently overlaps the normal DNN computations with the offload/prefetch/release operations, effectively virtualizing the memory usage of DNNs with little to no performance loss. The operations of vDNN are completely transparent to programmers and enable them to train larger and deeper neural networks that consume memory well beyond the limits of physical memory of GPUs today. The key contributions of our work are: 
 This work is the first to present a detailed, quantitative analysis on GPU-based DNN training, as opposed to re.cent literature targeting energy-Efficient accelerators for DNN inference [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. 

 To the best of our knowledge, our work is the first that provides an in-depth characterization study on the memory access characteristics of DNNs and their effect on the GPU memory system from an architectural perspective. 

 This work identifies the key limitations of current ML frameworks' memory management policies as they re.quire the network-wide memory usage of the target DNN to monolithically fit within the physical capacity of the GPU. We demonstrate this by showing that existing frameworks fail in training 6 out of the 10 studied DNNs when their memory allocation size (14 GB to 67 GB) exceeds the GPU memory budget (12 GB in NVIDIA's Titan X). 

 We propose, implement, and evaluate a runtime memory manager called vDNN that virtualizes the memory usage of neural networks across CPU and GPU memories. Our vDNN solution reduces the average GPU memory usage of these 6 memory hungry networks by 73% to 98%, allowing them to be trained on a single Titan X card. Compared to a hypothetical, oracular GPU containing enough memory to hold the entire DNN, vDNN incurs 1% to 18% performance overhead. 

II. BACKGROUND AND MOTIVATION 

This section provides an overview of modern DNNs, the memory management policies of current ML frameworks, and their key limitations that motivate this work. 

A. DNN Architecture 
Convolutional neural networks are one of the most popular ML algorithms for high accuracy computer vision tasks. While other types of networks are also gaining tractions (e.g., recurrent neural networks for natural language pro.cessing), all of these DNNs are trained using a backward propagation algorithm [19] via stochastic gradient-descent (SGD). For clarity of exposition and owing to their state-of.the-art performance in the ImageNet competition, this paper mainly focuses on the feedforward style convolutional neural networks commonly seen in AlexNet [1], OverFeat [30], GoogLeNet [17], and VGG [14]. However, the key intuitions of our work are equally applicable to any neural network that exhibits layer-wise computational characteristics and is trained via SGD, detailed later in this section. 
DNNs are designed using a combination of multiple types of layers, which are broadly categorized as convolutional layers (CONV), activation layers (ACTV), pooling layers (POOL), and fully-connected layers (FC). A neural network is structured as a sequence of multiple instances of these layers. DNNs for computer vision tasks in particular are broadly structured into the following two modules: 1) the feature extraction layers that detect distinguishable features across input images, and 2) the classification layers that analyze the extracted features and classify the image into a given image category. Feature extraction layers are generally designed using CONV/ACTV/POOL layers and are positioned as the initial part of the DNN. The classification layers are built up using the FC layers and are found at the end of the DNN computation sequence. The general trend in deep learning is to design the network with a large number of feature extraction layers so that a deep hierarchy of features are trained for robust image classification [14, 15, 17]. 

Fig. 2: Memory allocations required for linear networks using the baseline memory manager (bold arrows). For inference, the sum of all green (W) and red (X) arrows are allocated. For training, two additional data structures for dX and dY are required: both are sized to the maximum of all blue (dY) arrows and are reused while traversing back the layers during backward propagation. An optional temporary buffer, called workspace in cuDNN [8] (yellow arrow, WS), is needed in certain convolutional algorithms. The workspace buffer is sized with the maximum workspace requirement among all layers and is reused during backward propagation. 
B. DNN Training vs. Inference 
A neural network needs to be trained before it can be deployed for an inference or classification task. Training entails learning and updating the weights of the layers of a neural network by performing the operations of forward and backward propagation algorithms [19]. The direction of traversal, as well as the mathematical operations that must be performed, differ for forward and backward propagation. 
Forward Propagation. Forward propagation is performed from the first (input) layer to the last (output) layer, whereas backward propagation is performed in the opposite direction (last to first layer), from right to left in Figure 2. Intuitively, forward propagation traverses the network layer-wise and per.forms the aforementioned feature extraction and classification tasks on a given input, leading to an image classification. Dur.ing forward propagation, each layer applies a mathematical operation to its input feature maps (X) and stores the results as output feature maps (Y). For linear feedforward DNNs, the resulting Y of layer(n.1) is directly used as the input X by layer(n) (Figure 2). The computation flow of forward propagation is therefore a serialized process, as layer(n) can initiate its layer's operation only when the preceding layer(n.1) is finished with its computation and forwarded its output Y to layer(n)'s input X. Non-linear network topologies can contain one-to-many (fork) and many-to-one (join) inter.layer dependencies, but forward propagation still involves a series of layer-wise computations as detailed in Figure 3. Note that the GPU can only process a single layer's computation at any given time due to such inter-layer data dependencies. As a result, the minimum, per layer memory allocations required are determined by the layer's input-output relationships and its mathematical function1. For instance, a CONV layer using the 

1 Popular activation functions (sigmoid/tanh/ReLU [1]) can be refactored into an in-place algorithm using element-wise computation. Both Caffe and Torch leverage this in-place memory optimization and only allocate memory space for Y and dY for forward (Y) and backward (both Y and dY) propagation [31]. This paper adopts this in-place optimization for both baseline and vDNN for a conservative evaluation. 

<<FIGURE>>

Fig. 3: (a) The computation graph and its inter-layer dependencies of a GoogLeNet-style, non-linear feedforward network during forward propagation. Refcnt refers to the number of consumer layers that depends on the current, producer layer's Y. The order in which the GPU processes each layer's forward computation is shown in (b), from layer(1) to layer(5), highlighting the layer-wise computation of DNN training. The producer-consumer relationship is reversed during backward propagation. 
most memory-Efficient convolutional algorithm (e.g., implicit GEMM in cuDNN [8]2) requires three data structures, the input/output feature maps (X and Y) and the weights of the layer (W) for forward propagation. Employing a fast.fourier-transform (FFT) based convolution algorithm however requires an additional, temporary workspace (WS) buffer to manage transformed maps. 
Backward Propagation. For DNNs that are not fully trained, the inferred image category might be incorrect. As a result, a loss function is used to derive the magnitude of the inference error at the end of forward propagation. Specifically, the gradient of the loss function is derived with respect to the last layer(N)'s output: 

<<FORMULA>>            (1)

The value in Equation 1 is forwarded to the last layer(N) as its input gradient maps (dY), and the output gradient maps (dX) are derived based on the chain rule [19]: 
 
<<FORMULA>>                      (2)

Because the output <<FORMULA>> is the product of the input 
<<FORMULA>>  with <<FORMULA>>, deriving the value of dX for layer(N)
generally requires memory for both its input/output gradient maps (dY and dX) and also the input/output feature maps (X and Y) for this layer. For linear networks, the calculated dX of layer(N) is directly passed on to the preceding layer(N.1) to be used as dY for layer(N.1)'s dX derivation (Figure 2). 

2 cuDNN (version 4.0) provides six different convolutional algorithms. Implicit GEMM requires the least memory allocation as no additional workspace is needed. FFT-based convolutional algorithms on the other hand incur larger memory allocations because of the additional data structures required to store the feature maps transformed into frequency domain. More details are available in [8, 32]. 

<<FIGURE>>

Fig. 4: Breakdown of GPU memory usage based on its functionality (left axis). The right axis shows the fraction of allocated memory consumed by feature maps. 
This chain rule is similarly used to derive the gradients of the weights to update the network model. 

Similar to forward propagation, backward propagation is also performed layer-wise to the respective incoming gradient maps, dYs. Once backward propagation reaches the first layer, the weights are adjusted using the weight gradients so that the prediction error is reduced for the next classification task. Hence, training a network involves both forward and backward propagation, which are repeated for millions to billions of iterations. Because of the stochastic nature of SGD-based backward propagation, the network input is generally batched with hundreds of images (e.g., 128 and 256 images for best performing AlexNet and VGG-16), which increases memory allocation size but helps the network model better converge to an optimal solution. 
C. Motivation: Scalable and Memory-Efficient DNN Design 
To aid the design and deployment of neural networks, a va.riety of ML frameworks have been developed in recent years, including Caffe, Torch, Neon, TensorFlow, and Theano [9]. The rich set of features offered by these frameworks coupled with their ability to accelerate DNN training and inference using GPUs greatly simplifies the process of implementing neural networks. Despite their flexibility, popular ML frame.works suffer from severe limitations in the way they allocate and manage memory. 
To illustrate the shortcomings of ML frameworks in man.aging memory, consider the example shown in Figure 2. When training a DNN using existing ML frameworks, the memory required across all of the layers of the network must fit within the physical GPU memory capacity. The key reason for this GPU-side, network-wide memory allocation strategy is to reap performance benefits. More Specifically, page-migration based virtualization solutions that expose both CPU and GPU memory for page allocations (regardless of whether the virtualization feature is provided by future CUDA runtime extensions or programming models such as OpenMP 
(4.0) [33]) must transfer pages via PCIe, which involves several latency-intensive processes such as CPU interrupts for system calls, page-table updates, TLB updates/shootdowns, and the actual page transfer. Prior work [34] reported that 

Fig. 5: Per layer memory usage of VGG-16 (256). For brevity, we only show the memory usage during forward propagation and for layers that contain weights (CONV and FC). Left axis corresponds to the sum of workspace and per layer input/output feature maps. The right axis corresponds to the memory consumption for storing weights. The memory usage during backward propagation follows similar trends to this figure. 
the latency to page-in a single 4 KB page to the GPU is 20 to 50 's, meaning the PCIe bandwidth utilization using page-migration is 80 to 200 MB/sec, as opposed to the DMA initiated cudaMemcpy that achieves an average 12.8 GB/sec out of the 16 GB/sec maximum PCIe bandwidth. As the amount of data to be paged in/out via PCIe can be 10s of GBs for very deep networks (Figure 15), ML frameworks will suffer from huge performance penalties when relying on page-migration for training DNNs. 
Note that because of the layer-wise gradient update rule of the backward propagation algorithm (property of the chain rule, Section II-B), each layer's feature maps (X) are later reused during its own backward propagation pass. This means that all Xs must still be available in GPU memory until backward computation is completed. Figure 4 shows the amount of memory usage based on its functionality and the growing significance of feature maps as networks become deeper. Because deeper networks need to keep track of a larger number of Xs, the fraction of memory allocated for feature maps grows monotonically as the number of layers increases. Training the network itself is still done layer-wise, however, regardless of the depth of the neural network. The baseline network-wide memory allocation policy is therefore both extremely wasteful and not scalable because it does not take into account the layer-wise DNN training. Figure 5 shows the per layer memory usage of VGG-16 during forward propagation, which provides the following key observations. First, the intermediate feature maps and workspace (left axis) incur an order of magnitude higher memory usage compared to the weights (right axis) of each layer. Second, most of these intermediate data structures are concentrated on the feature extraction layers and are less significant in the later classifier layers. Third, the weights, while smaller in size compared to these intermediate data, are mostly concentrated on the classifier layers due to their full connectivity. Lastly, the per layer memory usage is much smaller than the 28 

<<FIGURE>>

Fig. 6: VGG-16's per layer computation latency for forward and backward propagation (left axis). Right axis shows the reuse distance of each layer's input feature maps, X. We define the reuse distance of a layer(n)'s X as the latency between the completion of layer(n)'s forward propagation and the start of the same layer(n)'s backward propagation. 

GB of memory required by the baseline policy (Figure 1), showing significant opportunities for memory savings with a fine-grained, layer-wise memory management policy. 

III. VIRTUALIZED DNN 
The design objective of our virtualized DNN (vDNN) memory manager is to virtualize the memory usage of DNNs, using both GPU and CPU memory, while minimizing its impact on performance. vDNN is completely transparent to the programmer as the allocation, placement, movement, and release of data is seamlessly orchestrated by the system architecture and the runtime system. Such abstraction enables ML practitioners to focus more on their ML algorithm and not have to worry about the low level details of GPU memory management. vDNN primarily optimizes the memory usage of the feature extraction layers as the majority of memory usage is concentrated on these layers, accounting for 81% of memory usage on AlexNet and 96% on VGG-16 (256). More Specifically, we target the feature maps of these feature extraction layers as these intermediate data structures account for the majority of GPU memory usage (Figure 4 and Fig.ure 5). The intuitions of vDNN can also be applied to weights and to the classification layers, but with less of a memory saving benefit. 

A. Design Principle 
Previous sections highlighted the fact that the memory requirement per individual layer is substantially smaller than what is actually provisioned with the baseline, network-wide memory allocation policy. vDNN adopts a sliding-window based, layer-wise memory management strategy in which the runtime memory manager conservatively allocates memory from its memory pool for the immediate usage of the layer that is currently being processed by the GPU. Intermediate data structures that are not needed by the current layer are targeted for memory release to reduce memory usage. 
Forward Propagation. As discussed in Section II-C, deep networks have to keep track of a large number of the inter-

<<FIGURE>>

Fig. 7: Execution flow of a linear network during forward propagation. The figure assumes that layer(N) is currently being processed by the GPU. During this layer's forward computation, the data associated with the arrows marked with black Xs (all preceding layer's input feature maps) are not used and can safely be released from the memory pool. 

<<FIGURE>>

Fig. 8: Execution flow of a linear network during backward propagation. The figure assumes that layer(2) is currently being processed by the GPU. Data associated with the arrows marked with black Xs can safely be released because they will not be reused during the training of this input image batch. 

mediate feature maps (Xs) that are extracted during forward propagation. Once a given layer(n)'s forward computation is complete, however, layer(n)'s X is not reused until the GPU comes back to the same layer(n)'s corresponding backward computation. Because the reuse distance of layer(n)'s X is on the order of milliseconds to seconds (e.g., more than 60 ms and 1200 ms for the first layer of AlexNet and VGG-16 (64), respectively), deep networks end up allocating a significant number of Xs that effectively camp inside the GPU memory without immediate usage (Figure 6). As a result, tackling these Xs for memory optimization is crucial for Efficient utilization of GPU memory as these intermediate data account for a significant fraction of memory allocations (Figure 4). vDNN therefore conditionally offloads these intermediate Xs to CPU memory via the system interconnect (e.g., PCIe, NVLINK [35]) if they are targeted for memory release. Section III-C details the vDNN memory transfer policy that decides which layers are chosen for offloading its X. Once the offload operation is complete, vDNN releases the offloaded X from the memory pool to reduce GPU memory usage. 
Care must be taken however when evaluating the feasibility of offloading a layer's input X. This is because, for non-linear network topologies, multiple layers can be the consumers of a previously computed layer's output feature maps (Y). For instance, layer(2) and layer(3) in Figure 3 are both using the output Y of layer(1) as its input X. offloading and consequently releasing the input X of layer(2), before reaching 

<<FIGURE>>

Fig. 9: Performance effect of offload and prefetch. FWD(n) and BWD(n) are the forward and backward computations for layer(n), respectively. OFF(n) is the offloading of layer(n)'s X and PRE(n) is the corresponding prefetch operation for layer(n). 

layer(3)'s forward computation, is problematic as these two layers share the same data structure for the input X. vDNN therefore keeps track of the inter-layer dependencies in the form of a dataflow graph (e.g., Refcnt in Figure 3) and allows the offload/release operation to be initiated only when the currently processing layer is the last consumer of its input feature maps. Figure 7 is an example execution flow of a linear DNN during forward propagation, highlighting when it becomes safe to release a layer's X. 
Backward Propagation. Similar to forward propagation, vDNN aggressively releases data structures that are not needed for training the remaining layers backward computation. During layer(n)'s backward propagation, layer(n+1)'s Y and dY are no longer required because the GPU has already completed the gradient updates for this layer (Figure 8). Again, by leveraging the layer-wise DNN backward propagation, vDNN immediately frees up a layer's Y and dY once this layer's backward computation is complete. X and dX are not released as the preceding layer's backward propagation will be needing these values for gradient derivation. Note that if a layer has offloaded its X to host memory, vDNN should guarantee that the offloaded data is copied back to GPU memory before the gradient update is initiated. Naively copying back the data on-demand will serialize the backward computation behind the memory copying operation of X. vDNN therefore launches a prefetch operation for layer(n)'s offloaded feature maps, which is overlapped with layer(m)'s backward computation, with n<m, so that prefetching is launched before its actual usage, hiding prefetching latency. 

B. Core Operations And Its Design 

vDNN is prototyped as a layer on top of cuDNN [8]. Each layer keeps track of the cross-layer data dependencies of input/output feature maps so that the vDNN offload and release operations are properly scheduled. vDNN employs two separate CUDA streams [36] to overlap normal DNN computations with the memory allocation, movement, and release operations of vDNN. stream compute is the CUDA stream that interfaces to the cuDNN handle and sequences all the layer's forward and backward computations. stream memory manages the three key components of vDNN; the memory allocation/release, offload, and prefetch. 
Memory Allocation/Release. The CUDA library only sup.ports synchronous memory (de)allocations, meaning that any calls to cudaMalloc() or cudaFree() will enforce an additional synchronization across all the GPUs within a node. To safely enable vDNN memory operations while not fall into the pitfalls of synchronous CUDA APIs, we employ the open-source asynchronous memory allocation/release API library distributed by NVIDIA [37]. When the program launches, the vDNN memory manager is allocated with a memory pool that is sized to the physical GPU memory capacity. Whenever vDNN allocates (and releases) data structures, the underlying memory manager will reserve (and free) memory regions from this memory pool without having to call cudaMalloc() or cudaFree(). 
Memory offload. offloading input feature maps is one of the key enablers of vDNN's memory savings. When a layer is chosen for offloading, vDNN first allocates a pinned host-side memory region using cudaMallocHost(). streammemory then launches a non-blocking memory trans.fer of this layer's X to the pinned memory via PCIe us.ing cudaMemcpyAsync(), overlapping it with the same layer's forward computation of cuDNN. The current implementation of vDNN synchronizes stream compute and stream memory at the end of each layer's forward computation if stream memory has offloaded its feature maps. This approach guarantees that the offloaded data is safely released from the memory pool before the next layer begins forward computation, maximizing the memory saving benefits of offloading. Because the Xs of CONV and POOL layers are read-only data structures, overlapping layer(n)'s offload operation with the same layer's forward propagation does not create any correctness issues. ACTV layers are already refactored into an in-place algorithm and only use Y and dY for gradient updates, obviating the need for memory offloading (Section II-B). Figure 9 provides an overview of vDNN's offload operation. Here, the baseline system is able to immediately launch layer(2)'s forward computation once layer(1) is complete. The execution of layer(2) is stalled for vDNN, because streamcompute must wait until the offloading operation of streammemory is complete, blocking layer(2)'s computation. The computation of layer(3) is not delayed however because the offload latency for layer(2) is completely hidden inside the latency to compute the same layer's forward propagation. 
Memory Prefetch. Similar to offloading, prefetching the offloaded Xs back to GPU memory is implemented using cudaMemcpyAsync() to overlap data transfers with the computations of backward propagation. However streammemory launches prefetch operations in the reverse order relative to the offload operations from forward propagation (Figure 9). As mentioned in Section III-A, the general rule of prefetching is to overlap the memory copy operation of layer(n)'s offloaded data with layer(m)'s backward computation, with layer ID m always being higher than n to maximize the benefit of both prefetching and latency hiding. In other words, when the GPU starts the backward propagation of layer(m), vDNN determines the best layer to prefetch among the preceding layers (as n<m). 
If the distance between the prefetched layer(n) and overlapping layer(m) is too far away, the memory saving benefit of vDNN offloading will be reduced because the reuse time of this prefetched data will be distant in the future. In other words, prefetching data too early in time will again suboptimally utilize GPU memory as the prefetched data will once again camp inside the GPU memory without immediate usage. We carefully designed the vDNN prefetch algorithm to avoid this pitfall and balance the memory saving benefits of offloading with the timeliness of prefetching. Figure 10 is a pseudo-code of the vDNN prefetch algorithm that determines the best candidate layer for prefetching. Before stream compute starts a layer's backward computation, vDNN first searches for a potential layer that requires prefetching of its X. If the search operation is successful (line 11), the layer ID to be prefetched is returned by findPrefetchLayer routine and is used to launch its prefetch operation via stream memory. Similar to offloading, vDNN synchronizes stream compute and stream memory so that the next layer's backward computation is stalled until the prefetch operation is finalized. Consequently, any prefetch operation launched during layer(n)'s backward computation is guaranteed to be ready before layer(n.1)'s computation. This benefit of course comes at the cost of a potential performance loss when the prefetch latency is longer than the overlapped computation, which we detail in Section V-C. 

C. vDNN Memory Transfer Policy 

Determining the best layers to offload their feature maps is a multi-dimensional optimization problem that must consider: 
1) GPU memory capacity, 2) the convolutional algorithms used and the overall layer-wise memory usage, and 3) the network-wide performance. The first two factors determine whether we are able to train the network at all (which we refer to as trainability of a network), while the last factor decides overall training productivity. If vDNN were to use the most memory-Efficient algorithm for all layers (e.g., implicit GEMM in cuDNN [8] which does not require any WS allocations) while also having all layers offload/prefetch, the GPU memory usage will be the lowest. Performance will likely suffer, however, compared to a baseline with the fastest convolutional algorithms adopted for each layers; the performance loss primarily comes from 1) the additional latency possibly incurred due to offload/prefetch, and 2) the performance difference between memory-optimal implicit GEMM and the performance-optimal convolutional algorithm. Going with the fastest algorithm, without any offload/prefetch, will result in the highest possible performance, but the potential memory overheads for the faster algorithm's workspace and the cumulative Xs that camp inside the GPU memory will likely overflow GPU memory. Given that optimizing the layer-wise memory usage and its performance is in itself a multi-dimensional optimization problem, selecting the most optimal hyperparameters across the entire network is non.trivial. We therefore adopt the following heuristic-based memory transfer policies that narrow the parameter choices and simplify the optimization problem, while still performing robustly in practice. 
Static vDNN. Feature extraction layers are mostly com.posed of CONV and ACTV layers with intermittent POOL layers that downsize the dimensionality of the feature maps. More than 70% to 80% of the (forward/backward) computation time however is spent on the CONV layers for deep neural networks. We therefore evaluate two static vDNN memory transfer options that exploit this computational characteristic. The first option we explore is to have the vDNN memory manager offload all of the Xs of all of the layers. This policy, vDNNall, is our most memory-Efficient solution as all Xs are offloaded and released from the GPU, drastically reducing device memory usage. The second vDNN policy is to only offload Xs for the CONV layers and leave the remaining layers' Xs resident inside GPU memory (vDNNconv). The vDNNconv policy is based on the observation that CONV layers have a much longer computation latency than ACTV/POOL layers, being more likely to effectively hide the latency of offload/prefetch. Not surprisingly the performance of vDNNconv is generally higher than vDNNall. But vDNNall has the advantage of consuming the least GPU memory, significantly enhancing the trainability of a DNN. We later evaluate the memory usage and performance of these two static policies with both the memory-optimal and performance-optimal convolutional algorithms employed. 
Dynamic vDNN. While static vDNN is simple and easy to implement, it does not account for the system architectural components that determine the trainability and performance of a DNN (e.g., maximum compute FLOPs and memory bandwidth, memory size, effective PCIe bandwidth, etc). For DNNs that comfortably fit within GPU memory, neither vDNNall nor vDNNconv is optimal as the best approach is to have all the memory allocations resident in GPU without any offloading and employ the fastest possible convolutional algorithm. Large and deep networks, on the other hand, might not have the luxury of using faster convolutional algorithm. So being able to fit such network on the GPU is the best optimization vDNN could make. We therefore develop a dynamic vDNN policy that automatically determines the offloading layers and the convolutional algorithms employed, at runtime, to balance the trainability and performance of a DNN. Dynamic vDNN leverages several properties of DNN training. First, we exploit the millions to billions of iterations of the same forward/backward propagation pass that are re.quired for training. NVIDIA's cuDNN provides a runtime API that experiments with all available convolution algorithms for a given layer, evaluating each algorithm's performance and its memory usage. Current ML frameworks leverage this API to undergo an initial profiling stage to determine the best algorithms to deploy for each CONV layer for best performance. The overhead of such profiling is on the order of a few tens of seconds, which is negligible relative to the days to weeks required for DNN training. Our dynamic vDNN augments this profiling stage with a number of additional profiling passes to select the best layers to offload and the best per layer algorithm. Once the baseline profile stage is completed and the fastest possible convolutional algorithms are derived for all CONV layers, dynamic vDNN employs the following additional profiling passes: 

1) First, the static vDNNall is tested for a single training pass with all CONV layers using the memory-optimal, no-WS incurred algorithm. This initial pass determines if the target DNN can be trained at all as it requires the least GPU memory. 
2) If vDNNall passed, another training phase is launched with all CONV layers employing the fastest algorithms but without any offloading. Such a configuration, if it passes successfully, will be adopted for the rest of the full training procedure as it provides the high.est performance while guaranteeing trainability. If this profiling phase fails due to memory oversubscription, two additional training passes are tested with the same fastest algorithms, but with vDNN offloading enabled for both vDNNconv and vDNNall respectively. If successful, vDNN employs the succeeded configuration for the rest of training. If both vDNNconv and vDNNall fails, we move on to the next profiling pass below to further reduce memory usage. 
3) The last phase is based on a greedy algorithm that tries to locally reduce a layer's memory usage, seeking a global optimum state in terms of trainability and performance. When traversing through each layer, vDNN first calculates whether using the fastest algorithm will overflow the GPU memory budget. If so, then the given layer's convolutional algorithm will be locally downgraded into a less performant but more memory-Efficient one, until it reaches the memory-optimal implicit GEMM. This greedy-based approach first tries vDNNconv with each CONV layer initially using its own performance-optimal algorithm. If vDNNconv fails, then another training pass is launched with the more memory-Efficient vDNNall. If vDNNall also fails with this greedy algorithm, then vDNN resorts back to the very first vDNNall solution, with the memory-optimal, no-WS algorithms applied across the entire network. 
While other possible settings might better balance performance and trainability, we find that our dynamic vDNN performs competitively without having to exhaustively search for globally optimal parameter selections. 

IV. METHODOLOGY 

A. vDNN Memory Manager 
We implemented a host-side memory manager that interacts with the latest and fastest version of cuDNN 4.0 [8], serving as the GPU back-end. All the layers that constitute a DNN's feature extraction layer have been implemented using cuDNN, and the execution of each layer is orchestrated using two CUDA streams, streamcompute and streammemory as discussed in Section III-B. The classification layers remain unchanged and use the same cuBLAS routines used in Torch. The vDNN API closely resembles that of Torch and Caffe, providing the high level abstractions of the target DNN and each of its layer compositions. 
While there are subtle differences between Torch, Caffe, and Theano's memory allocation scheme, prior work [9] quantitatively demonstrated that all three frameworks exhibit comparable performance and memory requirements3. We therefore choose Torch's memory management policy as base.line to compare against vDNN given its widespread deployment across both academia and industry (e.g., Facebook and Google DeepMind). This baseline policy adopts the network-wide allocation policy discussed in Section II-C. However, we further improve this baseline policy using the following strategy to reduce memory consumption during the backward propagation phase [38, 39]: rather than allocating separate dY and dX for all individual layers, we only allocate the minimally required number of each of these data structures and reuse them after each layer's backward computation is complete (Figure 2). 

B. GPU Node Topology 
We conducted experiments on NVIDIA's Titan X [40], which provides the highest math throughput (single precision throughput of 7 TFLOPS), memory bandwidth (max 336 GB/sec), and memory capacity (12 GB) in the family of Maxwell GPUs. The GPU communicates with an Intel i7.5930K (containing 64 GB of DDR4 memory) via a PCIe switch (gen3), which provides a maximum 16 GB/sec data transfer bandwidth. 

3 Because TensorFlow is the least performant in terms of GPU memory usage and training speed [9], we do not discuss its memory management policy further in this paper. 


C. DNN Benchmarks 
Conventional DNNs. First, we evaluate existing, state-of.the-art ImageNet winning DNNs: AlexNet [1], OverFeat [30], GoogLeNet [17], and three different batch sizes for VGG-16 (the deepest network with 16 CONV and 3 FC layers) [14]. The network configurations of these DNNs (e.g., layer type, batch size, etc.) are identical to the reference models main.tained by the researchers at Facebook [41]. While the memory usage of AlexNet, OverFeat, and GoogLeNet is already below the 12 GB memory capacity of Titan X (Figure 4), we use it to evaluate the performance regression on these networks with vDNN. VGG-16 is one of the largest and deepest DNN architecture to date, requiring substantial memory capacity for trainability (using up to 28 GB of memory when trained with the best performing batch size of 256). Accordingly, Simonyan and Zisserman [14] parallelized VGG-16 (256) across four GPUs, with each GPU training VGG-16 (64) that fits within a single GPU memory budget. We therefore study VGG-16 with three batch sizes (64/128/256) and use it as a representative, future-looking DNN architecture that stresses the memory capacity limits of today's GPUs. 
Very Deep Networks. To highlight vDNN's scalability in training very deep networks, we collected a second set of benchmarks by extending the number of CONV layers of VGG, from 16 CONV layers to 416 CONV layers. The original VGG network features a homogeneous architecture that only uses 3 . 3 convolutions (with stride 1 and pad 
1) and 2x2 pooling operations (with stride 2), from the first to the last feature extraction layer. The feature extraction layers are conceptually divided into five groups of CONV layers, separated by the intermediate POOL layers. The only difference among these CONV layer groups is that the number of output feature maps grows from 64 to 512, from the first to the last layer group. Simonyan and Zisserman [14] studied the effect of layer depth on classification accuracy by incrementally adding more CONV layers to each of these layer groups, going from 8 CONV layers to 16 CONV layers. We follow similar measures to deepen the layer depth of VGG by gradually adding 100 more CONV layers to VGG.16, resulting in VGG-116/216/316/416 configurations. Each addition of 100 CONV layers is done by adding 20 more CONV layers to each of the five CONV layer groups. The added CONV layers have the same number of output feature maps that are employed for that layer group. We use these four VGG-style networks to perform a case study on vDNN's scalability on training very deep networks that require much more memory. Compared to conventional DNNs whose input batch size is in the order of hundreds of images, we study these very deep networks with a relatively small batch size of 32 in order to highlight the memory scaling effect of layer depth on DNNs. 

V. RESULTS 

This section evaluates the effect of vDNN on GPU memory usage, off-chip memory bandwidth utilization, GPU power consumption, and overall performance. The static vDNNall and vDNNconv policies are denoted as all and conv in all the figures discussed in this section and are each eval.uated with both memory-optimal and performance-optimal (denoted as (m) and (p)) convolutional algorithms across the network. The baseline memory manager (base) is simi.larly evaluated with both memory-optimal and performance-optimal algorithms. The algorithms are dynamically chosen for vDNNdyn (denoted as dyn) as discussed in Section III-C. Memory management policies that fail in training the net.work, due to memory oversubscription, are marked with (.). 

A. GPU Memory Usage 
Because vDNN adopts a layer-wise memory allocation policy, the GPU memory usage during forward/backward propagation will fluctuate depending on the memory offloading policy chosen and the convolutional algorithm employed for a given layer (Figure 5). We therefore discuss both the maximum and average memory usage as shown in Figure 11. The maximum memory usage corresponds to the largest memory allocated across the entire run, which decides whether the target DNN application can be trained at all. The average memory on the other hand reflects how much memory has been used on average, and conversely, freed up during for.ward/backward propagation. The smaller the average memory usage becomes, the more likely vDNN will have headroom to improve performance by: 1) employing performance-Efficient convolutional algorithms that require larger workspace, and 
2) reducing the total number of offload layers and prevent potential performance drops due to offloading (Figure 9). 
Because the baseline policy provisions the memory alloca.tions to accommodate the entire network usage, the maximum and average memory usage are identical. The baseline policy therefore is not able to train networks like VGG-16 with batch 128 and 256 , which require more than the physically avail.able 12 GB of memory. Our vDNN enhances the trainability of a network by significantly reducing its memory requirements. Overall, the memory-optimal vDNNall(m) shows both the smallest average and maximum memory usage as it always offloads a layer's input feature maps while using the most memory-Efficient algorithms. As a result, vDNNall exhibits the highest offload traffic being sent to host memory, reaching up to 16 GB of GPU memory savings for VGG-16 (256) (Figure 12). Such aggressive offloading significantly improves memory efficiency and achieves an average 73% and 93% reduction on the maximum and average memory usage of the six networks shown in Figure 11. When employed with the performance-optimal algorithm, the average memory savings of vDNNall are slightly reduced to 64% and 90% for the maximum and average memory usage. Because vDNNconv only offloads the feature maps for the CONV layers, its memory savings is not as high as vDNNall. However, vDNNconv still reduces the maximum and average memory usage by 52% and 76% on average, even with the performance-optimal algorithms employed across the network. 
vDNNdyn allocates the largest memory among the three vDNN policies, reducing the maximum and average memory consumption by 49% and 69% on average compared to baseline. This is because vDNNdyn tries to balance its memory usage and performance, seeking to fit the network within GPU memory while still optimizing performance by minimizing the number of offload layers and employing the fastest possible convolutional algorithms. The static vDNNall and vDNNconv, on the other hand, do not consider the overall performance when the offloaded layers are chosen. For instance, VGG.16 (128) trained with memory-optimal vDNNall only uses up to 4.8 GB out of the 12 GB of available memory. This configuration leads to a 61% performance loss (Section V-C) as vDNNall fails to exploit the remaining 7.2 GB of the memory for performance optimizations. vDNNdyn tries to bridge this gap by dynamically deriving the offload layers as well as the best convolutional algorithms to employ for each layer. We further discusses vDNN's impact on performance in detail at Section V-C. 

B. Impact on Memory System 
While vDNN helps virtualize DNN's memory usage, it does come at the cost of adding more read (offload) and write 
Fig. 13: Maximum DRAM bandwidth utilization for each CONV layer's forward and backward propagation. 
(prefetch) traffic to the GPU memory subsystem, potentially interfering with the normal cuDNN operations. Because the additional vDNN memory traffic can be up to the bandwidth of the PCIe (maximum of 16 GB/sec for gen3), its effect on performance will be determined by the normal cuDNN operation's memory bandwidth intensity. Figure 13 shows the baseline's maximum DRAM bandwidth utilization for VGG.16, which is measured separately for each CONV layer's forward and backward propagation. The feature extraction layers rarely saturate the 336 GB/sec of peak memory band.width, providing more than enough headroom for vDNN's offload/prefetch traffic. Even if a hypothetical, future convo.lutional algorithm were to completely saturate the off-chip DRAM bandwidth, vDNN's additional traffic will approxi.mately incur up to a worst-case (16/336) = 4.7% performance overheads, which we believe is reasonable given the benefit of virtualized memory. 

C. Performance 
Figure 14 summarizes the performance of vDNN compared to baseline. For a conservative evaluation, we only compare the latencies incurred in the feature extraction layers because the classifier layers are executed identically for baseline and vDNN. Because the baseline policy requires more than 12 GB of memory for VGG-16 (128) and VGG-16 (256) with performance-optimal algorithms (15 GB and 28 GB respectively), it is impossible to train these two networks on a Titan X. We therefore establish an oracular baseline that removes the memory capacity bottlenecks of these two net.works for a conservative evaluation. The performance of this oracular baseline is estimated by configuring all CONV layers with the fastest algorithms and evaluating the latencies of each layers individually. The latencies are later accumulated altogether to estimate overall performance. Overall, vDNNall and vDNNconv with memory-optimal algorithms exhibit an average 58% and 55% performance loss (maximum 65% and 63% degradation) compared to baseline, an expected result as the memory manager puts no effort into balancing memory usage and overall performance. The dynamic vDNNdyn does much better in terms of balancing memory efficiency and overall throughput, closing the performance gap between the static vDNN and baseline and reaching an average 97% of baseline's throughput (worst case 82% of the oracular baseline, for VGG-16 (256)). 

D. Power 
This section discusses the effect of vDNNdyn on overall GPU power consumption. We use the system profiling utility of nvprof [42] to measure the average and maximum GPU power consumption. Each network is executed for 50 iterations of forward and backward propagation and the re.ported average and maximum power consumption is averaged altogether. All but VGG-16 (128) have been executed with the performance-optimal convolutional algorithms because VGG-16 (128) can only be trained with the memory-optimal algorithms under baseline (Figure 11). Note that the results for VGG-16 (256) are not discussed as this configuration can only be trained with vDNN, making it impossible to compare against baseline. Overall, vDNNdyn incurs 1% to 7% maximum power overheads. As discussed in Section V-B, the offload/prefetch memory traffic of vDNN is one of the biggest contributors to the instantaneous rise in peak power consumption. Nonetheless, the average power consumption (energy/time) is rarely affected because of the following two factors: 1) vDNNdyn does not incur any noticeable performance overhead for these five networks, and 2) the studied DNNs rarely saturate the peak DRAM bandwidth (Figure 13), so the additional energy overheads of vDNN memory traffic is expected to be negligible on average (Section V-B). 

E. Case Study: Training Very Deep Networks 
To highlight vDNN's scalability in training very deep net.works, we perform a case study on four VGG-style networks that contain hundreds of CONV layers and scale up the net.work memory requirements. As mentioned in Section IV-C, the batch size is set to be much smaller than those studied in previous subsections (which ranges from batch 128 to 256) as means to highlight the memory scaling effect of layer depth despite its small batch size. Figure 15 shows the memory allocation requirements of baseline and vDNNdyn for these very deep neural networks. As the number of CONV layers increases from 16 to 416, the baseline memory requirements monotonically increase by 14 times (from 4.9 GB to 67.1 GB), even with a small batch size of 32. Thanks to its layer-wise memory allocation policy, vDNNdyn significantly reduces the memory usage of all four networks, only using up to 4.2 GB of GPU memory and having all remaining 81% to 92% of overall memory allocations to reside in CPU memory. Compared to the oracular baseline, vDNNdyn also did not incur any noticeable performance degradations because the offload and prefetch latency is completely hidden inside the layer's DNN computations while still being able to employ the performance-optimal algorithms across the network. 

VI. RELATED WORK 

There have been a variety of proposals aiming to reduce the memory usage of neural networks. Network pruning techniques [43, 44, 45, 46, 47] remove small valued weight connections from the network as means to reduce network redundancy, leading to a reduction in memory consump.tion. Another redundancy mitigating approach uses quanti.zation [48] or reduced precision [49] to reduce the number of bits required to model the network. A variety of network compression techniques have also been explored by Gong et al. [50] to reduce the memory usage of DNNs. 
Although these prior studies reduce DNN memory re.quirements, they fall short in several respects. First, weights only account for a small fraction of the memory usage in state-of-the-art DNNs, as shown in Figure 4. Thus, proposals that optimize memory usage of weights, while beneficial in terms of memory bandwidth utilization and energy-efficiency, provide only limited opportunity for memory capacity sav.ings. Second, using reduced precision occasionally results in loss of classification accuracy unless carefully tuned for the given network and task. Our proposal optimizes the memory consumptions of the intermediate feature maps which are the most dominant data structures in DNNs. 
Several prior works discussed mechanisms to support vir.tualized memory on GPUs. Pichai et al. [51] and Power et al. [52] proposed TLB implementations that consider the unique memory access patterns of GPUs, improving the throughput of address translations as well as overall system throughput. Zheng et al. [34] discuss features needed in the GPU hardware and software stack to close the performance gap of GPU paged memory versus legacy programmer.directed memory management techniques. As discussed in Section II-C, page-migration based virtualization solutions are likely to underutilize PCIe bandwidth significantly and incur performance overheads when training networks that oversubscribe GPU memory. 
While less directly related to vDNN, a variety of accelerator architectures have also been proposed for DNNs [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 53]. While these custom ASIC designs drastically improve the energy-efficiency of DNNs, none of these address the memory capacity bottlenecks of DNN training, a unique contribution of our work. 

VII. CONCLUSION 

Existing machine learning frameworks require users to carefully manage their GPU memory usage so that the network-wide memory requirements fit within the physical GPU memory size. We propose vDNN, a scalable, memory-Efficient runtime memory manager that virtualizes the memory usage of a network across CPU and GPU memories. Our vDNN solution reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, substantially improving the memory-efficiency of DNNs. Similar experiments to VGG-16 (256) result in an average 90% reduction in memory usage at a cost of 18% performance penalty compared to an oracular baseline. We also study the scalability of vDNN to extremely deep neural networks, showing that vDNN can train networks with hun.dreds of layers without any performance loss. 

ACKNOWLEDGMENT 

We thank our colleagues at NVIDIA for their feedback on this work, and in particular John Tran, Sharan Chetlur, Simon Layton, and Cliff Woolley for their contributions to vDNN concepts and infrastructure. 

REFERENCES 
[1] A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet classification with Deep Convolutional Neural Networks, in Proceedings of the Advances in Neural Information Processing Systems, 2012. 
[2] A. Graves and J. Schmidhuber, Framewise Phoneme classification With Bidirectional LSTM and Other Neu.ral Network Architectures, in Neural Networks, 2005. 
[3] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, Natural Language Processing (Almost) From Scratch, in arxiv.org, 2011. 
[4] Tensorflow, https://www.tensorflow.org, 2016. 
[5] Torch, http://torch.ch, 2016. 
[6] Theano, http://deeplearning.net/tutorial, 2016. 
[7] Caffe, http://caffe.berkeleyvision.org, 2016. 
[8] NVIDIA, cuDNN: GPU Accelerated Deep Learning, 2016. 
[9] S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah, Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning, in arxiv.org, 2016. 
[10] The Next Platform (www.nextplatform.com), Baidu Eyes Deep Learning Strategy in Wake of New GPU Options, 2016. 
[11] G. Diamos, S. Sengupta, B. Catanzaro, M. Chrzanowski, A. Coates, E. Elsen, J. Engel, A. Hannun, and S. Satheesh, Persistent RNNs: Stashing Recurrent Weights On-Chip, in Proceedings of the International Conference on Machine Learning, 2016. 
[12] A. Krizhevsky, One Weird Trick For Parallelizing Con.volutional Neural Networks, in arxiv.org, 2014. 
[13] ImageNet, http://image-net.org, 2016. 
[14] K. Simonyan and A. Zisserman, fivery Deep Convolu.tional Networks for Large-Scale Image Recognition, in Proceedings of the International Conference on Learn.ing Representations, 2015. 
[15] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, in arxiv.org, 2015. 
[16] Wired (www.wired.com), Microsoft Neural Net Shows Deep Learning Can Get Way Deeper, 2016. 
[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi.novich, Going Deeper with Convolutions, in arxiv.org, 2014. 
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, Deep Networks with Stochastic Depth, in arxiv.org, 2016. 
[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-Based Learning Applied to Document Recog.nition, in Proceedings of the IEEE, 1998. 
[20] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning, in Pro.ceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2014. 
[21] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, DaDianNao: A Machine-Learning Supercomputer, in Proceedings of ACM/IEEE International Symposium on Microarchitec.ture, 2014. 
[22] Y. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An Energy-Efficient Recongurable Accelerator for Deep Convolutional Neural Networks, in IEEE International Conference on Solid-State Circuits, 2016. 
[23] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally, EIE: Efficient Inference Engine on Compressed Deep Neural Network, in Proceedings of ACM/IEEE International Symposium on Computer Architecture, 2016. 
[24] Y. Chen, J. Emer, and V. Sze, Eyeriss: A Spatial Archi.tecture for Energy-Efficient Dataflow for Convolutional Neural Networks, in Proceedings of ACM/IEEE Inter.national Symposium on Computer Architecture, 2016. 
[25] R. LiKamWa, Y. Hou, M. Polansky, Y. Gao, and L. Zhong, RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision, in Proceedings of ACM/IEEE International Symposium on Computer Architecture, 2016. 
[26] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Miguel, H. Lobato, G. Wei, and D. Brooks, Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators, in Pro.ceedings of ACM/IEEE International Symposium on Computer Architecture, 2016. 
[27] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, A Novel Processing-in-memory Architec.ture for Neural Network Computation in ReRAM-based Main Memory, in Proceedings of ACM/IEEE Interna.tional Symposium on Computer Architecture, 2016. 
[28] A. Shaee, A. Nag, N. Muralimanohar, R. Balasubra.monian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, ISAAC: A Convolutional Neural Net.work Accelerator with In-Situ Analog Arithmetic in Crossbars, in Proceedings of ACM/IEEE International Symposium on Computer Architecture, 2016. 
[29] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Com.puting, in Proceedings of ACM/IEEE International Symposium on Computer Architecture, 2016. 
[30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, OverFeat: Integrated Recognition, Lo.calization and Detection using Convolutional Networks, in arxiv.org, 2013. 
[31] S. Chintala, https://github.com/torch/nn/pull/235, 2015. 
[32] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, cuDNN: Ef.cient Primitives for Deep Learning, in Proceedings of the Advances in Neural Information Processing Systems, 2014. 
[33] OpenMP Architecture Review Board, OpenMP Appli.cation Program Interface (version 4.0), 2013. 
[34] T. Zheng, D. Nellans, A. Zulqar, M. Stephenson, and S. W. Keckler, fitoward High-Performance Paged-Memory for GPUs, in Proceedings of IEEE Inter.national Symposium on High-Performance Computer Architecture, 2016. 
[35] NVIDIA, NVIDIA NVLINK High-Speed Intercon.nect, 2016. 
[36] NVIDIA, NVIDIA CUDA Programming Guide, 2016. 
[37] NVIDIA, https://github.com/NVIDIA/cnmem, 2016. 
[38] S. Gross and M. Wilber, fitraining and Investigating Residual Nets, 2016. 
[39] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems, in Proceedings of the 2015 Workshop on Machine Learning Systems, 2015. 
[40] NVIDIA, GeForce GTX Titan X (Maxwell), 2015. 
[41] S. Chintala, https://github.com/soumith/convnet.benchmarks, 2016. 
[42] NVIDIA, CUDA Toolkit 7.5 Documentation: profiler, 2016. 
[43] S. Hanson and L. Pratt, Comparing Biases for Mini.mal Network Construction with Back-propagation, in Proceedings of the Advances in Neural Information Processing Systems, 1989. 
[44] Y. LeCun, S. Denker, and S. Solla, Optimal Brain Damage, in Proceedings of the Advances in Neural Information Processing Systems, 1990. 
[45] B. Hassibi and D. Stork, 'second Order Derivatives for Network Pruning: Optimal Brain Surgeon, in Proceed.ings of the Advances in Neural Information Processing Systems, 1993. 
[46] S. Han, J. Pool, J. Tran, and W. Dally, Learning Both Weights and Connections for Efficient Neural Networks, in Proceedings of the Advances in Neural Information Processing Systems, 2015. 
[47] S. Han, H. Mao, and W. Dally, Deep Compres.sion: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, in Pro.ceedings of the International Conference on Learning Representations, 2016. 
[48] V. Vanhoucke, A. Senior, and M. Mao, Improving the Speed of Neural Networks on CPUs, in Proceedings of Deep Learning and Unsupervised Feature Learning, 2011. 
[49] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and A. Moshovos, Reduced-Precision Strategies for Bounded Memory in Deep Neu.ral Nets, in arxiv.org, 2016. 
[50] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Com.pressing Deep Convolutional Networks Using Vector Quantization, in arxiv.org, 2014. 
[51] B. Pichai, L. Hsu, and A. Bhattacharjee, Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Uni.ed Address Spaces, in Proceedings of ACM Inter.national Conference on Architectural Support for Pro.gramming Languages and Operating Systems, 2014. 
[52] J. Power, M. Hill, and D. Wood, 'supporting x86.64 Address Translation for 100s of GPU Lanes, in Proceedings of IEEE International Symposium on High-Performance Computer Architecture, 2014. 
[53] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, 'shiDianNao: Shift.ing Vision Processing Closer to the Sensor, in Pro.ceedings of ACM/IEEE International Symposium on Computer Architecture, 2015. 
<|endoftext|>


<|startoftext|>
     You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

     ANDREW BOUTROS, SADEGH YAZDANSHENAS, and VAUGHN BETZ,
     Department of Electrical and Computer Engineering, University of Toronto

     Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational
     cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is ap-
     pealingduetotherapidchangeinDLmodelsbutalsocauseslowerperformanceandarea-efficiency compared
     to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional
     neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations,
     we highlight the area and performance costs of programmability to pinpoint the inefffciencies in current
     FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16
     and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from
     2.8×to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among
     different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area
     ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural
     changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking
     the on-chip memories to reduce the programmability gap for DL applications.

     1 INTRODUCTION

     Recent advances in deep learning (DL) have led to breakthroughs in a myriad of fields, achieving
     unprecedented accuracy in tasks that were thought to be inherently unsuitable for our 
     computing machines to perform. It has become, in a very short time span, the de-facto standard for
     numerous applications ranging from simple image classification [36], machine translation [44],
       and speech recognition [10] to generating artistic paintings [9], composing music [7], and beating
       world champions in complex board games [41]. Interestingly, the basic foundations of DL and the
       algorithm currently used to train deep neural networks (DNNs), known as back-propagation, were
       established in the 1980s [35]. But it was not until recent years that it experienced a resurgence of
       interest [20], powered by both the abundance of data required for training and the availability of
       the tremendous compute-power necessary to train and deploy those models.
         However, the main drawback of DNNs remains to be their high computational complexity when
       compared to conventional detection and classification computer vision algorithms based on hand-
       crafted features. For example, a relatively simple eight-layer convolutional neural network (CNN),
       AlexNet [20], has a computational complexity of 25.8GOP/Mpixel for its convolutional layers,
       which is 36.9×higher than that of a conventional histogram of oriented gradients feature extractor
       [43]. This gap grows even wider as we seek to improve the accuracy of CNNs by building deeper,
       bigger and more complex models that can surpass human-level performance on visual recognition
       tasks [14]. The ImageNet large-scale visual recognition challenge witnessed a 15×increase
       in operations required per image inference in return for an 11.7% reduction in classification error
       between 2012 and 2015 [15,36]. This substantial increase in compute requirements motivates high-
       performance and energy-efficient hardware accelerators to replace or co-exist with conventional
       CPUs in executing both CNN training and inference tasks.
         The training of CNN models is commonly performed in floating-point representation on graph-
       ics processing units (GPUs) having thousands of cores and large external memory bandwidth. It
       does not require much effort to deploy existing models or train new ones on GPUs using various
       frameworks (e.g., Caffe [18] and TensorFlow [1]) that exploit highly optimized GPU libraries such
       as Nvidia CuDNN [5] for dense and sparse matrix operations. Although GPUs can deliver high
       performance by performing batch computations, they are extremely power-hungry. This is afford-
       able for training, which has no constraints on output latency and is carried out a limited number
       of times during the development phase. However, when it comes to inference, this is not ideal for
       a wide class of applications that have limited power budget and tight latency constraints such as
       mobile embedded platforms, self-driving cars or large-scale data center services.
         To achieve the best performance and energy-efficiency, many researchers have focused on building
       custom application-specific integrated circuits (ASICs) for accelerating CNNs inference work-
       loads. Some examples are DaDianNao [3] that accelerates different types of DNNs using a multi-
       chip architecture and Eyeriss [4] that focuses on energy-efficient acceleration of convolutional
       layers by maximizing data re-use, performing data compression and using a zero-skipping technique.
       Despite being an attractive solution, ASICs do not offer enough flexibility to accommodate
       the rapid evolution of CNN models and the emergence of new types of layers used in them including
       the branching, elementwise addition and batch normalization layers as in more recent models
       (e.g., GoogLeNet [45] and ResNet [15]). As well, the high non-recurring engineering (NRE) cost
       and time for design, verification and fabrication of a large ASIC chip makes it difficult to keep pace
       with the rapid model improvements in this space.
         As a trade-off between performance, power-efficiency, and flexibility, FPGAs offer an interest-
       ing design point between GPUs and ASICs and recently have had much success in accelerating
       data center workloads in general [32] and more specifically CNN inference tasks [30]. In contrast
       to GPUs, FPGAs are generally more energy-efficient. A high-end Titan X Nvidia GPU can consume
       up to 5×more power compared to a high-end Intel Arria 10 FPGA running AlexNet inference tasks
       [2]. Several studies have also shown that CNN inference does not require high-precision floating-
       point computations and can be carried out using fixed-point arithmetic for less than 1% accuracy
       degradation [13]. This wide variety of precisions used in CNN inference matches well with FP-
       GAs as they can execute non-standard custom bit-width data paths with much higher efficiency
       and flexibility than GPUs. However, they have a shorter turn-around time, less NRE cost, and can
       be re-configured to support new models and layer types when compared to ASIC accelerators.
       Another interesting advantage for FPGAs is that they offer a variety of I/Os that support different
       communication protocols. This is useful when the CNN accelerator is a part of a larger system
       and receives inputs from different types of digital and analog sensors as the case in automotive
       applications. However, FPGAs run at significantly lower frequencies due to their reconfigurability
       overhead and thus have lower raw performance compared to both GPUs and ASICs.
         For this reason and despite their drawbacks, several companies have developed ASIC solutions
       to meet the processing needs of high-performance DL applications. A recent example for that is
       Google’sTensorProcessingUnit[19]that was deployed in data centers to accelerate inference tasks
       for various types of DNNs. It has almost 17×more multiply accumulate (MAC) units, 5.6×more
       on-chip memory and runs at 3.5×higher frequency when compared to Microsoft’s Catapult V1
       [32] that uses Intel Stratix V FPGAs. In this work, we study the area and performance gap between
       FPGAs and ASICs in accelerating inference tasks using multiple CNN computing architectures
       (CAs) to highlight the limitations of current FPGA architectures and how they affect the overall
       performance of DL accelerators. The motive behind this study is twofold; First, it shows which
       design practices are more suitable for FPGA platforms and make the best use of current FPGA
       architectures. Second, it provides FPGA architects with data on where FPGAs have the largest
       efficiency gap compared to ASICs, which can lead to insights on how current FPGA architectures
       could be modified to shrink this gap and deliver higher performance in a domain with extremely
       high demand such as DL.
         In this article, we make the following contributions:
          •We implement highly optimized RTL designs for three state-of-the-art CAs that use different
            parallelization schemes to accelerate CNNs. We then extend each of these previously 
            published architectures to support all layer types required to implement three different CNN
            models: AlexNet, VGG-16, and ResNet-50 to ensure our comparisons consider a broadly
            representative set of CNN models and implementations.
          •We present a quantitative comparison of area and performance results to measure the gap
            between the same CAs implemented on a high-end Intel Arria 10 FPGA and a 28nm ASIC.
          •We trace back the bottlenecks resulting in this gap and pinpoint the limitations of current
            FPGA architectures in accelerating CNNs.

       2 BACKGROUND

       Deep Neural Networks are a class of machine-learning algorithms that were developed to mimic
       the information-processing paradigm in biological nervous systems. The human brain as an ex-
       ample has an average of around 86 billion neurons [16] connected in a complex network in which
       each neuron receives inputs from its surrounding neurons and fires an activation if those inputs
       are greater than a specific threshold. Inspired by this system, DNNs typically consist of several
       layers each of which has d(l) neurons where lis the layer number ranging from 1 to <<FORMULA>> artificial
       neuron performs a biased weighted sum of all its inputs followed by a non-linear activation
       function to produce its output as shown in Equation (1), where x(l) is the output of neuron iof i
       layer l,w(l) is the weight parameter between the neuron jin layer l and neuron i in layer l−1, ij
       w(l) is the bias term and θ is the non-linear activation function that can be a sigmoid, tanh, or 0j 
       rectified linear unit (ReLU) function. This equation can be viewed as a series of MAC operations,
       which form the majority of computations in DNNs:

                                  <<FORMULA>>                    (1)
                                      
                                    <<FIGURE>>

                         Fig. 1. Different layer types in an example CNN.

         CNNs are a subset of DNNs in which the connections between neurons of successive layers
       are sparse. Each neuron receives inputs only from neighboring neurons of the previous layer or
       so-called its receptive field. This significantly reduces the number of weights and MAC operations
       required and achieves high accuracy in applications with spacial or temporal correlation between
       input samples such as image classification, gesture and speech recognition. Sections 2.1 and 2.2
       describe the main layers of a CNN and present a summary of the previous related work on 
       accelerating CNNs on FPGAs.

       2.1 Overview of CNN Layers
       CNN models typically consist of different layer types cascaded together such that the output of a
       specific layer is consumed by the subsequent one in a feed-forward scheme during inference. In
       Figure1, we show an example CNN, and we illustrate the functionality of each of the layer types
       subsequently explained in this section.

         2.1.1 Convolutional (CONV) Layers.A CONV layer takes a set ofNIM two-dimensional input
       feature maps. It accumulates the results of 2D convolutions with stride S between each input
       feature map and its corresponding K×K kernel of learnable weights to produce a two-dimensional
       output feature map. This is performed usingNOM different sets of kernels to generateNOM output
       feature maps that are consumed by the subsequent layer. CONV layers are very compute-intensive
       and represent the majority of computation in a CNN, which motivated many designers to focus
       on accelerating only the CONV and not all CNN layers [55]. We also notice that as CNN models
       get deeper, the portion of CONV layers operations compared to the total number of operations
       increases as they constitute 91.6%, 99.1%, and 99.8% of the total operations count for AlexNet,
       VGG-16, and ResNet-50, respectively.
         The computation of CONV layers can be summarized using the six nested loops in Algorithm1;
       they are highly parallelizable and can achieve high gains through hardware acceleration. However,
       it is a non-trivial optimization problem to choose the tiling and unrolling factors of those loops
       to achieve the best performance within the limited available hardware resources [27]. Typically, a

       ALGORITHM 1:Nested loops for CONV layers computation

                    <<ALGORITHM>>

       non-linear activation function such as the ReLU function <<FORMULA>> is applied to the outputs
       of a CONV layer before passing them to the next layers.
         2.1.2 Local Response Normalization (LRN) and Batch Normalization (BNORM) Layers. LRN is a
       heavily arithmetic layer that was used in the early CNN models such as AlexNet to normalize each
       element in its input feature maps with respect to the elements at the same location in the adjacent
       KN maps using the formula in Equation (2). The function of the LRN layer is to create lateral
       inhibition for the output values especially when using ReLU as an unbounded activation function
       [20]. However, this layer is removed in newer models and is sometimes replaced by BNORM layer
       followed by scaling, as in ResNets, which cuts down the required training steps and achieves the
       same accuracy. The computation for the BNORM layer is shown in Equation (3) where μ and σ^2
       are statistically computed over the training data set and γ and β are learned during the training
       phase of the CNN [17] but are all constants for inference:

                                                           <<FORMULA>>                   (3)

         2.1.3 Pooling (POOL) Layers.Another key layer in CNN is the POOL layer, which acts as a
       down-sampling function such that its input feature maps of size NX×NY are reduced in size but
       the number of input and output feature maps stays the same. There are different variations for
       POOL layers such as Max-POOL and Average-POOL, where each element in the output feature
       map represents the maximum or average value of a window of size KP×KP in the original input
       feature map, respectively.

         2.1.4 Element-Wise (ELTWISE) Layers.Recent CNNs have more complex models with branch-
       ing layers and skipping connections forming a directed acyclic graph as shown in Figure1after
       CONV2 layer. An ELTWISE layer combines two branches by performing an element-wise 
       addition of the elements of a skipping branch and the results of a CONV layer. Reference [38]proposed
       the use of weighted addition in ELTWISE layers for deeper networks with more than 100 layers;
       however, we focus on the unweighted variation of ELTWISE layers in this work. For this layer, the
       dimensions of the output feature maps match those of the input feature maps.

         2.1.5 Fully Connected (FC) Layers.The last layers of CNNs are typically FC layers, which are
       similar to those of conventional DNNs. The output of an FC layer is a one-dimensional vector
       of sizeNFC . Each element in this vector is a weighted sum of all the outputs of the previous
       layer, which were re-shaped into a one-dimensional vector of size out                                    
       NFC. As shown in Figure 1, in
       it is characterized by the large number of weights involved in computation (<<FORMULA>>) that,
       unlike the convolution kernels, cannot be re-used.Therefore,FC layers are usually memory-bound, out
       but more recent CNN models have a smaller number of FC layers with fewer weights making them
       less problematic. For instance, ResNet-50 has only 1 FC layer that has about 8% of the total number
       of weights in the network compared to three FC layers with 96% of the network weights in AlexNet,
       which further prioritizes the acceleration of CONV layers over other types of layers.

       2.2 Related Work
       Research efforts to accelerate CNNs on FPGAs can be classified into two major categories. The
       first category of work focuses on optimizing the mapping of CNN models to current FPGA
       architectures. For example, Reference [55] presents an analytical design methodology for design space
       exploration using the roof-line model to find the optimal loop unrolling and tiling parameters for
       the CONV loops shown in Algorithm1. This work is extended in Reference [56] to a multi-FPGA
       cluster using dynamic programming with the target of maximizing throughput or minimizing 
       latency. To overcome under-utilization of resources resulting from different sizes of CONV layers,
       References [39] and [40] partition the available resources using a dynamic programming technique
       into multiple convolutional layer processors, each of which is optimized for a subset of CONV 
       layers. Another aspect of optimizing CNNs for FPGA acceleration is model compression by using
       techniques such as Singular Value Decomposition for FC layers [33]. Another compression 
       technique reduces precision down to ternary [31,49] or binary [29,47] networks that are inherently
       more FPGA-friendly, and exhibit little or no accuracy degradation by increasing the size of the
       network as in Reference [28]. The use of non-standard floating-point number representations has
       also been proposed by Microsoft’s BrainWave project [6] that uses its custom 8-bit/9-bit floating-
       point precision without suffering any accuracy loss. Recent work has also proposed the use of
       mathematical optimizations such as Winograd and Fast Fourier Transformations to decrease the
       number of MAC operations required in CONV layers as in References [2,24,57].
         The second category seeks to ease development of DL accelerators on FPGAs such that it
       requires minimal hardware design expertise. Some works have investigated the use of High-Level
       Synthesis FPGA tools to implement CNNs in high-level programming languages that are synthesized
       into hardware [42]. Another widely investigated approach is to build automatic compilers
       to produce an end-to-end optimized accelerator for a specific CNN model and a specific FPGA
       platform [23,25,26]. In Reference [48], the authors present a framework that takes a CNN model
       described in a domain-specific language, converts it to a synchronous dataflow graph, optimizes
       performance and resource utilization via algebraic transformations, and finally generates a Vivado
       HLS hardware design. An open-source RTL template-based compiler that transforms a high-level
       description of the CNN model in the same proto txt format used by Caffe into an FPGA accelerator
       is also presented in Reference [37]. Similar frameworks were presented in References [51] and [11]
       that use Caffe-described and TensorFlow-described models along with RTL and RTL-HLS hybrid
       templates, respectively, to implement FPGA accelerators for not only CNN models but also Multi-
       Layer Perceptrons and Recurrent Neural Networks. The authors of Reference [52] implement
       an automated design ffow that generates high-performance systolic array CNN architectures
       and a two-phase design space exploration scheme using analytical models as well as on-board
       implementations.
         Our work is complementary to these studies and serves as the first step toward improving the
       current FPGA architecture, which was considered a constant factor by all previous works, for
       more efficient acceleration of emerging and highly motivated applications as DL. To the best of
       our knowledge, this work is the first attempt to quantify the area and performance gap between
       FPGA and ASIC implementations of state-of-the-art CNN CAs, highlight the architectural features

                         Table 1. Main Differences between the Three CAs

                                          <<TABLE>>

       of current FPGA architectures causing it, and present suggested architectural solutions that can
       reduce this gap.

       3 COMPUTING ARCHITECTURES

       We implement three different highly optimized state-of-the-art CAs for accelerating CNN infer-
       encetasksinRTLusingparameterizableSystemVerilogHDL.We refer to the three CAs as ASU-like
       [26,27],Intel-DLA-like[2],andChain-NN-like[50].We implement all the hardware computational
       blocks required to execute all the layers described in Section2.1for three different CNN models:
       AlexNet, VGG-16, and ResNet-50. We also implement the control logic required to run the CAs
       starting from reading the input features and weights from on-chip buffers, transferring them to
       the computational blocks, and writing the fInal results in the output feature buffers. The on-chip
       buffer sizes and the parallelization factors for each of the nested CONV loops are fixed in both
       the FPGA and ASIC implementations for each of these CAs according to the optimal design point
       originally reported in References [2,27,50]. For consistency and to enable fair comparisons, we
       also use a fixed-point data representation for all three CAs with 16-bit features and 8-bit weights
       as in Reference [27], which causes less than 2% accuracy degradation. We consider the external
       memory interface and direct memory access engines to be out of the scope of this work, as they
       do not affect the conclusions we seek to draw about the performance and area gaps or the bot-
       tlenecks of current FPGA architectures in accelerating CNNs. However, our performance models
       put off-chip data transfer into consideration according to any external memory interface that we
       specify. In our experiments, we report two sets of results: one assuming infinite bandwidth and the
       other assuming one bank of DDR4 memory at 1200MHz with a total bandwidth of 17GB/s similar
       to that used in Reference [2].
         We carefully chose those three CAs out of numerous architectures proposed in the literature
       to be diverse; the wide variations between them help ensure our analysis of FPGA vs. ASIC efff-
       ciencyhasbroadapplicability.ThemaindifferencesbetweenthethreeCAs,summarizedinTable1,
       are:
          •All three CAs have different parallelization schemes. In other words, the array of MAC units
            in each CA has a different number of dimensions leading to different execution orders, tiling
            and unrolling factors for the CONV loops in Algorithm1. Output tiles of size(POM ×POX ×
            POY ),(POM ×POX ×1),and(POM ×1×1)are produced by the ASU-like, Intel-DLA-like,
            and Chain-NN-like PE arrays, respectively.
          •The Intel-DLA-like CA uses a mathematical optimization for CONV layers with kernels of
            size 3×3 known as the Winograd Transform [22], which reduces the number of MAC op-
            erations needed to compute convolutions. However, the ASU-like and Chain-NN-like CAs

                                <<FIGURE>>

                     Fig. 2. ASU-like CA tiling schemes and hardware architecture.

            perform conventional sliding-window convolution operations. This enables us to explore
            different convolution schemes with different degrees of control logic complexity and ob-
            serve their effect on the area and performance gaps.
          • The three CAs implement their weight buffers differently. The Chain-NN-like CA stores the
            kernel weights in small distributed buffers such that every MAC unit has its local scratch-
            pad for weights implemented in the FPGA’s soft logic (MLABs). In contrast, both the ASU-
            like and Intel-DLA-like CAs have larger weight buffers implemented using on-chip memory
            blocks (BRAMs) to feed a group of MAC units. In FC layers, the Intel-DLA-like CA also
            interchanges the roles of weight and feature buffers.
          • The CAs differ in whether and how they use double-buffering to hide memory transfer
            time. The ASU-like CA uses double-buffering for weights to hide the computation time of
            FC layers by fflling one buffer from off-chip memory while using the weights in the other
            buffer for computations. The Intel-DLA-like CA uses double-buffering by interchanging
            input and output buffers after each layer to eliminate any external memory transfers if all
            the output feature maps of a layer can fft in on-chip buffers. The Chain-NN-like CA does
            not use any double-buffering techniques.
         None of the three CAs is available as an open-source implementation, and hence we imple-
       mented them from scratch to carry out the study presented in this article under controlled condi-
       tions (e.g., RTL implementation, same FPGA platform, same weight and activation precisions, etc.)
       to enable fair comparisons and focus only on the architectural aspects of these CAs. In Sections3.1,
       3.2,and3.3, we describe the details of the three CAs we implemented and any extensions added
       to them for the sake of our study.

       3.1 ASU-like CA
       This CA was proposed in Reference [27] by Ma et al. from Arizona State University (ASU) and
       then expanded in Reference [26] to support the ELTWISE and BNORM layers used in recent CNN
       models. The core of this CA, shown in Figure 2 (c), is a three-dimensional MAC unit array of size
       POM ×POX ×POY that can compute both CONV and FC layers.
         Feature maps and weights are tiled to minimize external memory transfers by either buffering
       all weights or all input feature maps in on-chip memory at any layer of the CNN model. In the
       shallower layers of the network, all the weights but only N+K−1 rows of the input feature OY maps 
       are buffered on-chip such that <<FORMULA>> as shown in Figure2(a). In the deeper layers

       <<FIGURE>>

       Fig. 3. Data re-use shiff register network operation for ASU-like CA with POX=POY=K=3 and POM =1.

       with smaller input and output feature maps and more weights, all features but only N sets OM 
       of weight kernels are buffered on-chip such that <<FORMULA>>   OM as shown in Figure 2 (b). The
       on-chip input and weight buffers, all implemented in BRAMs, are organized to supply the MAC
       units in the convolution engine with enough inputs to keep them busy at every clock cycle. There
       arePOY input buffers, each of which supplies the MAC units withPOX input features that get
       multiplied by weights fromPOM different weight buffers as shown in Figure2(c).
         The convolution engine performs the computation of Loops 1, 2, and 3 in Algorithm1in parallel
       using the three-dimensional array of MAC units. Each MAC unit sequentially accumulates the
       results of one kernel (Loops 5 and 6) across all input feature maps (Loop 4) and stores the partial
       sum locally in the accumulator. This means that afterK×K×NIM cycles, each MAC unit outputs
       its fInal result producingPOM ×POX ×POY outputs at the same time. This parallelization scheme
       has several advantages; it does not require an ymovement of partial sums as every MAC unit locally
        accumulates the results across Loops 4,5, and 6 without the need forcommunicationbetweenMAC
       units or any intermediate on-chip storage. It also allows ffexible implementation of convolutions
       of any input feature map count and any kernel size as a result of sequentially executing Loops 4,
       5, and 6. For example, for any input feature map count, convolutions of size 3×3 and 5×5
       are executed in 9 and 25 cycles, respectively. The convolution engine is preceded by a complex
       network ofPOY circular shift registers of size(POX +K−1)each. Figure3shows howPOX ×POY
       convolution results are computed using this shift register network overK×Ktime steps, where
       colored boxes are input/output features, white numbered boxes are kernel weights and colored
       numbered boxes indicate a multiplication operation between an input feature and a kernel weight.
       At every time step, the multiplication result is accumulated inside the MAC unit and a shift left of
       the input data is performed. EveryKtime steps a new row is loaded from the input buffers and
       data is re-arranged and transfered between the circular shift registers as indicated by the dashed
       arrows in the ffgure. AfterK×Ktime steps, this is repeated forNIM input maps before each MAC
       unit produces its fInal result. The convolution engine is followed by an output serializer that takes
       POM ×POX ×POY results and serializes them overPOY cycles. After the output serializer, there
       can be a normalization block that is either LRN or BNORM according to the implemented CNN
       model, then max pooling block and finally the output buffers. An optional ELTWISE block is used
       in the ResNet-50 model.

         Extensions:Both References [27] and [26] originally implement this CA for several CNN mod-
       els including ResNet-50 and VGG-16. Therefore, they implement all the hardware blocks shown

                                        <<FIGURE>>

               Fig. 4. Intel-DLA-like CA and the internal architecture of each processing element.

       in Figure2(c) except for the LRN block used in the AlexNet model. The LRN block is a heavily
       arithmetic block as it contains squaring, addition, multiplication and exponentiation operations.
       Since all the DSP blocks are consumed by the convolution engine, we implement all multiplication
       operations in the LRN block using soft multipliers that are found to be not limiting the maxi-
       mum operating frequency. We implement the exponentiation operation of Equation (2) using a
       piecewise-linear function consisting of 20 points that we computed using theαandβvalues from
       the AlexNet model similar to Reference [42].

       3.2 Intel-DLA-like CA
       In Reference [2], Intel presented the Deep Learning Accelerator (DLA), which is considered to be
       the state-of-the-art FPGA accelerator for the AlexNet CNN model. The core of this CA is an array
       ofPOM processing elements (PEs) connected in a daisy chain scheme where each PE receives input
       features and passes them to the subsequent PE in the next clock cycle as shown in Figure4.
         This CA uses double-bufferedstream bufferssuch that input features of a CONV layer are read
       from one buffer and its outputs are stored in the other one, which then serves as the input buffer
       for the next layer. The two buffers continue to interchange roles as input and output buffers after
       every layer without the need to store any intermediate results in external memory. After the last
       CONV layer, outputs are stored in off-chip memory before starting the computations of FC layers.
       Each PE contains local weight buffers that feed its dot product units with inputs at every clock
       cycle. For the FC layers, batch processing is used to allow weight re-use among multiple input
       features. In contrast to the CONV layers, features of a batch of sizeBinputs are stored in “weight”
       buffers inside the PEs while the weights are stored in the stream buffers and are passed between
       the PEs using the daisy chain connection. For our study, we report the results for bothB=1that
       minimizes latency and can be compared to other CAs that do not support batch processing and
       B=96 that maximizes throughput and aligns with the reported results in Reference [2].
         A major feature of this CA is its use of a mathematical optimization known as the Winograd
       Transform to reduce the number of MAC operations required to compute a convolution [22]. In
       Reference [2], anF(4×4,3×3)transform is performed using a weight matrix of size 3×3andan
       input feature matrix of size 6×6 resulting in an output matrix of size 4×4. Equation (4)showsthe
       Winograd transform and inverse transform for these sizes whereGandBT are used to transform
       the weight matrixWand the input feature matrixX,respectively,is the element-wise multipli-
       cation operator and thenAT is used to perform the inverse transform and obtain the output matrix
       Y. For this CA, the transform of the learned weights is done beforehand for the CONV layers of
       kernel size 3×3,since they are fixed after training the model while the transform of input features
       and inverse transform of the fInal result cannot be performed in advance and hence are performed
       on chip:
                                      <<TABLE>>
       
         Each PE in the convolution engine of this CA consists of a buffer for the Winograd-transformed
       weights,POX dot-product units and their corresponding circular shift registers for storing partial
       sums. Each dot product unit is pipelined intoLstages and uses the dedicated chain between DSP
       blocks on the FPGA to multiply and accumulatePIM Winograd features and weights and then
       store the partial result in a circular shift register (CSR) of sizeLas shown in Figure4. Therefore,
       each dot product unit can interleave the computation ofLdifferent MACs such that afterLcycles,
       it takes as an input the partial sum previously produced and adds to it the MAC result of the next
       PIM features and weights. After allNIM features are processed, the fInal result is produced and
       the circular shift register is reset to zeros before starting the processing of the next set of input
       features. The convolution engine consists ofPOM PEs connected in a daisy chain scheme allowing
       a better ffoorplan of the design on the FPGA with less fan-out from the input stream buffer to
       the convolution engine, and thus enabling a higher operating frequency. The convolution engine
       is followed by an inverse Winograd block that transformsPOX ×POM inputs intoP ×POX  OM
       outputs. This is followed by LRN and POOL blocks that processP ×POX  OM results in parallel
       before storing them back into the output stream buffer. Both thePOX andP  parameters are OX speciffed to be 6 and 4,respectively, according to the Winograd transform size used. Design space
       exploration was carried out in Reference [2] to find the optimal values forPIM andPOM and they
       were chosen to be 8 and 48,respectively.
         Extensions:This CA was originally implemented for the relatively small AlexNet CNN model
       in which input and output feature maps can fft in on-chip buffers. This enables the use of inter-
       changeable input and output buffers that eliminates the need to store any intermediate results in
       external memory. However, this feature is inapplicable to at least the first layers of the other CNN
       models used in our study as their feature maps exceed the capacity of on-chip buffers. For this
       case, we use a scheme similar to that of the ASU-like CA to tile input and output feature maps and
       store intermediate results in off-chip memory. For layers that have small enough feature maps,
       we maintain the double buffering technique to eliminate data transfers from and to the external
       memory. We also carried out an experiment in which we increased the size of stream buffers such
       that more layers can make use of the double buffering technique. However, this resulted in de-
       grading the maximum operating frequency of the design, leading to a net loss in performance, and
       therefore we decided to keep the sizes of the stream buffers the same as that used for the AlexNet
       model. In addition, we implemented BNORM and ELTWISE blocks for this CA that were not part
       of the original implementation in Reference [2].
       3.3 Chain-NN-like CA
       This CA was proposed in Reference [50] by Wang et al. from Waseda University. It was imple-
       mented as an ASIC (using TSMC 28nm process technology), specifically for accelerating the CONV
       layers of AlexNet. It uses a dual-channel 1D systolic chain ofNchain PEs to ffexibly compute 2D
       convolutions of any kernel size. Each PE has a multiplier and a set of input multiplexers controlled

                                    <<FIGURE>>

         Fig. 5. Chain-NN CA with(Nchain =16,K=2,Nsub =4)and the internal architecture of each PE.

       by complex central control logic that splits the PE chain intoNsub smaller sub-chains according
       to the size of the convolution kernel, whereNsub =Nchain /(K×K), as shown in Figure5.We
       implemented this CA for our study, because, despite being originally proposed as an ASIC imple-
       mentation, it has compelling resemblance to FPGA architectures that can efficiently implement 1D
       systolic chains of multipliers using the on-chip hard DSP blocks.
         This CA separates the input feature maps into odd and even columns and uses two separate
       input buffers to store them. The two input buffers supply inputs to the first PE of every sub-chain
       (i.e., the first of every 9, 25, and 121 PEs to implement convolutions of kernel size 3×3, 5×5,
       and 11×11,respectively). There areNsub−MAX output buffers, each of which stores the outputs
       produced by a sub-chain whereNsub−MAX =Nchain /(3×3),sincethat3×3 is the smallest kernel
       size used in AlexNet CONV layers. Each PE in the chain contains both a multiplier and a small local
       buffer of 512 words for storing the weights needed for the computations performed in this specific
       PE. The largest Arria 10 FPGA contains 3,136 multipliers but only 2,713 BRAMs. We therefore
       implement the local weight buffers in the soft logic (MLABs) and use the BRAMs to implement
       input and output feature buffers.
         Figure5shows the details of the dual-channel PE used in the 1D systolic chain of this CA. The
       two input channels receive odd-column and even-column input features either from the odd and
       even input buffers, respectively, if it is the first PE of a sub-chain, or from the channels of the
       previous PE, otherwise through an input multiplexer. The odd-column and even-column inputs
       propagate to the next PE after two cycles due to the systolic registers added to the chain. Another
       odd/even multiplexer chooses the MAC unit input to be either the odd-column or even-column
       input feature. The MAC unit multiplies the chosen input with the corresponding weight from the
       local weight buffer and adds the output to the previous partial result from the output buffers if
       it is the first PE of a sub-chain or to the output of the previous PE otherwise. For a CONV layer
       with kernel sizeK, the convolution engine produces the partial results of a tile of sizeNOX ×K
       acrossNsub output feature maps. Then this is repeatedNIM times (Loop 4 in Algorithm1) with
       the partial results used as inputs to the MAC units of the first PE in each sub-chain to produce the
       fInal results of this tile. The next tile of the sameNsub output feature maps is processed in the same
       manner (Loop 3) until the wholeNOX ×NOY ×Nsub are computed after which the computations
       of the nextNsub output feature maps (Loop 1) starts.
         The selection lines for the input multiplexer and output de-multiplexer of each PE are generated
       by a central control unit and are dynamically changed after each CONV layer according to the

                                          <<FIGURE>>

             Fig. 6. Odd-column and even-column input selection schemes forNOX =3andK=3.

       layer’s kernel size. The control logic to choose between odd-column and even-column inputs is
       explained in Figure6, which shows, as an example, a sub-chain of 9 PEs in the case of a CONV
       layer withK=3andNOX =3. To compute a tile of sizeNOX ×Koutputs, it requires an input tile
       of size(NOX +K−1)×(2K−1). The ffgure shows the inputs streamed from the input buffers to
       the sub-chain at every time step starting from time step 9 when the pipeline is fflled. Input features
       from the even-column buffer lag behind those from the odd-column buffer byKcycles as shown
       in the first time step in Figure6. After streaming a complete column of the input tile (2K−1 input
       features), no new inputs are fed into the pipeline for the next time step after which features from
       the next column of same type (odd or even) are fed into the sub-chain. The thick boxes in Figure6
       show the odd/even selection for each PE in every time step. At any time step, the input selections
       alternate between odd and even for everyKPEs in the sub-chain. After everyKtime steps the
       selections are toggled to form all the convolution windows required.
         Extensions:Since it was originally proposedas an ASIC architecture only for CONV layers, we
       migrated and optimized this CA for FPGAs and added POOL, LRN, BNORM and ELTWISE blocks
       that were not part of the original implementation in Reference [50]. The POOL block buffers the
       fInal results of theNsub output feature maps until a pooling window is ready to be computed.
       The LRN block operates on results ofKN adjacent maps and the BNORM and ELTWISE blocks
       operate on single results separately so their integration to this CA was straightforward. Since the
       other two CAs compute both CONV and FC layers using the same hardware, to provide a fair
       comparison, we extended this CA by mapping both the 1×1 CONV layers used in ResNet-50
       and the FC layers to its convolution engine instead of implementing a dedicated engine for those
       layers. Unlike the conventional CONV layers, each output feature in this layers is the result of a
       dot-product of two vectors. Therefore, we use sub-chains of size 9 PEs as dot-product units that
       multiply and accumulate an input feature vector withNsub weight vectors to produceNsub partial
       results in parallel. The main drawbacks of this approach is that it does not exploit the dual-channel
       architecture and the complex control logic, since there is no need to arrange data in convolutional
       windows as previously explained. Also, the effective efficiency of the PEs is significantly degraded
       when executing these layers due to wasting the majority of cycles fflling and ffushing the pipeline
       of the systolic sub-chain to produce the result of one dot-product.

       4 METHODOLOGY
       We implement the three CAs described in Section3using parameterizable SystemVerilog, in which
       we specify the CA variation to be BSC, LRN, or ELT, which is the notion we will use for the rest of
         
                         Table 2. CA Parameters and Experimental Setup
        
                                          <<TABLE>>

       the article to refer to CAs that implement VGG-16, AlexNet, and ResNet-50 CNN models, respec-
       tively. The variations of each CA contain only the blocks required for each of their corresponding
       CNN models. For instance, the BSC variation will not contain LRN, BNORM, or ELTWISE blocks as
       there are no normalization or elementwise layers in the VGG-16 model. For all the CAs, we use 16-
       bit and 8-bit fixed-point features and weights, respectively. For the ASU-like and Intel-DLA-like
       architectures, we use the same parameters reported in References [27] and [2]. For the Chain-
       NN-like CA, since it was originally implemented as an ASIC, the parameters used in Reference
       [50] will leave most of the FPGA’s DSP blocks unutilized. Therefore, we assigned the number of
       PEs (Nchain ) to be the minimum value that achieves the highest performance given the available
       DSP block count constraint. As an example, for an Arria 10 device with 3,036 hard multipliers,
       in case of VGG-16 that has 3×3 CONV layers with 512 output channels, we can fft a maximum
       of3,036÷(3×3)=337 sub-chains that occupy 3,033 multipliers and compute this CONV layer
       in512÷337=2 rounds. However, we can use only 2,304 hard multipliers (i.e. 256 sub-chains)
       instead, which computes the same layer also in 2 rounds but uses fewer DSP blocks and does not
       affect the performance of other layers as well. Table2summarizes the experimental setup and the
       parameters used in each CA.
         We optimize the performance of the three CAs implemented on the FPGA to achieve the highest
       possible operating frequency for each one. We then migrate the exact same RTL implementations
       to ASICs using the same architecture parameters indicated in Table2. One might argue that an
       optimized ASIC design can achieve higher performance by, for example, building custom highly
       efficient inter-PE network-on-chip such as in Reference [4] or fftting significantly more MACs on-
       chip [19]. However, the purpose of this study is not to benchmark FPGAs vs. ASICs in accelerating
       CNN inference, but rather highlight the bottlenecks of current FPGA architectures when imple-
       menting those CAs. Therefore, the ASIC implementations in this study serve as an upper-bound
       on the performance and area-efficiency of FPGA-optimized CNN accelerators where all the FPGA
       programmability has been removed. Comparing the same CAs on FPGAs and ASICs enables us
       to quantify the effect of FPGA programmability on the performance and area of those CAs and
       pinpoint the causes of this gap in current FPGA architectures; this would not be possible if we
       instead compared existing ASIC implementations to totally different state-of-the-art FPGA ones.

       4.1 Performance Modeling
       To obtain the performance results of the three CAs, we build analytical performance models based
       on our RTL simulations that calculate the number of cycles required for the computation of each
       layer as well as the time required for any necessary memory transfers of weights and features.
       We assume that the layout of the features and weights in the external memory is optimized for

                                          <<FIGURE>>

             Fig. 7. Processing time breakdown of one image for the LRN variation of the three CAs.

       the parallelization schemes of each CA, which allows us to utilize the burst capabilities and all
       the external memory bandwidth available. Given a high-level description of the CNN model, the
       operating frequency of the accelerator, the bit-widths of weights/features, and the available exter-
       nal memory bandwidth, our performance models produce the computation and memory transfer
       time required for each layer of the CNN. Our performance models assume either a single bank of
       DDR4x64 memory at 1,200MHz (for a total bandwidth of 17GB/s) or unlimited bandwidth to obtain
       effective performanceandcomputational performanceresults, respectively. As an example, Figure7
       shows the performance model output for AlexNet on the three CAs. We then use this output to
       calculate the throughput in GOPS counting each MAC as two operations (i.e., a multiplication and
       an addition). We veriffed our performance models against the results reported in References [27]
       and [2], and we found that our models align well with the published results.

       4.2 ASIC Flow
       For the ASIC implementations, we use Synopsys Design Compiler 2013.03 to synthesize the CAs
       using 28nm STMicroelectronics standard-cell libraries; we target an unachievable clock period
       of 0ns to achieve the highest possible frequency and then perform area recovery by setting the
       maximum area to 0 and carrying out an incremental compilation. The standard-cell library comes
       with a wide variety of variations for different processes, voltages and operating temperatures, from
       which we choose the 1.0V, 125°C, and worst-case process corner for our experiments.
         Memory Compiler:We use COFFE’s memory compiler [46] to generate on-chip memories for
       our ASIC implementations. Although this memory compiler was previously used to design FPGA
       BRAM blocks, it is capable of designing custom memory blocks for ASICs with any required word
       size and depth, without any FPGA-specific circuitry. The memory cell layout as well as the veri-
       ffcation of its area and timing results against state-of-the-art industrial and academic designs are
       detailed in Reference [46]. Our experiments also show that the area of memory blocks generated
       by COFFE’s memory compiler align well with that generated by the OpenRAM [12]memorycom-
       piler for memories having different word sizes and depths. The ASIC CAs have the flexibility to
       implement on-chip memories of the required size and type (i.e., simple or dual port) unlike the
       FPGA implementations, which are constrained by the fixed size of BRAM blocks.
         Place and Route Correction Factors:Using synthesis-only resultsfor ASIC designs can over-
       estimate frequency and underestimate area as it only predicts routing effects. However, pushing all
       nine designs that we implemented through multiple iterations of the place-and-route ffow proved
       computationally infeasible due to the very high runtime of such large designs and the limited tool
       licenses available. However, we exploit the modular nature of the three architectures and place
       and route smaller instances of the CAs with fewer PEs ( 1 /8 to 1 /4 of the full size designs) to obtain
       correction factors for our synthesis-only results of the full-size CAs. We use Cadence Innovus 16
       to place and route our designs. Our experiments show that the frequency achieved in synthesis is
       degraded after placement and routing by factors of 0.65, 0.74, and 0.73 for the ASU-like, Intel-DLA-
       like, and Chain-NN-like CAs, respectively. We observed that the area of the CAs scale linearly and

                  Table 3. Frequency, Effective Performance, and Image Processing for the

                            <<TABLE>>

       that the correction factors are consistent across different sizes of the CAs, as we expected given
       the modular nature of these architectures, and this increases our conffdence in the correction fac-
       tors. We also needed to bloat the area of the ASIC implementations by 5% for ASU-like and 11%
       for both Intel-DLA-like and Chain-NN-like architectures to achieve a successful routing that met
       timing. We apply those correction factors to our synthesis-only results to obtain more accurate
       and realistic area and performance numbers for the placed and routed ASIC implementations.

       4.3 FPGA Flow
       For the FPGA implementations, we use Intel Quartus Prime 17.0 to synthesize, place and route the
       three variations of each CA for the largest and fastest speed-grade Arria 10 device. The function-
       ality of all the designs is veriffed using ModelSim Intel FPGA Starter Edition 10.5b. To estimate
       the area occupied by the CAs on the FPGA, we first convert all the utilized resources to equivalent
       ALMs (eALMs). It is reported in Reference [34] that the costs of an M20K block and a DSP block
       in Stratix V architecture are 40 and 30 eALMs, respectively. For the Arria 10 architecture, which
       uses the same M20K blocks as Stratix V, we use the same cost for BRAMs; however, we account for
       the 10% increase in DSP block area compared to Stratix V due to adding support for floating-point
       arithmetic [21] leading to a DSP block cost of 33 eALMs. After that, we use the publicly available
       area of the 65 nm Stratix III ALM [53] and scale it down to 28nm to get an area estimate in squared
       millimeters that is comparable to the area of the ASIC implementations. Although the ALM ar-
       chitecture has only minor changes from Stratix III to Arria 10, we believe that the area results of
       the FPGA implementations in squared millimeters can still be optimistic, since we assume ideal
       scaling from 65 to 28nm. However, we are most interested in relative trends in our area gap anal-
       ysis, which can help us identify the blocks that have relatively higher gap than others, rather than
       finding the absolute area results in squared millimeters with high accuracy.

       5 RESULTS
       In this section, we first compare the FPGA implementations of the different variations of the three
       CAs in terms of performance, resource utilization, and area breakdown. Then, we study the per-
       formance and area gap compared to the ASIC implementations. Finally, we analyze these results
       and suggest FPGA architectural changes to achieve more efficient CNN inference acceleration.

       5.1 FPGA Results
       Table3summarizes the maximum frequency and the processing time of one image and Figure8(a)
       shows the performance results in TOPS for all variations of the three CAs. We show the perfor-
       mance results of the Intel-DLA-like CA in case of both processing a batch of sizeB=96 images,
       similar to what was reported in Reference [2], andB=1 similar to the other CAs. Besides using
       the Winograd transform that significantly reduces the amount of required operations and reduc-
       ing external memory transfers by using double-buffered stream buffers, the Intel-DLA-like CA also

                                              <<FIGURE>>

            Fig. 8. FPGA Results: (a) Performance in TOPS. (b) Resource utilization. (c) Area breakdown.

       achieves the highest frequency because of its pipelined daisy-chain architecture that allows an op-
       timized placement of the PEs with less fan-out from the feature/weight buffers to the PEs when
       compared to the other CAs. Therefore, the Intel-DLA-like CA achieves the highest performance
       with 1.54×and 1.07×more TOPS than that achieved by the ASU-like CA (which uses more PEs)
       for the BSC and ELT and LRN variations, respectively, in case of a single image inference.
         The Intel-DLA-like CA has the highest advantage over the ASU-like-CA in the BSC variation,
       since all the CONV layers of VGG-16 are of size 3×3 that benefft the most from the Winograd
       transform. This advantage decreases in the ELT variation as the ratio of 3×3 CONV layers to all
       layers decreases in ResNet-50, and we cannot fully make use of the double-buffering technique due
       to the ELTWISE layers that require storing intermediate results to the external memory. However,
       despite the significantly higher performance reported in Reference [2] in case of batch processing
       of FC layers, it achieves slightly more TOPS when compared to the ASU-like CA in case of single
       image inference using AlexNet. Figure8(a) also shows that the gains from batch processing (4.2×
       and 1.8×more TOPS in the LRN and BSC variations, respectively) almost vanishes in ELT, since
       the ResNet-50 model has only one small FC layer compared to three larger FC layers in AlexNet
       and VGG-16.
         The Chain-NN-like CA has the lowest performance results in all variations, since it runs at a sig-
       niffcantly lower frequency than the other CAs. We believe that this is due to the high utilization of
       the FPGA’s soft fabric (between 74%–77% as shown in Figure8(b)), leading to physically stretched
       critical paths. The large fan-out from the odd/even input buffers to the first PE of all sub-chains
       and the large multiplexers used for selecting the outputs of sub-chains for different convolution
       sizes (i.e., selecting between every 9th, 25th, 49th, or 121st PE for CONV layers of sizeK=3,5,7,
       or 11,respectively) also negatively affect the frequency. Finally, the performance of this CA is sig-
       niffcantly degraded in FC layers and 1×1 CONV layers, since it was originally implemented for
       accelerating only the CONV layers as explained in Section3.3.
         Figure8(b) shows the percentage utilization of ALMs, M20K BRAM blocks, and DSP blocks for
       each CA variation. The highest utilization percentage in most cases is for the DSP blocks, which
       are the core of the convolution engine in all CAs. The ASU-like CA uses all the 1,518 DSP blocks
       (3,03618-bitmultipliers)toimplementthethree-dimensionalarrayofMACunitsinitsconvolution
       engine and off-loads 100 MAC units to the FPGA’s soft fabric. The BSC and ELT variations of the
       Intel-DLA-like CA use 91% of the DSP blocks, 224 of which are used for the Winograd transform
       and inverse transform, while 1,152 blocks are used to implement the dot product units in its PEs. In
       addition, its LRN variation uses the remaining DSP blocks to implement some of the multiplication
       operations of the LRN layers. The Chain-NN CA uses significantly more soft logic, because it

          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:18 A. Boutros et al.

                                                 Table 4. Summary of Area and
                                                    Performance Ratios

                                             Var    CA     AR 1  CPR 2  EPR 3
                                                   ASU    9.38   4.44   2.09
                                                 Intel-DLA   7.87   2.83   1.26BSC  Chain-NN   8.16   6.33   3.63
                                                   ASU    11.02  4.63   1.25
                                                 Intel-DLA   8.48   2.91   1.15LRN  Chain-NN   8.38   5.98   2.29
                                                   ASU    9.48   4.58   2.08
                                                 Intel-DLA   7.93   2.82   1.5ELT  Chain-NN   8.27   6.26   5.35
                                                Geomean    8.73  4.31  2.01
                                            1 Area Ratio (FPGA/ASIC).
                                            2 Computational Performance Ratio (ASIC/FPGA).
                                            3 Effective Performance Ratio (ASIC/FPGA).
              Fig. 9. Area and performance gaps.
       implements the weight buffers as distributed memories in MLABs. In Figure8(c), we show the
       area in squared millimeters estimated according the methodology of Section4.3and its breakdown
       for all the CAs. With the exception of the Chain-NN-like CA that uses a signiffcant amount of
       the soft fabric to implement weight buffers, the area of the two other CAs is dominated by the
       computational blocks such as the convolution, pooling and normalization blocks. In the Intel-
       DLA-like CA, the Winograd transform and inverse transform blocks contribute to the total area
       by 29–33%, which is almost as expensive as the convolution engine, which consumes 32–37% of
       the total area.


       5.2 Performance Gap
       Figure9illustrates the area and computational performance gap between the FPGA and ASIC
       implementations of the three variations of each CA. The FPGA implementations are represented
       as triangles while the ASIC implementations are represented as squares. The colors and patterns of
       the data points represent the variation and the CA, respectively, and the dotted lines connect each
       FPGA implementation to its ASIC counterpart. The closer the data point is to the upper left corner
       ofthegraph,thebetteritisasitwillhavesmallerareaandhigherperformance.Table4summarizes
       the FPGA-to-ASIC area ratios as well as the computational performance and effective performance
       ASIC-to-FPGA ratio for each CA variation. The computational performance ratio (CPR) represents
       the performance gap between the FPGA and ASIC implementations assuming infinite external
       memory bandwidth. However, the effective performance ratio (EPR) represents the performance
       gap assuming a single-bank external memory interface as speciffed previously. We believe that
       the computational performance ratio better captures the cost of FPGA programmability and its
       effect on the computational core performance of the three CAs as it is not limited by a relatively
       low-performance external memory interface. The values of EPR are less than those of the CPR as
       shown in Table4due to the external memory bandwidth constraints. As the performance of the
       computational engine increases, the CAs can use multiple DDR memory banks or high-bandwidth
       memory to enhance the overall performance. Therefore, EPR and CPR represent lower and upper
       bounds for design points using different external memory systems. Since the main focus of this
       work is studying the computational gap caused by the FPGA programmability, we believe that the
       CPR is the more important metric.

                                      <<FIGURE>>

       Fig. 10. Area gap between FPGA and ASIC implementations for different blocks of: (a) BSC, (b) LRN, and
       (c) ELT. The percentages represent the contribution of each component to the total area of the FPGA
       implementation.

         Interestingly, the computational performance gap is not consistent among different CAs; how-
       ever different variations of the same CA have similar gap results. The Intel-DLA-like CA has
       the smallest ASIC-to-FPGA computational performance ratio (≈2.9) compared to the ASU-like
       and Chain-NN-like CAs (≈4.6 and 6.2,respectively). We believe that the reason is that the Intel-
       DLA-like CA has a modular daisy-chain architecture, which is more routing-friendly and bene-
       ffts the FPGA implementation more than the ASIC one due to the relatively slow speed of FPGA
       routing.

       5.3 Area Gap
       On average, the FPGA implementations have 8.7×larger area than their ASIC counterparts and
       the gap is, in contrast to the performance gap, fairly similar across different variations of the three
       CAs. To understand the reasons for this gap, Figures10(a),10(b), and10(c) illustrate the area ratio
       of different components in the FPGA implementations to those in the ASIC implementations for
       the BSC, LRN, and ELT variations, respectively. The percentages written above the bars represent
       the area breakdown of each FPGA implementation into different components and hence indicate
       the contribution of each component to the overall area gap. We notice that the convolution engine,
       which has the largest contribution to total area (up to 60% in some cases) and thus the strongest
       impactonthetotalareagap,hasanFPGA-to-ASICareaarearatiorangingfrom13to31fordifferent
       variations of the three CAs. The Intel-DLA-like uses Winograd transform to significantly reduce
       MAC operations in convolution, which costs almost the same area as the convolution engine in the
       FPGA implementation. However, the Winograd transform and inverse transform blocks in this CA
       have FPGA-to-ASIC area ratios of 28 and 26, respectively, which are almost twice the area gap for
       the convolution engine, since they contain a large number of multi-input adders implemented in
       the FPGA’s soft fabric compared to the convolution engine, which is mostly implemented in hard
       DSP blocks. The smallest area gap is in the feature and weight buffers, since the RAMs in the FPGA
       and the ASIC implementations are both custom SRAM blocks. However, the buffers area ratios are
       still signiffcant (≈3–5)because of the area overhead of the programmable routing in BRAM tiles
       as well as the underutilization of some of the M20K blocks on the FPGA, whereas in the ASIC
       implementations, we use memories with the exact required sizes. The NORM block has an area
       ratio of 32 and 28 and consumes 22% and 14% of the total area in ASU-like and Intel-DLA-like CAs,
       respectively, since it is a heavily arithmetic block and is mostly implemented in the soft fabric.
       However, it only consumes 3% of the total area in the Chain-NN-like CA, which produces outputs
       in one dimension only and therefore does not normalize output features at different locations in
       parallel. The POOL, ELTWISE and BNORM blocks have large area ratios, however they have small
       overall areas and hence limited impact on the total gap.
         An interesting observation is that the area gap in the convolution engine of the Intel-DLA-like
       CA is significantly less than that of the other two CAs: an area ratio of 13 compared to 20 and
       29 in ASU-like and Chain-NN-like CAs, respectively. This is because the Intel-DLA-like CA uses
       the hard adders in the DSP blocks to implement its dot-product unit, while the other two CAs
       pay for the area of the complete DSP block on the FPGA but only make use of the multipliers
       inside it and thus have a higher area gap compared to their ASIC counterparts. This observation
       motivates the investigation of new DSP block designs that could bring more of the convolution
       engine functionality inside the hard DSP block. For instance, the ASU-like CA needs two separate
       accumulators for the two independent 18-bit multipliers, which is not supported in current DSP
       blocks. Hence, the DSP block accumulators are wasted and soft logic is used to implement the
       accumulators. The convolution engine of the Chain-NN-like CA has the highest area gap as it
       implements input multiplexing, accumulation, and output de-multiplexing in the soft fabric.

       5.4 Architectural Insights
       Based on the results of Sections5.1and5.2, we can draw several architectural insights:

          • According to the resource utilization results in Figure8(b), the limiting factor is the DSP
            block count available on-chip, with close to 100% resource utilization in most cases. One
            direct approach to gain higher performance is adding more DSP blocks to current FPGAs,
            especially given that a DSP-focused device spends only 5% of its core area on DSP blocks
            [21]. This requires a careful architectural study to determine the optimal ratio and area
            distribution between DSPs, BRAMs, and ALMs for DL-tuned FPGAs that are still ffexible
            enough and suitable for other applications as well. These architectural explorations require
            a suite of DL benchmark circuits such as the one we developed in this work, and which we
            plan to expand and open-source in future work.
          • AsshowninFigure10, the area gap of the convolution engine of the Intel-like-DLA CA is
            significantly less than that of the other two CAs, since it makes better use of the DSP block
            available functionalities such as the internal adders and hard cascade chains. By looking
            at the ASIC area breakdown of the convolution engine, we can see that about 72% of the
            logic in the convolution engine of the Intel-DLA-like CA was implemented inside hard DSP
            blocks on the FPGA compared to only 32% and 35% in the ASU-like and Chain-NN-like CAs,
            respectively, and the rest is implemented in the soft fabric. We believe that small changes to
            the DSP block architecture could capture more of the convolution engine hardware inside
            the hard circuitry of the DSP block. For example, adding an operation mode that conffgures
            the two internal adders as independent accumulators for two independent 18-bit MACs
            (such as in the ASU-like CA) or having a small circular shift register accumulator for inter-
            leaving dot-product operations (as in the Intel-DLA-like CA) would save soft logic. Neither
            of the DSP block enhancements would add much logic to the block, nor would they require
            more block routing ports (inputs and outputs) and, therefore, the DSP block area increase
            would be minimal. To increase the DSP block count on-chip, as mentioned in our first sug-
            gestion, we not only wish to avoid signiffcant block area increase, but also remove DSP
            block functionalities that are unnecessary for DL applications and would not cause severe
            performance degradation when implemented in the soft fabric. For example, removing the
            built-in constant coefficient banks in the Arria 10 DSP blocks should be evaluated as they
            are not usable by any of our CAs.
          •In this study, we used 16- and 8-bit fixed-point precision for features and weights, respec-
            tively, in all CAs to ensure fair comparisons. However, the most suitable precision for CNN
            inference is debatable and varies widely in the literature from single-precision floating-
            point down to ternary and binary [28]. Currently, DSP blocks from Intel and Xilinx support
            a limited number of precisions. For instance, a DSP block in Intel Arria 10, and similarly
            Stratix 10, FPGAs supports two 18-bit, one 27-bit, or one single-precision floating-point
            multiplication. However, a DSP slice in Xilinx Virtex Ultrascale FPGAs supports one 27×18
            multiplication. Designers can sometimes fft more low-precision multiplies that match cer-
            tain patterns using clever tricks such as performing two 8-bit multiplies that share one
            operand using a single Xilinx DSP slice [8]. Even with these operand packing tricks, using
            lower precision leaves a portion of the DSP block logic idle. We can avoid this by designing
            DSP blocks that natively support low-precision multiplications and reuse routing ports and
            multiplier sub-arrays to keep the area overhead minimal.
          •When implementing the three CAs, we noticed that the required on-chip buffers are either
            deep central buffers for input and output features or smaller and more distributed buffers
            for the weights. When we tried to extend the double-buffering technique used in the Intel-
            DLA-like CA to more layers of models larger than AlexNet by implementing deeper stream
            buffers, it resulted in a net performance degradation as the operating frequency dropped
            significantly due to depth stitching of M20K BRAMs to implement those deep buffers. How-
            ever, when implementing the small weight buffers of the Chain-NN-like CA in MLABs, the
            high utilization of the soft fabric also resulted in lower operating frequency. This observa-
            tion indicates that having only M20K BRAMs and MLABs to implement on-chip memories
            might not be a good fft for DL acceleration on FPGAs. This also requires a more detailed ar-
            chitectural study to determine the best size and ratio of on-chip BRAMs and their effect on
            the overall performance using DL-representative benchmarks, and we believe our parame-
            terized CAs can form the start of this benchmark set. In addition, the memory-richness of
            FPGAs can be enhanced by employing emerging technologies such as Magnetic Tunneling
            Junction memories, which can provide bigger yet more dense BRAMs for memory-intensive
            applications as shown in Reference [54].

       6 CONCLUSION

       In this article, we implemented three highly optimized state-of-the-art CAs for accelerating CNN
       inference, which are: ASU-like, Intel-DLA-like, and Chain-NN-like CAs. We implemented three
       variations of each CA (BSC, LRN, and ELT) for three different CNN models (VGG-16, AlexNet, and
       ResNet-50, respectively) on an Intel Arria 10 FPGA device and compared them to 28nm ASIC im-
       plementations of the same CAs to quantify the programmability cost that comes with using FPGAs
       on the performance and area of DL accelerators. Across different variations of the three CAs, we
       observed a consistent area gap with an average FPGA-to-ASIC area ratio of 8.7×, to which the con-
       volution engine contributes the most with area ratios ranging from 13 to 31 for different CAs. The
       performance gap, unlike the area gap, varies significantly across different CAs. The computational
       performance of the ASIC implementations is 2.8×to 6.3×faster than that of the FPGA imple-
       mentations when assuming infinite external memory bandwidth. We find that the Intel-DLA-like
       CA has the smallest performance gap compared to its ASIC counterpart indicating that focusing
       on modular and routing-friendly designs is of great importance for building efficient FPGA-based
       DL accelerators. Finally, we suggest several FPGA DSP and RAM architecture changes for future
       work that could reduce the area and performance gaps and enable more efficient DL acceleration
       on FPGAs.

       ACKNOWLEDGMENTS

       The authors thank Martin Langhammer, Debbie Marr,and Eriko Nurvitadhi for helpful discussions,
       as well as Huawei, Intel, and NSERC for funding support.

       REFERENCES
        [1] M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. InProceedings of the OSDI. 265–283.
        [2] U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. InProceedings of the FPGA. 55–64.
        [3] Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. InProceedings of the MICRO. 609–622.
        [4] Y. Chen et al. 2017. Eyeriss: An energy-efficient reconffgurable accelerator for deep convolutional neural networks.In Proceedings of the JSSC, Vol. 52. 127–138.
        [5] S. Chetlur et al. 2014. CuDNN: efficient primitives for deep learning.arXiv:1410.0759.
        [6] E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at data center scale. InProceedings of the HOT CHIPS,Vol.29.
        [7] F. Colombo et al. 2017. Deep artiffcial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81–96.
        [8] Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. Inwhite paper of Xilinx.
        [9] L. Gatys et al. 2015. A neural algorithm of artistic style.arXiv:1508.06576.
       [10] A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. InProceedings of the ICASSP. 6645–6649.
       [11] Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. InProceedings of the FCCM. 152–159.
       [12] Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. InProceedings of the ICCAD.
       [13] P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks.arXiv:1604.03168.
       [14] K. He et al. 2015. Delving deep into rectiffers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026–1034.
       [15] K. He et al. 2016. Deep residual learning for image recognition. InProceedings of the CVPR. 770–778.
       [16] S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. InFrontiers in Human
          Neuroscience,Vol.3.
       [17] S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate
          shift. InProceedings of the ICML. 448–456.
       [18] Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding.arXiv:1408.5093.
       [19] N. Jouppi et al. 2017. In-data center performance analysis of a tensor processing unit. InProceedings of the ISCA. 1–12.
       [20] A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. InProceedings of the
          NIPS. 1097–1105.
       [21] M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. InProceedings of the FPGA.
          117–125.
       [22] A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. InProceedings of the CVPR. 4013–4021.
       [23] Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. InProceed-
          ings of the FPT. 61–68.
       [24] L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. InProceedings of the FCCM.
          101–108.
       [25] Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. InPro-
          ceedings of the FPL.1–8.
       [26] Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolu-
          tional neural networks. InProceedings of the FPL.1–8.
       [27] Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural net-
          works. InProceedings of the FPGA. 45–54.
       [28] A. Mishra et al. 2017. WRPN: Wide reduced-precision networks.arXiv:1709.01134.
       [29] E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In
          Proceedings of the FPT. 77–84.
       [30] K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. InMicrosoft
          Research Whitepaper,Vol.2.
       [31] A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on
          FPGA. InProceedings of the FPL.1–7.
       [32] A. Putnam et al. 2014. A reconffgurable fabric for accelerating large-scale data center services. InProceedings of the
          ISCA. 13–24.
       [33] J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. InProceedings of
          the FPGA. 26–35.
       [34] R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL
          HLS. InProceedings of the FPT. 20–27.
       [35] D. E. Rumelhart et al. 1985.Learning Internal Representations by Error Propagation. Technical Report.
       [36] O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. InProceedings of the IJCV, Vol. 115.
          211–252.
       [37] H. Sharma et al. 2016. From high-level deep neural models to FPGAs. InProceedings of the MICRO. 1–12.
       [38] F. Shen et al. 2016. Weighted residuals for very deep networks. InProceedings of the ICSAI. 936–941.
       [39] Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. InProceedings of the FPL.1–4.
       [40] Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. InProceedings of the ISCA.
          535–547.
       [41] D. Silver et al. 2017. Mastering the game of go without human knowledge. InNature, Vol. 550. 354–359.
       [42] N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural
          networks. InProceedings of the FPGA. 16–25.
       [43] A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision.
          arXiv:1703.05853.
       [44] I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. InProceedings of the NIPS. 3104–3112.
       [45] C. Szegedy et al. 2015. Going deeper with convolutions. InProceedings of the CVPR.
       [46] Kosuke Tatsumura et al. 2016. High density, low energy, magnetictunnel junction based block RAMs for memory-rich
          FPGAs. InProceedings of the FPT. 4–11.
       [47] Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. InProceedings of
          the FPGA. 65–74.
       [48] S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs.
          InProceedings of the FCCM. 40–47.
       [49] G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. InProceedings
          of the ICASSP. 2861–2865.
       [50] S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural
          networks. InProceedings of the DATE. 1032–1037.
       [51] Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network
          family. InProceedings of the DAC.1–6.
       [52] X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In
          Proceedings of the DAC.1–6.
       [53] H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. InProceed-
          ings of the FPGA. 5–14.
       [54] S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architec-
          ture exploration. InProceedings of the FPGA. 115–124.
       [55] C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. InProceed-
          ings of the FPGA. 161–170.
       [56] C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. InProceedings of the
          ISLPED. 326–331.
       [57] C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA
          shared memory system. InProceedings of the FPGA. 35–44.
<|endoftext|>